Boolean Model of Information Retrieval

A very primitive model but still interesting to start with information retrieval Techniques. Lets start with an example:

Consider we have 3 documents namely D1, D2 and D3. And the contents that these 3 documents contain are as follows:

D1: Shipment of gold damaged in a fire
D2: Delivery of silver arrived in a silver truck
D3: Shipment of gold arrived in a truck

The term vocabulary we are considering is as follows:

V={Fire, Gold, Silver, Truck}

Term Vocabulary is a list of terms/words we want to allow users to search. For real application, Its can be as large as  full English Dictionary.

Now the first steps is to build the Term X Doc Matrix which as follows:

Term   D1 D2 D3

Fire      1   0      0

Gold     1   0     1

Silver    0   1     0

Truck    0    1     1

The Matrix is pretty intuitive. cell contains 1 representing the presence of the term in the respective document otherwise it is 0.

Thus we have the binary representation of the terms are as follows:

Fire: 100, Gold: 101, Silver: 010, Truck: 011

This part is offline processing. Now the online query processing.

Lets say some one want query something that is transformed into following logical relation of the containing terms:

Query 1: (fire OR gold) AND (truck OR NOT silver)

Now if we transform the above relation into Boolean form, we have the following:

( 100 OR 101 ) AND (011 OR NOT 010)

= ( 100 OR 101 ) AND (011 OR 101)

= 101 AND  111

= 101

Now we have some binary bits. But how can we interpret this bits??

Pretty Simple. 1 indicates the documents to be retrieved as result of the query.

In this case, the search result will populate the documents D1 and D3 as indicated by the 1 in first and third position.

Similarly, we can perform the following search.

Query 2: (fire OR NOT silver) AND (NOT truck OR NOT fire).

(100 or not 010) and (not 011 or not 100)

= (100 or 101) and (100 or 011)

= 101 and 111

=101

One disadvantage of Boolean model is that it can answer only the presence or absence of terms. No way to answer in probabilistic way as the modern search engine works

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.