- A model for information retrieval supporting query in the form of a Boolean expression of terms which are combined with the operators AND, OR, and NOT.
- The model views each document as just a set of words. ( —— similar to Bag of words (BoW) model )
to determine which plays of Shakespeare contain the words Brutus AND Caesar AND NOT Calpurnia
- term-document incidence matrix
- Each document is identified by an index in advance to avoid linearly scanning the texts for each query.
- For each document ( here a play of Shakespeare’s ), whether it contains each word out of all the words Shakespeare used is recorded.
- The result is a binary term-document incidence matrix, as in Figure 1.
DOC (PLAY) | Antony and Cleopatra | Julius Caesar | The Tempest | Hamlet | Othello | Macbeth | ... | |
TERM | ||||||||
Antony | 1 | 1 | 0 | 0 | 0 | 1 | ||
Brutus | 1 | 1 | 0 | 1 | 0 | 0 | ||
Caesar | 1 | 1 | 0 | 1 | 1 | 1 | ||
Calpurnia | 0 | 1 | 0 | 0 | 0 | 0 | ||
Cleopatra | 1 | 0 | 0 | 0 | 0 | 0 | ||
mercy | 1 | 0 | 1 | 1 | 1 | 1 | ||
worser | 1 | 0 | 1 | 1 | 1 | 0 | ||
... |
◮ Figure 1 A term-document incidence matrix. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise.
- query Brutus AND Caesar AND NOT Calpurnia
- The matrix rows or columns perspective provide a vector for each term, which shows the documents it appears in; or a vector for each document, showing the terms that occur in it.
- To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100
- The answers for this query are thus Antony and Cleopatra and Hamlet.
term-document incidence matrix is often extremely sparse with few non-zero entries.
A much better representation is to record only the things that do occur, that is, the 1 positions.
- An index always maps back from terms to the parts of a document where they occur.
- Sometimes also named inverted file.
Brutus | 1 | 2 | 4 | 11 | 31 | 45 | 173 | 174 | ... | |
Caesar | 1 | 2 | 4 | 5 | 6 | 16 | 57 | 132 | ||
Calpurnia | 2 | 31 | 54 | 101 | ||||||
... | ||||||||||
◮Figure 2 The two parts of an inverted index. The dictionary is commonly kept in memory, with pointers to each postings list, which is stored on disk.
Doc 1
I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.
Doc 2
So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:
-
To build the index, the major steps in this are:
- Collect the documents to be indexed:
Friends, Romans, countrymen. So let it be with Caesar . . .- Within a document collection, we assume that each document has a unique serial number, known as the document identifier (docID).
- During index construction, we can simply assign successive integers to each new document when it is first encountered.
- Tokenize the text, turning each document into a list of tokens:
Friends Romans countrymen So . . . - Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
- The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 3 below.
- The core indexing step is sorting this list so that the terms are alphabetical.
- Multiple occurrences of the same term from the same document are then merged.
- Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the Figure 4.
- The input to indexing is a list of normalized tokens for each document, which we can equally think of as a list of pairs of term and docID, as in Figure 3 below.
- Collect the documents to be indexed:
-
The dictionary also records some statistics, such as the number of documents which contain each term (the document frequency, which is here also the length of each postings list).
Since a term generally occurs in a number of documents, this data organization already reduces the storage requirements of the index.
term | docID | term | docID | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
I | 1 | ambitious | 2 | ||||||||
did | 1 | be | 2 | ||||||||
enact | 1 | brutus | 1 | ||||||||
julius | 1 | brutus | 2 | ||||||||
caesar | 1 | capitol | 1 | ||||||||
I | 1 | caesar | 1 | ||||||||
was | 1 | caesar | 2 | ||||||||
killed | 1 | caesar | 2 | ||||||||
i’ | 1 | did | 1 | ||||||||
the | 1 | enact | 1 | ||||||||
capitol | 1 | hath | 1 | ||||||||
brutus | 1 | I | 1 | ||||||||
killed | 1 | I | 1 | ||||||||
me | 1 | =⇒ | i’ | 1 | |||||||
so | 2 | it | 2 | ||||||||
let | 2 | julius | 1 | ||||||||
it | 2 | killed | 1 | ||||||||
be | 2 | killed | 1 | ||||||||
with | 2 | let | 2 | ||||||||
caesar | 2 | me | 1 | ||||||||
the | 2 | noble | 2 | ||||||||
noble | 2 | so | 2 | ||||||||
brutus | 2 | the | 1 | ||||||||
hath | 2 | the | 2 | ||||||||
told | 2 | told | 2 | ||||||||
you | 2 | you | 2 | ||||||||
caesar | 2 | was | 1 | ||||||||
was | 2 | was | 2 | ||||||||
ambitious | 2 | with | 2 |
◮Figure 3 Building an index by sorting and grouping. The sequence of terms in each document, tagged by their documentID (left) is sorted alphabetically (right). Instances of the same term are then grouped by word and then by documentID.
term doc. freq. | postings lists |
||||||
---|---|---|---|---|---|---|---|
ambitious 1 | 2 | ||||||
be 1 | 2 | ||||||
brutus 2 | 1 | 2 | |||||
capitol 1 | 1 | ||||||
caesar 2 | 1 | 2 | |||||
did 1 | 1 | ||||||
enact 1 | 1 | ||||||
hath 1 | 2 | ||||||
I 1 | 1 | ||||||
i’ 1 | 1 | ||||||
it 1 | 2 | ||||||
julius 1 | 1 | ||||||
killed 1 | 1 | ||||||
let 1 | 2 | ||||||
me 1 | 1 | ||||||
noble 1 | 2 | ||||||
so 1 | 2 | ||||||
the 2 | 1 | 2 | |||||
told 1 | 2 | ||||||
you 1 | 2 | ||||||
was 2 | 1 | 2 | |||||
with 1 | 2 |