An
Introduction
to
Information
Retrieval
Christopher D. Manning
Prabhakar Raghavan
Hinrich Schütze
Cambridge University Press
Cambridge, England
Preliminary draft (c)2007 Cambridge UP
DRAFT!
DO NOT DISTRIBUTE WITHOUT PRIOR PERMISSION
© 2007 Cambridge University Press
By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze
Printed on December 12, 2007
Website: http://www.informationretrieval.org/
Comments, corrections, and other feedback most welcome at:
informationretrieval@yahoogroups.com
Preliminary draft (c)2007 Cambridge UP
Solutions to exercises
v
This document contains solutions to most exercises in Introduc-
tion to Information Retrieval. The solutions were provided by
a student and many have not been checked by the authors yet.
Please send comments and corrections to informationretrieval
(at) yahoogroups.com .
Requests for this document should be directed to the address
given in the section “Problem sets” at http://informationretrieval.org.
Please do not distribute these solutions without prior permis-
sion.
Preliminary draft (c)2007 Cambridge UP
Preliminary draft (c)2007 Cambridge UP
DRAFT! © December 12, 2007 Cambridge University Press. Feedback welcome.
vii
Boolean retrieval
? Exercise 0.1
[]
Draw the inverted index that would be built for the following document collection.
(See Figure 1.3 for an example.)
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
SOLUTION.
increase->3 july->2->3 new->1->4 rise->2->4 sale->1->2->3->4 top->1
forecast->1 home->1->2->3->4 in->2->3
Inverted Index:
Exercise 0.2
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
[]
a. Draw the term-document incidence matrix for this document collection.
b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).
SOLUTION.
Term-Document matrix: d1 d2 d3 d4 Approach 0 0 1 0 breakthrough 1 0 0 0
drug 1 1 0 0 for 1 0 1 1 hopes 0 0 0 1 new 0 1 1 1 of 0 0 1 0 patients 0 0 0 1
schizophrenia 1 1 1 1 treatment 0 0 1 0
Inverted Index: Approach -> 3 breakthrough ->1 drug ->1->2 for ->1->3-
>4 hopes ->4 new -.>2->3->4 of ->3 patients ->4 schizophrenia ->1->2->3->4
treatment >3
[]
Exercise 0.3
For the document collection shown in Exercise 1.2, what are the returned results for
these queries:
a. schizophrenia AND drug
Preliminary draft (c)2007 Cambridge UP
viii
Boolean retrieval
b.
for AND NOT(drug OR approach)
SOLUTION.
(i) doc1, doc2 (ii) doc4
? Exercise 0.4
[]
For the queries below, can we still run through the intersection in time O(x + y),
where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what
can we achieve?
a. Brutus AND NOT Caesar
b. Brutus OR NOT Caesar
SOLUTION. a. Time is O(x+y). Instead of collecting documents that oc-
cur in both postings lists, collect those that occur in the first one and not in
the second. b. Time is O(N) (where N is the total number of documents in the
collection) assuming we need to return a complete list of all documents satis-
fying the query. This is because the length of the results list is only bounded
by N, not by the length of the postings lists.
[]
Exercise 0.5
Extend the postings merge algorithm to arbitrary Boolean query formulas. What is
its time complexity? For instance, consider:
c. (Brutus OR Caesar) AND NOT (Anthony OR Cleopatra)
Can we always merge in linear time? Linear in what? Can we do better than this?
SOLUTION. We can always intersect in O(qN) where q is the number
of query terms and N the number of documents, so the intersection time is
linear in the number of documents and query terms. Since the tightest bound
for the size of the results list is N, the number of documents, you cannot do
better than O(N).
Exercise 0.6
We can use distributive laws for AND and OR to rewrite queries.
[]
a. Show how to rewrite the above query into disjunctive normal form using the dis-
tributive laws.
b. Would the resulting query be more or less efficiently evaluated than the original
form of this query?
c. Is this result true in general or does it depend on the words and the contents of
the document collection?
Preliminary draft (c)2007 Cambridge UP
SOLUTION. Query in disjunctive normal form: (brutus and not anthony
and not cleopatra) or (caesar and not anthony and not cleopatra). In this case,
disjunctive normal form is more efficient than conjunctive normal form. In
the former case, we compute intersections first and then, in the last step, one
union of two postings lists that (hopefully) are small. In the latter case, we
start with a union of postings lists and have to deal with potentially large
intermediate results.
The above reasoning is probably not true for some words, e.g., (rare-word-1
or rare-word-2) and not (hong or kong), assuming hong and kong are very
frequent and occur in the same documents.
The above is not true if there are only negated query words in the disjunctive
normal form.
Exercise 0.7
Recommend a query processing order for
d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
given the following postings list sizes:
ix
[]
Term
eyes
kaleidoscope
marmalade
skies
tangerine
trees
Postings size
213312
87009
107913
271658
46653
316812
SOLUTION. Using the conservative estimate of the length of unioned
postings lists, the recommended order is: (kaleidoscope OR eyes) (300,321)
AND (tangerine OR trees) (363,465) AND (marmalade OR skies) (379,571)
However, depending on the actual distribution of postings, (tangerine OR
trees) may well be longer than (marmalade OR skies) because the two com-
ponents of the former are more asymmetric. For example, the union of 11 and
9990 is expected to be longer than the union of 5000 and 5000 even though
the conservative estimate predicts otherwise.
S. Singh’s solution
1.7Time for processing :
(i) (tangerine OR trees) = O(46653+316812) =
O(363465) (ii) (marmalade OR skies) = O(107913+271658) = O(379571) (iii)
(kaleidoscope OR eyes) = O(46653+87009) = O(300321)
Order of processing: a. Process (i), (ii), (iii) in any order as first 3 steps (total
time for these steps is O(363465+379571+300321) in any case)
b. Merge (i) AND (iii) = (iv):
In case of AND operator, the complexity
of merging postings list depends on the length of the shorter postings list.
Therefore, the more short the smaller postings list, the lesser the time spent.
The reason for choosing (i) instead of (ii) is that the output list (iv) is more
probable to be shorter if (i) is chosen.
c. Merge (iv) AND (ii): This is the only merging operation left.
Preliminary draft (c)2007 Cambridge UP
x
Exercise 0.8
If the query is:
Boolean retrieval
[]
friends AND romans AND (NOT countrymen)
e.
how could we use the frequency of countrymen in evaluating the best query evaluation
order? In particular, propose a way of handling negation in determining the order of
query processing.
SOLUTION. For very frequent negated terms, use N-(length of postings
list) instead of (length of postings list). For infrequent negated terms, use
(length of postings list) for ordering. Process the latter group last. (Need to
say what to do with very frequent non-negated terms.)
[]
Exercise 0.9
For a conjunctive query, is processing postings lists in order of size guaranteed to be
optimal? Explain why it is, or give an example where it isn’t.
SOLUTION. The order is not guaranteed to be optimal. Consider three
terms with postings list sizes s1 = 100, s2 = 105 and s3 = 110. Suppose the in-
tersection of s1 and s2 has length 100 and the intersection of s1 and s3 length
0. The ordering s1, s2, s3 requires 100+105+100+110=315 steps through the
postings lists. The ordering s1,s3,s2 requires 100+110+0+0=210 steps through
the postings lists.
[]
Exercise 0.10
Write out a postings merge algorithm, in the style of Figure 1.6 (page 11), for an x OR y
query.
SOLUTION. UNION(x, y)
1answer<- ( )
2while x!=NIL and y!=NIL
3do if docID(x)=docID(y)
4 then ADD(answer,docID(x))
5 x<- next(x)
6 y<-next(y)
7 else if docID(x)