logo资料库

Introduction to Information Retrieval Solution-Manual.pdf

第1页 / 共119页
第2页 / 共119页
第3页 / 共119页
第4页 / 共119页
第5页 / 共119页
第6页 / 共119页
第7页 / 共119页
第8页 / 共119页
资料共119页,剩余部分请下载后查看
An Introduction to Information Retrieval Christopher D. Manning Prabhakar Raghavan Hinrich Schütze Cambridge University Press Cambridge, England Preliminary draft (c)2007 Cambridge UP
DRAFT! DO NOT DISTRIBUTE WITHOUT PRIOR PERMISSION © 2007 Cambridge University Press By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze Printed on December 12, 2007 Website: http://www.informationretrieval.org/ Comments, corrections, and other feedback most welcome at: informationretrieval@yahoogroups.com Preliminary draft (c)2007 Cambridge UP
Solutions to exercises v This document contains solutions to most exercises in Introduc- tion to Information Retrieval. The solutions were provided by a student and many have not been checked by the authors yet. Please send comments and corrections to informationretrieval (at) yahoogroups.com . Requests for this document should be directed to the address given in the section “Problem sets” at http://informationretrieval.org. Please do not distribute these solutions without prior permis- sion. Preliminary draft (c)2007 Cambridge UP
Preliminary draft (c)2007 Cambridge UP
DRAFT! © December 12, 2007 Cambridge University Press. Feedback welcome. vii Boolean retrieval ? Exercise 0.1 [] Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.) Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july Doc 4 july new home sales rise SOLUTION. increase->3 july->2->3 new->1->4 rise->2->4 sale->1->2->3->4 top->1 forecast->1 home->1->2->3->4 in->2->3 Inverted Index: Exercise 0.2 Consider these documents: Doc 1 breakthrough drug for schizophrenia Doc 2 new schizophrenia drug Doc 3 new approach for treatment of schizophrenia Doc 4 new hopes for schizophrenia patients [] a. Draw the term-document incidence matrix for this document collection. b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7). SOLUTION. Term-Document matrix: d1 d2 d3 d4 Approach 0 0 1 0 breakthrough 1 0 0 0 drug 1 1 0 0 for 1 0 1 1 hopes 0 0 0 1 new 0 1 1 1 of 0 0 1 0 patients 0 0 0 1 schizophrenia 1 1 1 1 treatment 0 0 1 0 Inverted Index: Approach -> 3 breakthrough ->1 drug ->1->2 for ->1->3- >4 hopes ->4 new -.>2->3->4 of ->3 patients ->4 schizophrenia ->1->2->3->4 treatment >3 [] Exercise 0.3 For the document collection shown in Exercise 1.2, what are the returned results for these queries: a. schizophrenia AND drug Preliminary draft (c)2007 Cambridge UP
viii Boolean retrieval b. for AND NOT(drug OR approach) SOLUTION. (i) doc1, doc2 (ii) doc4 ? Exercise 0.4 [] For the queries below, can we still run through the intersection in time O(x + y), where x and y are the lengths of the postings lists for Brutus and Caesar? If not, what can we achieve? a. Brutus AND NOT Caesar b. Brutus OR NOT Caesar SOLUTION. a. Time is O(x+y). Instead of collecting documents that oc- cur in both postings lists, collect those that occur in the first one and not in the second. b. Time is O(N) (where N is the total number of documents in the collection) assuming we need to return a complete list of all documents satis- fying the query. This is because the length of the results list is only bounded by N, not by the length of the postings lists. [] Exercise 0.5 Extend the postings merge algorithm to arbitrary Boolean query formulas. What is its time complexity? For instance, consider: c. (Brutus OR Caesar) AND NOT (Anthony OR Cleopatra) Can we always merge in linear time? Linear in what? Can we do better than this? SOLUTION. We can always intersect in O(qN) where q is the number of query terms and N the number of documents, so the intersection time is linear in the number of documents and query terms. Since the tightest bound for the size of the results list is N, the number of documents, you cannot do better than O(N). Exercise 0.6 We can use distributive laws for AND and OR to rewrite queries. [] a. Show how to rewrite the above query into disjunctive normal form using the dis- tributive laws. b. Would the resulting query be more or less efficiently evaluated than the original form of this query? c. Is this result true in general or does it depend on the words and the contents of the document collection? Preliminary draft (c)2007 Cambridge UP
SOLUTION. Query in disjunctive normal form: (brutus and not anthony and not cleopatra) or (caesar and not anthony and not cleopatra). In this case, disjunctive normal form is more efficient than conjunctive normal form. In the former case, we compute intersections first and then, in the last step, one union of two postings lists that (hopefully) are small. In the latter case, we start with a union of postings lists and have to deal with potentially large intermediate results. The above reasoning is probably not true for some words, e.g., (rare-word-1 or rare-word-2) and not (hong or kong), assuming hong and kong are very frequent and occur in the same documents. The above is not true if there are only negated query words in the disjunctive normal form. Exercise 0.7 Recommend a query processing order for d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes) given the following postings list sizes: ix [] Term eyes kaleidoscope marmalade skies tangerine trees Postings size 213312 87009 107913 271658 46653 316812 SOLUTION. Using the conservative estimate of the length of unioned postings lists, the recommended order is: (kaleidoscope OR eyes) (300,321) AND (tangerine OR trees) (363,465) AND (marmalade OR skies) (379,571) However, depending on the actual distribution of postings, (tangerine OR trees) may well be longer than (marmalade OR skies) because the two com- ponents of the former are more asymmetric. For example, the union of 11 and 9990 is expected to be longer than the union of 5000 and 5000 even though the conservative estimate predicts otherwise. S. Singh’s solution 1.7Time for processing : (i) (tangerine OR trees) = O(46653+316812) = O(363465) (ii) (marmalade OR skies) = O(107913+271658) = O(379571) (iii) (kaleidoscope OR eyes) = O(46653+87009) = O(300321) Order of processing: a. Process (i), (ii), (iii) in any order as first 3 steps (total time for these steps is O(363465+379571+300321) in any case) b. Merge (i) AND (iii) = (iv): In case of AND operator, the complexity of merging postings list depends on the length of the shorter postings list. Therefore, the more short the smaller postings list, the lesser the time spent. The reason for choosing (i) instead of (ii) is that the output list (iv) is more probable to be shorter if (i) is chosen. c. Merge (iv) AND (ii): This is the only merging operation left. Preliminary draft (c)2007 Cambridge UP
x Exercise 0.8 If the query is: Boolean retrieval [] friends AND romans AND (NOT countrymen) e. how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of handling negation in determining the order of query processing. SOLUTION. For very frequent negated terms, use N-(length of postings list) instead of (length of postings list). For infrequent negated terms, use (length of postings list) for ordering. Process the latter group last. (Need to say what to do with very frequent non-negated terms.) [] Exercise 0.9 For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn’t. SOLUTION. The order is not guaranteed to be optimal. Consider three terms with postings list sizes s1 = 100, s2 = 105 and s3 = 110. Suppose the in- tersection of s1 and s2 has length 100 and the intersection of s1 and s3 length 0. The ordering s1, s2, s3 requires 100+105+100+110=315 steps through the postings lists. The ordering s1,s3,s2 requires 100+110+0+0=210 steps through the postings lists. [] Exercise 0.10 Write out a postings merge algorithm, in the style of Figure 1.6 (page 11), for an x OR y query. SOLUTION. UNION(x, y) 1answer<- ( ) 2while x!=NIL and y!=NIL 3do if docID(x)=docID(y) 4 then ADD(answer,docID(x)) 5 x<- next(x) 6 y<-next(y) 7 else if docID(x)
分享到:
收藏