A large-scale evaluation and analysis of personalized search str....pdf

发布时间：2022-06-09 发布人：admin 分类：说明书资料大小：0.28M 资料格式：pdf 举报版权申诉

my472360924-1339854-4744302543291306553.pdf-第1页.png

第1页 / 共10页

my472360924-1339854-4744302543291306553.pdf-第2页.png

第2页 / 共10页

my472360924-1339854-4744302543291306553.pdf-第3页.png

第3页 / 共10页

my472360924-1339854-4744302543291306553.pdf-第4页.png

第4页 / 共10页

my472360924-1339854-4744302543291306553.pdf-第5页.png

第5页 / 共10页

my472360924-1339854-4744302543291306553.pdf-第6页.png

第6页 / 共10页

my472360924-1339854-4744302543291306553.pdf-第7页.png

第7页 / 共10页

my472360924-1339854-4744302543291306553.pdf-第8页.png

第8页 / 共10页

文本预览

A Largescale Evaluation and Analysis of Personalized Search Strategies ∗ Zhicheng Dou Nankai University Tianjin 300071, China Ruihua Song Microsoft Research Asia Beijing 100080, China JiRong Wen Microsoft Research Asia Beijing 100080, China douzc@yahoo.com.cn rsong@microsoft.com jrwen@microsoft.com ABSTRACT Although personalized search has been proposed for many years and many personalization strategies have been inves- tigated, it is still unclear whether personalization is consis- tently eﬀective on diﬀerent queries for diﬀerent users, and under diﬀerent search contexts. In this paper, we study this problem and provide some preliminary conclusions. We present a large-scale evaluation framework for personalized search based on query logs, and then evaluate ﬁve person- alized search strategies (including two click-based and three proﬁle-based ones) using 12-day MSN query logs. By an- alyzing the results, we reveal that personalized search has signiﬁcant improvement over common web search on some queries but it has little eﬀect on other queries (e.g., queries with small click entropy). It even harms search accuracy under some situations. Furthermore, we show that straight- forward click-based personalization strategies perform con- sistently and considerably well, while proﬁle-based ones are unstable in our experiments. We also reveal that both long- term and short-term contexts are very important in improv- ing search performance for proﬁle-based personalized search strategies. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—information ﬁltering, Search process; H.3.4 [Information Storage and Retrieval]: Systems and Software—Performance evaluation (eﬃciency and ef- fectiveness) General Terms Algorithms, Experimentation, Human Factors, Theory Keywords Click-through, Personalization, Personalized Search, Query Log, Re-ranking 1. INTRODUCTION One criticism of search engines is that when queries are is- sued, most return the same results to users. In fact, the vast ∗Work was done when the author was visiting Microsoft Research Asia. Copyright is held by the International World Wide Web Conference Com mittee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. ACM 9781595936547/07/0005. majority of queries to search engines are short [27, 12] and ambiguous [16, 7], and diﬀerent users may have completely diﬀerent information needs and goals under the same query [12, 26, 23, 34]. For example, a biologist may use query “mouse” to get information about rodents, while program- mers may use the same query to ﬁnd information about computer peripherals. When such a query is submitted to a search engine, it takes a moment for a user to choose which information he/she wishes to get. On the query “free mp3 download”, the users’ selections can also vary though al- most all of them are ﬁnding some websites to download free mp3: one may select the website “www.yourmp3.net”, while another may prefer the website “www.seekasong.com”. Personalized search is considered a solution to this prob- lem since diﬀerent search results based on preferences of users are provided. Various personalization strategies in- cluding [21, 22, 26, 31, 14, 9, 35, 30, 19] have been pro- posed, and personalized web search systems have been de- veloped, but they are far from optimal. One problem of current personalized search is that most proposed methods are uniformly applied to all users and queries. In fact, we think that queries should not be handled in the same manner because we ﬁnd: (1) Personalization may lack eﬀectiveness on some queries, and there is no need for personalization on such queries. This has also been found by [34]. For example on the query “mouse” mentioned above, using personalization based on user interest proﬁle, we could achieve greater relevance for individual users than common web search. Beyond all doubt, the personalization brings signiﬁcant beneﬁt to users in this case. Contrarily, for the query “Google”, which is a typi- cal navigational query as deﬁned in [3, 17], almost all of the users are consistently selecting results to redirect to Google’s homepage, and therefore none of the personalized strategies could provide signiﬁcant beneﬁts to users. (2) Diﬀerent strategies may have variant eﬀects on diﬀer- ent queries. For the query “free mp3 download”, using the typical user interest proﬁle-based personalization such as the method proposed in [6], which led to better results for the query “mouse”, we may achieve poor results because the results for query “free mp3 download” are mostly classiﬁed into one topic category and the proﬁle-based personaliza- tion is too coarse to ﬁlter out the desired results. In such a case, simply leveraging pages visited by this user in the past may achieve better performance. Furthermore, simply ap- plying one personalization strategy on some queries without any consideration may harm user experience. For example, when a sports fan submits the query “oﬃce”, he/she may

not be seeking information on sports, but may be seeking help on Microsoft Oﬃce Software or any other number of oﬃce-related inquiries. In this situation, if interest-based personalization is done, many irrelevant results could erro- neously be moved to the front and the user may become confused. (3) Personalization strategies may provide diﬀerent eﬀec- tiveness based on diﬀerent search histories and under variant contexts. For example, it could be diﬃcult to learn inter- ests of users who have done few searches. Furthermore, as Shen et al. [25] noted, users often search for documents to satisfy short-term information needs, which may be incon- sistent with general user interests. In such cases, long-term user proﬁles may be useless and short-term query context may be more useful. In short, the eﬀectiveness of a speciﬁc personalized search strategy may show great improvement over that of non- personalized search on some queries for some users, and un- der some search contexts, but it can also be unnecessary and even harmful to search under some situations. Until now, little investigation has been done on how personaliza- tion strategies perform under diﬀerent situations. In this paper, we get some conclusions on this problem and make the following contributions: (1) We develop a large-scale personalized search evalua- tion framework based on query logs. In this framework, dif- ferent personalized re-ranking strategies are simulated and the search accuracy is approximately evaluated by real user clicks recorded in query logs automatically. The framework enables us to evaluate personalization on a large scale. (2) We propose two click-based personalized search strate- gies and three proﬁle-based personalized search strategies. We evaluate all ﬁve approaches in the evaluation framework using 12-day query logs from MSN search engine1 and pro- vide an in-depth analysis on the results. (3) We reveal that personalization has diﬀerent eﬀective- ness on diﬀerent queries, users, and search contexts. Person- alization brings signiﬁcant search accuracy improvements on the queries with large click entropy, and has little eﬀect on the queries with small click entropy. Personalization strate- gies can even harm the search accuracy on some queries. Therefore, we conclude that not all queries should be per- sonalized equally. (4) We show that click-based personalization strategies perform consistently and considerably well though they can only work on the repeated queries. We ﬁnd that our proﬁle- based strategies are unstable because of the straightforward implementation. We also ﬁnd the proﬁle-based methods be- come more unstable when users search history grows. We reveal that both long-term and short-term contexts are very important in improving search performance for proﬁle-based personalization. The remaining sections are organized as follows. In Sec- tion 2, we discuss related works. We present a re-ranking framework and introduce how to use this framework to eval- uate the personalization strategies in Section 3. In Section 4, we give several personalization strategies in both person and group levels. In Section 5 we introduce the dataset used in our experiments and detailed data statistics. We compare and analyze the results of these strategies in Section 6. We conclude our work in Section 7. 2. RELATED WORK There are several prior attempts on personalizing web search. One approach is to ask users to specify general in- terests. The user interests are then used to ﬁlter search results by checking content similarity between returned web pages and user interests [22, 6]. For example, [6] used ODP2 entries to implement personalized search based on user pro- ﬁles corresponding to topic vectors from the ODP hierarchy. Unfortunately, studies have also shown that the vast major- ity of users are reluctant to provide any explicit feedback on search results and their interests [4]. Many later works on personalized web search focused on how to automati- cally learn user preferences without any user eﬀorts [22, 19, 29, 26]. User proﬁles are built in the forms of user interest categories or term lists/vectors. In [19], user proﬁles were represented by a hierarchical category tree based on ODP and corresponding keywords associated with each category. User proﬁles were automatically learned from search his- tory. In [29], user preferences were built as vectors of distinct terms and constructed by accumulating past preferences, in- cluding both long-term and short-term preferences. Tan et al. [31] used the methods of statistical language modeling to mine contextual information from long-term search his- tory. In this paper, user proﬁles are represented as weighted topic categories, similar with those given in [28, 6, 22], and these proﬁles are also automatically learned from users’ past clicked web pages. Many personalized web search strategies based on hy- perlink structure of web have also been investigated. Per- sonalized PageRank, which is a modiﬁcation of the global PageRank algorithm, was ﬁrst proposed for personalized web search in [20]. In [10], multiple Personalized PageRank scores, one for each main topic of ODP, were used to en- able “topic sensitive” web search. Jeh and Widom [14] gave an approach that could scale well with the size of hub vec- tors to realize personalized search based on Topic-Sensitive PageRank. The authors of [32] extended the well-known HITS algorithm by artiﬁcially increasing the authority and hub scores of the pages marked relevant by the user in pre- vious searches. Most recently, [17] developed a method to automatically estimate user hidden interests based on Topic- Sensitive PageRank scores of the user’s past clicked pages. In most of above personalized search strategies, only the information provided by user himself/herself is used to cre- ate user proﬁles. These are also some strategies which in- corporate the preferences of a group of users to accomplish personalized search. In these approaches, the search his- tories of users who have similar interest with test user are used to reﬁne the search. Collaborative ﬁltering is a typical group-based personalization method and has been used in personalized search in [29] and [30]. In [29], users’ proﬁles can be constructed based on the modiﬁed collaborative ﬁl- tering algorithm [15]. In [30], the authors proposed a novel method CubeSVD to apply personalized web search by an- alyzing the correlation among users, queries, and web pages contained in click-through data. In this paper, we also intro- duce a method which incorporates click histories of a group of users to personalize web search. Some people have also found that personalization has vari- ant eﬀectiveness on diﬀerent queries. For instance, Teevan et al. [34] suggested that not all queries should be handled 1MSN Search, http://search.msn.com 2Open Directory Project,http://dmoz.org/

in the same manner. For less ambiguous queries, current web search ranking might be suﬃcient and thus personaliza- tion is unnecessary. In [6] and [5], test queries were divided into three types: clear queries, semi-ambiguous queries, and ambiguous queries. The authors also concluded that per- sonalization signiﬁcantly increased output quality for am- biguous and semi-ambiguous queries, but for clear queries, one should prefer common web search. In [31], queries were divided into fresh queries and recurring queries. The au- thors found that recent history tended to be much more useful than remote history especially for fresh queries while the entire history was helpful for improving the search accu- racy of recurring queries. This also gave us a sense that not all queries should be personalized in the same way. These conclusions inspired our detailed analysis. 3. EXPERIMENT METHODOLOGY The typical evaluation method used in existing personal- ized search research is to conduct user studies [23, 26, 6, 34, 28, 29, 19, 5, 31]. Usually, a certain number of peo- ple participate in the evaluated personalized search system over several days. The user proﬁles are manually speciﬁed by participants themselves [6] or automatically learned from search histories. To evaluate the performance of personal- ized search, each participant is required to issue a certain number of test queries and determine whether each result is relevant. The advantage of this approach is that the rele- vance can be explicitly speciﬁed by participants. Unfortu- nately, there are still some drawbacks in this method. The constraint of the number of participants and test queries may bias the accuracy and reliability of the evaluation. We propose a framework that enables large-scale evalu- ation of personalized search. In this framework, we use click-through data recorded in query logs to simulate user experience in web search. In general, when a user issues a query, he/she usually checks the documents in the result list from top to bottom. He/she clicks one or more documents which look more relevant to him/her, and skip the docu- ments which he/she is not interested in. If a speciﬁc per- sonalization method can re-rank the “relevant” documents fronter in the result list, the user will be more satisﬁed with the search. Therefore, we utilize clicking decisions as rel- evance judgments to evaluate the search accuracy. Since click-through data can be collected at low cost, it is possible to do large-scale evaluation using this framework. Further- more, since click-through data reﬂect real world distribu- tions of query, user, and user selections, they could be more accurate to evaluate personalized search using click-through data than user surveys. One potential concern about the evaluation framework is that the original user selections may be inﬂuenced by initial result rankings[2], and thus it could be unfair to evaluate a reordering of the original search results using the original click data. Our framework may fail to evaluate the rank- ing alternation of documents that are relevant but were not clicked by users, and this may bias the evaluation. How- ever, our framework is still eﬀective to evaluate approximate search accuracy. It is the best method we could adopt to en- able large-scale evaluation of personalized search. We will investigate more stable methodology in future work. In the evaluation framework, we use MSN query logs to simulate and evaluate the personalized re-ranking. In MSN query logs, each user is identiﬁed by “Cookie GUID”, which remains the same in a machine as long as a cookie is not cleared. For each query, MSN search engine logs the query terms and records all click-through information including clicked web pages and their ranks. A “Browser GUID”, which remains the same before the browser is re-open, is also recorded for each query. It is used as the simple identiﬁer of a session, which includes a series of queries made by a single user within a small range of time and is usually meant to capture a single user’s attempt to fulﬁll a single information need [27, 12]. 3.1 Reranking Evaluation Framework In this evaluation framework, we ﬁrst download search results from MSN search engine, then use one personaliza- tion strategy to re-rank the results. The click-through data recorded in test set is then used to evaluate the re-ranking performance. In more detail, query re-ranking and evalua- tion are completed in the following steps: (1) Download the top 50 search results from MSN search engine for the test query. We denote the downloaded web pages with U and denote the rank list that contains the rankings of the web pages with τ1. (2) Compute a personalized score for each web page xi ∈ U using personalization algorithm and then generate a new rank list τ2 with respect to U sorted by descending person- alized scores. The personalized strategies are introduced in Section 4. (3) Combine the rankings in τ1 and τ2 using Borda’ rank- ing fusion method [13, 8] and sort the web pages with the combined rankings. The ﬁnal rank list is denoted with τ . τ is the ﬁnal personalized search result list for the query. Notice that we use the rank-based ranking fusion method because we are unable to get the relevance scores from MSN search engine. (4) Use the measurements introduced in Section 3.2 to evaluate the personalization performance on τ . In this paper, we assume the results downloaded from MSN search engine are consistent with those returned to the user when the query was submitted. We use the most recent MSN query logs on August 2006 and download search results in the early days on September 2006 so that the changes of search results can be ignored (We also tried the approach of rebuilding the search results from query logs but it failed because of the sparseness of queries and user clicks). We downloaded only the top 50 search results because most users never look beyond the top 50 entries in the test set. 3.2 Evaluation Measurements We use two measurements to evaluate the personalized search accuracy of diﬀerent strategies: rank scoring metric introduced in [30, 15] and average rank metric introduced in [23]. 3.2.1 Rank Scoring Rank scoring metric proposed by Breese [15] is used to evaluate the eﬀectiveness of the collaborative ﬁltering sys- tems which return an ordered list of recommended items. Sun et al.[30] used it to evaluate the personalized web search accuracy and we also use it in this paper. The expected utility of a ranked list of web pages is deﬁned as Rs = X j δ(s, j) 2(j−1)/(α−1)

where j is the rank of a page in the list, δ(s, j) is 1 if page j is clicked in the test query s and 0 otherwise, and α is set to 5 as the authors did. The ﬁnal rank scoring reﬂects the utilities of all test queries: repeated by the same user and this approach will only bring beneﬁts to these queries. This approach is denoted with P-Click. 4.1.2 Personlevel Reranking Based on User Inter R = 100 Ps Rs Ps RM ax s (1) ests s Here, RM ax is the obtained maximum possible utility when all pages which have been clicked appear at the top of the ranked list. Larger rank scoring value indicates better per- formance of personalized search. 3.2.2 Average Rank Average rank metric is used to measure the quality of personalized search in [23, 28]. The average rank of a query s is deﬁned as below. AvgRanks = 1 |Ps| X p∈Ps R(p) Here Ps denotes the set of clicked web pages on test query s, R(p) denotes the rank of page p. The ﬁnal average rank on test query set S is computed as: AvgRank = 1 |S| X s∈S AvgRanks (2) Smaller average rank value indicates better placements of relevant result, or better result quality. In fact, rank scoring metric and average rank metric has similar eﬀectiveness on evaluating personalization perfor- mance, and our experimental results show that they are consistent. 4. PERSONALIZATION STRATEGIES As we described in Section 2, personalized search methods can be categorized into person level and group level. In this paper, we propose several re-ranking methods in both levels to accomplish personalized search. These strategies are used to re-rank search results by computing a personalized score S (q, p, u) for each page p in the results returned to user u on query q, as Section 3 introduced. In the following sections, we will introduce the strategies. 4.1 Personlevel Reranking 4.1.1 Personlevel Reranking Based on Historical Clicks We suppose that for a query q submitted by a user u, the web pages frequently clicked by u in the past are more relevant to u than those seldom clicked by u, and thus, the personalized score on page p can be computed by: SP -Click (q, p, u) = |Clicks (q, p, u)| |Clicks (q, •, u)| + β (3) Here, |Clicks (q, p, u)| is the click number on web page p by user u on query q in the past, |Clicks (q, •, u)| is the total click number on query q by u, and β is a smoothing factor (β = 0.5 in this paper). Notice that |Clicks (q, p, u)| actually decides the ranking of the page, while |Clicks (q, •, u)| and β are only used for normalization. A disadvantage to this approach is that the re-ranking will fail when the user has never asked this query. We ﬁnd that in our dataset, about one-third of the test queries are As introduced in Section 2, many current researches use interest proﬁles to personalize search results [22, 19, 6]. In this paper, we also proposed a personalization method based on user interest proﬁle (we denote this method with L- Proﬁle). User’s proﬁle cl (u) is presented as a weighting vector of 67 pre-deﬁned topic categories provided by KDD Cup-2005 [18]. When a user submits a query, each of the returned web pages is also mapped to a weighting category vector. The similarity between the user proﬁle vector and page category vector is then used to re-rank search results: SL-P rof ile (q, p, u) = cl (u) · c (p) kcl (u)k kc (p)k (4) Here c (p) is category vector of web page p. c (p) is gener- ated by a tool using the query and web page classiﬁcation method introduced in [24]. Given a web page p, the tool re- turns top 6 categories which p belongs to with corresponding conﬁdences. Each component c (p)i of c (p) is the classiﬁ- cation conﬁdence returned by the tool, which means the probability that page p should be classiﬁed into category i. If category i is not in the 6 categories returned by the tool, then we set c (p)i = 0. User’s proﬁle cl (u) is automatically learned from his/her past clicked web pages as the following equation: cl(u) = X P (p|u)w(p)c(p) p∈P(u) Here P(u) is the collection of web pages visited by user u in the past. P (p|u) can be thought of as the probability that user u clicks web page p, i.e., P (p|u) = |Clicks (•, p, u)| |Clicks (•, •, u)| Here, |Clicks (•, •, u)| is the total click times made by u and|Clicks (•, p, u)| is the click times on web page p made by u. w (p) is the impact weight for page p when generating user proﬁles. We assume that the web pages submitted by many users are less important when building user proﬁles, thus, w (p) = log |U| |U(p)| |U| is the number of total users; |U(p)| is the number of users who have ever visited web page p. In method L-Proﬁle, user’s proﬁle cl(u) is accumulated from user’s visited web pages in the past. This proﬁle is called long-term proﬁle in previous works [29, 26, 31]. In fact, as investigated by [26], short-term user proﬁle is more useful for improving search in current session. In this paper, we use the clicks on the previous queries in current session to build user’s short-term proﬁle. A user’s short-term proﬁle cs(u) is computed as below. cs(u) = 1 |Ps(q)| X p∈Ps(q) c(p) Ps(q) is the collection of visited pages on previous queries in current session. The personalized score of page p using

short-term proﬁle is computed as the following equation: SS-P rof ile(q, p, u) = cs (u) · c (p) kcs (u)k kc (p)k (5) This approach is denoted with S-Proﬁle. We can also fuse the long-term personalized score and the short-term personalized score using a simple linear combi- nation: SLS-P rof ile (q, p, u) = θSL-P rof ile (q, p, u) + (1 − θ)SS-P rof ile (q, p, u) (6) We denote this approach with LS-Proﬁle. Methods L- Proﬁle, S-Proﬁle, and LS-Proﬁle are generally called proﬁle- based methods for short in this paper. 4.2 Grouplevel Reranking We use the K-Nearest Neighbor Collaborative Filtering algorithm to test group-based personalization. Due to the data sparsity in our dataset, using traditional CF methods on web search is inadequate. Instead, we compute the user similarity based on long-term user proﬁles: Sim (u1, u2) = cl (u1) · cl (u2) kcl (u1)k kcl (u2)k The K-Nearest neighbors are obtained based on the user similarity computed as follows. Su (ua) = {us|rank (Sim (ua, us)) ≤ K} Then we use the historical clicks made by similar users to re-rank the search results: P Sim(us, u) |Clicks (q, p, us)| SG-Click (q, p, u) = us∈Su(u) β + P us ∈Su(u) |Clicks (q, •, us)| (7) We denote this approach with G-Click. 5. DATASET In this section, we introduce the dataset used in our ex- periments. 5.1 Statistics about Dataset We collect a set of MSN query logs for 12 days in August 2006 for our experiments. Because the entire log set is too large, we randomly sample 10,000 distinct users (identiﬁed by “Cookie GUID”) from the users in the United States on August 19, 2006. These users and their click-through logs are extracted as our dataset. In addition, the queries without any clicks (about 34.6% of all queries) are excluded from the dataset because they are useless in our experiments. The entire dataset is split into two parts: a training set and a test set. The training set contains the log data of the ﬁrst 11 days and the log data of the last day is used for testing. Table 1 summarizes the statistics. Notice that all 10,000 users have search activities in the training set because users are sampled from the logs of training days. Because users are randomly sampled, this dataset could reﬂect the characteristics of the entire logs. It also has simi- lar characteristics of those in existing reports [27, 37, 11, 36, 1]. We show detailed statistics of the dataset in the following sections. Table 1: Basic statistics of dataset Item #days #users #queries #distinct queries #Clicks #Clicks/#queries #sessions ALL Training Test 12 10,000 55,937 34,203 93,566 1.6727 49,839 11 10,000 51,334 31,777 85,642 1.6683 45,981 1 1,792 4,639 3,465 7,924 1.7081 3,865 10000 1000 100 10 i s e m T y r e u Q 1 1 10 100 1000 10000 100000 Query ID(ordered by query times in descending order) (a) Distribution of query frequency (by log scale). 1000 100 10 s r e s u f o r e b m u N 1 1 10 100 1000 10000 100000 Query ID(ordered by number of users in descending order) (b) Distribution of user number of queries (by log scale). Figure 1: Query popularity distributions. 5.2 Statistics about Queries In our dataset, more than 80% distinct queries are only is- sued once in a 12-day period, and about 90% distinct queries string are issued only by one user. The 3% most popular distinct queries are issued by more than 47% users. The statistics is similar with that given in [27, 37, 11], and this indicates that information needs on the Web are quite di- verse. Furthermore, we ﬁnd that query frequency can also be characterized by Zipf distributions, consistent with that found by [37]. Figure 1(a) plots the distributions of query frequency. In this ﬁgure, queries are sorted by query times in descending order: the ﬁrst query is the most popular one, and the last is the most unpopular one. Figure 1(b) plots the distribution of number of users on each query. 5.3 Statistics about Test Users As Table 1 shows, 1,792 users have search activities on the test day. Figure 2 plots the distribution of historical (i.e. in

s r e s u f t o e g a n e c r e P 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 s r e s u f t o e g a n e c r e P 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 2 4 6 8 10 12 0 20 40 60 80 100 120 140 160 User historical query days User historical query times (a) Distribution of user historical query days (b) Distribution of user historical query times Figure 2: Distributions of user search frequency in training days for test users i s n o s s e s f o e g a t n e c r e P 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 In this paper, we deﬁne click entropy of query as Equation 8. ClickEntroy(q) = X −P (p|q) log2 P (p|q) (8) p∈P(q) Here ClickEntroy(q) is the click entropy of query q. P(q) is the collection of web pages clicked on query q. P (p|q) is the percentage of the clicks on web page p among all the clicks on q, i.e., 2 4 6 8 10 12 14 Queries per session P (p|q) = |Clicks(q, p, •)| |Clicks(q, •, •)| Figure 3: Distribution of query number per session. training days) query day and query times for users in the test set. Because users are sampled on one of the training days, each user has at least a day-long query history. We ﬁnd about 30% users in the test set have more than 5 days’ query history and about 50 % of them submit more than 10 queries in training days. 5.4 Statistics about Query Repetition In our dataset, 2,130 (about 46%) of all 4,639 queries in the test set are repeated ones that have been submitted in the training days, either by the same user or by diﬀerent users. Furthermore, 1,535 queries (72% of repeated ones and 33% of test queries) are repeated by the same user. These results are consistent with those given in [33] and are helpful for personalized search. 5.5 Statistics about Sessions In our dataset, we use “Browser GUID” as a simple iden- tiﬁer of session. A user opens a browser and asks one or more queries and then closes the browser: the whole process is considered as a session in this paper. Figure 3 shows the distribution of number of queries in a session. About 30% sessions contain at least two queries. This indicates that users sometimes submit several queries to fulﬁll an informa- tion need. 5.6 Distribution of Query Click Entropies As found by [34], for queries which showed less variation among individuals, the personalization may be insuﬃcient. Click entropy is a direct indication of query click varia- tion. If all users click only one same page on query q, then we have ClickEntroy(q) = 0. Smaller click entropy means that the majorities of users agree with each other on a small number of web pages. In such cases, there is no need to do personalization. Large click entropy indicates that many web pages were clicked for the query. This may mean: (1) a user has to select several pages to satisfy this query, which means the query is an informational query [3, 17]. Person- alization can help to ﬁlter the pages that are more relevant to users by making use of historical selections. (2) Diﬀerent users have diﬀerent selections on this query, which means the query is an ambiguous query. In such cases, personal- ization can be used to provide diﬀerent web pages to each individual. Figure 4(a) shows the click entropy distribution. More than 65% queries have low click entropy between 0 and 0.5. We ﬁnd many of these queries are submitted only by one user and the user only clicks one web page. Figure 4(b) shows the click entropy distribution for queries asked more than ﬁve times. Figure 4(c) plots the click entropy distribution for queries submitted by at least three users. From these ﬁgures, we also ﬁnd that the majorities of the more popular queries have low click entropies. 6. RESULTS AND DISCUSSIONS In this section, we will give detailed evaluation and anal- ysis of the ﬁve personalized search strategies. Notice that the original web search method without any personalization, which is used for comparing with the personalized methods, is denoted with “WEB”. We let K = 50 for method G-Click and θ = 0.3 for method LS-Proﬁle, and both of the two settings are empirical. In our experiments, we ﬁnd 676 queries in total 4,639 test

s e i r e u q f t o e g a n e c r e P 1 0.8 0.6 0.4 0.2 0 s e i r e u q f t o e g a n e c r e P 1 0.8 0.6 0.4 0.2 0 s e i r e u q f t o e g a n e c r e P 1 0.8 0.6 0.4 0.2 0 0-0.5 1-1.5 2-2.5 3-3.5 4-4.5 >5 0-0.5 1-1.5 2-2.5 3-3.5 4-4.5 >5 0-0.5 1-1.5 2-2.5 3-3.5 4-4.5 >5 Click entropy of queries (a) All queries Click entropy of queries Click entropy of queries (b) Queries with query times>5 (c) Queries with user number>2 Figure 4: Distribution of query click entropy. queries lost the clicked web pages in downloaded search re- sults. This is because MSN search engine has changed for these queries. We excluded these queries when reporting the experimental results in the following sections. Further- more, we ﬁnd for 57% (2,256/3,963) of the left queries, users select only the top results. In other words, original search method WEB has done the best on these queries and per- sonalization does not provide improvements. We call the other 1,707 queries, on which users select not only the top results, not-optimal queries. 6.1 Overall Performance of Strategies Table 2 shows the overall eﬀectiveness of the personaliza- tion strategies on the test queries. We ﬁnd: (1) Click-based personalization methods G-Click and P- Click outperform method WEB on the whole. For instance, on the not-optimal queries, method P-Click has a signiﬁ- cant (p < 0.01) 3.68% improvement over method WEB and method G-Click have a signiﬁcant (p < 0.01) 3.62% improve- ment over WEB (using rank scoring metric). P-Click and G-Click methods also have signiﬁcant improvements (1.39% and 1.37%) over WEB on all test queries including both not-optimal and optimal queries. These results show that click-based personalization methods can generally improve web search performance. (2) Methods P-Click and G-Click have no signiﬁcant dif- ferent performances on the whole. In our experiments, we sample 10,000 users and select the 50 most similar users for each test user in G-Click approach (we also try the meth- ods to select 20 and 100 users, but they show no signiﬁcant diﬀerence). By reason of high user query sparsity, selected similar users may have few search histories on the queries submitted by test user. This makes group-level personaliza- tion perform no signiﬁcant improvement over person-level personalization. If more days’ logs are given and more users are selected, method G-Click may perform better. (3) Proﬁle-based methods L-Proﬁle, S-Proﬁle, and LS- Proﬁle perform less well on average. We compute rank scor- ings of all the methods for each single test query and then plot the distributions of rank scoring increment over WEB method for each personalization strategy in Figure 5. We ﬁnd that though L-Proﬁle, S-Proﬁle, and LS-Proﬁle meth- ods improve the search accuracy on many queries, they also harm the performance on more queries, which makes them perform worse on average. This indicates that the straight- forward implementation of proﬁle-based strategies we em- ploy in this paper do not work well, at least not as stable as the click-based ones. We will give some analysis on why our proﬁle-based methods are unstable in Section 6.5. Table 2: Overall performance of personalization strategies. R.S. denotes the rank scoring metric and A.R. denotes the average rank metric. method WEB P-Click L-Proﬁle S-Proﬁle LS-Proﬁle G-Click all not-optimal R.S. 69.4669 70.4350 66.7378 66.7822 68.5958 70.4168 A.R. 3.9240 3.7338 4.5466 4.4244 4.1322 3.7361 R.S. 47.2623 49.0051 45.8485 45.1679 46.6518 48.9728 A.R. 7.7879 7.3380 8.3861 8.3222 8.0445 7.3433 6.2 Performance on Different Click Entropies We give the average search accuracy improvements of dif- ferent personalization strategies on the test queries with dif- ferent click entropy in Figure 6. We use only the queries asked by at least three users to make the click entropy more reliable. We see that the improvement of the personalized search performance increases when the click entropy of query be- comes larger, especially when click entropy ≥ 1.5. For the click-based methods P-Click and G-Click, the improvement of personalization is very limited on the queries with click entropies between 0 and 0.5. The G-Click method, which gets the best performance for these queries, has only a non- signiﬁcant 0.37% improvement over WEB methods in rank scoring metric. This means users have small variance on these queries, and the search engine has done well for these queries, while on the queries with click entropy≥2.5, the re- sult is disparate: both P-Click and G-Click methods make exciting performance. In the rank scoring metric, method G-Click has a signiﬁcant (p < 0.01) 23.37% improvement over method WEB and P-Click method have a signiﬁcant (p < 0.01) 23.68% improvement over method WEB. Proﬁle- based methods L-Proﬁle, S-Proﬁle and LS-Proﬁle worsen search performance when click entropy < 1.5, while L-Proﬁle and LS-Proﬁle also achieve better performances on queries with click entropy ≥ 1.5 (we wonder why method L-Proﬁle also worsens search accuracy when click entropy≥2.5 and will provide additional analysis on this in future work). All these results indicate that on the queries with small click entropy (which means that these queries are less am- biguous or more navigational), the personalization is insuf- ﬁcient and thus personalization is unnecessary.

1000 s e i r e u q t s e t f o r e b m u N 100 10 1 1000 s e i r e u q t s e t f o r e b m u N 100 10 1 1000 s e i r e u q t s e t f o r e b m u N 100 10 1 1000 s e i r e u q t s e t f o r e b m u N 100 10 1 1000 s e i r e u q t s e t f o r e b m u N 100 10 1 -100 -50 0 50 100 -100 -50 0 50 100 -100 -50 0 50 100 -100 -50 0 50 100 -100 -50 0 50 100 P-Click - WEB (a) P-Click L-Profile - WEB (b) L-Proﬁle S-Profile - WEB (c) S-Proﬁle LS-Profile - WEB (d) LS-Proﬁle G-Click - WEB (e) G-Click Figure 5: Distributions of rank scoring increment over WEB method. The count of the test queries with the same rank scoring increment range is plot in y-axis with log scale. WEB P-Click L-Profile S-Profile LS-Profile G-Click t n e m e v o r p m I g n i r o c S k n a R 25% 20% 15% 10% 5% 0% -5% -10% t n e m e v o r p m I k n a R e g a r e v A 60% 40% 20% 0% -20% -40% -60% WEB P-Click L-Profile S-Profile LS-Profile G-Click 0.0-0.5 0.5-1.0 1.0-1.5 1.5-2.0 2.0-2.5 >=2.5 0.0-0.5 0.5-1.0 1.0-1.5 1.5-2.0 2.0-2.5 >=2.5 Entropy (a) Ranking scoring Entropy (b) Average rank Figure 6: Search accuracy improvements over WEB method on the queries with variant click entropies. Only the queries asked by at least three users are included. Notice that in ﬁgure (b), smaller average rank means higher search accuracy. 6.3 Performance on Repeated queries In Subsection 5.4, we ﬁnd about 46% test queries are re- peated by the same user or diﬀerent users and 33% queries are repeated by the same user. It means that users often re- view the queries and results they ever referred. Teevan et al. [33] have also observed that re-ﬁnding behavior is common, and have shown that repeat clicks can often be predicted based on a user’s previous queries and clicks. In this pa- per, methods P-Click and G-Click are based on historical clicks. The high repetition ratio in real world makes these click-based personalization strategies work well. Table 3(a) shows the personalization performance on the repeated non-optimal queries repeated by either the same user or diﬀerent users and Table 3(b) gives the results on the queries repeated by the same user. We ﬁnd the per- sonalization methods P-Click and G-Click have signiﬁcant improvements over WEB method on queries repeated by same user. This means that when a user re-submit a query, his/her selections are also highly consistent with the past and the personalization based on his/her past clicks per- forms well. These results tell us that we should record user query and click histories and use them to improve future search if no privacy problems exist. We also should provide convenient ways for users to review their search histories, just like those provided by some current search engines. 6.4 Performance on Variant Search Histories Do users who frequently use search engine beneﬁt more from personalized search? Do proﬁle-based personalized search strategies perform better when the search history grows? To answer these questions, we plot the improve- ments of rank scorings on queries given by users with diﬀer- ent search frequencies in Figure 7. We ﬁnd: (1) Using click-based methods P-Click and G-Click, users who have greater search activities in training days do not consistently beneﬁt more than users who do less searching. This is because users who frequently use the search engine may have more varied information needs. They may repeat old queries, but they may also submit lots of fresh queries, which makes our click-based methods P-Click and G-Click perform similar averages for users with diﬀerent search fre- quencies (notice that the two series of methods P-Click and G-Click are very close to each other). (2) Method L-Proﬁle when using a user’s long-term inter- est proﬁle can perform better when a user has more queries, especially when the number of queries grows from 0 to 70. This is because we can catch users’ long-term interests more accurately when their search histories are long enough. At the same time, we ﬁnd that the performance of L-Proﬁle becomes more unstable when the user has more and more queries, especially when they have more than 80 queries. This is because there is more noise in queries and further- more the users have varied information needs. This tell us that when the user’s search histories increase, we should take more analysis on user’s real information need and select only appropriate search histories to build up user proﬁles. Tan et al. [31] ﬁnd that the best performance of proﬁle-based per- sonalized search methods they proposed is achieved when

分享到：

赞收藏

资料库

A large-scale evaluation and analysis of personalized search str....pdf

相关推荐

课程资源

热门标签

最新资料