logo资料库

数据索引与轨迹推荐.docx

第1页 / 共13页
第2页 / 共13页
第3页 / 共13页
第4页 / 共13页
第5页 / 共13页
第6页 / 共13页
第7页 / 共13页
第8页 / 共13页
资料共13页,剩余部分请下载后查看
Spatial Activities Trajectory Recommendation with Semantic Wei Sun Abstract With the development of mobile Internet technology, mobile social networks based on the mobile terminals have blown up in the recent years. Therefore, spatial keyword search and activity trajectory recommendation have been very important technologies widely used in travel planning service. Existing spatial keyword indexing querying methodologies mainly focus on the spatial and textual similarities, while lacking in the semantic understanding of spatial objects. At the aspect of activity trajectory recommendation, simple path search’s efficiency is so low that it cannot meet users’ need. In this paper, we study the problem of semantic based spatial keyword querying and spatial activity trajectory recommending with cluster. For returning k objects most similar to the query subjects to not only their spatial and textual properties, but also the coherence of semantic meanings, we propose a novel indexing structure called NIQ-tree which integrates spatial, textual and semantic information to prune the search space effectively in query processing. And some experiments are carried out to evaluate the efficiency of the algorithm we propose. For returning the optimal trajectory, we apply the cluster idea in our trajectory recommendation. Therefore we propose Spatial Sketch-Based Approximate Trajectory Recommending Algorithm which make it more efficient for activity trajectory recommendation. Finally, we introduce our developed activity trajectory recommendation system that integrates spatial keyword querying with semantic and spatial trajectory recommendation to certificate our work’s validity. Introduction With the development of mobile Internet technology, mobile social networks based on the mobile terminals has blown up in the recent years. Quantities of users can share their location so massive Geo-tagging data appear swiftly (e.g. users of Micro-blog Sina can show their locations, comments and photos). From the spatial-temporal features, this kind of data can be considered as the activity trajectory. As we know, traveling is a significant part of our daily life. Thanks to the route recommendation system and service, we can travel to unknown places and simply and accurately find our location. Behind the service, it uses spatial keyword querying technique and some efficient search for activity trajectories. Spatial keyword querying is a significant technique for location based services (LBS) system, such as Baidu Map and Google Map. Extensive efforts from academic and industrial communities have been made many effective work and support on keyword query. However, some early work[1-4] mainly focus on Spatial Keyword Boolean Query (SKBQ) that requires exact keyword match. Obviously it can lead no results to be returned on account of different textual expression. To solve the problem, researches proposed some novel indexing structures to support Spatial Keyword Approximate Query (SKAQ) more recently in [5-8], which are able to handle the spelling mistakes (e.g.. theater vs theatre) that frequently
appear in real applications. But still, they cannot retrieve the objects that are synonym but literally different to the keywords in query, such as ‘theater’ and ‘cinema’ or ‘coffee’ and ‘Starbucks’. This gap motivates us to investigate other approaches that can capture the semantic meanings of spatial keywords. In the meantime, increasing geo-textual objects are becoming available on the web that represent Point-of-Interest (POI). Each POI has a spatial location and a semantic description stored in the trajectory database. Some applications (e.g. Foursquare, Facebook Place and Flickr) are refining and enriching the traditional trajectory databases by associating locations with semantic meanings. In this paper, based on the Beijing road network data we design a activity trajectory recommendation system to exhibition Spatial keyword search and trajectory recommendation functions. On the Spatial keyword search, traditional methods fail to retrieve the objects that are synonym but literally different to the keywords in query. To sum up, our main contributions can be briefly summarized as follows: - We introduce and formalize a new probabilistic topic model based similarity measure between a query and a object. - We propose a novel hierarchical indexing structure called NIQ-tree to integrate the spatial, semantic and textual information of the objects. - We conduct an extensive experiment to make the comparisons with different algorithms to demonstrate the efficiency of our method. On activity trajectory search, our main contributions can be summarized as the follows: - We use Spatial Sketch-Based Approximate Trajectory Recommending Algorithm to query trajectory finally that can return the optimal trajectory. - We show our trajectory recommendation system and evaluate the efficiency. Problem statement Preliminaries In this section, we introduce some preliminaries and give some problems of this paper. Probabilistic Topic Model [9] is a mature natural language processing technique that has been proven to be successful on theme interpretation and document classification. In this paper, we apply the Latent Dirichlet Allocation model, i.e. one of the most frequently used probabilistic topic model, to understand the semantic of textual description by words with respect to the topics. Therefore topic can be understood as possible semantic meaning of textual data defined as follows: Definition 1. (Topic): A topicz represents a type of intended activity that a user can be interested in such as ‘Chinese restaurant’, ‘coffee shop’, ‘supermarket’ and so on.Z is a preprocessed topic set, which is the union of all topics used to describe the meaningful semantic of textual descriptions. By carrying out statistical analysis on the large amount of textual descriptions, the LDA model derives the semantic relevance of a topic to all relevant words, known as words distribution defined as follows:
Definition 2. (Words Distribution): Given the topic setZ and the set of all possible wordsV, a matrixM=Z×(⊆V) is used to represent the distribution of topics inZ over all words in V , where is words collection associated with topicz , apparently ⊆ . Each represents a word distribution, i.e. a distribution of a topicz∈Z over all words relevant to this ∈ topic. We use [] to denote the proportion of words distribution which satisfies =1. Definition 3. (Topic Distribution): Given a textual descriptionW, the topic distribution ofW, denoted asT, is the statistical proportion for each word inW, where the topic proportion T fromW to topicz is calculated as: T = ∈∩+ +× Where∈∩ is number of keywords belongs to the given textual description ofW in ; α is the symmetric Dirichlet prior and generally set to 0.1.|W| and|Z| are the number of keywords inW and topics inZ respectively. Definition 4. (Topic Distance): Given two textual descriptions W andW', their topic distance(,') is defined as: distance can be quantified by several similarity measures (e.g., Euclidean and Cosine Distance). Here, we adapt the cosine distance to measure their distance in high dimensional topic space, and normalize the result to range [0,1] by using the sigmoid function. The topic ,' = ∈(×'[]) ||||×||'|| (2) (1) (3) Definition 5. (Spatial Web Object): Given a spatial web object o, it can be formalized as In the real-world applications, a spatial web object is a shop, a supermarket, a restaurant or other points whose position and textual description are easy to get through the Internet. where two arbitrary textual descriptions is, the more relevant they are in semantics according to the LDA interpretation. |||| is modulus of in theZ dimensions. It is obvious that less topic distance of o=(o.λ,o.φ), whereo.λ is the position of o ando.φ is the textual description for describing o. Definition 6. (Spatial Keyword Query): Consider a queryq=(q.λ,q.φ,τ) whereq.λ is the τ is space;q.φ is a group of words that describe user’s intention, such as ‘Japanese restaurant’; denoted asTD(,), which is measured by Edit Distance[10] in this paper. and if only its textual distance to q is no more than the threshold of q, i.e.TD(,) <τ. query position of user which can be represented by longitude and latitude in a two-dimensional a user-specified threshold of textual distance. The distance between query q and spatial object o is Definition 7. (Candidate Object): Given a query q, a spatial is said to be a candidate object, if 1+−(q.λ,q.λ)−1 2 By combing the spatial distance and the topic distance, we further define a distance function of q Definition 8.(Distance Function): Given a query q and a spatial object o, their spatial distance are calculated by the Euclidean distance of their geographical locations. We normalize it to range [0,1] as shown in Equation 3. , = and o, denoted asD(q,o), as Dq,o =×, + 1− ×(,) Whereλ is a user-specified parameter to balance the weights of the spatial and semantic distance toq. (4)
represents a spatial web object, which is Problem Definition Definition 9. (Semantic Trajectory): A semantic trajectoryTr is defined as a sequence of geo-textual objects, i.e.Tr=(1,2,…,). Each associated with a semantic trajectoryTr∈T whereTis a pre-defined trajectory vocabulary. Q= 1,2,…, (1,2,3,…,−1, are the points which you need to pass by from the starting point q1 and terminal point q ). These points consist of a semantic trajectory. Meanwhile, a positive integer k and a semantic trajectory setT are given. We return the k candidate trajectories fromT that have the smallest minimum match distances with respect toQ. a query destination q 1 , Given a start point and a Baseline Algorithm Quadtree based Algorithm The first baseline algorithm uses the Quardtree[11] to prune the search space in spatial dimension. In the method, the Quadtree, which only utilizes the spatial coordinates of the points, is used to index these points in two-dimensional space directly. Given a query q, the first baseline traverse the index structure to find the spatial nearest object incrementally in terms of the spatial best match distance which is computed as follows: , =×(,) (5) It is easy to see that the spatial best match distance is always the lower bound of the distance between q and o. In the processing of the query, we keep finding the next nearest point o’(based on spatial best match distance) and computing its distance. IR-Tree Index Algorithm The second baseline adopts the IR-tree[12] as the indexing structure, which is used to support efficient spatial keyword search on static point set. The IR-tree is essentially an R-tree extended with inverted files[13]. The search algorithm based on IR-tree proceeds similarly with the R-tree based method, i.e., trying to find the most spatially close trajectories and then computing the minimum match distance with respect to the query. The only modification is, before probing the entries in a node of IR-tree, we first check its inverted file to see if it contains any point of query. If not, all places enclosed in this node can be pruned directly. By this means, this baseline is expected to examine fewer nodes than the R-tree base method, thus can achieve better efficiency. NIQ-tree based Algorithm In this section, we proposed an improved hybrid indexing structure NIQ-tree based on iDistance[14,15]. The iDistance is a well-known index scheme for high dimensional similarity search, with a basic idea to group all objects by clustering, which enables us to achieve superior pruning effect in querying. By utilizing iDistance to sketch the topic distributions, the NIQ-tree
is expected to support effective pruning on spatial, simultaneously. textual and semantic dimensions Indexing Structure. The NIQ-tree is a three layered hybrid indexing structure shown as Figure 2, where the spatial, semantic and textual layers are integrated in a vertical way. In designing NIQ-tree, we adopt a spatial first method due to the better pruning effect in spatial domains, which can be explained by its two dimensional nature(v.s. high dimensions of topic and textual domains). The basic form of a NIQ-tree node isN= (p.rect,o,r) , where p is the pointer(s) to its child node(s); rect is the minimum bounding rectangle(MBR) in spatial of all objects contained by N; o and r are the center point and radius of a topic space hyper-sphere that covers the topic distributions of all objects contained by N respectively. On top of all spatial web objects, we use Quadtree to index them in the spatial domain according to their spatial closeness first. For each leaf node of Quadtree, all objects are further organized by iDistance index in the semantic layer, such that objects are grouped and managed by their topic coherence. For each leaf nodes N of the semantic lists in the textual layer, and similar to MHR-tree, the n-gram inverted lists functionally sketch out the textual descriptions of objects contained in N , so that it is possible to filter irrelevant ob jects layer,it is referenced to a set of n-gram based inverted according to the q.φ andq.τ specified in query. Fig. 1 In constructing the NIQ-tree, we build up a Quadtree for all points in spatial database first like the first baseline algorithm. Then for the points in every leaf node of the Quardtree, we use iDistance to cluster these points based on their topic distributions of all contained objects, and construct a+−tree to organize the nodes(each node represents a cluster) according to the key wherei is the identifier of the partition,c is a constant to partition the single dimension space key=i×c+(,) value computed as follows:
i+1 ×c], is a reference point to nodes in minimum topic space cost. Finally, we build the inverted lists for every leaf node of the following formula: Query Processing. Algorithm 1 illustrates the query processing mechanism over the NIQ-tree. Given a query q, the objects retrieval is carried out on the spatial, topic and textual domains of the index alternately. Starting from the root of index, we traverse the spatial layer into regions so that will be mapped to the range[i×c, andp is the point in this partition. In this method, the high dimensional topic space, and+− tree can thus be applied directly. Next we set theN.o andN.r for all spatial layer nodes of the NIQ-tree in an bottom up fashion, such that they can cover the center point and radius ofN’s child +−tree by the n-gram method. nodes in the ascending order of the best match distance(,) with respect to q defined as the , =×∈.. + 1− ×(,) Where×∈.. and (,) denote the minimum possible spatial and topic distance from q to any object contained in the nodeN. Let||.,N.o|| be the cosine distance between textual descriptionq.φ and reference point o in topic layer, the minimum possible topic distance(q,N) can be computed as follows: .,N.o ≤. min, = 0 .,N.o >. It is noted that(,) is the lower bound distance toq for all unvisited points according process further search. During the search in topic layer, the leaf node in the B+− can be to its definition.. In the processing of the research, when we pop a non-leaf from the priority queue, we add its child nodes to queue, and when a leaf node is popped out, we go to the topic layer and .,N.o −. (6) (7)
dynamically maintain the top-k minimum distance for all scanned points and keep the k-th the following condition holds for all unvisited topic layer leaf node: (8) points have no opportunity to be better than the current top-k results. identified according to its key value. Similar to iDistanceKNN search [16,17], we browse the space by expanding the radius R of the hyper-sphere centered at the query point q. At each time, R ×∈.. + 1− ×≥D is increased by△R(i.e..R=R+△R). If the leaf node of this layer intersects with the searching sphere, we traverse the node according to its key value in the range of[i×c+disq,o −R,i× c+distq, +R], wheredisq,o is the distance betweenq and reference pointo. Then, by checking its inverted list in textual layer, we find the point the points whose textual distance toq is no more than the threshold as candidates and compute their distance toq . Especially, we minimum distance as an upper bound D. The radius of search sphereR stops increasing when When. is the MBR of a spatial layer leaf nodeN. Obviously,(×∈.. + 1− ×) is a lower bound distance from q to a topic layer leaf node rooted inN. The whole search algorithm terminates when is no less than because the remaining unvisited Algorithm 2 Algorithm is called Spatial Sketch-Based Approximate Trajectory recommmending attention to offer the optimal trajectory for users from the trajectory databaseT. According to the former spatial keyword query with semantic, we regard a trajectory as a set of POI(s)o= (o.λ,o.φ) with spatial and textual information. We have investigated the problem of trajectory In the 2 algorithm[16], we process the trajectory search by three main steps as shown in Algorithm. Cluster algorithm has been applied in various fields. With respect to the trajectory recommendation, we apply optimized cluster algorithm and optimized Dijkstra algorithm in search on categorical POI sites, which returns a trajectory that maximizes user satisfaction score within a given distance threshold . Figure 2. The first step is the preprocessing step, in which the sites are clustered to spatial clusters, and the useful inter with intra cluster information is extracted as well. Figure 2 (a) illustrates the preprocessing step. Afterwards, the second step is to search on the granularity of clusters, without touching at the site level. As shown in Figure 2 (b), we first discuss the priority of the candidate clusters to be checked. Then we consider the issue of sequence of the candidate clusters to go through (sketched trips in Figure 2 (b)), so that the hopeless and unpromising ones can be filtered out. Furthermore, we analyze the contents of the sketched trips to derive the execution plan based on the cost analysis of the required shortest path search, which contributes to avoid vast unnecessary computational cost in trip search.
At last, the trip search goes to the site level as Figure 2 (c) shows. Given a candidate clusters, we scan the corresponding candidate sites based on the execution plan derived from the previous step, and some bounds are used to guarantee the effectiveness and efficiency of trip search in Fig. 2 . Spatial keyword search Experiment In this section, we conduct extensive experiments on real datasets to evaluate the performance of our proposed index and search algorithms. Experiment Settings We create the real object datasets by using the online check-in records of Foursquare within the area of New York City. Each check-in records of Foursquare contains the user ID, venue with geo-location (place of interest), time of check-in and the tips written in plain English. We put the records belonging to the same object to form textual descriptions of the objects. The topic distributions over words are obtained by the textual descriptions in the tips associated with the location, and then the textual descriptions for each place are interpreted into a probabilistic topic distribution by LDA model. The number of objects in the whole dataset is 422030 in sum. Paramete r k λ τ D c m Default Value 10 0.5 3 200K 4K 20 Desccription top-k results weight factor threshold distance No. of objects of edit capacity of Quadtree leaf node number of clusters Table.1: default values of parameters We compare the query time cost and number of visited objects during the search processing of the proposed method (NIQ-tree) with the two baseline algorithms proposed in Section 3. The default values for parameters are given in Table 2. In the experiments, we vary one parameter and
分享到:
收藏