Spatial Activities Trajectory Recommendation
with Semantic
Wei Sun
Abstract
With the development of mobile Internet technology, mobile social networks based on the
mobile terminals have blown up in the recent years. Therefore, spatial keyword search and activity
trajectory recommendation have been very important technologies widely used in travel planning
service. Existing spatial keyword indexing querying methodologies mainly focus on the spatial
and textual similarities, while lacking in the semantic understanding of spatial objects. At the
aspect of activity trajectory recommendation, simple path search’s efficiency is so low that it
cannot meet users’ need. In this paper, we study the problem of semantic based spatial keyword
querying and spatial activity trajectory recommending with cluster. For returning k objects most
similar to the query subjects to not only their spatial and textual properties, but also the coherence
of semantic meanings, we propose a novel indexing structure called NIQ-tree which integrates
spatial, textual and semantic information to prune the search space effectively in query processing.
And some experiments are carried out to evaluate the efficiency of the algorithm we propose. For
returning the optimal trajectory, we apply the cluster idea in our trajectory recommendation.
Therefore we propose Spatial Sketch-Based Approximate Trajectory Recommending Algorithm
which make it more efficient for activity trajectory recommendation. Finally, we introduce our
developed activity trajectory recommendation system that integrates spatial keyword querying
with semantic and spatial trajectory recommendation to certificate our work’s validity.
Introduction
With the development of mobile Internet technology, mobile social networks based on the
mobile terminals has blown up in the recent years. Quantities of users can share their location so
massive Geo-tagging data appear swiftly (e.g. users of Micro-blog Sina can show their locations,
comments and photos). From the spatial-temporal features, this kind of data can be considered as
the activity trajectory.
As we know,
traveling is a significant part of our daily life. Thanks to the route
recommendation system and service, we can travel to unknown places and simply and accurately
find our location. Behind the service, it uses spatial keyword querying technique and some
efficient search for activity trajectories. Spatial keyword querying is a significant technique for
location based services (LBS) system, such as Baidu Map and Google Map. Extensive efforts
from academic and industrial communities have been made many effective work and support on
keyword query. However, some early work[1-4] mainly focus on Spatial Keyword Boolean Query
(SKBQ) that requires exact keyword match. Obviously it can lead no results to be returned on
account of different textual expression. To solve the problem, researches proposed some novel
indexing structures to support Spatial Keyword Approximate Query (SKAQ) more recently in
[5-8], which are able to handle the spelling mistakes (e.g.. theater vs theatre) that frequently
appear in real applications. But still, they cannot retrieve the objects that are synonym but literally
different to the keywords in query, such as ‘theater’ and ‘cinema’ or ‘coffee’ and ‘Starbucks’. This
gap motivates us to investigate other approaches that can capture the semantic meanings of spatial
keywords.
In the meantime, increasing geo-textual objects are becoming available on the web that
represent Point-of-Interest (POI). Each POI has a spatial location and a semantic description
stored in the trajectory database.
Some applications (e.g. Foursquare, Facebook Place and Flickr) are refining and enriching
the traditional trajectory databases by associating locations with semantic meanings.
In this paper, based on the Beijing road network data we design a activity trajectory
recommendation system to exhibition Spatial keyword search and trajectory recommendation
functions. On the Spatial keyword search, traditional methods fail to retrieve the objects that are
synonym but literally different to the keywords in query. To sum up, our main contributions can be
briefly summarized as follows:
- We introduce and formalize a new probabilistic topic model based similarity
measure between a query and a object.
- We propose a novel hierarchical indexing structure called NIQ-tree to integrate the
spatial, semantic and textual information of the objects.
- We conduct an extensive experiment to make the comparisons with different
algorithms to demonstrate the efficiency of our method.
On activity trajectory search, our main contributions can be summarized as the follows:
- We use Spatial Sketch-Based Approximate Trajectory Recommending Algorithm to
query trajectory finally that can return the optimal trajectory.
- We show our trajectory recommendation system and evaluate the efficiency.
Problem statement
Preliminaries
In this section, we introduce some preliminaries and give some problems of this paper.
Probabilistic Topic Model [9] is a mature natural language processing technique that has been
proven to be successful on theme interpretation and document classification. In this paper, we
apply the Latent Dirichlet Allocation model, i.e. one of the most frequently used probabilistic
topic model, to understand the semantic of textual description by words with respect to the topics.
Therefore topic can be understood as possible semantic meaning of textual data defined as
follows:
Definition 1. (Topic): A topicz represents a type of intended activity that a user can be
interested in such as ‘Chinese restaurant’, ‘coffee shop’, ‘supermarket’ and so on.Z is a
preprocessed topic set, which is the union of all topics used to describe the meaningful semantic of
textual descriptions.
By carrying out statistical analysis on the large amount of textual descriptions, the LDA
model derives the semantic relevance of a topic to all relevant words, known as words distribution
defined as follows:
Definition 2. (Words Distribution): Given the topic setZ and the set of all possible wordsV, a
matrixM=Z×(⊆V) is used to represent the distribution of topics inZ over all words in
V , where
is words collection associated with topicz , apparently ⊆ . Each
represents a word distribution, i.e. a distribution of a topicz∈Z over all words relevant to this
∈
topic. We use [] to denote the proportion of words distribution which satisfies
=1.
Definition 3. (Topic Distribution): Given a textual descriptionW, the topic distribution ofW,
denoted asT, is the statistical proportion for each word inW, where the topic proportion
T fromW to topicz is calculated as:
T = ∈∩+
+×
Where∈∩
is number of keywords belongs to the given textual description ofW in
; α is the symmetric Dirichlet prior and generally set to 0.1.|W| and|Z| are the number
of keywords inW and topics inZ respectively.
Definition 4. (Topic Distance): Given two textual descriptions W andW', their topic
distance(,') is defined as:
distance can be quantified by several similarity measures (e.g., Euclidean and Cosine
Distance). Here, we adapt the cosine distance to measure their distance in high dimensional
topic space, and normalize the result to range [0,1] by using the sigmoid function. The topic
,' = ∈(×'[])
||||×||'||
(2)
(1)
(3)
Definition 5. (Spatial Web Object): Given a spatial web object o, it can be formalized as
In the real-world applications, a spatial web object is a shop, a supermarket, a restaurant or other
points whose position and textual description are easy to get through the Internet.
where
two arbitrary textual descriptions is, the more relevant they are in semantics according to the LDA
interpretation.
|||| is modulus of in theZ dimensions. It is obvious that less topic distance of
o=(o.λ,o.φ), whereo.λ is the position of o ando.φ is the textual description for describing o.
Definition 6. (Spatial Keyword Query): Consider a queryq=(q.λ,q.φ,τ) whereq.λ is the
τ is
space;q.φ is a group of words that describe user’s intention, such as ‘Japanese restaurant’;
denoted asTD(,), which is measured by Edit Distance[10] in this paper.
and if only its textual distance to q is no more than the threshold of q, i.e.TD(,) <τ.
query position of user which can be represented by longitude and latitude in a two-dimensional
a user-specified threshold of textual distance. The distance between query q and spatial object o is
Definition 7. (Candidate Object): Given a query q, a spatial is said to be a candidate object, if
1+−(q.λ,q.λ)−1
2
By combing the spatial distance and the topic distance, we further define a distance function of q
Definition 8.(Distance Function): Given a query q and a spatial object o, their spatial distance
are calculated by the Euclidean distance of their geographical locations. We normalize it to range
[0,1] as shown in Equation 3.
, =
and o, denoted asD(q,o), as
Dq,o =×, + 1− ×(,)
Whereλ is a user-specified parameter to balance the weights of the spatial and semantic distance
toq.
(4)
represents a spatial web object, which is
Problem Definition
Definition 9. (Semantic Trajectory): A semantic trajectoryTr is defined as a sequence of
geo-textual objects, i.e.Tr=(1,2,…,). Each
associated with a semantic trajectoryTr∈T whereTis a pre-defined trajectory vocabulary.
Q= 1,2,…,
(1,2,3,…,−1, are the points which you need to pass by from the
starting point q1
and terminal point q ). These points consist of a semantic trajectory.
Meanwhile, a positive integer k and a semantic trajectory setT are given. We return the k
candidate trajectories fromT that have the smallest minimum match distances with respect toQ.
a
query
destination
q
1
,
Given
a
start
point
and
a
Baseline Algorithm
Quadtree based Algorithm
The first baseline algorithm uses the Quardtree[11] to prune the search space in spatial
dimension. In the method, the Quadtree, which only utilizes the spatial coordinates of the points,
is used to index these points in two-dimensional space directly. Given a query q, the first baseline
traverse the index structure to find the spatial nearest object incrementally in terms of the spatial
best match distance which is computed as follows:
, =×(,)
(5)
It is easy to see that the spatial best match distance is always the lower bound of the distance
between q and o. In the processing of the query, we keep finding the next nearest point o’(based
on spatial best match distance) and computing its distance.
IR-Tree Index Algorithm
The second baseline adopts the IR-tree[12] as the indexing structure, which is used to support
efficient spatial keyword search on static point set. The IR-tree is essentially an R-tree extended
with inverted files[13]. The search algorithm based on IR-tree proceeds similarly with the R-tree
based method, i.e., trying to find the most spatially close trajectories and then computing the
minimum match distance with respect to the query. The only modification is, before probing the
entries in a node of IR-tree, we first check its inverted file to see if it contains any point of query.
If not, all places enclosed in this node can be pruned directly. By this means, this baseline is
expected to examine fewer nodes than the R-tree base method, thus can achieve better efficiency.
NIQ-tree based Algorithm
In this section, we proposed an improved hybrid indexing structure NIQ-tree based on
iDistance[14,15]. The iDistance is a well-known index scheme for high dimensional similarity
search, with a basic idea to group all objects by clustering, which enables us to achieve superior
pruning effect in querying. By utilizing iDistance to sketch the topic distributions, the NIQ-tree
is expected to support effective pruning on spatial,
simultaneously.
textual and semantic dimensions
Indexing Structure. The NIQ-tree is a three layered hybrid indexing structure shown as
Figure 2, where the spatial, semantic and textual layers are integrated in a vertical way. In
designing NIQ-tree, we adopt a spatial first method due to the better pruning effect in spatial
domains, which can be explained by its two dimensional nature(v.s. high dimensions of topic and
textual domains). The basic form of a NIQ-tree node isN= (p.rect,o,r) , where p is the
pointer(s) to its child node(s);
rect is the minimum bounding rectangle(MBR) in spatial of all
objects contained by N; o and r are the center point and radius of a topic space hyper-sphere that
covers the topic distributions of all objects contained by N respectively. On top of all spatial web
objects, we use Quadtree to index them in the spatial domain according to their spatial closeness
first. For each leaf node of Quadtree, all objects are further organized by iDistance index in the
semantic layer, such that objects are grouped and managed by their topic coherence. For each leaf
nodes N of the semantic
lists in the
textual layer, and similar to MHR-tree, the n-gram inverted lists functionally sketch out the
textual descriptions of objects contained in N , so that it is possible to filter irrelevant ob jects
layer,it is referenced to a set of n-gram based inverted
according to the q.φ andq.τ specified in query.
Fig. 1
In constructing the NIQ-tree, we build up a Quadtree for all points in spatial database first
like the first baseline algorithm. Then for the points in every leaf node of the Quardtree, we use
iDistance to cluster these points based on their topic distributions of all contained objects, and
construct a+−tree to organize the nodes(each node represents a cluster) according to the key
wherei is the identifier of the partition,c is a constant to partition the single dimension space
key=i×c+(,)
value computed as follows:
i+1 ×c],
is a reference point to
nodes in minimum topic space cost. Finally, we build the inverted lists for every leaf node of the
following formula:
Query Processing. Algorithm 1 illustrates the query processing mechanism over the
NIQ-tree. Given a query q, the objects retrieval is carried out on the spatial, topic and textual
domains of the index alternately. Starting from the root of index, we traverse the spatial layer
into regions so that will be mapped to the range[i×c,
andp is the point in this partition. In this method, the high dimensional topic space, and+−
tree can thus be applied directly. Next we set theN.o andN.r for all spatial layer nodes of the
NIQ-tree in an bottom up fashion, such that they can cover the center point and radius ofN’s child
+−tree by the n-gram method.
nodes in the ascending order of the best match distance(,) with respect to q defined as the
, =×∈.. + 1− ×(,)
Where×∈.. and (,) denote the minimum possible spatial and topic
distance from q to any object contained in the nodeN. Let||.,N.o|| be the cosine distance
between textual descriptionq.φ and reference point o in topic layer, the minimum possible topic
distance(q,N) can be computed as follows:
.,N.o ≤.
min, = 0
.,N.o >.
It is noted that(,) is the lower bound distance toq for all unvisited points according
process further search. During the search in topic layer, the leaf node in the B+− can be
to its definition.. In the processing of the research, when we pop a non-leaf from the priority queue,
we add its child nodes to queue, and when a leaf node is popped out, we go to the topic layer and
.,N.o −.
(6)
(7)
dynamically maintain the top-k minimum distance for all scanned points and keep the k-th
the following condition holds for all unvisited topic layer leaf node:
(8)
points have no opportunity to be better than the current top-k results.
identified according to its key value. Similar to iDistanceKNN search [16,17], we browse the
space by expanding the radius R of the hyper-sphere centered at the query point q. At each time, R
×∈.. + 1− ×≥D
is increased by△R(i.e..R=R+△R). If the leaf node of this layer intersects with the searching
sphere, we traverse the node according to its key value in the range of[i×c+disq,o −R,i×
c+distq, +R], wheredisq,o
is the distance betweenq and reference pointo. Then, by
checking its inverted list in textual layer, we find the point the points whose textual distance toq is
no more than the threshold as candidates and compute their distance toq . Especially, we
minimum distance as an upper bound D. The radius of search sphereR stops increasing when
When. is the MBR of a spatial layer leaf nodeN. Obviously,(×∈.. +
1− ×) is a lower bound distance from q to a topic layer leaf node rooted inN. The whole
search algorithm terminates when is no less than because the remaining unvisited
Algorithm
2 Algorithm is called Spatial Sketch-Based Approximate Trajectory recommmending
attention to offer the optimal trajectory for users from the trajectory databaseT. According to the
former spatial keyword query with semantic, we regard a trajectory as a set of POI(s)o=
(o.λ,o.φ) with spatial and textual information. We have investigated the problem of trajectory
In the 2 algorithm[16], we process the trajectory search by three main steps as shown in
Algorithm. Cluster algorithm has been applied in various fields. With respect to the trajectory
recommendation, we apply optimized cluster algorithm and optimized Dijkstra algorithm in
search on categorical POI sites, which returns a trajectory that maximizes user satisfaction score
within a given distance threshold .
Figure 2. The first step is the preprocessing step, in which the sites are clustered to spatial clusters,
and the useful inter with intra cluster information is extracted as well. Figure 2 (a) illustrates the
preprocessing step. Afterwards, the second step is to search on the granularity of clusters, without
touching at the site level. As shown in Figure 2 (b), we first discuss the priority of the candidate
clusters to be checked. Then we consider the issue of sequence of the candidate clusters to go
through (sketched trips in Figure 2 (b)), so that the hopeless and unpromising ones can be filtered
out. Furthermore, we analyze the contents of the sketched trips to derive the execution plan based
on the cost analysis of the required shortest path search, which contributes to avoid vast
unnecessary computational cost in trip search.
At last, the trip search goes to the site level as Figure 2 (c) shows. Given a candidate clusters,
we scan the corresponding candidate sites based on the execution plan derived from the previous
step, and some bounds are used to guarantee the effectiveness and efficiency of trip search in
Fig. 2
.
Spatial keyword search Experiment
In this section, we conduct extensive experiments on real datasets to evaluate the
performance of our proposed index and search algorithms.
Experiment Settings
We create the real object datasets by using the online check-in records of Foursquare within
the area of New York City. Each check-in records of Foursquare contains the user ID, venue with
geo-location (place of interest), time of check-in and the tips written
in plain English. We put the
records belonging to the same object to form textual descriptions of the objects. The topic
distributions over words are obtained by the textual descriptions in the tips associated with the
location, and then the
textual descriptions for each place are interpreted into a probabilistic topic
distribution by LDA model. The number of objects in the whole dataset is 422030 in sum.
Paramete
r
k
λ
τ
D
c
m
Default
Value
10
0.5
3
200K
4K
20
Desccription
top-k
results
weight factor
threshold
distance
No. of objects
of
edit
capacity of Quadtree leaf
node
number of clusters
Table.1: default values of parameters
We compare the query time cost and number of visited objects during the search processing
of the proposed method (NIQ-tree) with the two baseline algorithms proposed in Section 3. The
default values for parameters are given in Table 2. In the experiments, we vary one parameter and