Journal of Software Engineering and Applications, 2019, 12, 35-50
http://www.scirp.org/journal/jsea
ISSN Online: 1945-3124
ISSN Print: 1945-3116
Prediction of Criminal Suspects Based on
Association Rules and Tag Clustering
Bo Cheng1, Weihong Li1*, Haoxin Tong2
1School of Geography, South China Normal University, Guangzhou, China
2R & D Department, Aerospace Finest (Guangdong) Information Technology Co. Ltd., Guangzhou, China
How to cite this paper: Cheng, B., Li,
W.H. and Tong, H.X. (2019) Prediction of
Criminal Suspects Based on Association
Rules and Tag Clustering. Journal of Soft-
ware Engineering and Applications, 12,
35-50.
https://doi.org/10.4236/jsea.2019.123003
Received: February 26, 2019
Accepted: March 24, 2019
Published: March 27, 2019
Copyright © 2019 by author(s) and
Scientific Research Publishing Inc.
This work is licensed under the Creative
Commons Attribution International
License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Open Access
Abstract
To date, not many studies have been conducted on criminal prediction. In
this study, the criminal data related to city S is divided into a training data set
and a validation data set at a 1:1 ratio in light of the personal tag data and the
travel and accommodation data of criminals and ordinary people in city S.
Firstly, the FP-growth algorithm is adopted to calculate association rules be-
tween the criminals and the ordinary people in their travel and hotel accom-
modation data, in order to discover criminal suspects based on association
rules. Secondly, the DBSCAN algorithm is employed for clustering of the tag
data of the criminals and the ordinary people, followed by similarity calcula-
tion, in order to discover criminal suspects based on tag clustering. Lastly, in-
tersection operation is performed on the above two sets of criminal suspects,
and the resulting intersection is verified against the criminal validation set for
elimination of criminals who appear in the intersection so as to obtain final
criminal suspects. Results show that a set of 648 criminal suspects is retrieved
based on the association rules calculated by the FP-growth algorithm, while a
set of 973 criminal suspects is retrieved based on DBSCAN clustering and co-
sine similarity of the personal tags; the number of criminal suspects is nar-
rowed down to 567 after the intersection operation of the two sets, and 419 of
the 567 criminal suspects are further verified to be criminals using the valida-
tion set, thereby leaving the other 148 to be the final criminal suspects and
giving a prediction accuracy of 73.9%. The data mining method of criminal
suspects based on association rules and tag clustering in this study has been
successfully applied to the police system of city S, and the experiment proves
the effectiveness of this method in detecting criminal suspects.
Keywords
FP-Growth, Association Rule, DBSCAN, Tag Clustering, Criminal Suspects
DOI: 10.4236/jsea.2019.123003 Mar. 27, 2019
35
Journal of Software Engineering and Applications
B. Cheng et al.
DOI: 10.4236/jsea.2019.123003
1. Introduction
Nowadays, crime situation is becoming increasingly serious across the globe
with more crime types and a higher number of criminals, posing a threat to hu-
man lives and property as well as social stability. Public security authorities are
increasingly tasked with maintaining public security and fighting crimes with an
ever-growing requirement on law enforcement. Given the constantly generated
crime data, it is necessary for data analysts to reveal hidden patterns in the data,
analyze implicit relationships between the data, predict occurrence of crimes and
discover potential criminals, so as to improve the efficiency of law enforcement
efficiency of public security authorities and prevent occurrence of crimes.
Association rule mining (ARM) [1] is a research focus in the field of data
mining. As one of the classic algorithms for data mining, ARM has been widely
used in crime research. It is possible to extract relevant criminal evidence from
association rules of a large number of data items, and further explore the pat-
terns, trends and links between different crimes, so as to provide support for the
police in case investigation and crime prevention. ARM performs well in ex-
ploring the causes of crime, identifying the main crime suspects, and having a
deeper insight into crime series. A variety of ARM algorithms have been playing
an important role in crime analysis and crime prediction. In the international
research community of association-rule-based crime mining, Ng et al. [2] intro-
duced temporal association rules and proposed an incremental algorithm to
solve the problem of how to process time series whose association rules contain
time expressions, and employed the new algorithm to discover crime patterns in
Hong Kong. Buczak et al. [3] explored applying fuzzy association rule mining to
recognition of community crime patterns, and such application promoted the
efficiency of local law enforcement. Tan et al. [4] were the first to analyze the
role of the FP-growth algorithm in computer crime forensics, specifying the li-
mitation that the FP-growth algorithm had when used to discover the latest
crimes and serious crimes, and making some improvements to the FP-growth
algorithm and its test. Joshi et al. [5] proposed the FP-Tree similarity algorithm
and found it more effective than the Apriori algorithm. Usha et al. [6] [7] tested
the Apriori algorithm and other algorithms such as the Fp-growth algorithm in
real and synthetic crime data sets, and found that each algorithm had its own
advantages. Shekhar et al. [8] explored spatial frequent pattern mining (SFPM)
based on criminal pattern analysis (CPA) and validated this mining method with
a spatial crime dataset. Isafiade et al. [9] revisited frequent pattern growth
models for crime pattern mining, proposing a revised frequent pattern growth
(RFPG) model, and also proposed a descriptive statistical approach based on a
quartile (floor-ceil) function, which was used to identify recurring trends in
criminal activity. Based on geographic and demographic factors, Asmai et al.
[10] used ARM to generate a crime mapping model for crime analysis, employ-
ing the model to examine the occurrence of crime at a specific location, and de-
monstrating that the model could be used to analyze future crime locations with
36
Journal of Software Engineering and Applications
B. Cheng et al.
relatively high crime potential and improve crime prevention implementation.
Extensive research has been conducted in China on ARM-based crime mining.
Based on fuzzy set and rough set theory, Chen and Yu [11] [12] employed ARM
to quantitatively analyze a criminal dataset, make deductions, and extract rules,
providing theoretical guidance for crime prevention. ARM has gained wide-
spread attention and application in the fields of criminal portraits and criminal
forensic analysis [13]. Moreover, ARM has been widely used in crime research
such as crime investigation [14], criminal suspect analysis [15], criminal beha-
vior analysis [16], and reoffending [17] [18]. In view of the temporal and spatial
characteristics of crime data, many studies have proposed a number of improved
ARM algorithms such as spatio-temporal association rules [19] [20], cluster as-
sociation rules [21], and data cube-based association rules [22], as well as other
improved algorithms such as incremental association rules [23].
In summary, association rules have been extensively applied to crime mining
in relevant studies, the effectiveness of ARM in different fields of crime mining
has been investigated in depth, and a large number of improved ARM algo-
rithms have been proposed. These studies are mainly focused on using crime
data for association rules mining, but in practice more frequently encountered is
business data, which is not necessarily related to crime but is easier to obtain,
thereby making it crucial to extract crime information from a large volume of
ordinary business data. Moreover, existing studies are largely focused on crime
case analysis and crime pattern mining instead of the mining and prediction of
potential criminals, and most of the studies are focused on macro-level factors
that affect the occurrence of crimes while not considering micro-level characte-
ristics of criminals, while the micro-level characteristics are the internal factors
that determine whether an individual would commit a crime. This study com-
bines ARM algorithm and clustering algorithm to deal with crime and related
data. This method is different from previous crime mining methods. It can di-
rectly find criminal suspects instead of criminal hotspots or others, and it makes
full use of ordinary business data, which is easier to access and handle. In this
study, ARM is performed on city S-related travel data and accommodation data
of criminals and ordinary people, and meanwhile, tag clustering is performed on
criminals and ordinary people, in order to discover as criminal suspects, the or-
dinary people who not only frequently travel with criminals but also have highly
similar tags to them. Given that criminal acts are often carried out in the form of
gangs, the people who are discerned to travel with the criminals and have a high
similarity to the criminals in tags can be considered as potential key individuals.
Monitoring criminal suspects can greatly reduce the incidence of infraction and
crimes and improve public safety. In Section 2, we briefly describe the data used
in the study and the preprocessing of the data. In Section 3, we provide the de-
tails of our business process modeling methodology. In Section 4, we describe
the research results and analysis in detail, and test and compare the algorithm.
In Section 5, we summarize the conclusions of this work.
37
Journal of Software Engineering and Applications
DOI: 10.4236/jsea.2019.123003
B. Cheng et al.
2. Research Data
City S-related travel and accommodation data of criminals and ordinary people
in 2016 as well as their personal tag data at that time are collected. The travel and
accommodation data consist of shuttle ticketing data and hotel accommodation
data (Table 1). A total of 600,000 and 2 million personal attribute data are col-
lected on criminals and ordinary people, respectively, with the data consisting of
personal ID numbers and relevant personal tags (Table 2). The criminal data is
divided into a training data set and a validation data set at a 1:1 ratio. The train-
ing data set is used for criminal suspect-related computation, while the valida-
tion data set is used for verification of the method effectiveness by checking
whether actual criminals are among the criminal suspects.
The personal attributes in Table 2, except for ID number, are treated as per-
sonnel tags, and the meanings of some personal tags are shown in Table 3.
Clustering analysis and similarity calculation are performed on vector inputs.
Tag vectorization is conducted according to the publicly accessible corpus of
Chinese word vectors developed by Beijing Normal University and Renmin
University of China—a corpus providing pre-trained 300-dimensional (300 d)
character vectors. Each tag vector is calculated according to the following for-
mula:
1V
m
= ∑
m =
i
1
v
i
where V is a calculated tag vector, with m representing the number of Chinese
characters in the tag and
a character vector of the tag’s i-th
character. Each individual has 12 tags, and therefore the tag matrix of each indi-
vidual is 12 × 300 in size.
1
2
v v
,
i
i
300
v
i
v
i
,
,
=
(
)
3. Research Methods
This paper first uses the FP-growth algorithm to mine the ordinary people who
frequently travel with the criminals as criminal suspects based on travel data and
hotel accommodation data; Then use the DBSCAN algorithm to cluster and
analyze the label data of criminals and ordinary people, and obtain several tag
clusters. Carry out the similarity calculation of the criminals and ordinary per-
sonnel in each cluster, and find some ordinary persons with the highest similar-
ity with the criminals as criminal suspects; Finally, the criminal suspects based
on the association rules and the criminal suspects based on the label clustering
are interdigitated to obtain the final criminal suspects, and the criminals test da-
ta is used to check whether calculated criminal suspects obtain actual criminals.
Table 1. Travel accommodation data.
Data sheet
Shuttle
Hotel
Attribute field
Route code, departure time, starting station, arrival station, passenger ID number
Check-in ID number, hotel code, check-in time, hotel name
38
Journal of Software Engineering and Applications
DOI: 10.4236/jsea.2019.123003
Table 2. Personal tags and meaning of the tag values.
Tag
Value
ID number
6cbe2819c3xxxxxxxx
Gender
Age
Marital status
Employment status
Income
M
F
Minor
Youth
Middle aged
Elderly
Married
Unmarried
Employed
Unemployed
Low
Middle
High
Educational level
Single-parent family
Offspring
House property
Foster family
Household registration
Migrant
Primary school
Junior high school
Senior high school
University
Yes
No
Yes
No
Tenant
Property owner
Yes
No
Urban
Rural
Yes
No
B. Cheng et al.
Meaning
Male
Female
<18 years old
18 - 40 years old
41 - 60 years old
>61 years old
first marriage, remarriage, and remarriage
unmarried, divorced, widowed
Currently employed
Currently unemployed
Below the minimum wage
Between the minimum wage and the per capita wage
Higher than the per capita wage
Primary school education
Junior high school education
Senior high school education
University education
Raised in a single-parent family
Raised in a non-single-parent family
Having offspring
Having no offspring
Having no house property
Having house property
Fostered
Non-fostered
Urban household registration
Rural household registration
Migrant
Non-migrant
DOI: 10.4236/jsea.2019.123003
The research method flow is shown in Figure 1.
3.1. Association Rule Mining Based on FP-Growth
Association rule mining is a procedure which is meant to find frequent patterns,
correlations, associations, or causal structures from data sets found in various
kinds of databases such as relational databases, transactional databases, and other
39
Journal of Software Engineering and Applications
B. Cheng et al.
Figure 1. Research method flow chart.
forms of data repositories. Association rules are created by thoroughly analyzing
data and looking for frequent if/then patterns. Then, depending on the following
two parameters, the important relationships are observed:
1) Support: Indication of how frequently the itemset appears in the database.
It is defined as the fraction of records that contain X∪Y to the total number of
records in the database. Suppose, the support of an item is 0.1%, it means only
0.1% of the transactions contain that item.
)
Support XY Support count of XY Total number of transaction in D
=
(
)
(
2) Confidence: Fraction of the number of transactions that contain X∪Y to
the total number of records that contain X.
It’s is a measure of strength of the association rules. Suppose, the confidence
of the association rule X ⇒ Y is 80%, it means that 80% of the transactions that
contain X also contain Y together.
)
Confidence X Y Support XY Supp
ort
X
=
(
(
|
)
(
)
The FP-Growth Algorithm, proposed by Han in [24], is an efficient and scala-
ble method for mining the complete set of frequent patterns by pattern fragment
growth, using an extended prefix-tree structure for storing compressed and cru-
cial information about frequent patterns named frequent-pattern tree (FP-tree).
In his study, Han proved that his method outperforms other popular methods
for mining frequent patterns, e.g. the Apriori Algorithm [25] and the TreePro-
40
Journal of Software Engineering and Applications
DOI: 10.4236/jsea.2019.123003
B. Cheng et al.
jection [26]. In some later works [27] [28] it was proved that FP-Growth has
better performance than other methods, including Eclat [29] and Relim [30].
The popularity and efficiency of FP-Growth Algorithm contributes with many
studies that propose variations to improve his performance.
The FP-Growth Algorithm is an alternative way to find frequent itemsets
without using candidate generations, thus improving performance. For so much
it uses a divide-and-conquer strategy. The core of this method is the usage of a
special data structure named frequent-pattern tree (FP-tree), which retains the
itemset association information.
In simple words, this algorithm works as follows: first it compresses the input
database creating an FP-tree instance to represent frequent items. After this first
step it divides the compressed database into a set of conditional databases, each
one associated with one frequent pattern. Finally, each such database is mined
separately. Using this strategy, the FP-Growth reduces the search costs looking
for short patterns recursively and then concatenating them in the long frequent
patterns, offering good selectivity.
Based on travel data and hotel accommodation data, this study compares each
car or daily hotel accommodation to a shopping basket, where the item is the ID
of the person and the item set is all the people in the car or hotel. With the
minimum support and confidence, the FP-growth algorithm is used to calculate
the association rules between criminals and ordinary people. That is to mine or-
dinary people who frequently travel or stay with criminals as criminal suspects.
3.2. DBSCAN Tag Clustering and Cosine Similarity
As one of the most cited of the density-based clustering algorithms, DBSCAN is
likely the best known density-based clustering algorithm in the scientific com-
munity today. The central idea behind DBSCAN and its extensions and revisions
is the notion that points are assigned to the same cluster if they are densi-
ty-reachable from each other. To understand this concept, we will go through
the most important definitions used in DBSCAN and related algorithms. The
definitions and the presented pseudo code follows the original by Ester et al., but
are adapted to provide a more consistent presentation with the other algorithms
discussed in the paper. Clustering starts with a dataset D containing a set of
points p D∈ . Density-based algorithms need to obtain a density estimate over
the data space. DBSCAN estimates the density around a point using the concept
of ϵ-neighborhood.
Definition 1. -Neighborhood. The ϵ-neighborhood, N(p), of a data point
p is the set of points within a specified radius around p.
(
)
<
}
)
=
q d p q
,
|
(
N p
{
where d is some distance measure and
. Note that the point p is always in
its own ϵ-neighborhood, i.e.,
always holds.Following this defini-
∈
)
tion, the size of the neighborhood
can be seen as a simple unnorma-
lized kernel density estimate around p using a uniform kernel and a bandwidth
R+∈
(
)
p N p
(
N p
DOI: 10.4236/jsea.2019.123003
41
Journal of Software Engineering and Applications
B. Cheng et al.
of ϵ. DBSCAN uses
gions and to classify the points in a data set into core, border, or noise points.
N p and a threshold called minPts to detect dense re-
(
)
Definition 2. Point classes. A point p D∈ is classified as
• a core point if
has high density, i.e.,
minPts Z+∈ is a user-specified density threshold,
(
N p
)
(
N p ≥
) minPts
where
• a border point if p is not a core point, but it is in the neighborhood of a core
point q D∈ , i.e.,
∈
• a noise point, otherwise.
(
p N q
)
, or
Definition 3. Directly density-reachable. A point q D∈ is directly densi-
)
.
ty-reachable from a point p D∈ with respect to and minPts if, and only if,
, and
) minPts
(
N p ≥
(
q N p
∈
1)
2)
That is, p is a core point and q is in its ϵ-neighborhood.
Definition 4. Density-reachable. A point p is density-reachable from q if
with q = p1 and
−
)
there exists in D an ordered sequence of points (
p
p p
,
,
,
n
2
1
{
p = pn such that pi+1 directly density-reachable from pi
i
1,2,
∀ ∈
}
1
.
n
,
Definition 5. Density-connected. A point p D∈ is density-connected to a
point q D∈ if there is a point o D∈ such that both p and q are densi-
ty-reachable from o.
Definition 6. Cluster. A cluster C is a non-empty subset of D satisfying the
following conditions:
∀
,p q C
∈ , p is density-connected to q.
1) Maximality: If p C∈ and q is density-reachable from p, then q C∈ ; and
2) Connectivity:
The DBSCAN algorithm identifies all such clusters by finding all core points
and expanding each to all density-reachable points. The algorithm begins with
an arbitrary point p and retrieves its ϵ-neighborhood. If it is a core point then it
will start a new cluster that is expanded by assigning all points in its neighbor-
hood to the cluster. If an additional core point is found in the neighborhood,
then the search is expanded to include also all points in its neighborhood. If no
more core points are found in the expanded neighborhood, then the cluster is
complete and the remaining points are searched to see if another core point can
be found to start a new cluster. After processing all points, points which were
not assigned to a cluster are considered noise.
This study uses cosine similarity to calculate the distance. Refer to the existing
study [31] to set ε to 0.6 and MinPts to 255. The DBSCAN algorithm clusters the
vectorized criminals and ordinary people’s tag matrix, and calculates the simi-
larity and sorting of the criminals and criminal suspects for each cluster. Then,
ordinary people with similarities greater than 0.8 are selected from various clus-
ters as criminal suspects. The cosine similarity calculation formula is as follows:
cos
(
)
θ
=
n
∑
i
1
=
(
x
i
×
y
i
)
n
∑
i
1
=
(
x
i
2
)
×
n
∑
i
1
=
(
y
i
2
)
DOI: 10.4236/jsea.2019.123003
42
Journal of Software Engineering and Applications