logo资料库

论文研究 - 基于关联规则和标签聚类的犯罪嫌疑人预测.pdf

第1页 / 共16页
第2页 / 共16页
第3页 / 共16页
第4页 / 共16页
第5页 / 共16页
第6页 / 共16页
第7页 / 共16页
第8页 / 共16页
资料共16页,剩余部分请下载后查看
Prediction of Criminal Suspects Based on Association Rules and Tag Clustering
Abstract
Keywords
1. Introduction
2. Research Data
3. Research Methods
3.1. Association Rule Mining Based on FP-Growth
3.2. DBSCAN Tag Clustering and Cosine Similarity
4. Results and Analysis
4.1. Analysis of Association Rule Results
4.2. Result Analysis of DBSCAN Clustering
4.3. Result Validation
5. Conclusions
Conflicts of Interest
References
Journal of Software Engineering and Applications, 2019, 12, 35-50 http://www.scirp.org/journal/jsea ISSN Online: 1945-3124 ISSN Print: 1945-3116 Prediction of Criminal Suspects Based on Association Rules and Tag Clustering Bo Cheng1, Weihong Li1*, Haoxin Tong2 1School of Geography, South China Normal University, Guangzhou, China 2R & D Department, Aerospace Finest (Guangdong) Information Technology Co. Ltd., Guangzhou, China How to cite this paper: Cheng, B., Li, W.H. and Tong, H.X. (2019) Prediction of Criminal Suspects Based on Association Rules and Tag Clustering. Journal of Soft- ware Engineering and Applications, 12, 35-50. https://doi.org/10.4236/jsea.2019.123003 Received: February 26, 2019 Accepted: March 24, 2019 Published: March 27, 2019 Copyright © 2019 by author(s) and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access Abstract To date, not many studies have been conducted on criminal prediction. In this study, the criminal data related to city S is divided into a training data set and a validation data set at a 1:1 ratio in light of the personal tag data and the travel and accommodation data of criminals and ordinary people in city S. Firstly, the FP-growth algorithm is adopted to calculate association rules be- tween the criminals and the ordinary people in their travel and hotel accom- modation data, in order to discover criminal suspects based on association rules. Secondly, the DBSCAN algorithm is employed for clustering of the tag data of the criminals and the ordinary people, followed by similarity calcula- tion, in order to discover criminal suspects based on tag clustering. Lastly, in- tersection operation is performed on the above two sets of criminal suspects, and the resulting intersection is verified against the criminal validation set for elimination of criminals who appear in the intersection so as to obtain final criminal suspects. Results show that a set of 648 criminal suspects is retrieved based on the association rules calculated by the FP-growth algorithm, while a set of 973 criminal suspects is retrieved based on DBSCAN clustering and co- sine similarity of the personal tags; the number of criminal suspects is nar- rowed down to 567 after the intersection operation of the two sets, and 419 of the 567 criminal suspects are further verified to be criminals using the valida- tion set, thereby leaving the other 148 to be the final criminal suspects and giving a prediction accuracy of 73.9%. The data mining method of criminal suspects based on association rules and tag clustering in this study has been successfully applied to the police system of city S, and the experiment proves the effectiveness of this method in detecting criminal suspects. Keywords FP-Growth, Association Rule, DBSCAN, Tag Clustering, Criminal Suspects DOI: 10.4236/jsea.2019.123003 Mar. 27, 2019 35 Journal of Software Engineering and Applications
B. Cheng et al. DOI: 10.4236/jsea.2019.123003 1. Introduction Nowadays, crime situation is becoming increasingly serious across the globe with more crime types and a higher number of criminals, posing a threat to hu- man lives and property as well as social stability. Public security authorities are increasingly tasked with maintaining public security and fighting crimes with an ever-growing requirement on law enforcement. Given the constantly generated crime data, it is necessary for data analysts to reveal hidden patterns in the data, analyze implicit relationships between the data, predict occurrence of crimes and discover potential criminals, so as to improve the efficiency of law enforcement efficiency of public security authorities and prevent occurrence of crimes. Association rule mining (ARM) [1] is a research focus in the field of data mining. As one of the classic algorithms for data mining, ARM has been widely used in crime research. It is possible to extract relevant criminal evidence from association rules of a large number of data items, and further explore the pat- terns, trends and links between different crimes, so as to provide support for the police in case investigation and crime prevention. ARM performs well in ex- ploring the causes of crime, identifying the main crime suspects, and having a deeper insight into crime series. A variety of ARM algorithms have been playing an important role in crime analysis and crime prediction. In the international research community of association-rule-based crime mining, Ng et al. [2] intro- duced temporal association rules and proposed an incremental algorithm to solve the problem of how to process time series whose association rules contain time expressions, and employed the new algorithm to discover crime patterns in Hong Kong. Buczak et al. [3] explored applying fuzzy association rule mining to recognition of community crime patterns, and such application promoted the efficiency of local law enforcement. Tan et al. [4] were the first to analyze the role of the FP-growth algorithm in computer crime forensics, specifying the li- mitation that the FP-growth algorithm had when used to discover the latest crimes and serious crimes, and making some improvements to the FP-growth algorithm and its test. Joshi et al. [5] proposed the FP-Tree similarity algorithm and found it more effective than the Apriori algorithm. Usha et al. [6] [7] tested the Apriori algorithm and other algorithms such as the Fp-growth algorithm in real and synthetic crime data sets, and found that each algorithm had its own advantages. Shekhar et al. [8] explored spatial frequent pattern mining (SFPM) based on criminal pattern analysis (CPA) and validated this mining method with a spatial crime dataset. Isafiade et al. [9] revisited frequent pattern growth models for crime pattern mining, proposing a revised frequent pattern growth (RFPG) model, and also proposed a descriptive statistical approach based on a quartile (floor-ceil) function, which was used to identify recurring trends in criminal activity. Based on geographic and demographic factors, Asmai et al. [10] used ARM to generate a crime mapping model for crime analysis, employ- ing the model to examine the occurrence of crime at a specific location, and de- monstrating that the model could be used to analyze future crime locations with 36 Journal of Software Engineering and Applications
B. Cheng et al. relatively high crime potential and improve crime prevention implementation. Extensive research has been conducted in China on ARM-based crime mining. Based on fuzzy set and rough set theory, Chen and Yu [11] [12] employed ARM to quantitatively analyze a criminal dataset, make deductions, and extract rules, providing theoretical guidance for crime prevention. ARM has gained wide- spread attention and application in the fields of criminal portraits and criminal forensic analysis [13]. Moreover, ARM has been widely used in crime research such as crime investigation [14], criminal suspect analysis [15], criminal beha- vior analysis [16], and reoffending [17] [18]. In view of the temporal and spatial characteristics of crime data, many studies have proposed a number of improved ARM algorithms such as spatio-temporal association rules [19] [20], cluster as- sociation rules [21], and data cube-based association rules [22], as well as other improved algorithms such as incremental association rules [23]. In summary, association rules have been extensively applied to crime mining in relevant studies, the effectiveness of ARM in different fields of crime mining has been investigated in depth, and a large number of improved ARM algo- rithms have been proposed. These studies are mainly focused on using crime data for association rules mining, but in practice more frequently encountered is business data, which is not necessarily related to crime but is easier to obtain, thereby making it crucial to extract crime information from a large volume of ordinary business data. Moreover, existing studies are largely focused on crime case analysis and crime pattern mining instead of the mining and prediction of potential criminals, and most of the studies are focused on macro-level factors that affect the occurrence of crimes while not considering micro-level characte- ristics of criminals, while the micro-level characteristics are the internal factors that determine whether an individual would commit a crime. This study com- bines ARM algorithm and clustering algorithm to deal with crime and related data. This method is different from previous crime mining methods. It can di- rectly find criminal suspects instead of criminal hotspots or others, and it makes full use of ordinary business data, which is easier to access and handle. In this study, ARM is performed on city S-related travel data and accommodation data of criminals and ordinary people, and meanwhile, tag clustering is performed on criminals and ordinary people, in order to discover as criminal suspects, the or- dinary people who not only frequently travel with criminals but also have highly similar tags to them. Given that criminal acts are often carried out in the form of gangs, the people who are discerned to travel with the criminals and have a high similarity to the criminals in tags can be considered as potential key individuals. Monitoring criminal suspects can greatly reduce the incidence of infraction and crimes and improve public safety. In Section 2, we briefly describe the data used in the study and the preprocessing of the data. In Section 3, we provide the de- tails of our business process modeling methodology. In Section 4, we describe the research results and analysis in detail, and test and compare the algorithm. In Section 5, we summarize the conclusions of this work. 37 Journal of Software Engineering and Applications DOI: 10.4236/jsea.2019.123003
B. Cheng et al. 2. Research Data City S-related travel and accommodation data of criminals and ordinary people in 2016 as well as their personal tag data at that time are collected. The travel and accommodation data consist of shuttle ticketing data and hotel accommodation data (Table 1). A total of 600,000 and 2 million personal attribute data are col- lected on criminals and ordinary people, respectively, with the data consisting of personal ID numbers and relevant personal tags (Table 2). The criminal data is divided into a training data set and a validation data set at a 1:1 ratio. The train- ing data set is used for criminal suspect-related computation, while the valida- tion data set is used for verification of the method effectiveness by checking whether actual criminals are among the criminal suspects. The personal attributes in Table 2, except for ID number, are treated as per- sonnel tags, and the meanings of some personal tags are shown in Table 3. Clustering analysis and similarity calculation are performed on vector inputs. Tag vectorization is conducted according to the publicly accessible corpus of Chinese word vectors developed by Beijing Normal University and Renmin University of China—a corpus providing pre-trained 300-dimensional (300 d) character vectors. Each tag vector is calculated according to the following for- mula: 1V m = ∑ m = i 1 v i where V is a calculated tag vector, with m representing the number of Chinese characters in the tag and a character vector of the tag’s i-th character. Each individual has 12 tags, and therefore the tag matrix of each indi- vidual is 12 × 300 in size. 1 2 v v , i i 300 v i v i , , = ( ) 3. Research Methods This paper first uses the FP-growth algorithm to mine the ordinary people who frequently travel with the criminals as criminal suspects based on travel data and hotel accommodation data; Then use the DBSCAN algorithm to cluster and analyze the label data of criminals and ordinary people, and obtain several tag clusters. Carry out the similarity calculation of the criminals and ordinary per- sonnel in each cluster, and find some ordinary persons with the highest similar- ity with the criminals as criminal suspects; Finally, the criminal suspects based on the association rules and the criminal suspects based on the label clustering are interdigitated to obtain the final criminal suspects, and the criminals test da- ta is used to check whether calculated criminal suspects obtain actual criminals. Table 1. Travel accommodation data. Data sheet Shuttle Hotel Attribute field Route code, departure time, starting station, arrival station, passenger ID number Check-in ID number, hotel code, check-in time, hotel name 38 Journal of Software Engineering and Applications DOI: 10.4236/jsea.2019.123003
Table 2. Personal tags and meaning of the tag values. Tag Value ID number 6cbe2819c3xxxxxxxx Gender Age Marital status Employment status Income M F Minor Youth Middle aged Elderly Married Unmarried Employed Unemployed Low Middle High Educational level Single-parent family Offspring House property Foster family Household registration Migrant Primary school Junior high school Senior high school University Yes No Yes No Tenant Property owner Yes No Urban Rural Yes No B. Cheng et al. Meaning Male Female <18 years old 18 - 40 years old 41 - 60 years old >61 years old first marriage, remarriage, and remarriage unmarried, divorced, widowed Currently employed Currently unemployed Below the minimum wage Between the minimum wage and the per capita wage Higher than the per capita wage Primary school education Junior high school education Senior high school education University education Raised in a single-parent family Raised in a non-single-parent family Having offspring Having no offspring Having no house property Having house property Fostered Non-fostered Urban household registration Rural household registration Migrant Non-migrant DOI: 10.4236/jsea.2019.123003 The research method flow is shown in Figure 1. 3.1. Association Rule Mining Based on FP-Growth Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from data sets found in various kinds of databases such as relational databases, transactional databases, and other 39 Journal of Software Engineering and Applications
B. Cheng et al. Figure 1. Research method flow chart. forms of data repositories. Association rules are created by thoroughly analyzing data and looking for frequent if/then patterns. Then, depending on the following two parameters, the important relationships are observed: 1) Support: Indication of how frequently the itemset appears in the database. It is defined as the fraction of records that contain X∪Y to the total number of records in the database. Suppose, the support of an item is 0.1%, it means only 0.1% of the transactions contain that item. ) Support XY Support count of XY Total number of transaction in D = ( ) ( 2) Confidence: Fraction of the number of transactions that contain X∪Y to the total number of records that contain X. It’s is a measure of strength of the association rules. Suppose, the confidence of the association rule X ⇒ Y is 80%, it means that 80% of the transactions that contain X also contain Y together. ) Confidence X Y Support XY Supp ort X = ( ( | ) ( ) The FP-Growth Algorithm, proposed by Han in [24], is an efficient and scala- ble method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and cru- cial information about frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his method outperforms other popular methods for mining frequent patterns, e.g. the Apriori Algorithm [25] and the TreePro- 40 Journal of Software Engineering and Applications DOI: 10.4236/jsea.2019.123003
B. Cheng et al. jection [26]. In some later works [27] [28] it was proved that FP-Growth has better performance than other methods, including Eclat [29] and Relim [30]. The popularity and efficiency of FP-Growth Algorithm contributes with many studies that propose variations to improve his performance. The FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. For so much it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frequent-pattern tree (FP-tree), which retains the itemset association information. In simple words, this algorithm works as follows: first it compresses the input database creating an FP-tree instance to represent frequent items. After this first step it divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. Finally, each such database is mined separately. Using this strategy, the FP-Growth reduces the search costs looking for short patterns recursively and then concatenating them in the long frequent patterns, offering good selectivity. Based on travel data and hotel accommodation data, this study compares each car or daily hotel accommodation to a shopping basket, where the item is the ID of the person and the item set is all the people in the car or hotel. With the minimum support and confidence, the FP-growth algorithm is used to calculate the association rules between criminals and ordinary people. That is to mine or- dinary people who frequently travel or stay with criminals as criminal suspects. 3.2. DBSCAN Tag Clustering and Cosine Similarity As one of the most cited of the density-based clustering algorithms, DBSCAN is likely the best known density-based clustering algorithm in the scientific com- munity today. The central idea behind DBSCAN and its extensions and revisions is the notion that points are assigned to the same cluster if they are densi- ty-reachable from each other. To understand this concept, we will go through the most important definitions used in DBSCAN and related algorithms. The definitions and the presented pseudo code follows the original by Ester et al., but are adapted to provide a more consistent presentation with the other algorithms discussed in the paper. Clustering starts with a dataset D containing a set of points p D∈ . Density-based algorithms need to obtain a density estimate over the data space. DBSCAN estimates the density around a point using the concept of ϵ-neighborhood. Definition 1. -Neighborhood. The ϵ-neighborhood, N(p), of a data point p is the set of points within a specified radius around p. ( ) < }  ) = q d p q , | ( N p  { where d is some distance measure and . Note that the point p is always in its own ϵ-neighborhood, i.e., always holds.Following this defini- ∈  ) tion, the size of the neighborhood can be seen as a simple unnorma- lized kernel density estimate around p using a uniform kernel and a bandwidth R+∈ ( ) p N p ( N p DOI: 10.4236/jsea.2019.123003 41 Journal of Software Engineering and Applications
B. Cheng et al. of ϵ. DBSCAN uses gions and to classify the points in a data set into core, border, or noise points. N p and a threshold called minPts to detect dense re- ( ) Definition 2. Point classes. A point p D∈ is classified as • a core point if has high density, i.e., minPts Z+∈ is a user-specified density threshold, ( N p ) ( N p ≥  ) minPts where • a border point if p is not a core point, but it is in the neighborhood of a core point q D∈ , i.e., ∈  • a noise point, otherwise. ( p N q ) , or Definition 3. Directly density-reachable. A point q D∈ is directly densi- ) . ty-reachable from a point p D∈ with respect to and minPts if, and only if, , and ) minPts ( N p ≥  ( q N p ∈  1) 2) That is, p is a core point and q is in its ϵ-neighborhood. Definition 4. Density-reachable. A point p is density-reachable from q if with q = p1 and − ) there exists in D an ordered sequence of points ( p p p , , , n 2 1 { p = pn such that pi+1 directly density-reachable from pi i 1,2, ∀ ∈ } 1 . n , Definition 5. Density-connected. A point p D∈ is density-connected to a point q D∈ if there is a point o D∈ such that both p and q are densi- ty-reachable from o. Definition 6. Cluster. A cluster C is a non-empty subset of D satisfying the following conditions: ∀ ,p q C ∈ , p is density-connected to q. 1) Maximality: If p C∈ and q is density-reachable from p, then q C∈ ; and 2) Connectivity: The DBSCAN algorithm identifies all such clusters by finding all core points and expanding each to all density-reachable points. The algorithm begins with an arbitrary point p and retrieves its ϵ-neighborhood. If it is a core point then it will start a new cluster that is expanded by assigning all points in its neighbor- hood to the cluster. If an additional core point is found in the neighborhood, then the search is expanded to include also all points in its neighborhood. If no more core points are found in the expanded neighborhood, then the cluster is complete and the remaining points are searched to see if another core point can be found to start a new cluster. After processing all points, points which were not assigned to a cluster are considered noise. This study uses cosine similarity to calculate the distance. Refer to the existing study [31] to set ε to 0.6 and MinPts to 255. The DBSCAN algorithm clusters the vectorized criminals and ordinary people’s tag matrix, and calculates the simi- larity and sorting of the criminals and criminal suspects for each cluster. Then, ordinary people with similarities greater than 0.8 are selected from various clus- ters as criminal suspects. The cosine similarity calculation formula is as follows: cos ( ) θ = n ∑ i 1 = ( x i × y i ) n ∑ i 1 = ( x i 2 ) × n ∑ i 1 = ( y i 2 ) DOI: 10.4236/jsea.2019.123003 42 Journal of Software Engineering and Applications
分享到:
收藏