Application of Cluster-Based in AML.pdf-资料库

Application of Cluster-Based Local Outlier Factor Algorithm in Anti-Money Laundering Gao Zengan Post Doctoral Station of Theoretical Economics China Center for Anti-Money Laundering Studies Fudan University Shanghai, P. R. China School of Economics and Management Southwest Jiaotong University Chengdu, P. R. China E-mail address: gaozengan133@163.com institutions’ capability Abstract—Financial in recognizing suspicious money laundering transactional behavioral patterns (SMLTBPs) is critical to anti-money laundering. Combining distance-based unsupervised clustering and local outlier detection, this paper designs a new cluster-based local outlier factor (CBLOF) algorithm to identify SMLTBPs and use authentic and synthetic data experimentally its applicability and effectiveness. test to Keywords-clustering; outlier detection; local outlier factor (LOF); suspicious money laundering transactional behavioral patterns (SMLTBPs); anti-money laundering (AML) I. INTRODUCTION Anti-money laundering (AML) in financial industry is based on the analysis and processing of Suspicious Activity Reports (SARs) filed by financial institutions (FIs), but the very large number of SARs usually makes financial intelligence units’ (FIUs’) analysis a waste of time and resources simply because only a few transactions are really suspicious in a given amount [1], so financial AML is far from a real-time, dynamic, and self-adaptable recognition of suspicious money laundering transactional behavioral patterns (SMLTBPs). Literature review finds that artificial intelligence [2], support vector machine (SVM) [3], outlier detection [4], and break-point analysis (BPA) [5] are used to improve FIs’ ability in processing suspicious data, various approaches to novelty detection on time series data are examined in [6], outlier detection methodologies are surveyed by [7], and a data mining-based framework for AML research is proposed in [8] after a comprehensive comment is made on relative studies. But the effectiveness and efficiency of SMLTBP identification remains a hot spot for research since the passage of the USA Patriot Act and the creation of the U.S. Department of Homeland Security signaled a new era in applying information technology and data mining in detecting money laundering and terrorist financing [9]. As SMLTBP recognition is short of training data, the number of clusters is usually unknown, and the result of clustering is always changing dynamically [10, 11], this paper designs a cluster-based local outlier factor (CBLOF) algorithm to help FIUs concentrate on a desirable number of SMLTBPs having a proper degree of suspiciousness as determined by their actual needs and resources endowments. Following the introduction, Section II describes the design of the algorithm, Section III is about the experimental process, and Section IV ends the paper with a suggestion for future research. II. ALGORITHM DESIGN combines algorithm The CBLOF distance-based unsupervised clustering and local outlier [12] detection, and clustering is for the purpose of pre-processing data for the consequent anomaly identification. A. Clustering As far as the nature of money laundering (ML) is concerned, the chosen clustering algorithm should be able to generate the number of clusters automatically (with no need for pre-establishment) and all the clusters are to be ranked according to the number of the components in each. Thus we propose the following procedures: 1) Start with any object (say p) in a dataset and create a cluster. The initial cluster is supposed to be C1. 2) Choose any other object q, calculate its distance to the existing clusters C1, C2, C3, …, Ci and denote it by distance q Ci , and then figure out the minimal distance ( , value ( , distance q Cmin . ) ) 3) Let the threshold be ε. If distance q Cmin ε≤ holds and “q has never been clustered” satisfies, add q to the cluster Ci which is assumed to be nearest to q when compared with all ( , ) The research is supported by the National Social Science Foundation of China (No. 08BGJ013). 978-1-4244-4639-1/09/$25.00 ©2009 IEEE

( , distance q Cmin ε> , the other known clusters. Conversely, if implying that q has not yet been clustered into any category, build a new cluster Cj and embed q into it. Nevertheless, distance q Cm ε≤ , suppose there is a cluster Cm outside Ci, if then integrate Cm and Ci into one cluster, that is, if we have distance q Cm ε≤ and distance q Ci ε≤ simultaneously, then clusters Cm and Ci are merged into one. ( , ( , ( , ) ) ) ) 4) Repeat Steps 2) and 3) until all the objects have been clustered. 5) Rank all the clusters in the decreasing number of their components involved. B. Outlier Detection An outlier is a point that deviates so much from surrounding “normal” points as to arouse suspicion that it was generated by a different mechanism. After clustering, all the samples have been categorized into mutually exclusive clusters ranked as per the number of their components. As most transactions in an account are usually normal or legal, the clusters generated from above are divided into Large Category (LC) and Small Category (SC) in this paper, with the former being supposed to represent normal transactional behavioral patterns free of ML suspicion and the latter, on the contrary, for anomalous patterns worth notice. + 1 C b for LC, that is, } b c i { i > b } ≤ = where and c = . While (1) represents the majority of the SC c j { j objects in the dataset, (2) indicates that the number of LC components is greatly different from that of SC components. c i i LC = ≤ b { } Furthermore, the points in SC are all outliers when compared with those in LC [13, 14]. But for AML research, seasonal industries and some special industries must be exempted because abnormal phenomena in a particular period can never be treated as ML red flags. So the paper will study n number of data points with top local outlier factor (LOF) values because they are more of ML suspicion. Also, this can effectively improve AML pertinence. In the light of the local outlier definition in [12], LOF can be employed to measure the deviant degree of SC points from LC, i.e., how far the transactional behavioral patterns represented by the points in SC deviate from the normal or legitimate patterns, where LOF value is determined by the number of the components in the clusters sample data belong to and the distance from sample data to the nearest LC. Given a point o in the dataset, its LOF value is: LOF( o ) = ∗ c min[distance( o,c )] i i (3) 2 k , c } of c the , dataset D, For { = c c , 1 clustering c c and 1 let > . Given any two result > C parameters α and β, we have: + ≥ (2) β (1) D α = ∑ c C b T a c b ta + + ≥ ∗ c ij = n 1 2 1 2 k i j j o ∈ ∈ LC c c , i i the farther ∈ SC c , the point o deviates from . The higher the LOF value where the normal is, transactional behavioral patterns. Once the LOF value is fixed for each object, we can get to know how suspicious the transactional behavioral patterns are in the given account. Rank the data points as per their LOF values, we can get a feature-oriented ordering of SMLTBPs to help FIs choose n number of objects as they like for a detailed exploration and finally determine what to file to FIUs under the restraints of investigation resources in labor, capital, and technology, etc. III. EXPERIMENTAL PROCESS The experimental process mainly includes extracting research variables, preparing data samples, actualize the algorithm, and discussing the experimental results. A. Choose and Define the Features to Be Studied We are more interested in the transactional behavioral attributes like amount and frequency than in the account owner’s subjective characters, transaction amount, transaction amount deviation coefficiency, and transaction frequency (i.e., withdrawal frequency and deposit frequency) are chosen to be research variables with the following definitions: thus Definition 1: Transaction amount (Tai) is the total amount of all the transactional segments or subsequences, that is, (4) where, Tai is the transaction amount of the ith transactional segment of a given account, Taij is the amount of the jth transaction in the ith segment (and so and so forth hereinafter). Transaction amount is a critical criterion for us to determine whether a transaction is suspicious or not since a large cash transaction is viewed as a special kind of suspicious transaction in this research. Definition 2: Transaction amount deviation coefficiency 2) (Tadi) is the ratio between transaction amount variation (Tsi and average transaction amount ( 2 Ts i = Tad i = n i − n i 1 • Ta i iTa ), that is, ∑ Ta i ta ij − ( ) in = 1 j n ∑ = 1 j ta ij 2 (5) Tadi is used to measure the degree of equalization of transaction amount (i.e., structuring) which means a large cash transaction, either deposit or withdrawal, is purposefully divided into several transactions of a nearly equal amount in order to exempt from filing Currency Transaction Reports (CTRs) as required by the authority. The less the Tadi value is, the more equalized the transaction amount is, and the more suspicious the transaction is as far as CTR regulations are concerned. Definition 3: Withdrawal/deposit frequency is the ratio of

the number of withdrawal/deposit transfers to the aggregated frequency of transactions. Denote the withdrawal the deposit frequency of the ith transactional segment of a given account by Tfwi and Tfdi, respectively, and denote the total frequency of transactions by tf, we have the following formulas: frequency and t f = T f w T f d i i t f w = i t f w + i t f d i = t f d i t f t f (6) is capital centralized Analyzing withdrawal frequency and deposit frequency can identify two novel capital flows within a short time frame: followed by one in-transfers decentralized capital out-transfers, and the other is decentralized capital in-transfers followed by centralized out-transfers. Also, this analysis can compensate for the research of [5]. B. Prepare Data Samples Just like the authors of [6], we are most interested in data patterns that deviate from the normal operational data. So historical transaction records are to be transformed into several segments or subsequences of neighboring single transactions, with one segment (subsequence) representing one behavioral pattern, and the transactional data embedded in SMLTBPs are just the suspicious objects we hope to find out. For each feature as above mentioned, calculate its feature value for each segment and take the feature vectors composed of feature values as research samples. In this research, we have collected from 108 accounts of one commercial bank 34,303 authentic transactional data from January 1 through October 30, 2006 out of which the account data of 25 firms in 4 industries sharing similar turnover scale and transactional frequency are taken as experimental samples. Meanwhile, twenty segments of synthetic data are generated by the mechanism in Figure 1 [15, 16] to test the applicability of the algorithm in detecting abnormal objects. Each segment of artificial simulation data is employed twice. As per the Regulations for Financial Institutions to File Currency Transaction Reports and Suspicious Transaction Reports of the People’s Bank of China, ten days is accepted as transactional data. After the standard pre-processing the subsequences, and extracting the feature values, we obtain 696 experimental samples of 25 accounts, of which only 40 synthetic data samples are listed in Table 1 due to the limit of the paper. to the experimental data, segmenting segment C. Actualize the Algorithm Do experiment with the CBLOF algorithm on the sample set. As global outlier detection cannot mine all the outliers [12], give LOF value to each sample, and then identify n number of samples with the highest LOF values for further investigation and final reporting. D. Experimental Results and Discussions Let clustering threshold ε ＝ 0.15 and categorization parameters α＝75％ and β＝4, we will first of all standardize the dimensions of data samples, and then program with C++ language, cluster, categorize LC and SC, and compute LOF values of transaction segments. Once more only a part of the experimental results are shown in Table 2 due to the limit of the paper, where only the five samples with top LOF values are listed for each account. They are the five transactions with the highest degree of suspiciousness, as well. Figure 1. Synthetic Data Simulation Method

REFERENCES [1] H. G. Goldberg and R. W. H. Wong. “Restructuring transactional data for link analysis in the FinCEN AI System”. In: Proceedings of 1998 AAAI Fall Symposium on Artificial Intelligence and Link Analysis, Menlo Park, C. A.: AAAI Press, 1998. [2] H. G. Goldberg and T. E. Senator. “Restructuring database for knowledge discovery by consolidation and link formation”. In: Proceedings of the First International Conference on Knowledge Discovery in Database (KDD-95), Menlo Park, Calif.: AAAI Press, 1995: 136-141. J. Tang and J. Yin. “Developing an intelligent data discriminating system of antimoney laundering based on SVM”. In: Proc. of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 2005: 3453-3457. [3] [4] R. C. Watkins, K. M. Reynolds, R. Demara, M. Georgiopoulos, A. Gonzalez, and R. Eaglin. “Tracking dirty proceeds: exploring data mining technologies as tools to investigate money laundering”. Police Practice and Research, 2003, 4 (2): 163-178. [5] R. J. Bolton and D. J. Hand. “Unsupervised profiling methods for fraud detection”. Conference on Credit Scoring and Credit Control VII, Edinburgh, UK, 2001: 5-7. J. Anstey, D. Peters, and C. Dawson. “Discovering novelty in time series data”. In: Proceedings of the 15th Annual Newfoundland Electrical and Computer Engineering Conference, Canada, 2005. [6] [7] V. Hodge and J. Austin, “A survey of outlier detection methodologies.” Artif. Intell. Rev., 2004, 22 (2): 85-126. [8] Z. Gao and M. Ye. “A framework for data mining-based anti-money laundering research”. Journal of Money Laundering Control, 2007, 10 (2): 170-179. J. S. Zdanowicz. “Detecting money laundering and terrorist financing via data mining”. Communications of the ACM, 2004, 47 (5): 53–55. [9] [10] J. Han, Y. Huang, and N. Cercone. “Data mining: an overview from database perspective”. IEEE Transaction on Knowledge and Data Engineering, 1996, 8 (6): 1-40. [11] J. Han and M. Kamber. Data Mining: Concepts and Technique. Morgan Kaufmann Publishers, 2001. [12] M. M. Breuning, H. P. Kriegel, T. Ng. Raymond, and J. Sander. “LOF: identifying density-based local outliers”. In: Proc. ACM SIGMOD 2000. Int. Conf. on Management of Data. Dallas, TX, 2000: 93-104. [13] Z. He, X. Xu, and S. Deng. “Discovering cluster-based local outliers”. Pattern Recognition Letters. 2003, 24 (9-10): 1641-1650. [14] M. F. Jiang, S. S. Tseng, and C. M. Su. “Two-phase clustering process for outliers detection”. Pattern Cognition Letters, 2001, 22: 691-700. [15] E. Lundin, H. Kvarnström, and E. A. Jonsson. “Synthetic fraud data generation methodology”. In: Lecture Note in Computer Science, ICICS2002. Laboratories for Information Technology, Singapore, Springer Verlag, 2002: 265-277. [16] E. Barse, H. Kvarnström, and E. Jonsson. “Synthesizing, test data for fraud detection systems”. In: Proceedings of the 19th Annual Computer Security Applications Conference. 2003: 384-395. [17] Z. Gao. “Comments on anti-money laundering suspicious activity report regime”. In: Proceedings of 2006 International Conference on Public Administration. X. Zhu and S. Zhao, Eds. Chengdu: University of Electronic Science and Technology of China Press, 2006: 970–975. What is more, all the anomalous samples in the 25 accounts are ranked in the decreasing order of their LOF values. Particularly, 33 of 40 artificial samples are found with relatively higher LOF values and they even rank the top five of their accounts, which proves the effectiveness of the CBLOF algorithm in identifying suspicious data. IV. CONCLUSION Making a good use of the advantages of both distance-based unsupervised clustering and local outlier detecting, the CBLOF algorithm can effectively identify the synthetic data suspicious of ML transactions with a high processing speed and a satisfactory accuracy. Needing neither prior samples to serve as training data nor the number of clusters to be designated in advance can solve the problem that AML research is always in short of case data. In particular, the algorithm is self-adaptable to the evolution of ML methods and can recognize SMLTBPs that haven’t been detected before, which is quite beneficial in saving limited investigation resources and preventing FIs from filing defensive SARs [17]. However, only a few transactional behavioral features of amount and frequency are studied in this paper, so relative subjective characters of the account owner remains open to our future research.

资料库

Application of Cluster-Based in AML.pdf

相关推荐

行业

热门标签

最新资料