logo资料库

论文研究 - 在线招聘欺诈检测的智能模型.pdf

第1页 / 共22页
第2页 / 共22页
第3页 / 共22页
第4页 / 共22页
第5页 / 共22页
第6页 / 共22页
第7页 / 共22页
第8页 / 共22页
资料共22页,剩余部分请下载后查看
An Intelligent Model for Online Recruitment Fraud Detection
Abstract
Keywords
1. Introduction
2. Background
2.1. Online Recruitment System
2.2. Knowledge Discovery from Data (KDD)
2.3. Data Mining Tools
3. Literature Review
Data Mining Techniques Related Works
4. Research Methodology
4.1. Based Up on, the Main Questions of This Study Are
4.2. The Proposed Online Recruitment Fraud Detection Model
4.3. Pre-Processing
4.4. Selection Feature
4.5. Classification Algorithm
4.6. Evaluation Parameter
5. Results and Discussion
5.1. Pre-Processing Data Set
5.2. Selection Features
5.3. Ensemble Classification
6. Conclusion and Future Work
Conflicts of Interest
References
Journal of Information Security, 2019, 10, 155-176 http://www.scirp.org/journal/jis ISSN Online: 2153-1242 ISSN Print: 2153-1234 An Intelligent Model for Online Recruitment Fraud Detection Bandar Alghamdi, Fahad Alharby Naïf Arab University (NAUSS), Riyadh, KSA How to cite this paper: Alghamdi, B. and Alharby, F. (2019) An Intelligent Model for Online Recruitment Fraud Detection. Jour- nal of Information Security, 10, 155-176. https://doi.org/10.4236/jis.2019.103009 Received: January 6, 2019 Accepted: July 8, 2019 Published: July 11, 2019 Copyright © 2019 by author(s) and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY 4.0). http://creativecommons.org/licenses/by/4.0/ Open Access Abstract This study research attempts to prohibit privacy and loss of money for indi- viduals and organization by creating a reliable model which can detect the fraud exposure in the online recruitment environments. This research presents a major contribution represented in a reliable detection model using ensem- ble approach based on Random forest classifier to detect Online Recruitment Fraud (ORF). The detection of Online Recruitment Fraud is characterized by other types of electronic fraud detection by its modern and the scarcity of studies on this concept. The researcher proposed the detection model to achieve the objectives of this study. For feature selection, support vector ma- chine method is used and for classification and detection, ensemble classifier using Random Forest is employed. A freely available dataset called Employment Scam Aegean Dataset (EMSCAD) is used to apply the model. Pre-processing step had been applied before the selection and classification adoptions. The results showed an obtained accuracy of 97.41%. Further, the findings presented the main features and important factors in detection purpose include having a company profile feature, having a company logo feature and an industry fea- ture. Keywords Online Recruitment Fraud, Intelligent Model, Privacy 1. Introduction In the last decade which is called the Internet and social Era, the integral parts of the modern landscape considered are the Internet and social media. In modern organizations, there is a wide use of the Internet and social media deployed in employee recruitment [1]. Recently, the cloud was integrated to the procedure of recruiting new members, where the managed cloud services or solutions are used DOI: 10.4236/jis.2019.103009 Jul. 11, 2019 155 Journal of Information Security
B. Alghamdi, F. Alharby DOI: 10.4236/jis.2019.103009 by human resource managers. Nevertheless, there are many violating risk threats increased by scams and frauds along with the wide interest and adopting such embedded software [2]. Cybercrime is one of the present risky crimes that face the world and threaten the individuals and organizations security causing substantial losses [3]. Based on cyber security ventures report 2021, the cost of cybercrime damages in the world is around $6 trillion annually [4]. Thus, we have an urgent need for In- formation Security to ensure the Confidentiality, Integrity and Availability (CIA) to combat these crimes. It can be done through the implementation of known in- formation security strategies such as prevention, detection, and response. In Saudi Arabia, the 2030 vision predicts that there will be a growth in job generation [5]. This growth requires government to assure the inclusion of CIA in job recruitment process to protect the individuals and organizations from cyber- crime occurrence. ORF is considered as one type of cybercrimes that has appeared recently. ORF violates the privacy and financial funds of individuals and organi- zations by exploiting Internet technology and web service. It allows non-legitimate users to damage the reputations of the organizations [6]. Data mining methods have added to data analysis, knowledge mining, predic- tion and detection of cybercrime. They can be used to create an intelligent model which is very effective in detecting fraud and scams of the network. ORF disrupts the privacy of job seeker and bothers the reputation of organizations. Moreover, it causes loss of money for individuals. It happens when criminals post fake ads exploiting the automation recruitment to trap job seekers. To the best of our knowledge, there has only been one empirical research conducted up to now on this kind of cybercrime. This is the only research that has examined the online fraud issues and developed a new solution for detection. Vidros, Kolias, Kambourakis, & Akoglu (2017) added many features of ORF to the public dataset (EMSCAD). The researchers recommended the research com- munity to find a reliable detection model of ORF. Thus, we need to obtain a new re- liable model to enhance the performance of classification based on pre-processing and feature selection phase [2]. 2. Background 2.1. Online Recruitment System The online recruitment system utilizes web-based tools like public internet or intranet to recruit staff [7]. Recruitment brings many benefits for firm’s success by getting the best applicants in short time, highlighting the professional re- quirements, assessing applicants via interviews and welcoming the newly se- lected employees. Furthermore, this process makes the hiring process more af- fordable and productive without spending a lot of money. The critical compo- nents of online recruitment include tracking the status of candidates, employer’s website, job portals, online testing, and social networking [8]. The advantages of e-recruitment include effectiveness, high value, easiness, and efficiency [9]. The 156 Journal of Information Security
B. Alghamdi, F. Alharby literature highlighted various vital benefits of e-recruitment such as Time reduc- tion, Cost Reduction, Reach huge masses (employers and candidates), Filter Functions, and Development of Brand image. Two different methods for online recruitment as defined by (Prasad & Ka- poor, 2016) include posting the company’s profile as well as the job require- ments on job portals and creating an online recruitment page on the company’s website [10]. 2.2. Knowledge Discovery from Data (KDD) Knowledge Discovery from Databases (KDD) can be defined as a recognition and extraction of valuable, genuine, useful, unique, and comprehensible correla- tions or patterns in the data [11]. Figure 1 summarizes the KDD concept that depends on multiple sequence of performed processes. Each process depends on the success and output of prior phase or process. It affirms the iterative nature of the process and various feedback loops to indicate revision activities [12]. The conventional model for KDD process is represented in Figure 1. The KDD model includes five main phases: Selection of relevant prior know- ledge; Acquiring or creating targeted dataset; Preprocessing to handle missing values, noise and errors in the data; Transformation to create dataset form suita- ble for easily implementing data mining algorithms; Data mining, the deci- sion-making activity to define models such as regression, classification or clus- tering to obtain patterns of interest, representational form, or rule sets and trees; and Interpretation and Evaluation with respect their validity, and visualization of the patterns and models. Web mining investigates the data quantities available in the World Wide Web by extracting information from all available web documents and services. Web content is very dynamic due to the rapid growth nature and update [13]. Text mining or text classification refers to text analytics by extracting high qualified information from digital text [14]. Classification, clustering and regression, Classification techniques use labelled datasets, a supervised learning method, and involves learning and training phase that classifies data into various and multiple classes based on assigned attributes derived from dataset [15]. Cluster- ing defines similar classes of items based on the similarity among objects or DOI: 10.4236/jis.2019.103009 Figure 1. The conventional KDD model [2]. 157 Journal of Information Security
B. Alghamdi, F. Alharby DOI: 10.4236/jis.2019.103009 items. It can be used to perform a preprocessing phase for selection tasks [16]. Regression is one of predictive and statistical techniques used in numerical and continuous prediction through training process to determine the correlation among various attributes [17]. 2.3. Data Mining Tools Data mining tools provide different open source tools to allow access and per- form data mining such as rapid miner, weka, and orange. Rapid Miner is an open source tool that utilizes the client-server model and deals with various data file format [18]. Waikato Environment for Knowledge Analysis (Weka) is a ma- chine learning tool lunched by Waikato University [19] and offers a comprehen- sive range of preprocessing data and data modelling algorithms [18]. Orange is a very powerful open source data mining tool launched by the University of Ljubl- jana to support a wide range of widgets found in toolbox visualization and data analysis [18]. Data mining utilizes different methods for trend pattern and prediction tasks. The main methods deployed for data mining and diverse implementation me- thods include Classification is based on finding the basic rules to distribute items into specific defined classes, which is considered a predicting task [20]. 3. Literature Review There is rich literature about cybercrime detection models in different fields. However, there are only two studies one descriptive. One of them is an empirical study that has addressed the fraud and scams in the online recruitment. The re- lated works often studied data mining techniques for various other detection purposes. A few research efforts addressed the online recruitment frauds. Vidros et al. (2016) determined the frauds exposed by job seekers through on- line recruitment services. They found that ORF is a new field of current severe vulnerable. Three current methods of fraud and scams in the online recruitment include Fake Job Advertisement, Economic trickery using fake job advertise- ments published on online recruitment boards, and Reputed and real business enterprise publishing fake vacancy announcement [21]. Data Mining Techniques Related Works Many studies have carried out data mining such as Yasin & Abuhasan (2016) that has provided an intelligent classification model to detect phishing emails using knowledge discovery, data mining and text processing techniques. A mod- el based on knowledge discovery (KD) was proposed to build an intelligent email classifier to classify a new email message into legitimate or spam. The knowledge discovery model achieved high accuracy rates in classification of phishing emails that outperformed other schemes. Using the Random Forest algorithm and J48, 99.1% and 98.4% accuracy was achieved respectively. Using MLP classifier, TP rate and FP rate were 0.977 and of 0.026 respectively, while MLP achieved ROC 158 Journal of Information Security
B. Alghamdi, F. Alharby area of 0.987. The results of this study confirmed that the proposed model achieves high rates of accuracy in the classification of phishing e-mail [22]. Al-garadi, et al. (2016) introduced a study that has investigated cybercrime detection in online communications especially cyber bullying in Twitter. The main aim was to develop a number of unique features derived from Twitter. They included network, activity, user, and tweet content. A model to detect cy- ber bullying in Twitter was proposed using engineering features. The number of friends (followers), the number of users being followed (following), the follow- ing and followers ratio, and account verification status were collected through a survey. Users’ activity features were also employed to measure the online com- munication activity of a user. The features implemented and included personal- ity, gender and age. Naïve Bayes (NB), Support vector machine (SVM), Random forest and KNN were applied. Random forest showed f-measure 93%. The re- sults of this study indicate that the proposed model contributes to providing a suitable solution for the detection of cyberbullying in online communication en- vironments [23]. Sharaff, Nagwani, & Swami, 2015 investigated the impact of feature selection technique on email classification through studying the effect of two feature selec- tion methods. A comparison was conducted between Bayes algorithm, tree-based algorithm J48 and support vector machine. Feature selection techniques in- cluded a Chi-Square (χ2) and information gain. The best performance was gained using SVM classification technique which gave the overall best results without employing any feature selection techniques. There is no effect of Naïve Bayes on feature selection techniques. Further, J48 showed slight improvement with feature selection, whereas info-Gain performed better than Chi-square fea- ture selection technique [24]. The research of Sornsuwit & Jaiyen (2015) created an Intrusion Detection Model Based on Ensemble Learning for User to Root (U2R) and Remote to Lo- cal (R2L) Attacks. The ensemble learning was concentrated to detect network intrusion data and reduce redundant features using a correlation-based algo- rithm. It can improve the accuracy of classifier by solving the determined prob- lems applied on U2R and R2L attacks in KDD Cup’99 intrusion detection data- set. They applied an Adaboost algorithm to construct a strong classifier as linear combination of weak classifiers. Naïve Bayes was used to determine the appro- priate class of unseen data and to compute the posterior probability for each class. A multilayer Perceptron (MLP) network was also used to perform linear mapping from input space to hidden space and from hidden space to output space. Support Vector Machine (SVM) approach was used to solve the classifica- tion problems based on an optimal hyper plane in a high-dimensional space. The result of this study show that reducing features contributes to improved ef- ficiency in detecting attacks in works in many weak scales [25]. Gaikwad & Thool (2015) applied the bagging ensemble method of machine learning in order to provide a novel intrusion detection technique. Two instru- ments were utilized based on five modules including feature selection, REP Tree 159 Journal of Information Security DOI: 10.4236/jis.2019.103009
B. Alghamdi, F. Alharby DOI: 10.4236/jis.2019.103009 design, and construction of main classifier, packet sniffer and detector. In addi- tion, they proposed an intrusion detection system called bagging ensemble me- thod algorithm. Weak classifier was used to improve the classification accuracy. The ensemble bagging machine-learning technique provided highest classifica- tion accuracy of 99.67%. They also revealed that the model building time and false positives exhibited by the method were lower as compared to AdaBoost al- gorithm with Decision stump base classifiers. The results of the study confirm that the bagging group with REPTree displays the highest accuracy of the classi- fication. One advantage of using bagging method is that it takes less time to build the model. The proposed group method provides low false positives in comparison to other machine learning techniques [26]. Zuhaira, Selmat, & Salleh (2015) investigated the effect of feature selection on phish website detection by examining the effects of the feature selection ap- proach on classification performance. An empirical test was conducted on a spe- cific test-bed set to extract a large number of hybrid features. Four feature selec- tion algorithms (FSAs) included CBF, WFS, χ2, and IG. A comparison between classification models was performed to qualify the way of the features selective subsets shift detection accuracy, specificity and sensitivity of the classification model to the best rates. Some feature selection methods significantly outper- formed their competitors by exhibiting better robustness, prediction, and per- formance. The results of the experiment showed a signification improvement in detection accuracy with low latency and accuracy of observation in the sensitivi- ty of hardness and predictability. The results of the study contributed to provid- ing the best possible subset of features for strong selection and effective phishing detection [27]. Nizamani, et al. (2014) proposed a fraudulent email detection model based on advanced feature choice. J48 classification algorithms technique was used due to its simplification and inductive nature. Support Vector Machine (SVM) was also used to transform non-linearly separable data to a new linearly separable data by using kernel trick. Moreover, Naive Bayes’ (NB) classification algorithm was used to calculate the probabilities of the feature values for each of the classifica- tion categories. Further, cluster-based classification model (CCM) was applied to perform the classification task by grouping the data points based on obvious features. The dataset contains 8000 emails in total. The frequency-based features attain high accuracy for the task of fraudulent email detection regardless of choice of classification method. The model employed features extracted from the content of the emails achieving accuracy as high as 96%. The results of the study showed that the level of correctness was pretentious by the kind of determined features rather than the classifiers’ type [28]. Shrivas & Dewangan (2014) presented an approach based on ANN-Bayesian Net-GR approach that combines an Artificial Neural Network (ANN) and a Bayesian Net. By using Gain Ratio (GR) advantage selection approach, the au- thors used Classification and Regression Tree (CART) to build a binary decision tree by splitting the record at each node. According to a function of a single 160 Journal of Information Security
B. Alghamdi, F. Alharby attribute, the Artificial Neural Network (ANN) was utilized to mine data for classification. The ensemble approach was used to build a hybrid model to im- prove classification accuracy. Further, a feature selection approach was used to overcome bias. To reduce the irrelevant features and improve classification ac- curacy, various classification techniques were applied on NSL-KDD and KDD99 dataset. The proposed model provided accuracy of 99.42% with KDD99 dataset and 98.07% with NSL KDD dataset [29]. Balakrishnan, Venkatalakshmi, & Kannan (2014) studied the intrusion detec- tion system using feature selection and classification technique. The study aimed to provide and employ an intrusion detecting system to deal with possible at- tacks. The authors adopted various techniques, where the needed data acquired from the KDD’99 cup dataset. A rule based classifier was used to perform effec- tive decision making on intrusions, in addition to a support vector machine me- thod to make binary classification and regression estimation tasks. A proposed algorithm for optimal feature selection was applied through calculating informa- tion gain ratio on attribute selection. The proposed feature selection algorithm selected only the important features that helped in reducing the time taken for detecting and classifying the record. Rule based classifier and support vector machine helped achieve a greater accuracy. The provided intrusion detection system reduced the false positive rates and reduces the computation time. This study contributed to the selection of the optimal advantage by calculating the percentage of information gain in characterization. It also helped reduce the time taken to discover and classify the record. Help the base-based workbook and the support transfer machine to achieve greater accuracy. The intrusion de- tection system reduced false positive rates and reduces the calculation time [30]. Riyad & Ahmed (2013) designed an ensemble classification approach for in- trusion detection. The Support Vector Machine (SVM) algorithm was used to maximize the classification by sub dividing feature space into sub spaces and to classify the new data, the Random Forest tree predictor was used to construct the tree with different bootstrap samples, and Artificial Neural Network was used process information. These combined models can increase the accuracy of pre- diction over a single model. Each classification from the base algorithms is given a weight 0 to 1 depending on their accuracy. The result of the study indicated that the ensemble method is one of the main developments in the field of ma- chine learning [31]. Ozarkar & Patwardhan (2013) implemented an efficient spam classification using Random Forest and Partial Decision Trees algorithms to classify spam vs. non-spam emails. A Chi-square test was used in order to decide whether effects were present or not. Information Gain measure was applied to reduce in entropy caused by partitioning the examples according to a given attribute. In addition, Symmetrical Uncertainty measure was employed to determine desirable proper- ty for a measure of feature-feature inter-correlation to have. The authors also used One R algorithm to infer a rule that predicts the class given the values of 161 Journal of Information Security DOI: 10.4236/jis.2019.103009
B. Alghamdi, F. Alharby DOI: 10.4236/jis.2019.103009 the attributes. The study acquired the best percentage accuracy of 99.918% with Random Forest which is 9% better than previous spam base approaches and 96.416%. The results of the study showed that the use of Random Forest and Partial Decision Trees algorithms to classify spam are more effective than other algorithms that have been implemented in terms of accuracy and time complex- ity [32]. Rathi & Pareek (2013) used data mining to investigate spam mail detection analyzing various data mining approaches on spam dataset to find the best clas- sifier for email classification. Support vector machine was used to analyze data and was mainly used for classification purpose. A Naïve Bayes classifier was used to determine the presence or absence of a particular feature of a class was unre- lated to the presence/absence of any other feature, given the class variable de- pending on the nature of probability model. Moreover, a feature selection me- thod was used to analyze the data, by removing irrelevant and redundant fea- tures from the data. The results showed that promising accuracy of the classifier Random Tree is 99.715% with best-first feature selection algorithm and accuracy is 90.93% [33]. 4. Research Methodology 4.1. Based Up on, the Main Questions of This Study Are Q1—How to determine the relevant features used in Online Recruitment Fraud? Q2—What is the best classification algorithm to be used for Online Recruit- ment Fraud? Q3—Is the Ensemble approach suitable for the detecting Online Recruitment Fraud? The key purpose of this research is to protect individuals and organization protection from compromised privacy and loss of money through constructing a suitable reliable model to detect the fraud exposure in the online recruitment environments. To achieve this objective, the research will apply algorithms to detect this behavior. This research aims to achieve a set of various objectives:  To utilize the unique existing dataset for online recruitment through prepro- cessing data EMSCAD to enhance the accuracy of the model.  To determine the relevant features by applying feature selection techniques which assist to reduce dimensionality.  To build a reliable model which helps to effectively detect fraud ads with highest accuracy. This research is an empirical study based on observation, testing and evalua- tion. Weka tool is used to implement and evaluate the performance the proposed model. The proposed model through following steps to solve the problem of re- search. The proposed model involves three main stages of scrutiny:  First stage: Pre-processing stage which conducted by EMSCAD.  Second stage: Features Selections, where the support vector machine (SVM) 162 Journal of Information Security
分享到:
收藏