Journal of Information Security, 2019, 10, 155-176
http://www.scirp.org/journal/jis
ISSN Online: 2153-1242
ISSN Print: 2153-1234
An Intelligent Model for Online Recruitment
Fraud Detection
Bandar Alghamdi, Fahad Alharby
Naïf Arab University (NAUSS), Riyadh, KSA
How to cite this paper: Alghamdi, B. and
Alharby, F. (2019) An Intelligent Model for
Online Recruitment Fraud Detection. Jour-
nal of Information Security, 10, 155-176.
https://doi.org/10.4236/jis.2019.103009
Received: January 6, 2019
Accepted: July 8, 2019
Published: July 11, 2019
Copyright © 2019 by author(s) and
Scientific Research Publishing Inc.
This work is licensed under the Creative
Commons Attribution International
License (CC BY 4.0).
http://creativecommons.org/licenses/by/4.0/
Open Access
Abstract
This study research attempts to prohibit privacy and loss of money for indi-
viduals and organization by creating a reliable model which can detect the
fraud exposure in the online recruitment environments. This research presents
a major contribution represented in a reliable detection model using ensem-
ble approach based on Random forest classifier to detect Online Recruitment
Fraud (ORF). The detection of Online Recruitment Fraud is characterized by
other types of electronic fraud detection by its modern and the scarcity of
studies on this concept. The researcher proposed the detection model to
achieve the objectives of this study. For feature selection, support vector ma-
chine method is used and for classification and detection, ensemble classifier
using Random Forest is employed. A freely available dataset called Employment
Scam Aegean Dataset (EMSCAD) is used to apply the model. Pre-processing step
had been applied before the selection and classification adoptions. The results
showed an obtained accuracy of 97.41%. Further, the findings presented the
main features and important factors in detection purpose include having a
company profile feature, having a company logo feature and an industry fea-
ture.
Keywords
Online Recruitment Fraud, Intelligent Model, Privacy
1. Introduction
In the last decade which is called the Internet and social Era, the integral parts of
the modern landscape considered are the Internet and social media. In modern
organizations, there is a wide use of the Internet and social media deployed in
employee recruitment [1]. Recently, the cloud was integrated to the procedure of
recruiting new members, where the managed cloud services or solutions are used
DOI: 10.4236/jis.2019.103009 Jul. 11, 2019
155
Journal of Information Security
B. Alghamdi, F. Alharby
DOI: 10.4236/jis.2019.103009
by human resource managers. Nevertheless, there are many violating risk threats
increased by scams and frauds along with the wide interest and adopting such
embedded software [2].
Cybercrime is one of the present risky crimes that face the world and threaten
the individuals and organizations security causing substantial losses [3]. Based
on cyber security ventures report 2021, the cost of cybercrime damages in the
world is around $6 trillion annually [4]. Thus, we have an urgent need for In-
formation Security to ensure the Confidentiality, Integrity and Availability (CIA)
to combat these crimes. It can be done through the implementation of known in-
formation security strategies such as prevention, detection, and response.
In Saudi Arabia, the 2030 vision predicts that there will be a growth in job
generation [5]. This growth requires government to assure the inclusion of CIA in
job recruitment process to protect the individuals and organizations from cyber-
crime occurrence. ORF is considered as one type of cybercrimes that has appeared
recently. ORF violates the privacy and financial funds of individuals and organi-
zations by exploiting Internet technology and web service. It allows non-legitimate
users to damage the reputations of the organizations [6].
Data mining methods have added to data analysis, knowledge mining, predic-
tion and detection of cybercrime. They can be used to create an intelligent model
which is very effective in detecting fraud and scams of the network. ORF disrupts
the privacy of job seeker and bothers the reputation of organizations. Moreover, it
causes loss of money for individuals. It happens when criminals post fake ads
exploiting the automation recruitment to trap job seekers. To the best of our
knowledge, there has only been one empirical research conducted up to now on
this kind of cybercrime. This is the only research that has examined the online
fraud issues and developed a new solution for detection.
Vidros, Kolias, Kambourakis, & Akoglu (2017) added many features of ORF
to the public dataset (EMSCAD). The researchers recommended the research com-
munity to find a reliable detection model of ORF. Thus, we need to obtain a new re-
liable model to enhance the performance of classification based on pre-processing
and feature selection phase [2].
2. Background
2.1. Online Recruitment System
The online recruitment system utilizes web-based tools like public internet or
intranet to recruit staff [7]. Recruitment brings many benefits for firm’s success
by getting the best applicants in short time, highlighting the professional re-
quirements, assessing applicants via interviews and welcoming the newly se-
lected employees. Furthermore, this process makes the hiring process more af-
fordable and productive without spending a lot of money. The critical compo-
nents of online recruitment include tracking the status of candidates, employer’s
website, job portals, online testing, and social networking [8]. The advantages of
e-recruitment include effectiveness, high value, easiness, and efficiency [9]. The
156
Journal of Information Security
B. Alghamdi, F. Alharby
literature highlighted various vital benefits of e-recruitment such as Time reduc-
tion, Cost Reduction, Reach huge masses (employers and candidates), Filter
Functions, and Development of Brand image.
Two different methods for online recruitment as defined by (Prasad & Ka-
poor, 2016) include posting the company’s profile as well as the job require-
ments on job portals and creating an online recruitment page on the company’s
website [10].
2.2. Knowledge Discovery from Data (KDD)
Knowledge Discovery from Databases (KDD) can be defined as a recognition
and extraction of valuable, genuine, useful, unique, and comprehensible correla-
tions or patterns in the data [11]. Figure 1 summarizes the KDD concept that
depends on multiple sequence of performed processes. Each process depends on
the success and output of prior phase or process. It affirms the iterative nature of
the process and various feedback loops to indicate revision activities [12]. The
conventional model for KDD process is represented in Figure 1.
The KDD model includes five main phases: Selection of relevant prior know-
ledge; Acquiring or creating targeted dataset; Preprocessing to handle missing
values, noise and errors in the data; Transformation to create dataset form suita-
ble for easily implementing data mining algorithms; Data mining, the deci-
sion-making activity to define models such as regression, classification or clus-
tering to obtain patterns of interest, representational form, or rule sets and trees;
and Interpretation and Evaluation with respect their validity, and visualization
of the patterns and models.
Web mining investigates the data quantities available in the World Wide Web
by extracting information from all available web documents and services. Web
content is very dynamic due to the rapid growth nature and update [13]. Text
mining or text classification refers to text analytics by extracting high qualified
information from digital text [14]. Classification, clustering and regression,
Classification techniques use labelled datasets, a supervised learning method,
and involves learning and training phase that classifies data into various and
multiple classes based on assigned attributes derived from dataset [15]. Cluster-
ing defines similar classes of items based on the similarity among objects or
DOI: 10.4236/jis.2019.103009
Figure 1. The conventional KDD model [2].
157
Journal of Information Security
B. Alghamdi, F. Alharby
DOI: 10.4236/jis.2019.103009
items. It can be used to perform a preprocessing phase for selection tasks [16].
Regression is one of predictive and statistical techniques used in numerical and
continuous prediction through training process to determine the correlation
among various attributes [17].
2.3. Data Mining Tools
Data mining tools provide different open source tools to allow access and per-
form data mining such as rapid miner, weka, and orange. Rapid Miner is an
open source tool that utilizes the client-server model and deals with various data
file format [18]. Waikato Environment for Knowledge Analysis (Weka) is a ma-
chine learning tool lunched by Waikato University [19] and offers a comprehen-
sive range of preprocessing data and data modelling algorithms [18]. Orange is a
very powerful open source data mining tool launched by the University of Ljubl-
jana to support a wide range of widgets found in toolbox visualization and data
analysis [18].
Data mining utilizes different methods for trend pattern and prediction tasks.
The main methods deployed for data mining and diverse implementation me-
thods include Classification is based on finding the basic rules to distribute items
into specific defined classes, which is considered a predicting task [20].
3. Literature Review
There is rich literature about cybercrime detection models in different fields.
However, there are only two studies one descriptive. One of them is an empirical
study that has addressed the fraud and scams in the online recruitment. The re-
lated works often studied data mining techniques for various other detection
purposes. A few research efforts addressed the online recruitment frauds.
Vidros et al. (2016) determined the frauds exposed by job seekers through on-
line recruitment services. They found that ORF is a new field of current severe
vulnerable. Three current methods of fraud and scams in the online recruitment
include Fake Job Advertisement, Economic trickery using fake job advertise-
ments published on online recruitment boards, and Reputed and real business
enterprise publishing fake vacancy announcement [21].
Data Mining Techniques Related Works
Many studies have carried out data mining such as Yasin & Abuhasan (2016)
that has provided an intelligent classification model to detect phishing emails
using knowledge discovery, data mining and text processing techniques. A mod-
el based on knowledge discovery (KD) was proposed to build an intelligent email
classifier to classify a new email message into legitimate or spam. The knowledge
discovery model achieved high accuracy rates in classification of phishing emails
that outperformed other schemes. Using the Random Forest algorithm and J48,
99.1% and 98.4% accuracy was achieved respectively. Using MLP classifier, TP
rate and FP rate were 0.977 and of 0.026 respectively, while MLP achieved ROC
158
Journal of Information Security
B. Alghamdi, F. Alharby
area of 0.987. The results of this study confirmed that the proposed model
achieves high rates of accuracy in the classification of phishing e-mail [22].
Al-garadi, et al. (2016) introduced a study that has investigated cybercrime
detection in online communications especially cyber bullying in Twitter. The
main aim was to develop a number of unique features derived from Twitter.
They included network, activity, user, and tweet content. A model to detect cy-
ber bullying in Twitter was proposed using engineering features. The number of
friends (followers), the number of users being followed (following), the follow-
ing and followers ratio, and account verification status were collected through a
survey. Users’ activity features were also employed to measure the online com-
munication activity of a user. The features implemented and included personal-
ity, gender and age. Naïve Bayes (NB), Support vector machine (SVM), Random
forest and KNN were applied. Random forest showed f-measure 93%. The re-
sults of this study indicate that the proposed model contributes to providing a
suitable solution for the detection of cyberbullying in online communication en-
vironments [23].
Sharaff, Nagwani, & Swami, 2015 investigated the impact of feature selection
technique on email classification through studying the effect of two feature selec-
tion methods. A comparison was conducted between Bayes algorithm, tree-based
algorithm J48 and support vector machine. Feature selection techniques in-
cluded a Chi-Square (χ2) and information gain. The best performance was
gained using SVM classification technique which gave the overall best results
without employing any feature selection techniques. There is no effect of Naïve
Bayes on feature selection techniques. Further, J48 showed slight improvement
with feature selection, whereas info-Gain performed better than Chi-square fea-
ture selection technique [24].
The research of Sornsuwit & Jaiyen (2015) created an Intrusion Detection
Model Based on Ensemble Learning for User to Root (U2R) and Remote to Lo-
cal (R2L) Attacks. The ensemble learning was concentrated to detect network
intrusion data and reduce redundant features using a correlation-based algo-
rithm. It can improve the accuracy of classifier by solving the determined prob-
lems applied on U2R and R2L attacks in KDD Cup’99 intrusion detection data-
set. They applied an Adaboost algorithm to construct a strong classifier as linear
combination of weak classifiers. Naïve Bayes was used to determine the appro-
priate class of unseen data and to compute the posterior probability for each
class. A multilayer Perceptron (MLP) network was also used to perform linear
mapping from input space to hidden space and from hidden space to output
space. Support Vector Machine (SVM) approach was used to solve the classifica-
tion problems based on an optimal hyper plane in a high-dimensional space.
The result of this study show that reducing features contributes to improved ef-
ficiency in detecting attacks in works in many weak scales [25].
Gaikwad & Thool (2015) applied the bagging ensemble method of machine
learning in order to provide a novel intrusion detection technique. Two instru-
ments were utilized based on five modules including feature selection, REP Tree
159
Journal of Information Security
DOI: 10.4236/jis.2019.103009
B. Alghamdi, F. Alharby
DOI: 10.4236/jis.2019.103009
design, and construction of main classifier, packet sniffer and detector. In addi-
tion, they proposed an intrusion detection system called bagging ensemble me-
thod algorithm. Weak classifier was used to improve the classification accuracy.
The ensemble bagging machine-learning technique provided highest classifica-
tion accuracy of 99.67%. They also revealed that the model building time and
false positives exhibited by the method were lower as compared to AdaBoost al-
gorithm with Decision stump base classifiers. The results of the study confirm
that the bagging group with REPTree displays the highest accuracy of the classi-
fication. One advantage of using bagging method is that it takes less time to
build the model. The proposed group method provides low false positives in
comparison to other machine learning techniques [26].
Zuhaira, Selmat, & Salleh (2015) investigated the effect of feature selection on
phish website detection by examining the effects of the feature selection ap-
proach on classification performance. An empirical test was conducted on a spe-
cific test-bed set to extract a large number of hybrid features. Four feature selec-
tion algorithms (FSAs) included CBF, WFS, χ2, and IG. A comparison between
classification models was performed to qualify the way of the features selective
subsets shift detection accuracy, specificity and sensitivity of the classification
model to the best rates. Some feature selection methods significantly outper-
formed their competitors by exhibiting better robustness, prediction, and per-
formance. The results of the experiment showed a signification improvement in
detection accuracy with low latency and accuracy of observation in the sensitivi-
ty of hardness and predictability. The results of the study contributed to provid-
ing the best possible subset of features for strong selection and effective phishing
detection [27].
Nizamani, et al. (2014) proposed a fraudulent email detection model based on
advanced feature choice. J48 classification algorithms technique was used due to
its simplification and inductive nature. Support Vector Machine (SVM) was also
used to transform non-linearly separable data to a new linearly separable data by
using kernel trick. Moreover, Naive Bayes’ (NB) classification algorithm was
used to calculate the probabilities of the feature values for each of the classifica-
tion categories. Further, cluster-based classification model (CCM) was applied to
perform the classification task by grouping the data points based on obvious
features. The dataset contains 8000 emails in total. The frequency-based features
attain high accuracy for the task of fraudulent email detection regardless of
choice of classification method. The model employed features extracted from the
content of the emails achieving accuracy as high as 96%. The results of the study
showed that the level of correctness was pretentious by the kind of determined
features rather than the classifiers’ type [28].
Shrivas & Dewangan (2014) presented an approach based on ANN-Bayesian
Net-GR approach that combines an Artificial Neural Network (ANN) and a
Bayesian Net. By using Gain Ratio (GR) advantage selection approach, the au-
thors used Classification and Regression Tree (CART) to build a binary decision
tree by splitting the record at each node. According to a function of a single
160
Journal of Information Security
B. Alghamdi, F. Alharby
attribute, the Artificial Neural Network (ANN) was utilized to mine data for
classification. The ensemble approach was used to build a hybrid model to im-
prove classification accuracy. Further, a feature selection approach was used to
overcome bias. To reduce the irrelevant features and improve classification ac-
curacy, various classification techniques were applied on NSL-KDD and KDD99
dataset. The proposed model provided accuracy of 99.42% with KDD99 dataset
and 98.07% with NSL KDD dataset [29].
Balakrishnan, Venkatalakshmi, & Kannan (2014) studied the intrusion detec-
tion system using feature selection and classification technique. The study aimed
to provide and employ an intrusion detecting system to deal with possible at-
tacks. The authors adopted various techniques, where the needed data acquired
from the KDD’99 cup dataset. A rule based classifier was used to perform effec-
tive decision making on intrusions, in addition to a support vector machine me-
thod to make binary classification and regression estimation tasks. A proposed
algorithm for optimal feature selection was applied through calculating informa-
tion gain ratio on attribute selection. The proposed feature selection algorithm
selected only the important features that helped in reducing the time taken for
detecting and classifying the record. Rule based classifier and support vector
machine helped achieve a greater accuracy. The provided intrusion detection
system reduced the false positive rates and reduces the computation time. This
study contributed to the selection of the optimal advantage by calculating the
percentage of information gain in characterization. It also helped reduce the
time taken to discover and classify the record. Help the base-based workbook
and the support transfer machine to achieve greater accuracy. The intrusion de-
tection system reduced false positive rates and reduces the calculation time [30].
Riyad & Ahmed (2013) designed an ensemble classification approach for in-
trusion detection. The Support Vector Machine (SVM) algorithm was used to
maximize the classification by sub dividing feature space into sub spaces and to
classify the new data, the Random Forest tree predictor was used to construct the
tree with different bootstrap samples, and Artificial Neural Network was used
process information. These combined models can increase the accuracy of pre-
diction over a single model. Each classification from the base algorithms is given
a weight 0 to 1 depending on their accuracy. The result of the study indicated
that the ensemble method is one of the main developments in the field of ma-
chine learning [31].
Ozarkar & Patwardhan (2013) implemented an efficient spam classification
using Random Forest and Partial Decision Trees algorithms to classify spam vs.
non-spam emails. A Chi-square test was used in order to decide whether effects
were present or not. Information Gain measure was applied to reduce in entropy
caused by partitioning the examples according to a given attribute. In addition,
Symmetrical Uncertainty measure was employed to determine desirable proper-
ty for a measure of feature-feature inter-correlation to have. The authors also
used One R algorithm to infer a rule that predicts the class given the values of
161
Journal of Information Security
DOI: 10.4236/jis.2019.103009
B. Alghamdi, F. Alharby
DOI: 10.4236/jis.2019.103009
the attributes. The study acquired the best percentage accuracy of 99.918% with
Random Forest which is 9% better than previous spam base approaches and
96.416%. The results of the study showed that the use of Random Forest and
Partial Decision Trees algorithms to classify spam are more effective than other
algorithms that have been implemented in terms of accuracy and time complex-
ity [32].
Rathi & Pareek (2013) used data mining to investigate spam mail detection
analyzing various data mining approaches on spam dataset to find the best clas-
sifier for email classification. Support vector machine was used to analyze data
and was mainly used for classification purpose. A Naïve Bayes classifier was used
to determine the presence or absence of a particular feature of a class was unre-
lated to the presence/absence of any other feature, given the class variable de-
pending on the nature of probability model. Moreover, a feature selection me-
thod was used to analyze the data, by removing irrelevant and redundant fea-
tures from the data. The results showed that promising accuracy of the classifier
Random Tree is 99.715% with best-first feature selection algorithm and accuracy
is 90.93% [33].
4. Research Methodology
4.1. Based Up on, the Main Questions of This Study Are
Q1—How to determine the relevant features used in Online Recruitment
Fraud?
Q2—What is the best classification algorithm to be used for Online Recruit-
ment Fraud?
Q3—Is the Ensemble approach suitable for the detecting Online Recruitment
Fraud?
The key purpose of this research is to protect individuals and organization
protection from compromised privacy and loss of money through constructing a
suitable reliable model to detect the fraud exposure in the online recruitment
environments. To achieve this objective, the research will apply algorithms to
detect this behavior. This research aims to achieve a set of various objectives:
To utilize the unique existing dataset for online recruitment through prepro-
cessing data EMSCAD to enhance the accuracy of the model.
To determine the relevant features by applying feature selection techniques
which assist to reduce dimensionality.
To build a reliable model which helps to effectively detect fraud ads with
highest accuracy.
This research is an empirical study based on observation, testing and evalua-
tion. Weka tool is used to implement and evaluate the performance the proposed
model. The proposed model through following steps to solve the problem of re-
search. The proposed model involves three main stages of scrutiny:
First stage: Pre-processing stage which conducted by EMSCAD.
Second stage: Features Selections, where the support vector machine (SVM)
162
Journal of Information Security