Journal of Transportation Technologies, 2017, 7, 190-205 
http://www.scirp.org/journal/jtts 
ISSN Online: 2160-0481 
ISSN Print: 2160-0473 
 
 
 
Road Crash Prediction Models: Different 
Statistical Modeling Approaches 
Azad Abdulhafedh 
University of Missouri-Columbia, MO, USA 
 
 
 
How  to  cite  this  paper:  Abdulhafedh,  A. 
(2017)  Road  Crash  Prediction  Models: 
Different  Statistical  Modeling  Approaches. 
Journal of Transportation Technologies, 7, 
190-205. 
https://doi.org/10.4236/jtts.2017.72014 
 
Received: December 20, 2016 
Accepted: April 27, 2017 
Published: April 30, 2017 
 
Copyright © 2017 by author and   
Scientific Research Publishing Inc. 
This work is licensed under the Creative 
Commons Attribution International   
License (CC BY 4.0). 
http://creativecommons.org/licenses/by/4.0/     
  
Open Access
 
Abstract 
Road crash prediction  models are very useful tools in highway safety, given 
their potential for determining both the crash frequency occurrence and the 
degree  severity  of  crashes.  Crash  frequency  refers  to  the  prediction  of  the 
number of crashes that would occur on a specific road segment or intersection 
in a time period, while crash severity models generally explore the relation-
ship between crash severity injury and the contributing factors such as driver 
behavior,  vehicle  characteristics,  roadway  geometry,  and  road-environment 
conditions. Effective interventions to reduce crash toll include design of safer 
infrastructure  and  incorporation  of  road  safety  features  into  land-use  and 
transportation  planning;  improvement  of  vehicle  safety  features;  improve-
ment of post-crash care for victims of road crashes; and improvement of driv-
er behavior, such as setting and enforcing laws relating to key risk factors, and 
raising public awareness. Despite the great efforts that transportation agencies 
put into preventive measures, the annual number of traffic crashes has not yet 
significantly decreased. For instance, 35,092 traffic fatalities were recorded in 
the US in 2015, an increase of 7.2% as compared to the previous year. With 
such a trend, this paper presents an overview of road crash prediction models 
used by transportation agencies and researchers to gain a better understand-
ing  of  the  techniques  used  in  predicting  road  accidents  and  the  risk  factors 
that contribute to crash occurrence. 
 
Keywords 
Crash Prediction Models, Poisson, Negative Binomial, Zero-Inflated, Logit 
and Probit, Neural Networks 
 
1. Introduction 
Road traffic accidents are the world’s leading cause of death for individuals be-
*PhD in Civil Engineering. 
 
DOI: 10.4236/jtts.2017.72014    April 30, 2017 
A. Abdulhafedh 
 
tween the ages of one and twenty-nine [1]. Throughout the world, cars, buses, 
trucks, motorcycles, pedestrians, animals, taxis and other categories of travelers, 
share the roadways, contributing to economic and social development in many 
countries. Yet each year, many vehicles are involved in crashes that are responsi-
ble for millions of deaths and injuries. Globally, every year, about 1.25 million 
people  are killed in motor vehicle crashes and approximately 50 million  more 
are  injured.  Following  current  trends,  about  two  million  people  could  be  ex-
pected  to  be  killed  in  motor  vehicle  crashes  each  year  by  2030  [1].  Currently, 
road crashes are ranked as the ninth most serious cause of death in the world, 
and without new initiatives to improve road safety, fatal crashes will likely rise to 
the third place by the year 2020 [1]. In developed countries, road traffic death 
rates have decreased since the 1960s because of successful interventions such as 
seat belt safety laws, enforcement of speed limits, warnings about the dangers of 
mixing alcohol consumption with driving, and safer design and use of roads and 
vehicles. For example, road traffic fatalities have declined by about 25.0 percent 
in the United States from 2005 to 2014 and the number of people injured has 
decreased 13.0 percent from 2005 to 2014 [2]. In Canada, the number of road 
traffic fatalities has declined by about 62.0 percent from 1990 to 2014, and the 
number of injuries has declined by about 68.0 percent during the same period 
[3]. However, traffic fatalities have increased in developing countries from 1990 
to 2014 (i.e. 44.0 percent in Malaysia and about 243.0 percent in China) [1]. De-
veloping countries bear a large share of the burden, accounting for 85.0 percent 
of annual deaths and 90.0 percent of the disability-adjusted life years. More than 
one-half of all road traffic deaths globally involve people ages 15 to 44, during 
their most productive earning years. Moreover, the disability burden for this age 
group  accounts  for  about  60.0  percent  of  all  disability-adjusted  life  years.  The 
costs and consequences of these losses are significant. Three-quarters of all poor 
families who lost a member in a traffic crash reported a decrease in their stan-
dard of living, and about 61.0 percent reported having to borrow money to cover 
expenses following their loss [4]. The World Bank estimates that road traffic in-
juries cost 2.0 percent to 3.0 percent of the Gross National Product of developing 
countries, or twice the total amount of development aid received worldwide by 
developing countries [5]. Although transportation agencies often try to identify 
the most hazardous road  sites, and  put great efforts into preventive measures, 
such  as  illumination  and  policy  enforcement,  the  annual  number  of  traffic 
crashes has not yet significantly decreased. For instance, 35,092 traffic fatalities 
were recorded in the US during 2015, an increase of 7.2% as compared to the 
previous year [6]. The fatality rate per 100 million vehicle miles traveled (VMT) 
increased  3.7%  between  2014-2015.  Thirty-five  States  had  more  motor  vehicle 
fatalities in 2015 than in 2014. Every month except November saw increases in 
fatalities from 2014 to 2015, and the highest increases occurred in July and Sep-
tember [6]. Given this trend, it is imperative to gain a better understanding of 
the risk factors that may be associated with traffic crashes. This paper aims at 
presenting an overview of road crash prediction models used by transportation 
191 
A. Abdulhafedh 
192 
 
agencies and researchers to help understanding the techniques used in predict-
ing road accidents and the risk factors that contribute to crash occurrence. 
2. The Importance of Traffic Accidents Prediction Models 
Traffic accidents prediction models are very useful tools in highway safety, given 
their potential for determining both the frequency of accident occurrence and 
the contributing factors that could then be addressed by transportation policies. 
Vehicular crash data can be used to model both the frequency of crash occur-
rence and the degree of crash severity. Crash frequency refers to the prediction 
of the number of crashes that would occur on a specific road segment or inter-
section in a time period [7]. Crash severity methods generally explore the rela-
tionship between crash severity injury categories and contributing factors such 
as driver behavior, vehicle characteristics, roadway geometry, and road-environ- 
ment conditions. Traffic accident related-fatalities and injuries can be prevented 
or at least minimized by a joint involvement from multiple sectors (i.e. transpor- 
tation agencies, police, health departments, education institutions) that oversee 
road safety, vehicles, and the drivers themselves. Effective interventions include 
design of safer infrastructure and incorporation of road safety features into land- 
use  and  transport  planning;  improvement  of  vehicle  safety  features;  improve-
ment of post-crash care for victims of road crashes, and improvement of driver 
behavior,  such  as  setting  and  enforcing  laws  relating  to  key  risk  factors,  and 
raising public awareness [8]. Transportation agencies and research institutions 
often seek to identify the most dangerous road sites, and this will require mod-
eling road crash data to determine both crash frequency and crash severity de-
gree. In addition, traffic accidents prediction models can also assist with the de-
velopment of generalized theories concerning road safety. A range of basic laws 
have been put forth to help explain the relationship between the occurrence of 
road  crashes  and  potential  risk  factors,  such  as:  the  universal  law  of  learning, 
which implies that the crash rate tends to decline as the number of kilometers 
travelled increases; the law of rare events, which states that rare events, such as 
environmental  hazards,  would  have  more  effect  on  crash  rates  than  regular 
events; and the law of complexity, which implies that the more complex the traf-
fic situation road users encounter, the higher the probability of crash occurrence 
[9]. 
3. Factors Affecting Road Traffic Accidents 
A traffic accident may have many contributing factors, such as those related to 
driver behavior, road geometry, traffic volumes, vehicle, and environment. The 
influence  of  such  variables  on  crash  occurrence  could  significantly  vary  on  a 
case-by-case basis, but in general, both behavioral factors related to the driver’s 
errors, and non-behavioral factors related to road geometry, traffic flow condi-
tions, vehicle, and environment are thought to significantly affect traffic crashes 
[10]. Research has revealed that there are generally six major groups of risk fac-
tors affecting traffic crash occurrence [11] [12] [13] [14] [15]: 
A. Abdulhafedh 
 
1) Driver behavior: alcohol and drug use, reckless operation of vehicle, failure 
to properly use occupant protection devices, the use of  cell phones  or texting, 
and fatigue. 
2)  Vehicle  factors:  vehicle  type,  and  the  engineering  and  the  safety  design 
standards for vehicle performance. For example, the design of windshield glass 
and the location and durability of gas tanks can increase safety. Passenger pro-
tection systems in vehicles (i.e. air bags, safety belts), if used, can eliminate inju-
ries or reduce their severity. 
3) Roadway characteristics: road geometries and road side conditions, such as 
well-designed curves and grades, wide lanes, adequate sight distance, clearly vis-
ible striping, flared guardrails, good quality shoulders, roadsides free of obsta- 
cles, well-located crash attenuation devices, and well-planned use of traffic sig-
nals.   
4) Traffic volumes: average annual daily traffic (AADT) or the vehicle miles 
travelled (VMT). AADT is the average number of vehicles passing a point along 
a particular road section each day. Thus, AADT represents the vehicle flow over 
a road section on an average day of the year. VMT refers to the distance travelled 
by  vehicles  on  roads.  It  is  often  used  as  an  indicator  of  traffic  demand  and  is 
commonly applied to evaluate mobility patterns and travel trends.   
5) Environmental factors: weather conditions, and light conditions. 
6) Time factors: the season of the year, the month of the year, weekdays, and 
the hour of crash occurrence. 
4. The Costs of Road Traffic Accidents 
The highest cost of traffic crashes is in the loss of human lives; however, society 
also bears the consequences of many costs associated with motor vehicle crashes. 
Highway crashes currently cost the USA about $1078.0  billion a year, approx-
imately 5.0 percent higher than 2000. Total costs include both economic costs 
and  societal  harm  [16].  In  the  year  2010,  3.9  million  people  were  injured  and 
32,999 killed in 13.6 million motor vehicle crashes in the US [2]. The economic 
costs of these crashes totaled $242.0 billion including lost productivity, medical 
costs,  legal  and  court  costs,  emergency  service  costs,  insurance  administration 
costs, congestion costs, property damage, and workplace losses. The $242.0 bil-
lion cost of motor vehicle crashes represents the equivalent of nearly $784.0 for 
each person living in the United States, and 1.6 percent of the $14.96 trillion U.S. 
Gross Domestic Product for 2010 [16]. When quality of life valuation is consi-
dered, the total value of societal harm from motor vehicle crashes in 2010 was 
$836.0 billion, roughly three and a half times the value measured by economic 
impacts alone. Lost market and household productivity accounted for $77.0 bil-
lion of the total $242.0 billion economic costs, while property damage accounted 
for $76.0 billion. Medical expenses totaled $23.0 billion. Congestion caused by 
crashes, including travel delay, excess fuel consumption, greenhouse gases and 
criteria pollutants accounted for $28.0 billion. Each fatality resulted in an aver-
age discounted lifetime cost of $1.4 million. Each critically injured survivor cost 
193 
A. Abdulhafedh 
194 
 
an average of $1.0 million [16].   
5. Literature Review 
Early crash analysis models were generally based on simple multiple linear re-
gression  methods  assuming  normally  distributed  errors.  However,  researchers 
soon discovered that crash occurrence could be better fitted with a Poisson dis-
tribution.  Hence,  a  Poisson  regression  model  based  upon  a  generalized  linear 
framework was soon adopted over conventional multiple linear regression tech-
niques.  Several  such  Poisson  regression  approaches  for  exploring  the  relation-
ship between the risk factors and crash frequency have been proposed [15] [16] 
[17]  [18]  [19]  [20].  However,  it  has  been  found  that  Poisson  regression  ap-
proaches have one important constraint that the mean must be equal to the va-
riance which if violated, the standard errors estimated by the maximum likelih-
ood method, will be biased, and the test statistics derived from the model will be 
incorrect. Recent studies have shown that crash data are usually over-dispersed, 
when the variance exceeds the mean, therefore, incorrect estimation of the like-
lihood of crash occurrence could result in applications of the Poisson regression 
model  [7].  In  efforts  to  overcome  the  problem  of  over-dispersion,  researchers 
began  to  employ the Negative Binomial (NB) distribution (also called the Pois-
son-Gamma) instead of the Poisson distribution, which relaxes the mean equals to 
variance constraint, and hence can accommodate over-dispersion in crash data 
counts [7]. NB models have been widely used in crash frequency modeling [14] 
[15] [19] [21] [22] [23]. However, NB models have some limitations such as the 
inability to handle under-dispersion of crash counts when the mean of the crash 
counts  is  higher  than  the  variance.  Although  rare,  this  phenomenon  can  arise 
when  the  sample  size  is  very  small,  leading  to  erroneous  parameter  estimates 
[24] [25]. To address the limitations of NB models, Poisson-lognormal models 
have been proposed, in which the error term is Poisson-lognormal rather than 
gamma-distributed to better handle the under-dispersed crash counts [21] [26] 
[27].  Another  widely  used  type  of  crash  prediction  model  is  the  zero-inflated 
Poisson  and  zero-inflated  negative  binomial  models,  which  have  been  intro-
duced mainly to deal with the over-dispersion problem caused by excessive ze-
roes (i.e. locations where no crashes can be observed) in traffic data counts. The 
zero-inflated models have shown great flexibility, although their applicability in 
crash prediction has been criticized because of the long term mean equals zero in 
the safe state that could produce some biased estimates [7] [22]. Generalized ad-
ditive modeling approaches have also been proposed which provide smoothing 
functions for the explanatory variables. However, these models typically include 
more parameters than the traditional count models, and therefore their applica-
bility to the crash prediction has been very limited [28] [29]. Random- parame-
ters models have been applied to take the effect of the unobserved heterogeneity 
from  one  roadway  site  to  another,  however,  their  application  in  practice  has 
been  very  limited  [30]  [31]  [32].  The  finding  that  road  crashes  are  poorly  ex-
plained by linear functions of independent variables, has encouraged the explo-
A. Abdulhafedh 
 
ration of non-linear approximators such as fuzzy logic and neural networks. For 
example, a fuzzy logic approach was used for prediction of urban highway crash 
occurrence and it was found that the use of fuzzy sets in crash prediction is in-
deed  a  viable  approach  [33].  Neural  networks  have  been  applied  to  highway 
safety applications as predictive tools, such as in driver behavior analysis, pave-
ment  maintenance,  vehicle  detections,  traffic  signal  control,  and  vehicle  emis-
sions,  however,  their  application  to  crash  analysis  has  been  limited  [28]  [34] 
[35]. For instance, an artificial neural network was utilized to analyze the free-
way crash frequency in Taiwan, and the results indicated that an artificial neural 
network  can  provide  a  consistent  alternative  method  for  analyzing  crash  fre-
quency [36]. Also, a group of artificial neural networks was applied to model the 
non-linear relationships between the injury severity levels and crash-related fac-
tors.  The  findings  indicated  that  artificial  neural  network  models  can  predict 
crashes  more  effectively  than  the  traditional  statistical  methods  [37].  In  crash 
severity models, a wide variety of statistical approaches such as the binary and 
the multinomial logit models, nested logit models, mixed logit models and or-
dered  probit  models  have  been  investigated.  For  example,  the  ordered  probit 
model was applied to predict crash severity on roadway sections, signalized in-
tersections and toll plazas in Florida [38]. A mixed logit model was applied that 
used the injury outcome of the crash using limited crash data to investigate the 
proportion of crashes of each severity level on a specific roadway segment over a 
specified time period. Then, the number of crashes by severity level was deter-
mined without the need for detailed crash-specific data [39]. Also, a multinomial 
logistic regression was applied to model the severity injury of different vehicle 
collision  patterns  in  urban  highways  in  Arkansas,  and  the  researchers  recom-
mended the use of the MNL over other models [40]. 
6. A Review on the Statistical Approaches of Road Crash 
Prediction Models 
There are different statistical approaches for modeling traffic crashes. The fol-
lowing approaches present some of the mostly used methods. 
6.1. Multiple Linear Regression 
Early models of traffic accident models were based on the simple multiple linear 
regression approach assuming normally distributed errors. The general form of 
the linear crash prediction model can be expressed as follows:   
(
)
θ
with
θ
=
                                      (1)   
Y
θ
∼
Dist
where, 
(
f X
,
)
βε
,
Y: the dependent variable (i.e. crash frequency), 
θ: the crash dataset, 
Dist(θ): the model distribution, 
X: a vector representing different independent variables (i.e. risk factors), 
β: a vector of regression coefficients,   
f(.): link function that relates X and Y together, 
195 
A. Abdulhafedh 
ε: the disturbance or error terms of the model. 
 
6.2. Poisson Regression 
Although  multiple  linear  regression  models  have  been  widely  applied,  it  has 
been found that crash occurrence can often be better fitted with a Poisson dis-
tribution. One frequent pitfall is to model crash data as continuous data by ap-
plying an ordinary least square regression [41]. This approach is inappropriate 
because  regression  models  can  produce  predicted  values  that  are  non-integers 
and can also predict values that are negative, both of which are inconsistent with 
continuous data modeling. In addition, many distributions of crash data are po-
sitively skewed with many observations in the data set having a value of 0.0. The 
high number of zeros in the data set prevents the transformation of  a skewed 
distribution into a normal one, which is a requirement of normal distribution. 
An alternative is to use a Poisson distribution or one of its variants. Poisson dis-
tributions have a number of advantages over an ordinary normal distribution, 
including a skew, discrete distribution, and the restriction of predicted values to 
non-negative numbers [41]. Hence, generalized linear modeling variates of the 
Poisson  regression  model  have  been  proposed  to  explore  the  relationship  be-
tween the risk factors and traffic accident modeling [15] [17] [18] [19]. Poisson 
regression  has  been  applied  to  a  wide  range  of  transportation  count  data,  in-
cluding crash frequency. A Poisson regression model is similar to an ordinary 
linear regression, with two exceptions. First, it assumes that the errors follow a 
Poisson  (not  normal)  distribution.  Second,  rather  than  modeling  the  response 
variable Y as a linear function of the regression coefficients, it models the natural 
log of the response variable, ln(Y), as a linear function of the coefficients [7]. The 
Poisson model can be expressed as follows: 
iEX
P
n
                                          (2) 
i
λ−
(
P n
i
)
(
!
λ
=
)
where, 
P (ni): the probability of n crashes occurring on a highway segment i,   
ni: the number of observations per time period (such as a year), 
λi: the expected crash frequency on road segment i per time period (i.e. the 
mean of distribution) which can be estimated as follows: 
where 
λ
i
=
EXP
(
X
β
i
)
                                                (3) 
Xi: a vector of the independent variables (i.e. risk factors), 
β: a vector of the estimates (coefficients) of the independent variables Xi. 
This model is estimable by standard maximum likelihood methods, with the 
log likelihood (LL) function given as: 
LL
(
)
β
= ∑
n
1
−
EXP
(
X
β
i
)
+
n
(
β
Xi
)
−
(
Ln n
)
!
                  (4) 
One  assumption  of  Poisson  Models  is  that  the  mean  and  the  variance  are 
equal, an assumption that is sometimes violated [7]. This can be dealt with by 
using a dispersion parameter if the difference is small, or by using a negative bi-
196 
A. Abdulhafedh 
 
nomial regression model if the difference is large [42]. 
6.3. Negative Binomial Regression Model (NB) 
In order to overcoming the problem of over-dispersion, the Negative Binomial 
(NB) distribution (also called the Poisson-Gamma) has been investigated as an 
alternative to the Poisson distribution given that it relaxes the condition of mean 
equals to variance, and hence can take into account over-dispersion in the crash 
data counts [7]. As a result, NB models have been widely applied in crash fre-
quency modeling [14] [15] [19] [21] [22] [23].   
The NB uses a Gamma probability distribution and can relax the assumption 
of the mean equals the variance and, hence, the NB can accommodate over-dis- 
persion that may exist in the crash data counts [43]. A primary source of over- 
dispersion is the clustering of data, and the possible omission of relevant inde-
pendent variables influencing the Poisson rate across observations [44]. In order 
to obtain the NB model, the Poisson regression can be rewritten by adding an 
error term to its expected number of crashes, and becomes [7]: 
λ
(
)i
β ε
Xi
+
=
)
+
(
1
(
E n
i
i EXP
=
                                            (5) 
where EXP (εi) is a gamma-distributed error with mean equals one and variance 
equals α. The addition of this term allows the variance VAR (ni) to differ from 
the mean E (ni) as shown in Eq. 6: 
(
)
VAR n
i
                                    (6) 
This error term is called the over-dispersion parameter, and both α and β can 
be estimated from the maximum likelihood function. When α is zero, the model 
becomes Poisson regression, and if α is found to be significantly different from 
zero, then the NB regression can be used instead of the Poisson regression model 
to  handle  the  over-dispersion  in  crash  data.  However,  the  NB  model  also  has 
some limitations such as its inability to handle the case of under-dis- persion of 
the data count, when the mean of the crash counts is higher than the variance 
[25] [44].   
(
E nα
i
)
)
6.4. Poisson-Lognormal Regression Model 
To address the limitations of the NB models, the Poisson-lognormal model was 
introduced, in which the error term is Poisson-lognormal rather than gamma- 
distributed  so  as  to  better  handle  under-dispersed  data  counts  [21]  [26]  [27]. 
The Poisson-lognormal model is similar to the negative binomial model, how-
ever,  the  EXP (εi)  term  used  in  the  model  is  lognormal-rather  than  gam-
ma-distributed. The Poisson-lognormal model provides more flexibility than the 
negative binomial model, but it does have some limitations, such as, its complex 
estimation of parameters due to the fact that the Poisson-lognormal distribution 
does not have a closed form [26]. 
6.5. Zero Inflated Poisson and Negative Binomial Regression 
Models 
Another  widely  used  crash  frequency  modeling  approach  is  the  zero-inflated 
197