Highway Traffic Crash Risk Prediction Method considering Temporal Correlation Characteristics
Highway Traffic Crash Risk Prediction Method considering Temporal Correlation Characteristics
Zhao, Liping;Li, Feng;Sun, Dongye;Dai, Fei
2023-02-15 00:00:00
Hindawi Journal of Advanced Transportation Volume 2023, Article ID 9695433, 13 pages https://doi.org/10.1155/2023/9695433 Research Article Highway Traffic Crash Risk Prediction Method considering Temporal Correlation Characteristics 1 1 2 1 Liping Zhao , Feng Li, Dongye Sun , and Fei Dai Institute of Systems Engineering, Academy of Military Sciences, No. 2 Fengti South Road, Fengtai District, Beijing 100166, China National Engineering Research Center for Transportation Safety and Emergency Informatics, Telecommunications & Information Center, No. 1 Anwai Waiguan Houshen, Beijing 100011, China Correspondence should be addressed to Liping Zhao; 825889797@qq.com and Dongye Sun; 14114218@bjtu.edu.cn Received 22 April 2022; Revised 15 September 2022; Accepted 25 November 2022; Published 15 February 2023 Academic Editor: Fei Hui Copyright © 2023 Liping Zhao et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Crash risk analysis and prediction are considered the premise of highway trafc safety control, which directly afects the accuracy and efectiveness of trafc safety decisions. A highway trafc crash risk prediction method considering temporal correlation characteristics is proposed in this research. Firstly, the case-control sample analysis method is used to extract 6 time series sample data composed of crash trafc fow data and corresponding non-crash trafc fow data for crash risk analysis and prediction. Secondly, the multiparameter fusion clustering analysis method is used to indicate that the sample data of diferent time series have diferent efects on the crash risk. Ten, the random forest model is used to screen several trafc fow variables that afect the highway crash risk. Tereafter, the downstream mean speed (ASD1D2), the upstream mean occupancy (AOU1U2), and the speed diference (DSU1D1) on the nearest detector were determined as the explanatory variables of the crash risk prediction model. Finally, based on the three variables, the dynamic Bayesian network model for highway trafc crash risk prediction is proposed. Te overall prediction accuracy of this model is 84.9%, the crash prediction accuracy is 60.8%, and the non-crash prediction accuracy is 92.3%. Also, the prediction results show that the dynamic Bayesian model has better prediction efect than the static Bayesian model for the same sample data. [8–10] have found that trafc fow state has a strong cor- 1. Introduction relation with the road trafc crash occurrence. For example, Research on road trafc safety has a long history. Early Lee et al. used the trafc fow data within 5 minutes before studies about road trafc safety focused on the crash cause the crash to study the infuence of trafc fow dynamic mechanism and infuencing factors analysis. Yang et al. characteristics on the collision pattern of the crash [11]. explored and analyzed the infuence of diferent geo- Golob and Recker divided the trafc fow pattern before the graphical conditions and environmental factors on highway crash into diferent trafc fow states and found that diferent crash risk by using the improved association rule algorithm trafc fow states likely led to diferent types of trafc crashes [12]. Golob et al.’s study frstly extracted six variables rep- [1–4]. Wang et al. [5–7] analyzed the trafc crash causative factors under diferent trafc modes from the microscopic resenting the trafc fow characteristics before the crash, perspective. However, it is not realistic to consider all the then divided the trafc fow state by using these six variables factors (drivers, vehicles, roads, and environments) in the as the clustering index, and fnally analyzed the types of trafc crash cause modeling. Te randomness of trafc trafc crashes that are prone to occur in various trafc fow crashes makes trafc crash-causing analysis fall into the states [13, 14]. Golob et al.’s study indicates that in trafc bottleneck. fow running state, there is a certain relationship with the In recent years, the concept of active safety has gradually trafc crash [15]. However, these studies are all based on the entered the vision of researchers. More and more studies trafc fow data before the crashes. It is difcult to refect the 2 Journal of Advanced Transportation randomness of trafc crashes. Also, it is not conducive to 2. Study Area and Data Survey accurately identify the prone crash trafc state from the 2.1. Data Collection and Processing. Te main research feld normal trafc fow state. of this paper is to reveal the infuence of trafc fow timing Te highway trafc system is a dynamic, undulating, and characteristics on highway crash risk prediction. Te data non-linear complex system. From the perspective of spatial involved include trafc crash data and corresponding up- dimension, the trafc fow variables closest to the crash site, stream and downstream trafc fow data within a certain that is, the trafc fow variables upstream and downstream of time range. Te sample data used in this study are the trafc the crash site, are most correlated with crash risk. From the safety crash data and corresponding trafc fow state data on time dimension, the trafc fow state variables in a period of the 495.493–539.045 miles section of interstate highway I5 in time before the crash are most likely to have a certain California, USA. In order to control the infuence of weather, correlation with the crash occurrence [16–19]. At present, road conditions, and other factors on crash risk prediction a large number of studies [20–22] have used trafc state modeling, the case-control sample structure was used to parameters such as fow/speed/density as explanatory var- match the sample data. Based on the location where each iables to analyze and predict the possibility of trafc crashes. kind of crash data occurred, the trafc fow state data of the However, the trafc fow data collected by coil, microwave four detectors closest to the crash were extracted, and the radar, and foating vehicle have strong time-varying char- two detectors upstream were named U2 and U1, and the two acteristics. Terefore, the temporal and spatial characteris- detectors downstream were named D2 and D1, as shown in tics of trafc fow state variables have a certain degree of Figure 1(a). In order to accurately identify the infuence of infuence on crash risk modeling. Moreover, the essence of time series characteristics of crash trafc fow variables on highway crash risk prediction and discrimination is the crash risk prediction model, the data of trafc fow variables causal relationship between the running trafc fow state within 30 minutes before the crash were extracted. Tey are variables of upstream and downstream and the possibility of divided into 6 time segments every 5 minutes, including time crash occurrence. Ten, we establish linear or non-linear series 0 (i.e., 0–5 minutes before the crash), time series 1 (i.e., relationship model and determine whether there is a risk in 5–10 minutes before the crash), time series 2 (i.e., 10–15 min the future trafc safety running state. At the same time, the before the crash), time series 3 (i.e., 15–20 min before the trafc safety data have certain temporal correlation char- crash), time series 4 (i.e., 20–25 min before the crash), and acteristics. Te long scale cumulative trafc fow sequence time series 5 (i.e., 25–30 minutes before the crash), as shown generally contains multiple subsequences with diferent in Figure 1(b). It should be noted that since time series 0 is stage characteristics. Te time correlation between the se- after the crash, it is only suitable for crash detection, but not quences will have a certain infuence on the crash risk for crash risk estimation. In addition, trafc crash identi- prediction model. Conventional real-time crash risk pre- fcation and taking corresponding measures need response diction models do not consider the temporal characteristics time. Only the model established by using the trafc de- of trafc fow sequences, which may afect the prediction tection information in time series 2, series 3, and series 4 has accuracy of the model. practical value for active safety management. In order to solve this problem, this paper proposes a highway crash risk prediction model considering time sequence correlation. Te dynamic Bayesian network model 2.2. Initial Variable Extraction. In order to facilitate the is used to characterize the infuence of time correlation establishment of crash risk prediction model by using dy- characteristics on crash risk prediction model. Firstly, the namic Bayesian network model, the original trafc fow random forest model is used to screen the trafc fow state variables on the four detectors and the mean and diference variables that afect the highway crash risk. Ten, the dy- values of upstream and downstream trafc variables are used namic Bayesian network model is used to explain the in- as the initial variables of the model. Relevant studies show fuence of the temporal correlation of trafc fow state that the trafc safety state upstream and downstream of the variables on the modeling of trafc safety crash risk pre- crash site can comprehensively refect the infuence of diction. Te infuence of time sequence characteristics be- various factors on crash risk. In order to determine the tween variables on highway crash risk modeling is illustrated mechanism of this infuence, the original data collected by by comparative analysis. Te results show that the proposed the detector are fused to further explore the impact of method has better identifcation rate and lower error rate. It upstream and downstream trafc state on crash risk. further shows that the dynamic Bayesian network model can Terefore, explanatory variables of the model can be divided better describe the dynamic time-varying characteristics of into three types. Te frst type refers to the original trafc trafc fows before crashes. fow data extracted from four detections, the second type is Te rest of the text is arranged as follows. Section 2 the diference between trafc variables of upstream and briefy introduces the study area, sample data sources, and downstream detectors, and the third type is the mean value temporal characteristics of crash trafc fow variables. of trafc variables of upstream and downstream detectors. Section 3 presents the research problems, solutions, and Te specifc names of the three types of variables are shown related model methods involved. Section 4 introduces the in Table 1. Te frst letter of the variable name represents the comparative analysis and discussion of the model operation type of the variable, O represents the original variable, D results. Te research results are summarized in Section 5. represents the diference of the upstream and downstream Journal of Advanced Transportation 3 Upstream Upstream Downstream Downstream Detector 2 Detector 1 Detector 1 Detector 2 Te time and U2 U1 D1 D2 location of the crash Downstream detector2 (D2) Downstream detector1 (D1) Crash location Crash location Upstream detector1 (U1) Upstream detector2 (U2) -30 min -25 min -20 min -15 min -10 min -5 min t=0 Trafc direction Loop detector Series 6 Series 5 Series 4 Series 3 Series 2 Series 1 (a) (b) Figure 1: Schematic diagram of sample data extraction based on temporal and spatial features. (a) Spatial feature. (b) Temporal feature. detector variables, and A represents the mean value of the in two adjacent time segments (time series 1 and series 2) is upstream and downstream detector variables. Te second selected to draw scatter charts, so as to more intuitively letter represents the variable name, S for speed, V for fow, analyze the changes of trafc fow state in diferent time segments. As shown in Figure 2, the horizontal axis is the and O for occupancy. Te part underlined in the variable represents the name of detector, U1 represents the frst average speed of upstream trafc fow, and the vertical axis is the average speed of downstream trafc fow. upstream detector, U2 represents the second upstream detector, D1 represents the frst downstream detector, and It can be seen from the fgure that the overall trend of D2 represents the second downstream detector. According trafc fow status did not change much in two consecutive to this coding rule, DOU1D1 is the average density difer- periods before the crash, mainly because non-crash trafc ence between the frst upstream detector and the frst fow accounted for a large proportion in the sample. downstream detector, and other variables are named in the Trough further analysis, it is found that in the two 5-minute same way. time intervals, the proportion of non-crash trafc fow state In this study, according to the data sample matching change is much less than the proportion of crash trafc fow principle, 247 crash and 1096 non-crash trafc fow original state change. In all 274 crash trafc fow samples, 79 crash trafc fow samples (28.8%) exhibited state transition, while data (i.e., volume, speed, and occupancy) of U1, U2, D1, and D2 30 minutes before the crash were extracted. Finally, in all 1096 non-crash trafc fow samples, 36 non-crash trafc fow samples (3.2%) exhibited state change. From the a total of 1370 sets of data samples were obtained for highway crash risk modeling, in which the ratio of crash proportion of state transition data, it can be seen that in the trafc fow to non-crash trafc fow was 1 : 4. According to adjacent time series, the proportion of state transition in the the above method, it is divided into 6 time series, and each accident trafc fow is much higher than that in the non- segment contains a total of 30 trafc fow variables, which accident trafc fow. Tis indicates that under the same are used as the basic sample data for crash risk prediction conditions, compared with non-accident trafc fow, time modeling. It should be pointed out that in the extracted data factor has a greater impact on accident trafc fow. In other samples, the trafc fow variables in each time segment words, it shows that the change of trafc fow state in dif- contain not only the three original variables of cumulative ferent time series before the crash has a strong infuence on the highway crash. Terefore, trafc fow characteristics of fow, average speed, and average occupancy rate on the four upstream and downstream detectors but also the diference diferent time series have diferent impacts on road trafc and mean value of the trafc fow variables on the upstream risks. It is necessary to consider the state transition process and downstream detectors. Tat is, each of the 6 time series of trafc data capture with multiple time intervals when contains 30 variable values as explanatory variables of building the crash risk prediction model. the model. 3. Methodology 2.3. Analysis of Trafc Flow Temporal Characteristics. In At present, there is no fxed procedure and process for order to more intuitively express the infuence of the constructing highway crash risk prediction model. In order temporal correlation characteristics of trafc fow state to establish a reasonable crash risk prediction model, it variables on the modeling of highway crash risk, the trafc usually has a strong relationship with the types of trafc safety state is divided by the trafc fow state variable data in safety data, the physical signifcance of explanatory variables, the six time segments mentioned above. According to the and the purpose of establishing the model. On the basis of classifcation results, we can see the changing trend of trafc existing studies, the general steps to be followed in highway fow state in diferent time segments at the same place. In this crash risk prediction are proposed in this study, as shown in paper, the average speed index of upstream and downstream Figure 3. 4 Journal of Advanced Transportation Table 1: Te statistics of initial variables. Optional initial variable Variable naming OSU1, OVU1, OOU1, OSU2, OVU2, OOU2, OSD1, OVD1, OOD1, OSD2, OVD2, Detector original variable OOD2 DSU1D1, DVU1D1, DOU1D1, DSU2D1, DVU2D1, DOU2D1, DSU1D2, Te diference between the variables of upstream and downstream detectors DVU1D2, DOU1D2, DSU2D2, DVU2D2, DOU2D2 Mean values of variables of upstream and downstream detectors ASU1U2, AVU1U2, AOU1U2, ASD1D2, AVD1D2, AOD1D2 Free Chan ge Congestion Free Chan ge Congestion Journal of Advanced Transportation 5 40 40 30 30 20 20 Congestion Chan ge Free Congestion Chan ge Free 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Upstream average trafc flow speed (km/h) Upstream average trafc flow speed (km/h) (a) (b) Figure 2: Trafc fow state change in diferent time segments. (a) Time series 1 (5–10 min before crash). (b) Time series 2 (10–15 min before crash). Model preparation stage: the preparatory stage of prediction model. Ten, the dynamic Bayesian network highway crash risk modeling mainly includes defning model is used to quantify the infuence of trafc fow timing modeling objectives, understanding the advantages and sequence correlation characteristics on highway crash risk, disadvantages of existing prediction models, and col- and the highway crash risk prediction model based on lecting relevant basic data according to the re- dynamic Bayesian network is established. Finally, the ef- quirements of the models. fectiveness of the proposed model is verifed by comparison with the prediction results of the static Bayesian network Model establishment stage: according to the purpose of model. Te steps of model analysis are as follows. modeling, analyze variable requirements and basic According to the above crash risk modeling process, the data, select appropriate models, assign corresponding random forest model and dynamic Bayesian network model physical meanings to model variables, and make cor- are used for variable selection and model structure con- responding model assumptions for problems that struction. In order to facilitate understanding, the random cannot be considered. forest model and dynamic Bayesian network model used in Model application stage: according to the established this paper are introduced, respectively. model, with the help of necessary mathematical soft- ware and computer technology to solve the model for parameter estimation, model results are discussed and 3.1. Random Forest. Random forest algorithm is an en- analyzed, such as error analysis, sensitivity analysis, and semble learning algorithm proposed by Cutler et al. in 2001 prediction accuracy analysis. [23]. A major advantage of the model is that it is easy to measure the relative importance of each characteristic In addition, the following problems need to be con- variable to the prediction [24]. Random forest algorithm is sidered and solved during the of highway crash risk pre- a supervised machine learning algorithm that builds mul- diction modeling. Firstly, there are many trafc fow tiple decision trees and merges them together to obtain more variables that afect the highway crash risk. How to select the accurate and stable classifcation or regression results [25]. variables with the highest correlation with the crash risk as In the random forest algorithm, there are two indexes used the explanatory variable of the crash risk prediction model? to evaluate the importance of variables, which are, re- Secondly, how long are the trafc fow sequence data before spectively, the evaluation indexes of the importance of the crash used to predict the highway trafc crash risk? variables based on Gini value and OOB (out-of-bag) error Tirdly, what methods or models are used to characterize the rate. Te infuence of diferent trafc variables on highway crash and non-crash trafc fow data samples with temporal crash risk was explored by using OOB error rate evaluation correlation characteristics, so as to establish an efective index in this research [26]. Relevant studies have shown that highway crash risk prediction model? random forest algorithm can solve multicollinearity problem Aiming at the problems in the modeling process men- without separate cross validation, and this method can ef- tioned above, we put forward the modeling steps of highway fectively deal with data samples with multiple variables. crash risk assessment model in this paper. First of all, data Terefore, random forest algorithm is used to extract the matching was carried out by using matched case-control trafc fow variables highly related to the highway crash risk sample structure to eliminate the infuence of other factors as the explanatory variables of the prediction model. on crash risk modeling to the maximum extent. Secondly, the random forest model is used to explore the correlation between the initial variables of the model and the trafc 3.2. Dynamic Bayesian Networks. Based on probability crash risk, and the variable with the largest correlation network, dynamic Bayesian network combines original coefcient is extracted as the input variable of the crash risk static network structure with time information to form Downstream average trafc flow speed (km/h) Downstream average trafc flow speed (km/h) 6 Journal of Advanced Transportation Defne the purpose of modeling Data acquisition and Model applicability preprocessing analysis Problem description and data feature analysis Model structure Description Dependent Model parameter variable selection selection variable selection selection No Model parameter estimation Validation of model Figure 3: Framework and process of highway crash risk forecasting model. a new stochastic model with time sequence data. With the [2], . . ., X[t]) � P(X[t + 1] | X[t]). Finally, the conditional introduction of time factor, the data formed at diferent probabilistic process of adjacent time is assumed to be stationary. Tat is, P(X[t + 1] | X[t]) has nothing to do with moments refect the change and development law of de- scriptive variables. For Bayesian networks, the key problem the time t. is to make probabilistic inference about the hidden states of Based on the above assumptions, the dynamic Bayesian a group of random variables, and the random variables network model is defned as a random process of joint representing the hidden states in dynamic Bayesian net- probability distribution on time trajectory. It consists of works have the characteristics of time series. Tese observed a pair of states (B , B ). B is defned as a priori network, 0 ⟶ 0 samples can be represented in terms of decomposition or which is used to describe the joint probability P(X ) on the distribution. In addition, because dynamic Bayesian network initial state. B is defned as the transfer network, which is is a typical directed acyclic graph model, the conditional used to describe the variable transfer probability P(X | X ). t+1 t probability distribution of each node in it can be estimated Te network graphical expression is shown in Figure 4. independently. Terefore, the dynamic Bayesian network When there are only two time series, (B , B ) is 0 ⟶ model is easier to explain and learn [27]. a Bayesian network with two time series. It includes the As a more general spatial-temporal state analysis model, functions of both transition probability and observation the dynamic Bayesian network model actually extends the probability models. Te node probability of this Bayesian static Bayesian network to stochastic process model with network containing two time series can be calculated by time factor. Such an extension would make the distribution of random variables very complicated and difcult to solve. (i) (i) P X | Pa X , (1) P X | X � t t−1 t t In order to facilitate modeling and solving, it is generally i�1 necessary to make simplifed treatment and necessary (i) condition assumption [28]. First, the conditional probability where X is the ith node (including hidden node and (i) process is assumed to be uniformly stable for all T in fnite observation node, N � N + N ) in time series T. Pa(X ) is h o (i) time. Second, the dynamic probabilistic process is assumed the parent node of node X . Te parent node contains not to be Markov. Tat is, future satisfaction P(X[t + 1] | X[1], X only variables at time T but also variables at time t − 1. In Model application Model establishment Model preparation stage stage stage Journal of Advanced Transportation 7 1 1 1 X X X 0 t t+1 1 2 2 X X X 2 t t+1 1 3 3 X X X 3 t t+1 (a) (b) 1 1 1 1 X X X X t t+1 t+2 t+3 2 2 2 2 X X X X t t+1 t+2 t+3 3 3 3 3 X X X X t t+1 t+2 t+3 (c) Figure 4: Graphic representation of dynamic Bayesian network. a Bayesian network containing two time series, the frst time dynamic Bayesian networks containing T time series, the series has no parameters, and only the nodes in the second node probability distribution can be calculated by time series have node probability parameters. Similarly, for N T N (1: N) (i) (i) (i) (i) PX � P X | PaX × P X | PaZ . (2) 1: T B 1 1 B t t 0 ⟶ i�1 t�2 i�1 In dynamic Bayesian networks, the hidden state of time As shown in Figure 5, nodes are used to represent hidden series Tis represented by a series of random variables, denoted state variables or observed state variables. Te hidden state (i) as H , i ∈ 1, . . . , N . Meanwhile, the observation state can uses a discrete random variable to represent the probability t h also be represented by a series of random variables, denoted as of picking every possible value. Te hidden state variable H (j) E , j ∈ 1, . . . , N . Each hidden and observed state variable is a binary variable in the highway crash risk model. Te t o can be a discrete or continuous variable. In spatial-temporal crash and non-crash trafc fow variables are considered as state analysis models such as hidden Markov model and state the observed variable E . It should be pointed out that when space model, there is usually a transition probability only a time segment is considered, it is static Bayesian P(H | H ), an observation probability P(E | H ), and an network structure. When multiple time segments are con- t t−1 t t original state distribution P(H ). However, this kind of sidered, it is a dynamic Bayesian network structure. Te model cannot accurately describe the causal relationship fgure takes two time series as an example. Its network between variables, while the dynamic Bayesian network structure is time series dimension extension based on static model can consider both the causal relationship between network structure. Te connecting lines between nodes are variables and the dynamic change of the causal relationship also divided into two types in DBN. One is the connection caused by time factors, which are more suitable for analyzing line within the same time segment, which represents the the highway crash risk considering the space-time charac- instantaneous correlation between variables and is repre- teristics. Terefore, the structure diagram of dynamic sented by a solid line. Te other is the connection line Bayesian network model can be given as shown in Figure 5. between diferent time segments, which is used to describe 8 Journal of Advanced Transportation T=1 T=2 Hidden (1) (2) state H H variable Observation variable (1) (2) (1) (1) (2) (2) E E E E E E E E 1 1 2 2 3 2 3 1 3 (a) (b) Figure 5: Structure diagram of dynamic Bayesian network model. (a) SBN. (b) DBN. the transition between the crash states of two time series, and state variables, the number of time segments is a relatively is represented by dashed lines. important infuencing factor in the establishment of dy- namic Bayesian network model. Teoretically speaking, the more the input variables of the model, the higher the model accuracy. However, the Te third step is to learn the parameter estimation of complexity of the corresponding model is higher, and the conditional probability distribution between observed var- solving time and computing resources required for the iables. Dynamic Bayesian network parameter learning is model will also increase. Relevant research results show that, similar to static Bayesian parameter learning algorithm. In when the number of modeling variables exceeds a certain order to avoid the error of parameter estimation caused by number, the improvement range of model accuracy is very missing trafc fow observation variables, the expectation- small, which is far less than the negative efect caused by the maximization algorithm is used to estimate the maximum increase of model solving time. Trough several modeling likelihood of parameters [30]. experiments, it is found that three variable indexes are considered the optimal scheme when establishing dynamic 4. Case Study Bayesian network model. It can not only ensure the accuracy of the model but also achieve the highest computational 4.1. Results efciency. Terefore, the three variables with the highest 4.1.1. Variable Selection Results Based on Random Forest correlation with road accident risk were selected as modeling Model. Tere is usually a few minutes delay between the indicators in this study. recorded time and the actual time of the crash occurrence According to the diferent characteristics of specifc [31]. Terefore, trafc fow data 5–10 minutes before the research felds, the application of Bayesian network mod- crash are used as the basic variables of the random forest eling technology is mainly designed from the following three model. Tis paper uses R language computing platform to aspects. realize the random forest model program. Te 30 trafc fow Te frst step is describing the variables of the research variables were taken as the initial variables, and the trafc problem and their value range. In this study, hidden vari- fow data 5–10 minutes before the crash were taken as the ables (including crash state and non-crash state) and ob- sample data. Te random forest algorithm was used to servation variables (including trafc fow, speed, and occupancy on upstream and downstream detectors) are calculate the importance of each variable. Te calculation results are shown in Figure 6, where the horizontal axis is the mainly included. name of the variable, and the vertical axis represents the Te second step is structural learning that represents the average model accuracy reduced by the variable, namely, the dependencies between variables. For dynamic Bayesian importance of the variable. networks, structural learning should not only consider the It can be seen from the fgure that the index with the causal relationship between variables in the same time largest average variation of model classifcation accuracy is segment but also consider the causal relationship between the downstream average speed. Te second is the average variables in diferent time segments. Terefore, static occupancy rate of upstream detector, and the average var- Bayesian network structure learning algorithms such as iation of accuracy is between 0.03 and 0.035. Te efect of the mountain climbing algorithm, simulated annealing algo- diference index of upstream and downstream detectors on rithm, and genetic algorithm cannot be directly used in dynamic Bayesian network structure learning [29]. the model classifcation accuracy is obviously weaker than the detector original index and mean index. At the same According to the above model variable selection results, it time, it can be seen that for the same type of indicators, the can be seen that there are not many trafc observation impact of speed and density indicators on model accuracy is variables afecting crash risk. Also, there is a clear causal signifcantly higher than that of fow indicators. Te reason relationship between variables, so the network structure can for this result may be that the data used for modeling are be given directly. In addition to observation variables and Journal of Advanced Transportation 9 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 Variables Figure 6: Ranking of the importance of variables. 5 min accumulated fow data, while the speed and density Table 2: Te confusion matrix of prediction result. indicators are 5 min average data. Tis makes the data Predicted results signifcantly diferent at the dimensional level, which may Actual results Crash Non-crash cause variables to have diferent efects on the model. Crash True crash (T ) False non-crash (F ) crash non-crash According to the ranking results of the importance of Non-crash False crash (F ) True non-crash (T ) crash non-crash variables, the variables with high importance should be selected as the input variables of the crash risk model. Meanwhile, in order to reduce the complexity of the model, evaluation index of the model prediction validity. Its cal- the number of input variables should not be too much. culation formula is as follows: Terefore, the three variables with the highest importance are selected as the input variables of the prediction model, Overall prediction accuracy � (T + T )/ crash non-crash namely, the mean speed variable ASD1D2 of downstream (T + F + F + T ) 100%. crash crash non-crash non-crash detector, the mean occupancy variable AOU1U2 of up- Non-crash prediction accuracy � T /(F non-crash non- stream detector, and the speed diference variable DSU1D1 + T ) 100%. crash non-crash of upstream and downstream nearest detector. Crash prediction accuracy � T /(T + F ) crash crash crash 100%. 4.1.2. Highway Crash Risk Prediction Results Based on Dy- ������������������������������������������������ G-value � crash prediction accuracy∗ non − crash prediction accuracy. namic Bayesian Network. In order to ensure the general- Crash accuracy rate � T /(T + F ) 100%. ization ability of the model, the sample data were randomly crash crash crash divided into training dataset and validation dataset. Te Crash recall rate � T /(T + F ) 100%. crash crash non-crash proportion of crash and non-crash sample data in the two ∗ ∗ F-value � 2 crash accuracy rate crash recall rate/ datasets remains unchanged, namely, the sample ratio of (crash accuracy rate + crash recall rate). crash trafc fow to non-crash trafc fow is 1 : 4. Te From the 1370 groups of sample data collected above proportion of crash sample data in the dataset is much (274 groups of crash sample data and 1096 groups of non- smaller than that of non-crash sample data. For the clas- crash sample data), 870 groups of data (174 groups of crash sifcation prediction model based on such unbalanced data, sample data and 696 groups of non-crash sample data) were only the overall classifcation accuracy cannot completely randomly selected as training sample data. Also, the explain the quality of the model. It is necessary to construct remaining 500 groups of sample data (100 groups of crash a confusion matrix based on the prediction results to il- samples and 400 groups of non-crash samples) were vali- lustrate the prediction accuracy of such classifcation models dation samples. Te structure learning and parameter based on unbalanced samples, and the confusion matrix is learning process of dynamic Bayesian network is realized by shown in Table 2. using R language mathematical statistics analysis platform. Te overall prediction accuracy, crash prediction accu- Te validation dataset is input into the trained model to racy and non-crash prediction accuracy, and the F-value analyze the validity of the predictive evaluation model. Te measure of crash prediction can be calculated as the Change in average model accuracy ASD1D2 AOU1U2 DSU1D1 ASU1U2 DOU1D1 OSU1 OOU1 DSU2D2 DOU2D1 DOU1D2 OSD1 AOD1D2 DOU2D2 DSU2D1 OSU2 OSD2 OOD1 AVU1U2 OOD2 OOU2 OVU1 OVU2 DSU1D2 AVD1D2 OVD1 DVU2D2 DVU2D1 DVU1D1 OVD2 DVU1D2 10 Journal of Advanced Transportation Table 3: Te confusion matrix of single prediction result. confusion matrix of single prediction results is shown in Table 3. Predicted results Actual results Overall prediction accuracy � (72 + 361)/(72 + 39 + 361 Crash Non-crash + 28) 100% � 86.6%. Crash 72 (T ) 28 (F ) crash non-crash Non-crash 39 (F ) 361 (T ) Non-crash prediction accuracy � 361/(28 + 361) crash non-crash 100% � 92.8%. Crash prediction accuracy � 72/(72 + 39) 100% � static Bayesian model. Te F-value of the dynamic Bayesian 64.9%. model is 0.023 higher than that of the static Bayesian model. √���������� � G-value � 0.928∗ 0.649 � 0.776. Te G-value of the dynamic Bayesian model is 0.017 higher than that of the static Bayesian model. Terefore, the pre- Crash accuracy rate � 72/(72 + 39) 100% � 64.9%. diction results show that the dynamic Bayesian model has Crash recall rate � 72/(72 + 28) 100% � 72.0%. better prediction efect than the static Bayesian model for the ∗ ∗ F-value � 2 64.9% 72%/(64.9% + 72%) � 0.682. same sample data. In order to reduce the error caused by the randomness of sample data extraction, 10 training datasets and validation 4.2.2. Infuence Analysis of Time Segment Number. As datasets were randomly divided from the original dataset. mentioned above, when establishing the dynamic Bayesian Te mean value of multiple model estimates is taken as the network model, the selection of time segment is an im- fnal estimate value of the model, and the mean value of portant infuencing factor in the process of establishing the multiple validation results is taken as the predicted value of dynamic Bayesian network model. Te 5 min cumulative the model to illustrate the validity of the model. Te pre- trafc fow state index is generally accepted as an explanatory diction results are shown in Table 4. From the perspective of variable of highway crash risk. Terefore, the unit time prediction accuracy, the overall prediction accuracy of dy- length of each time segment is set to 5 min. In addition, namic Bayesian network model is above 80%, the prediction another important variable afecting model performance is accuracy of crash is between 55% and 65%, and the pre- the number of time fragments, that is, each dynamic network diction accuracy of non-crash is about 90%. Tis is mainly model is composed of several variables of time fragments. In because the proportion of crash trafc fow to non-crash this paper, trafc fow index data of upstream and down- trafc fow in the sample data is 1 : 4, and the proportion of stream detectors were extracted in 30 min before the crash (a non-crash trafc fow sample is signifcantly higher than that total of six time segments). Due to the error of 3–5 minutes of crash trafc fow sample. Terefore, there will be a phe- between the recording time of the crash sample and the nomenon that the accuracy of non-crash prediction is sig- actual time of the crash, time segment 0, that is, the crash nifcantly higher than that of crash prediction. Nevertheless, sample data accumulated 0–5 minutes before the crash from the perspective of prediction accuracy index, the occurred, has a certain error, so it is omitted. If fve time predictive ability of dynamic Bayesian network model for series are considered into the model, the dynamic Bayesian trafc crash risk has reached a high level. network structure will be composed of fve static Bayesian networks. Also, there is correlation between adjacent frag- ment variables, so the computation and complexity of model 4.2. Discussion parameter estimation are very high. If only two or three 4.2.1. Comparative Analysis of Prediction Results. In order to adjacent time segments are considered, the prediction results further illustrate the efectiveness of Bayesian network cannot fully refect the infuence of temporal correlation model considering temporal correlation characteristics, the between trafc fow observation variables on crash risk. prediction results of dynamic Bayesian network model and In order to ensure the operation efciency and pre- static Bayesian network model are compared and analyzed in diction accuracy of the model, this paper establishes 10 dynamic Bayesian network models with 2, 3, 4, and 5 time this study. Te data used for the static Bayesian network model include time segment 1, that is, the trafc fow series, respectively. Among them, there are four combina- tions of models including two time series, which are, re- variable sample in 5–10 minutes before the crash. Te dy- namic Bayesian network uses two time series data, time spectively, training models with sample data of series 1 and segment 1 and time segment 2, for model training and series 2, series 2 and series 3, series 3 and series 4, or series 4 verifcation. Te structure learning and parameter learning and series 5. Te model containing three time series has of static and dynamic Bayesian network models are realized three combinations, which are modeled with sample data of by the R language package BNLearn. Te fnal prediction series 1, series 2, and series 3, series 2, series 3, and series 4, or results of the two models are shown in Figure 7. series 3, series 4, and series 5, respectively. Te model As can be seen from Figure 7, the overall prediction containing four time series has two combinations, which are accuracy of the dynamic Bayesian model is 1.8% higher than modeled with sample data in series 1, series 2, series 3, and series 4 or series 2, series 3, series 4, and series 5, respectively. that of the static Bayesian model. Te prediction accuracy of the dynamic Bayesian model is 2.1% higher than that of the Tere is only one combination method for the model containing 5 time series, that is, modeling with sample data static Bayesian model. Te non-crash prediction accuracy of the dynamic Bayesian model is 1.9% higher than that of the in series 1, series 2, series 3, series 4, and series 5. Terefore, Journal of Advanced Transportation 11 Table 4: Te confusion matrix of multiple prediction results. Predicted results Te serial number Overall prediction accuracy Accuracy of crash Accuracy of non-crash G-value F-value (%) prediction (%) prediction (%) 1 86.6 64.9 92.8 0.776 0.682 2 82.2 54.8 90.4 0.704 0.586 3 83.4 56.4 93.2 0.725 0.644 4 84.2 59.1 91.7 0.736 0.633 5 86.8 68.5 90.0 0.789 0.656 6 86.2 62.6 93.9 0.767 0.691 7 84.8 60.1 92.0 0.746 0.645 8 85.6 63.7 91.2 0.762 0.644 9 82.6 55.5 91.1 0.711 0.603 10 86.4 61.8 95.6 0.768 0.712 Mean 84.9 60.8 92.3 0.748 0.649 100.00 92.30% 90.40% 90.00 84.90% 83.10% 80.00 0.748 0.725 70.00 0.649 0.632 60.80% 58.70% 60.00 (%) 50.00 40.00 30.00 20.00 10.00 0.00 Overall prediction Crash prediction Non-crash prediction F-Value G-Value accuracy accuracy accuracy Dynamic Bayesian network model Static Bayesian network model Figure 7: Prediction result comparison between dynamic Bayesian network model and static Bayesian network model. the infuence of the number of time series on the crash risk modeling process, it is not the more the time segments model can be illustrated by comparing the crash prediction considered, the better the prediction accuracy of the accuracy of 10 dynamic Bayesian network models. Te model is. In addition, it is found that the prediction ac- curacy of the model containing time series 1 is higher than prediction results are shown in Table 5. As can be seen in Table 5, in terms of prediction ac- that of the model without time series 1. Also, the pre- curacy, the prediction results obtained by modeling with diction accuracy of the model containing both series 1 and diferent number of time periods are diferent. Te model series 2 is higher than that of the model without these two with the highest accuracy is the dynamic Bayesian network time series. Tis indicates that the variables closer to the model with the sample data in the time series 1, series 2, crash occurrence time have a greater impact on the crash and series 3. Meanwhile, it is noted that the prediction risk prediction model, and taking them as model training accuracy of the model does not increase with the increase and validation data can efectively improve the prediction of the number of time series. Tis indicates that in the accuracy of the model. 12 Journal of Advanced Transportation Table 5: Prediction results of dynamic Bayesian network model considering diferent time series. Accuracy of prediction results Te model number Contains time series Overall prediction accuracy Accuracy of crash Accuracy of non-crash G-value F-value (%) prediction (%) prediction (%) 1 84.9 60.8 92.3 0.748 0.649 12 2 83.8 58.8 93.4 0.741 0.640 23 3 83.5 57.7 91.8 0.728 0.634 34 4 81.7 50.7 88.4 0.670 0.589 45 5 85.9 61.1 92.9 0.753 0.653 123 6 83.5 58.7 91.4 0.733 0.640 234 7 82.5 54.7 90.4 0.703 0.615 345 8 84.8 59.7 92.8 0.744 0.645 1234 9 83.8 56.7 92.1 0.723 0.627 2345 10 85.1 59.8 93.2 0.746 0.646 12345 crash trafc fow, so only two kinds of prediction results with 5. Conclusion or without risk can be obtained. In order to meet the re- Crash risk analysis and prediction are considered the quirements of highway safety risk management practice, premise of highway trafc safety control, which directly more detailed crash risk classifcation level is needed. In the afects the accuracy and efectiveness of trafc safety de- future, the crash severity index can be considered as the cisions. Tis paper analyzes the infuence of temporal cor- dependent variable of the prediction model. Te dichotomy relation characteristics of trafc fow state variables on problem can be extended to multiclassifcation problem to highway crash risk. A highway crash risk prediction method achieve the classifcation and prediction of highway crash considering time series correlation feature is proposed. risk level. Firstly, the “case-control” analysis method is used to extract crash trafc fow data and corresponding non-crash trafc Data Availability fow data as sample data for crash risk prediction modeling. Meanwhile, the sample data of half an hour were divided All datasets were collected from the Performance Mea- into 6 time segments every 5 minutes as sample data for surement System which can be freely downloaded from model training and verifcation. Ten, the random forest https://pems.dot.ca.gov/. model is used to select the trafc fow variables highly correlated with the risk of highway crashes from 30 initial Conflicts of Interest variables. Te downstream mean speed variable ASD1D2, Te authors declare that they have no conficts of interest. the upstream mean occupancy variable AOU1U2, and the speed diference variable DSU1D1 on the upstream and downstream nearest detector are determined as the ex- Acknowledgments planatory variables of the crash risk prediction model. Fi- Tis research was supported by the National Key R&D nally, based on the dynamic Bayesian network modeling Program of China (2017YFC0803900). method, the highway trafc crash risk prediction model considering the temporal correlation feature is proposed. References Te validity of the model is illustrated by comparing the prediction accuracy of the model with that of the static [1] Y. Yang, Z. Yuan, J. Chen, and M. Guo, “Assessment of Bayesian network model in the test dataset. Te results of osculating value method based on entropy weight to trans- case study show that the prediction accuracy of the crash risk portation energy conservation and emission reduction,” prediction model considering the temporal correlation Environmental Engineering & Management Journal.vol. 16, features is higher than that of the static Bayesian network no. 10, pp. 2413–2424, 2017. method. Also, the prediction model using the frst three time [2] Y. Yang, Z. Yuan, and R. Meng, “Exploring trafc crash series has the best efect. occurrence mechanism towards cross-area freeways via an Although the crash risk prediction model proposed in improved data mining approach,” Journal of Transportation Engineering Part A Systems, vol. 148, no. 9, Article ID this study improves the accuracy of crash risk prediction to 04022052, 2022. a certain extent, there is still a lot of room for improvement. [3] Y. Yang, K. He, Y. P. Wang, Z. Z. Yuan, Y. H. Yin, and Firstly, the infuence of time series factors on crash risk M. Z. Guo, “Identifcation of dynamic trafc crash risk for prediction model is mainly considered in this model. Besides cross-area freeways based on statistical and machine learning time factor, space factor is also one of the important factors methods,” Physica A: Statistical Mechanics and Its Applica- afecting the accuracy of the model. How to consider the tions, vol. 595, Article ID 127083, 2022. infuence of both time and space factors in the process of [4] Y. Yang, K. Wang, Z. Yuan, and D. Liu, “Predicting freeway accident risk modeling is the direction of future research. In trafc crash severity using XGBoost-bayesian network model addition, the essence of the prediction model proposed in with consideration of features interaction,” Journal of Ad- this research is a dichotomous prediction of crash and non- vanced Transportation, Article ID 4257865, 2022. Journal of Advanced Transportation 13 [5] W. Wang, Z. Yuan, Y. Yang, X. Yang, and Y. Liu, “Factors [21] S. Roshandel, Z. Zheng, and S. Washington, “Impact of real- infuencing trafc accident frequencies on urban roads: time trafc characteristics on freeway crash occurrence: a spatial panel time-fxed efects error model,” PLoS One, systematic review and meta-analysis,” Accident Analysis & Prevention, vol. 79, pp. 198–211, 2015. vol. 14, no. 4, Article ID e0214539, 2019. [6] S. Yu, Y. Jia, and D. Sun, “Identify factors that infuence the [22] J. Sun and J. Sun, “Proactive assessment of real-time trafc patterns of road-crashes by using association rules: a study fow accident risk on urban expressway,” Journal of Tongji case from Wisconsin, United States,” Sustainability, vol. 11, University, vol. 42, no. 6, pp. 873–879, 2014. 2019. [23] A. Cutler, D. R. Cutler, and J. R. Stevens, “Random forests,” [7] W. Wang, Z. Yuan, Y. Liu, X. Yang, and Y. Yang, “A random Machine Learning, vol. 45, no. 1, pp. 157–176, 2004. parameter logit model of immediate red-light running be- [24] R. Harb, X. Yan, E. Radwan, and X. Su, “Exploring precrash havior of pedestrians and cyclists at major-major in- maneuvers using classifcation trees and random forests,” tersections,” Journal of Advanced Transportation, vol. 2019, Accident Analysis and Prevention, vol. 41, no. 1, pp. 98–107, Article ID 2345903, 13 pages, 2019. 2009. [8] Y. Yang, N. Tian, Y. Wang, and Z. Yuan, “A parallel FP- [25] C. Strobl, A. L. Boulesteix, T. Kneib, T. Augustin, and growth mining algorithm with load balancing constraints for A. Zeileis, “Conditional variable importance for random trafc crash data,” International Journal of Computers, forests,” BMC Bioinformatics, vol. 9, no. 1, p. 307, 2008. [26] K. J. Archer and R. V. Kimes, “Empirical characterization of Communications & Control, vol. 17, no. 4, p. 4806, 2022. [9] D. Sun, Y. Ai, Y. Sun, and L. Zhao, “A highway crash risk random forest variable importance measures,” Computational assessment method based on trafc safety state division,” Statistics and Data Analysis, vol. 52, no. 4, pp. 2249–2260, PLoS One, vol. 15, no. 1, Article ID e0227609, 2020. 2008. [10] D. Sun, Y. Ai, and L. Wang, “Freeway trafc safety state [27] C. G. Enright, M. G. Madden, and N. Madden, “Bayesian classifcation method based on multi-parameter fusion networks for mathematical models: techniques for automatic clustering,” Modern Physics Letters B, vol. 36, no. 20, Article construction and efcient inference,” International Journal of ID 2250088, 2022. Approximate Reasoning, vol. 54, no. 2, pp. 323–342, 2013. [11] A. H. Lee, K. Wang, J. A. Scott, K. K. Yau, and [28] Q. Xiao and S. Gao, Application of Bayesian Network in In- G. J. McLachlan, “Multi-level zero-infated Poisson regression telligent Information Processing, National Defense Industry modelling of correlated count data with excess zeros,” Sta- Press, Beijing, China, 2012. tistical Methods in Medical Research, vol. 15, no. 1, pp. 47–61, [29] I. N. Junejo, “Using dynamic Bayesian network for scene 2006. modeling and anomaly detection,” Signal, Image and Video Processing, vol. 4, no. 1, pp. 1–10, 2010. [12] T. F. Golob and W. W. Recker, “An analysis of truck-involved freeway accidents using log-linear modeling,” Journal of [30] K. P. Murphy, “Te Bayes net toolbox for matlab,” Compu- Safety Research, vol. 18, no. 3, pp. 121–136, 1987. tational Statistics, vol. 33, no. 2, pp. 1024–1034, 2001. [13] T. F. Golob and W. W. Recker, “Relationships among urban [31] H. Hassan and M. A. Abdelaty, “Exploring visibility-related freeway accidents, trafc fow, weather, and lighting condi- crashes on freeways based on real-time trafc fow data,” tions,” Journal of Transportation Engineering, vol. 129, no. 4, Transportation Research Board Meeting, vol. 11, 2011. pp. 342–353, 2003. [14] T. F. Golob, W. W. Recker, and V. M. Alvarez, “Freeway safety as a function of trafc fow,” Accident Analysis and Prevention, vol. 36, no. 6, pp. 933–946, 2004. [15] T. F. Golob and W. W. Recker, “A method for relating type of crash to trafc fow characteristics on urban freeways,” Transportation Research, Part A (Policy and Practice), vol. 38, no. 1, 80 pages, 2004. [16] L. Li, X. Sheng, B. Du, Y. Wang, and B. Ran, “A deep fusion model based on restricted Boltzmann machines for trafc accident duration prediction,” Engineering Applications of Artifcial Intelligence, vol. 93, Article ID 103686, 2020. [17] Y. Lin, L. Li, H. Jing, B. Ran, and D. Sun, “Automated trafc incident detection with a smaller dataset based on generative adversarial networks,” Accident Analysis & Prevention, vol. 144, Article ID 105628, 2020. [18] L. Li, C. G. Prato, and Y. Wang, “Ranking contributors to trafc crashes on mountainous freeways from an incomplete dataset: a sequential approach of multivariate imputation by chained equations and random forest classifer,” Accident Analysis & Prevention, vol. 146, Article ID 105744, 2020. [19] L. Li, Y. Lin, B. Du, F. Yang, and B Ran, “Real-time trafc incident detection based on a hybrid deep learning model,” Transportmetrica: Transport Science, vol. 18, no. 1, pp. 78–98, [20] D. Lord and F. Mannering, “Te statistical analysis of crash- frequency data: a review and assessment of methodological alternatives,” Transportation Research Part A: Policy and Practice, vol. 44, no. 5, pp. 291–305, 2010.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png
Journal of Advanced Transportation
Hindawi Publishing Corporation
http://www.deepdyve.com/lp/hindawi-publishing-corporation/highway-traffic-crash-risk-prediction-method-considering-temporal-2gNoWe7iG7