Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Time and Distance Gaps of Primary-Secondary Crashes Prediction and Analysis Using Random Forests and SHAP Model

Time and Distance Gaps of Primary-Secondary Crashes Prediction and Analysis Using Random Forests... Hindawi Journal of Advanced Transportation Volume 2023, Article ID 7833555, 19 pages https://doi.org/10.1155/2023/7833555 Research Article Time and Distance Gaps of Primary-Secondary Crashes Prediction and Analysis Using Random Forests and SHAP Model 1 1 1,2 1 Xinyuan Liu , Jinjun Tang , Fan Gao , and Xizhi Ding Smart Transportation Key Laboratory of Hunan Province, School of Trafc and Transportation Engineering, Central South University, Changsha 410075, China Department of Geography and Resource Management, Te Chinese University of Hong Kong, Shatin, N.T., Hong Kong Correspondence should be addressed to Jinjun Tang; jinjuntang@csu.edu.cn Received 18 September 2022; Revised 12 December 2022; Accepted 18 March 2023; Published 14 April 2023 Academic Editor: Wen Liu Copyright © 2023 Xinyuan Liu et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Secondary crashes (SCs) are typically defned as the crash that occurs within the spatiotemporal boundaries of the impact area of the primary crashes (PCs), which will intensify trafc congestion and induce a series of road safety issues. Predicting and analyzing the time and distance gaps between the SCs and PCs will help to prevent the occurrence of SCs. In this paper, a combined data- driven method of static and dynamic approaches is applied to identify SCs. Ten, the random forests (RF) method is implemented to predict the two gaps using temporal, primary crash, roadway, and real-time trafc characteristics data collected from 2016 to 2019 at California interstate freeways. Subsequently, the SHapley Additive explanation (SHAP) approach is employed to interpret the RF outputs. Te results show that the trafc volume, speed, lighting, and population are considered the most signifcant factors in both gaps. Furthermore, the main and interaction efects of factors are also quantifed. High volume possibly promotes the time and distance gaps, while low volume inhibits them. And volume afects the distance gap inconsiderably when it falls between 300 and 400veh/5min. Trafc conditions with high speed and low volume are strongly associated with short-time and short-distance gaps. Darker surroundings probably accelerate the occurrence of SCs. Moreover, crashes involving the violation categories of improper turns or unsafe lane changes likely result in long time and distance gaps. Tese results have important implications for proposing trafc management and improving road safety. United States. In this context, SC prevention has become 1. Introduction a major consideration in the trafc safety feld. Road trafc crashes pose a threat to normal trafc operations In the past decades, a large body of literature has been and safety and can cause property damage or even serious devoted to investigating the identifcation of SCs and injuries. According to the world health organization [1], modeling the risk of SC occurrence [5–13]. Various statis- approximately 1.3 million people die each year as a result of tical and machine learning (ML) methods were applied to road trafc crashes. Between 20 and 50 million more people explore these two aspects of SCs [9–12]. However, the time sufer nonfatal injuries, with many incurring a disability. gap (i.e., the time diference) and distance gap (i.e., the Furthermore, road trafc crashes cost most countries 3% of spatial separation) between an SC and the corresponding PC their gross domestic product [1]. SCs, happening in the have received less attention, which might hinder a better spatiotemporal impact area of primary crashes (PCs), understanding of the possible time and location of SCs. commonly result in an additional impact on trafc and extra Among the few methods applied to study these two gaps, personal injury [2, 3]. According to [4], SCs can account for statistical approaches subjected themselves to the possibility 20% of all crashes and 18% of all fatalities on freeways in the of predicting infnitely large gaps [14, 15], while ML methods 2 Journal of Advanced Transportation failed to provide satisfactory prediction performance on the historical average speed to brighten the impact area. Like- distance gap [16]. Moreover, the black-box models need wise [11, 13], we applied this method to identify SCs and considered recurrent congestion. more explanation to discuss the efects of contributing factors in detail [16]. Terefore, some promising methods In summary, static methods are easy to implement and and data experiments are required. quickly obtain identifcation results, while dynamic methods To better capture the characteristics of SCs, we frst achieve better performance but consume a lot of compu- developed a hybrid method (i.e., static spatiotemporal tational time. Combining these two methods for SC iden- threshold-based and speed contour map-based methods) to tifcation can improve efciency and accuracy [16, 25]. Tis identify SCs and obtain the time and distance gaps. Sub- paper proposes a two-stage strategy to identify SCs by in- sequently, random forests (RFs) were used to predict the corporating the fxed spatiotemporal threshold-based and time and distance gaps, which have high prediction per- speed contour map-based methods. formance and diversity. And an interpretation technique, namely the SHapley Additive explanation (SHAP) approach, 2.2. Secondary Crash Risk Modeling and Predicting. was applied to examine the model outputs and estimate the global and local efects of the infuencing factors. Un- Several statistical and ML models have been applied to derstanding time and distance gaps and their infuencing explore the relationship between SC occurrences and con- factors can provide management strategies for trans- tributing factors [9–12]. For example, [10] proposed a logit portation agencies and improve trafc operations and road model to predict SC likelihood, and their results revealed safety. that rear-end crashes could increase the SC likelihood [11] developed a random efects logit model to link the proba- bility of SCs with real-time trafc volume conditions, pri- 2. Literature Review mary crash characteristics, environmental conditions, and geometric characteristics. Similarly, [29] used the Bayesian 2.1. Secondary Crash Identifcation. Overall, two types of methods, static and dynamic methods, were widely used to complementary log-log model to predict the likelihood of SCs and examine their relationship with several variables. identify SCs. Static methods identify SCs by setting the fxed spatiotemporal thresholds, which means crashes are iden- However, previous studies focused less on the time and distance gaps between the SCs and PCs. Several studies have tifed as SCs if they fall within the spatiotemporal thresholds of another crash [17]. First introduced this method and made attempts using regression approaches. For example, defned the thresholds equivalent to one mile upstream of [14] selected the ordinary least-squares (OLS) regression to a PC and 15minutes after clearance time. Following this model the two gaps separately. Teir results showed that study, further research associated with static methods has time and distance gaps were closely associated with collision been explored [5–7, 18]. For example, some studies pro- type and the duration of the primary crash. Likewise, [15] posed a spatial threshold of 2 miles and time thresholds of 2, applied OLS regression to evaluate the relationship between the time and distance gaps concerning individual crash 1, and 2hours, respectively, to identify California secondary crashes [19–21]. SCs can be selected quickly and efectively characteristics. Tey found that the number of lanes, total vehicles involved in the crash, morning time, and AADT from massive crashes according to spatiotemporal thresh- olds [2, 16]. However, static methods have the problem of were the most signifcant factors afecting time and distance subjective judgment: overestimation or underestimation of gaps. Although most independent variables had a high the thresholds [2, 22]. As an improvement [7], we in- signifcance, traditionalstatistical models usually made more troduced three sets of spatiotemporal thresholds to identify prior assumptions for input variables, and they were unable SCs on Florida interstates. Te spatial thresholds for all three to predict the possibility of massive gaps. Moreover, [14, 15] sets were 2 miles, and the time thresholds were 2h, built an independent regression model for the time and 15minutes, and 30minutes after the PCs’ clearance time. distance gaps, ignoring the potential correlation of the two Teir results confrmed that the identifcation ratio of SCs gaps because they happen at the same time. Terefore, it is necessary to consider an alternative model to investigate varied for diferent sets. With the support of various sensor technologies, dy- gaps simultaneously. By contrast, ML methods have become increasingly namic methods are becoming increasingly popular and used because of an improvement in the misclassifcation of SCs attractive and have gained more attention due to their high [22]. Tere are three main dynamic methods: (a) queuing prediction power and low limitation on data [30]. Multiple theory-based method [23, 24]; and (b) shockwave-based ML methods have been employed in trafc safety studies approaches [25, 26]; (c) speed contour map-based method [8, 13, 16, 29], such as neural network models, genetic al- [11, 13, 18]. In practical application, due to the data quality gorithms, random forests, XGBoost, etc. In a small number and quantity requirements of methods (a) and (b), the of studies on the time and distance gaps [16], the authors models are often simplifed and set assumptions, failing to utilized a linear regression model and two ML algorithms, refect the actual condition in the real world. Nevertheless, including a back-propagation neural network (BPNN) and the least-squares support vector machine (LSSVM), to build the speed contour map-based method has performed well without any simplifcation or assumptions since it can ac- three prediction models. Te results indicated that the BPNN and LSSVM models outperformed the linear re- curately capture the impact area of PCs [13, 27, 28]. For example, [18] compared the crash state speed with the gression model, but these two ML models also failed to Journal of Advanced Transportation 3 provide adequate performance on distance gap prediction. number. A two-step matching strategy is devised to obtain Regarding ML models, many other promising approaches, trafc volume and average speed for each crash. Te frst step matches the nearest detector upstream for every crash based such as ensemble algorithms, combine several base learners to enhance the prediction performance [31–33]. on the latitude and longitude of the crashes and the loop Besides, relatively fewer studies have focused on SC detectors. Te second step is extracting the volume and prevention. As [2] summarized, available data and high costs speed for 5minutes before the crashes. have limited relevant investigations, so continued endeavors Referring to the previous studies on SCs [14, 16], 17 are still needed. Te main objective of this study is to develop variables were selected from 4 dimensions. Specifcally, a reliable model to predict the time and distance gaps and temporal characteristics consist of 5 variables, namely, peak, analyze associated infuencing factors, which can help with weekend, weather, lighting, and population, which refect proactive prevention and improve safety. Several existing the environment’s state. Population density has a relation- research gaps and insufciencies were mitigated and sup- ship with vehicle trips [36, 37]. Primary crash factors include plemented in this study. 8 variables: collision severity, collision type, violation cat- egory, part count, etc. Tese variables demonstrate all the 3. Data Preparation information associated with a crash. Road condition and surface refect the roadway characteristics, including In this study, crash data were collected from the Statewide whether the pavement is a maintenance area or free from Integrated Trafc Records System (SWITRS), which records abnormal conditions or whether the pavement is dry/wet. detailed description of crash-related information, such as the Trafc volume (veh/5min) and speed (mile/h) report the unique case identifer, location (state route, postmile, lati- trafc characteristics. Detailed descriptions and statistical tude and longitude), collision year and time, collision se- information are expressed in Table 1. Additionally, the verity and type, lighting, weather, etc. A total of 24643 Pearson correlation coefcients (PCCs) were applied to crasheswere collectedfromfreeways I-10, I-5,US-101, I-210, examine the multicollinearity between the 17 variables. and I-110 in Los Angeles County of California over four Figure 1 demonstrates the computed results. As shown, all years, from June 2016 to December 2019 [34]. Trough the absolute values of PPC are less than 0.8, indicating a low a detailed examination, we removed the issues of redundant linear correlation between variables. attributes and missing values from the crash data. In order to combine real-time trafc data into the 4. Methodology analysis of crashes, volume, and speed were extracted from the caltrans performance measurement system [35]. In 4.1. SC Identifcation. Te identifcation of SCs is the basis PeMS, data were gathered from a set of loop detectors on the for conducting SC modeling and analysis. Te static spa- road and transmitted to the management center for storage. tiotemporal threshold-based estimation is the frst stage to And the confguration information of the detector was in- identify SCs roughly, and it can be defned in the following tegrated, including the location and unique identifcation equation: 1, if􏼂t ∈ t , t + t 􏼁􏼃 ∪ 􏼂S ∈ S , S + S 􏼁􏼃, B A A threshold B A A threshold SC � 􏼨 (1) 0, others, where (t , S ) denotes the location and occurrence time of where V and V denote the current and the reference A A (t,S) (t,S) the crash A, (t , S ) denotes the location and occurrence speed of one cell; V � 1 denotes that the cell is afected; B B (t,S) time of another crash B that needs to be examined, and V � 0 denotes that the cell is not afected. Te size of (t,S) (t , S ) denotes the defned time threshold and the impact area was determined by the reference speed V . threshold threshold (t,S) spatial threshold, and the value of 1 means that crash B is Te detailed procedures of the identifcation method are as identifed as a secondary crash corresponding to crash A and follows: 0 otherwise. (i) Apply the fxed spatiotemporal thresholds to Speed contour map-based method estimates the impact identify the candidate SCs. Referring to previous area of the PC based on the change in trafc speed, and a SC is studies on SC analysis in California [19–21], 2 miles identifed when it is discovered in this area. Te speed contour and 2hours were selected as the thresholds in this map comprises grid cells split by defned time intervals and the study. Te initial identifcation on 24,643 crashes milepost of sensor stations [2]. Te impact area can be has yielded 563 possible SCs. ascertained by checking the speed of each cell near the crash. In (ii) Extract the 5-min speed data to develop a speed general, it can be written as the following equation: contour map for a potential PC. More specifcally, 1, if V < V , (t,S) b (t,S) given the fxed spatiotemporal thresholds that have V � 􏼨 (2) (t,S) been determined, the time period for extracting 0, others, speed data is between 2hours before and 2hours 4 Journal of Advanced Transportation Table 1: Description of variables used in crash analysis. Variables Types Description Count Percent Mean Std Temporal characteristics 0 �no 261 70.9 Peak Binary — — 1 �yes (7:00–9:00 or 17:00–19:00) 107 29.1 0 �no 259 70.4 Weekend Binary — — 1 �yes 109 29.6 0 �clear 306 83.2 Weather Categorical 1 �cloudy 46 12.5 — — 2 �rainy 16 4.3 0 �daylight 217 59.0 1 �dusk-dawn 17 4.6 Lighting Categorical — — 2 �dark-streetlights 92 25.0 3 �dark-no streetlights 42 11.4 0 �incorporated (less than 25,000) 10 2.7 1 �incorporated (25,000–100,000) 93 25.3 Population Categorical 2 �incorporated (100,000–250,000) 67 18.2 — — 3 �incorporated (over 250,000) 188 51.1 4 �unincorporated (rural) 10 2.7 Primary crash characteristic 0 �fatal 3 0.8 Collision 1 �severe injury 98 26.6 Categorical — — severity 2 �other visible injury 15 4.1 3 �complaint of pain 252 68.5 0 �head on 2 0.5 1 �sideswipe 53 14.4 2 �rear-end 242 65.8 Collision type Categorical 3 �broadside 11 3.0 — — 4 �hit object 45 12.2 5 �overturned 12 3.3 6 �vehicle/pedestrian 3 0.8 0 �alcohol or drug 22 6.0 1 �unsafe speed 247 67.1 Violation 2 �following too closely 4 1.1 Categorical — — category 3 �unsafe lane change 44 12.0 4 �improper turning 38 10.3 5 �other 13 3.5 Counting total parties in the collision — — 0 �1 party 44 12.0 1 �2 parties 202 54.9 Party count Discrete 2 �3 parties 87 23.6 — — 3 �4 parties 29 7.9 4 �5 parties 4 1.1 5 �6 parties 2 0.5 0 �no 128 34.8 Tow away Binary — — 1 �yes 240 65.2 0 �no 343 93.2 Truck involved Binary — — 1 �yes 25 6.8 0 �felony 29 7.9 Hit and run Categorical 1 �misdemeanor 14 3.8 — — 2 �no hit and run 325 88.3 Alcohol 0 �no 333 90.5 Binary — — involved 1 �yes 35 9.5 Roadway characteristic 0 �construction or repair zone 23 6.2 Road condition Binary — — 1 �no unusual condition 345 93.8 0 �dry 335 91.0 Road surface Binary — — 1 �wet 33 9.0 Trafc characteristics Volume Continuous Vehicle counts over the 5minutes period preceding PCs — — 369.73 158.82 (veh/5min) Speed (mile/h) Continuous Vehicle speed over the 5minutes period preceding PCs — — 48.22 17.16 Journal of Advanced Transportation 5 1.0 Peak Weekend 0.8 Weather Lighting Population 0.6 Collision Severity Collision Type 0.4 Violation Category Party Count Tow Away 0.2 Truck Involved Hit and Run 0.0 Alcohol Involved Road Condition –0.2 Road Surface Volume (veh/5 min) Speed (mph) –0.4 Figure 1: Pearson correlation coefcients of variables. after the PC, and the spatial period is 2 miles up- others will not be selected once. After k rounds of extraction, stream and 2 miles downstream of the PC location. k new sample sets are obtained. (2) Decision trees genera- To eliminate the efects of recurrent congestion, the tion: training k decision trees using k sample sets of data. historical average speed was calculated by collecting During each round of generating trees, m variables from speed data from the PC-free days in a year [13, 18]. M(m < M) features are selected for training. Te ran- domness of the training data and variable combinations (iii) Estimate the impact area of a potential PC using improves the prediction performance of the model and equation (2). Te crashes that occur in the impact essentially prevents overftting. (3) Result combination. area of PC are identifed as SCs. Since the decision trees generated are independent, they Following the two-stage identifcation method, 368 SCs have the same contribution to the predicted result. Tere- are identifed in this study. Te ratio of the number of SCs to fore, the fnal result is obtained by averaging the k predicted the number of all crashes is 1.49%, which is consistent with results. For multioutput problems, the following changes are the fndings of the references in this area that this ratio is required in the decision trees: First is to store several output around 1–1.6% [11–13, 18, 25, 38–40]. values instead of 1. Ten use splitting criteria that calculate the average reduction across all outputs. 4.2. Random Forests. Tis study used RF to predict the time and distance gaps, which has been widely used in the 4.3. SHAP Method. ML methods commonly demonstrate an transportation feld [41–46]. RF uses a bootstrap sampling outstanding prediction performance, while their abilities are method to change the training set to build an integration of limited due to their low interpretability. Although the RF regression trees [47]. Such a mechanism expresses the fol- model can obtain global explanations (i.e., the relative im- lowing advantages: gaining higher performance. Further- portance), it cannot quantify local explanations for indi- more, RF can perform multiple output modeling [48, 49], vidual predictions. Nevertheless, local explanations provide more detailed information than global ones [50, 51]. Shapley which is suitable for simultaneously predicting the time and distance gaps. additive explanations (SHAP) technology is a representative Te input vectors for the RF model are represented as local interpretation method that can explain the main local 􏼈x � [x , x , . . . , x ],y � [y , y ]􏼉, i � 1,2, . . . , N. M efects and interaction efects of independent variables on i1 i2 iM i1 i2 and N are the number of features and samples, y and y dependent variables, as proposed by [52]. Furthermore, [53] i1 i2 indicate the time gap and the distance gap of sample i, improved SHAP to better and faster explain tree-based ML respectively. Figure 2 expresses the structural framework of models, such as random forests and gradient boosted trees. RF,which consists of thefollowing threeparts: (1)Sample set SHAP value is the core of the method which is computed selection: using the resampling method p times on the based on the game-theoretic approach, and it represents the original dataset to generate a sample set. In other words, average marginal contributions of one variable on a single some samples are likely to be chosen multiple times, while prediction. SHAP value is defned as the following equation: Peak Weekend Weather Lighting Population Collision Severity Collision Type Violation Category Party Count Tow Away Truck Involved Hit and Run Alcohol Involved Road Condition Road Surface Volume (veh/5 min) Speed (mph) 6 Journal of Advanced Transportation Table 2: Optimal values of parameters of the RF model. Original data set Parameters Values n_estimators 110 max_depth 10 max_features “auto” ... Sample set 1 Sample set 2 Sample set k min_samples_split 2 min_samples_leaf 1 Tree 1 Tree 2 ... Tree k ... Result 1 Result 2 Result k model and the multilayer perceptron regression (MPR) model. All the models were trained and validated by ap- plying the same dataset to guarantee the reliability of the comparison results. Specifcally, at a ratio of 7:3, the raw Final result samples were split into a training set and a testing set for training and testing model. Two classical regression evalu- Figure 2: Structural framework of RF. ation measures, namely, mean absolute error (MAE) and mean squared error (MSE), were used to assess model performance. Te fnal evaluation results are presented in R R ϕ (f, x) � 􏽘 􏽨f 􏼐P ∪ i􏼑 − f 􏼐P 􏼑􏽩, (3) i x i x i Table 3. As shown, the RF model mostly outperformed the M! R∈R other two models on both the training and testing sets in terms of predicting the time and distance gaps. where R indicates the set of all variable orderings, P represents the set of all variables that rank before the variable i in the ordering R, M is the number of variables, x means 5.2. Global Importance of Variables. Figure 3 visualizes the the values of explanatory variables, and f refers to the global importance of variables on the time gap. In the left single prediction, which can be written by the following part, variables are sorted in descending order according to equation: their global importance, computed by averaging their ab- solute SHAP values per variable. Te left x-axis indicates the f(x) � ϕ (f) + 􏽘 ϕ (f, x), (4) 0 i mean(|SHAPvalue|). As shown, lighting is the most dom- i�1 inant variable on the time gap, and its average efect on the predicted value is 0.11, followed closely by volume and where ϕ (f) means the base value, i.e., the average value of speed, which change the predicted value by 0.093 and 0.056, overall predictions. respectively, on average. It suggested that the trafc char- Additionally, the global importance of variables is the acteristics signifcantly afect the time gap. Tis fnding is not sum of the contribution of one variable on all predictions, surprising; Trafc characteristics are the direct response of which is calculated by averaging absolute SHAP values as the trafc state, which largely infuences the travel sur- shown in the following equation: roundings and driver status. As [11] indicated, more than 􏼌 􏼌 􏼌 􏼌 (j) geometric characteristics and primary crash characteristics, 􏼌 􏼌 I � 􏽘 􏼌 􏼌 ϕ , (5) i 􏼌 􏼌 trafc characteristics could signifcantly afect the SC like- j�1 lihood. Subsequently, population has a greater contribution (j) where I represents the importance of variable i, ϕ in- than party count and collision severity, indicating that the i i dicates the SHAP value for variable i in the single prediction temporal characteristic of population impacts the time gap j, and n is the number of all predictions. more than the primary crash characteristic. By contrast, the Te proposed RF model and SHAP method were mainly roadway characteristics of road surface and condition have implemented in Python (3.8.8) using scikit-learn (0.24.1) a substantially minor efect on the time gap, with the and shap (0.40.0). Te SHAP package contains three ap- mean(|SHAPvalue|) less than 0.005. plications: force plot, summary plot and dependence plot. In In the right part, the diagram consists of points repre- this study, we apply the summary plot to describe the im- senting the samples, and the color visually reveals the portance of each variable and the dependence plot to refect magnitude of variables (red means a high value, while blue the main efects and the interaction efects of all variables. means a low value). Te right x-axis indicates the SHAP value, which refers to the efects of all variables on a single model output (i.e., the local efect). Tis diagram roughly 5. Results and Discussion illustrates the variation of efects with the change of either 5.1. Results. In this study, the grid-search with 5-fold cross- variable. Taking lighting as an example, its left side of the validation techniques (i.e., GridSearchCV) was used to vertical axis is covered with red points (indicate dark) and its determine the core parameters of the RF model. Table 2 right side is stacked with blue points (refer to daylight). Tis reports the optimal values of the parameters. In the appli- demonstrates that night may decrease the time gap, while the daytime probably promotes the time gap. In addition, high cation, the proposed RF model is compared with two tra- ditional multivariate models: the K-nearest neighbor (KNN) volume (red points) mostly has a positive SHAP value and Journal of Advanced Transportation 7 Table 3: Results of several models. Time gap Distance gap MAE MSE MAE MSE Training set Testing set Training set Testing set Training set Testing set Training set Testing set RF 0.22 0.46 0.07 0.31 0.45 0.45 0.33 0.31 KNN 0.44 0.47 0.28 0.32 0.49 0.47 0.36 0.36 MPR 0.45 0.46 0.30 0.32 0.46 0.47 0.31 0.32 Bold values refer to the maximum prediction performance in each circumstance. High Lighting Volume (veh/5 min) Speed (mph) Population Party Count Collision Severity Weather Tow Away Truck Involved Violation Category Peak Collision Type Hit and Run Weekend Alcohol Involved Road Surface Road Condition Low 0.00 0.02 0.04 0.06 0.08 0.10 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 mean (|SHAP value|) SHAP value Figure 3: Global importance of independent variables and a summary of local explanations for the time gap. low volume (blue points) mainly has a negative one, re- were calculated for each variable. In addition, considering vealing that high volume promotes the time gap while low the nontrivial efects of trafc characteristics on the time and volume inhibits it. distance gaps (see Figures 3 and 4 in the previous section), their interaction efects with the rest of the variables were Figure 4 represents the global importance of variables on the distance gap. As shown in the left part, volume is the also estimated. In this section, we select variables with strong efects for analysis. most signifcant contributor and has an overwhelming efect on the distance gap, changing the predicted value by 0.136. Figure 5 shows the local dependence plots for volume on Defnitely, volume size directly infuences the length of the the time and distance gaps. Specifcally, the frst two plots vehicle queueand, thus, the distance gap between the PCand reveal the main efects of volume, and the last two refect the SC. Lighting, speed, and population also rank at the top of interaction efects between volume and speed. Moreover, the the importance list. Road surface and condition are in the left column is for the time gap, while the right column is for bottom third and second places. Generally, the importance the distance gap. In each plot, every point corresponds to ranking of variables for the two gaps is diferent, but there a sample. Te x-axis represents the volume value; the left y- are overall similarities. Trafc features are always the most axis indicates the SHAP value (i.e., the local efect); the right y-axis and the diferent colored points in the last two plots important. Crash and temporal characteristics are com- monly distributed throughout the importance list. And road describe the speed value. As shown in Figures 5(a) and 5(b), traits contribute relatively small to both time and distance plots for volume reveal an overall upward trend. When gaps. Regarding the right part, it shows that high volume, volume is around 100veh/5min in the two plots, its local daylight, enormous speed, and a dense population have efects remain at the negative highest level, suggesting that a positive SHAP value, possibly increasing the distance gap. low volume may lead to a sharp decline in the time and distance gaps. One possible explanation is that low volume allows for such long distances between vehicles that drivers 5.3. Local Efects of Variables. In previous studies, the local tend to relax their vigilance generally. When faced with efects of a particular variable on the predicted outcome are a sudden crash, they are likely to react slowly and are unable often observed assuming that other variables are constant. to stop timely at high speed (as shown in the lower-left corner of Figures 5(c) and 5(d), the corresponding speed is Te drawback is that this way does not consider the issue that the changes of specifc variable likely cause variations in around 65mph). Another reasonable interpretation is that low volume does not contribute to long queue length for- other variables (rather than assuming that all other variables are constant). Te local dependence plot obtained based on mation, thus creating a short-distance gap. As volume grows to 500veh/5min, its local efects remain at the positive the SHAP method can quantify the variables’ efects while avoiding theabovementioneddisadvantage. Temain efects highest level, indicating that high volume is likely to rapidly Feature value 8 Journal of Advanced Transportation High Volume (veh/5 min) Lighting Speed (mph) Population Violation Category Collision Severity Party Count Weather Truck Involved Weekend Tow Away Hit and Run Peak Collision Type Road Surface Road Condition Alcohol Involved Low mean (|SHAP value|) SHAP value Figure 4: Global importance of independent variables and a summary of local explanations for the distance gap. 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (a) (b) 0.3 0.4 70 70 0.3 0.2 60 60 0.2 0.1 50 50 0.1 40 40 0.0 0.0 30 –0.1 30 –0.1 –0.2 20 20 –0.2 –0.3 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (c) (d) Figure 5: SHAP local dependence plots of volume. (a) Main efects of volume on the time gap. (b) Main efects of volume on the distance gap. (c) Interaction efects between volume and speed on the time gap. (d) Interaction efects between volume and speed on the distance gap. increase the time gap and distance gap. Tis fnding is Figures 6(a) and 6(b) show the main local efects of speed consistent with existing works [15]. Te reason might be that on the time and distance gaps, respectively. Te trends in the high volume makes the trafc situation entirely stressful, and two plots are similar in general (down then up), but the drivers have developed a cautious driving style under this infection points correspond to diferent speed values. In circumstance. When a PC occurs, drivers in the immediate Figure 6(a), as speed ranges between 0 and 50mph, its local vicinity upstream will not feel large shock, so SC does not efects on the time gap decline to negative from positive as it occur as quickly. Moreover, high volume can prolong queue increases. When speed falls 50–75mph, its local efects show length and thus increase the distance gap. When volume is a steep upward trend. As for Figure 6(b), when speed in- around 500veh/5min, its corresponding speed falls in an creased from 0 to 30mph, its local efects decline from 0.05 extensive range of 24–76mph. to −0.22, indicating that this value range of speed inhibits the SHAP value SHAP value 0.00 0.02 0.04 0.06 0.08 0.10 Speed (mph) 0.12 SHAP value SHAP value 0.14 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 Feature value Speed (mph) Journal of Advanced Transportation 9 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (a) (b) 0.3 0.3 800 800 0.2 0.2 600 600 0.1 0.1 0.0 400 400 0.0 –0.1 –0.1 200 200 –0.2 –0.2 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (c) (d) Figure 6: SHAP local dependence plots of speed. (a) Main efects of speed on the time gap. (b) Main efects of speed on the distance gap. (c) Interaction efects between speed and volume on the time gap. (d) Interaction efects between speed and volume on the distance gap. distance gap. As the speed continues to increase, the local In other words, a bright environment has a larger volume efects grow to be positive. Moreover, we found that when and positive local efects, while a dark condition has a rel- the speed ranges between 60 and 75mph (the average atively smaller volume and negative local efects. It makes volume for this speed range is 281veh/5min), the corre- sense that the vehicle trips are more during the day than at sponding efects for both time and distance gaps are stable night. Likewise, it is reasonable to consider that high volume around value 0, as observed from Figures 6(c) and 6(d). Such likely prolongs queue length and therefore increases the a fnding demonstrates that this trafc state has minor distance gap. promotion/inhibition on both gaps. Figures 8(a) and 8(b) represent the main efects of vi- Figures 7(a) and 7(b) demonstrate the main efects of olation category on the two gaps. As observed, improper turns (i.e., violation category �4) have the maximum SHAP lighting on the time and distance gaps; the two plots reveal an approximate concave trend. As shown in Figure 7(a), the value. Specifcally, its local efects on the time and instance local efects of daylight and dawn (i.e., lighting �0 and 1) on gaps roughly fall between 0–0.10 and 0–0.15, respectively; the time gap fall in the range of 0–0.20, while streetlights and such ranges indicate that this violation category promotes no streetlights (i.e., lighting �2 and 3) have the most neg- the time and distance gaps to a varying degree. Te reason ative efects. Such variations in local efects indicate that might be that the crashes in which the violation category is a darker environment will accelerate the occurrence of SCs. improper turns probably block turn lanes (usually on a one- Probably because the driver’s sight distance in dark situa- way road), thus afecting the vehicles behind and causing tions depends on the space illuminated by the streetlights a long queue length. Followed by another violation category and headlights, leading to a lack of timely and clear per- of unsafe lane changes (i.e., violation category �3), which ception of the current road condition, resulting in in- shows positive correlations with both gaps. Likewise, crashes caused by unsafelane changes likely block multiple lanesand sufcient avoidance of PCs and thus reducing the time gap. Figures 7(c) and 7(d) display the interaction efects between involve several vehicles, thus decreasing the road capacity lighting and volume. As observed, all points are approxi- signifcantly and extending the queue length. Besides, this mately divided by their color into upper-right and lower-left type of crash is more visible. Tat means drivers behind can parts, with most of the pale and dark blue points (i.e., catch the crash information at a distance and drive more representing daylight and dawn) being above the horizontal carefully, increasing the time and distance gaps. By contrast, axis where the local efect is −0.1, the red and orange points the other four violation categories have more negligible local (i.e., denoting streetlights and no streetlights) being below it. efects. As shown in Figures 8(c) and 8(d), it is the SHAP value SHAP value Volume (veh/5 min) SHAP value SHAP value Volume (veh/5 min) 10 Journal of Advanced Transportation 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 12 3 03 12 Lighting Lighting (a) (b) 3 3 0.3 0.4 0.3 0.2 0.2 2 2 0.1 0.1 0.0 0.0 1 1 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (c) (d) Figure 7: SHAP local dependence plots of lighting. (a) Main efects of lighting on the time gap. (b) Main efects of lighting on the distance gap. (c) Interaction efects between volume and lighting on the time gap. (d) Interaction efects between volume and lighting on the distance gap. fatal crashes) occur in the speed range of 60–70mph, sug- interaction efects between violation category and speed. We fnd a strong association between crashes involving alcohol gesting that serious crashes frequently occur at high speeds. (i.e., violation category �0) and high speed, because points Te main and interaction efects of other variables are are red on the frst vertical column. Another interesting presented in Figures 10 and 11. As shown, plots of pop- fnding is that the red points in the ffth vertical column (i.e., ulation reveal a broadly upward trend, varying from negative violation category �4) are concentrated at the bottom, il- to positive. A dense population (i.e., Population �3 and 4, lustrating that those crashes, which occurred due to unsafe indicating the population is more than 250000) promotes lane changes at high speeds, reduce the time and the time gap and distance gap. One possible explanation is distance gaps. that car ownership and travel trips may be relatively high in Figures 9(a) and 9(b) represent the main efects of these densely populated areas, leading to long queuing times collision severity. Te fatal crashes, severe injury crashes, and length. Te local efects of most weekdays (i.e., week- end �0) and peak periods (i.e., peak �1) on the distance gap and light injury crashes (i.e., collision severity �0, 1, and 2) have a promotion on the time and distance gaps, while only are greater than the value 0. It makes sense that weekdays complaining crashes (i.e., collision severity �3) mainly have and peak periods have many commuter trips, resulting in inhibition on the two gaps. One possible reason is that high volume on the road. Te plot for collision type shows serious crashes attract more attention, such as rapid rescue a v trend on the time gap while a downward trend on the and intervention by trafc police, so that SCs do not occur at distance gap. Te efects of clear days are around the value 0, a close time and distance. Figures 9(c) and 9(d) show the while the efects of most cloudy days are less than the value 0. interaction efects between collision severity and speed. As Such a comparison indicates that cloudy days will inhibit observed, most of the blue points (represent the sample of both gaps, i.e., SCs will occur sooner and closer on cloudy SHAP value SHAP value Lighting SHAP value SHAP value Lighting Journal of Advanced Transportation 11 0.10 0.150 0.08 0.125 0.06 0.100 0.04 0.075 0.02 0.050 0.00 0.025 –0.02 0.000 –0.04 –0.025 –0.06 –0.050 Violation Category Violation Category (a) (b) 0.150 0.10 70 70 0.125 0.08 60 60 0.100 0.06 0.075 0.04 50 50 0.050 0.02 0.025 0.00 0.000 –0.02 –0.04 20 –0.025 –0.050 –0.06 0 4 5 0 12345 Violation Category Violation Category (c) (d) Figure 8: SHAP local dependence plots of violation category. (a) Main efects of the violation category on the time gap. (b) Main efects of violation category on the distance gap. (c) Interaction efects between violation category and speed on the time gap. (d) Interaction efects between violation category and speed on the distance gap. 0.150 0.125 0.125 0.100 0.100 0.075 0.075 0.050 0.050 0.025 0.025 0.000 0.000 –0.025 –0.025 –0.050 –0.050 0123 0123 Collision Severity Collision Severity (a) (b) Figure 9: Continued. SHAP value SHAP value SHAP value Speed (mph) SHAP value SHAP value SHAP value Speed (mph) 12 Journal of Advanced Transportation 3 3 0.3 0.3 0.2 0.2 2 2 0.1 0.1 0.0 0.0 1 1 –0.1 –0.1 –0.2 –0.2 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (c) (d) Figure 9: SHAP local dependence plots of collision severity. (a) Main efects of collision severity on the time gap. (b) Main efects of collision severity on the distance gap. (c) Interaction efects between collision severity and speed on the time gap. (d) Interaction efects between collision severity and speed on the distance gap. 0.10 0.10 0.05 0.05 0.00 0.00 –0.05 –0.05 –0.10 –0.10 Population Population (a) (b) 0.06 0.04 0.02 0.04 0.00 0.02 –0.02 –0.04 0.00 –0.06 –0.02 –0.08 01 0 1 Weekend Weekend (c) (d) Figure 10: Continued. SHAP value SHAP value SHAP value Collision Severity SHAP value SHAP value SHAP value Collision Severity Journal of Advanced Transportation 13 0.05 0.04 0.04 0.03 0.02 0.02 0.00 0.01 0.00 –0.02 –0.01 –0.02 –0.04 0 1 Peak Peak (e) (f) 0.12 0.06 0.10 0.04 0.08 0.02 0.06 0.04 0.00 0.02 –0.02 0.00 –0.04 –0.02 –0.06 0123456 0123456 Collision Type Collision Type (g) (h) 0.05 0.05 0.00 0.00 –0.05 –0.05 –0.10 –0.10 –0.15 –0.15 –0.20 012 012 Weather Weather (i) (j) 0.03 0.01 0.02 0.00 0.01 –0.01 0.00 –0.02 –0.01 –0.03 –0.02 –0.04 –0.03 –0.05 –0.04 –0.06 0 1 0 1 Alcohol Involved Alcohol Involved (k) (l) Figure 10: Continued. SHAP value SHAP value SHAP value SHAP value SHAP value SHAP value SHAP value SHAP value 14 Journal of Advanced Transportation 0.01 0.01 0.00 0.00 –0.01 –0.01 –0.02 –0.02 –0.03 –0.03 –0.04 –0.04 –0.05 –0.05 –0.06 –0.06 0 1 0 1 Road Surface Road Surface (m) (n) 0.005 0.06 0.000 –0.005 0.04 –0.010 0.02 –0.015 –0.020 0.00 –0.025 –0.030 –0.02 0 1 Road Condition Road Condition (o) (p) Figure 10: SHAP main efects of variables on the time gap and the distance gap. (a) Population on the time gap. (b) Population on the distance gap. (c) Weekend on the time gap. (d) Weekend on the distance gap. (e) Peak on the time gap. (f) Peak on the distance gap. (g) Collision type on the time gap. (h) Collision type on the distance gap. (i) Weather on the time gap. (j) Weather on the distance gap. (k) Alcohol involved on the time gap. (l) Alcohol involved on the distance gap. (m) Road surface on the time gap. (n) Road surface on the distance gap. (o) Road condition on the time gap. (p) Road condition on the distance gap. 1.0 1 0.3 0.3 0.8 0.2 0.2 0.6 0.1 0.1 0.4 0.0 0.0 –0.1 –0.1 0.2 –0.2 –0.2 0.0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (a) (b) Figure 11: Continued. SHAP value SHAP value SHAP value Weekend SHAP value SHAP value SHAP value Weekend Journal of Advanced Transportation 15 0.12 0.06 0.10 800 800 0.04 0.08 600 0.02 600 0.06 0.04 0.00 400 400 0.02 –0.02 0.00 200 200 –0.04 –0.02 –0.06 01234 5 6 01234 5 6 Collision Type Collision Type (c) (d) 6 6 0.3 0.3 5 5 0.2 0.2 4 4 0.1 0.1 3 3 0.0 0.0 2 2 –0.1 –0.1 1 1 –0.2 –0.2 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (e) (f) 2 2 0.3 0.4 0.3 0.2 0.2 0.1 0.1 1 1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (g) (h) Figure 11: Continued. SHAP value SHAP value SHAP value Weather Collision Type Volume (veh/5 min) SHAP value SHAP value SHAP value Weather Collision Type Volume (veh/5 min) 16 Journal of Advanced Transportation 1 1 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (i) (j) 1 1 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (k) (l) 1 1 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (m) (n) Figure 11: SHAP interaction efect plots among variables on the time gap and the distance gap. (a) Speed and weekend on the time gap. (b) Speed andweekend onthe distancegap.(c) Collisiontype and volumeon the timegap.(d)Collision type andvolume onthe distancegap. (e) Speed and collision type on the time gap. (f) Speed and collision type on the distance gap. (g) Volume and weather on the time gap. (h) Volume and weather on the distance gap. (i) Volume and alcohol involved on the time gap. (j) Volume and alcohol involved on the distance gap. (k) Volume and road surface on the time gap. (l) Volume and road surface on the distance gap. (m) Volume and road condition on the time gap. (n) Volume and road condition on the distance gap. SHAP value SHAP value SHAP value Road Condition Road Surface Alcohol Involved SHAP value SHAP value SHAP value Road Condition Road Surface Alcohol Involved Journal of Advanced Transportation 17 intensity of the lighting. Moreover, crashes involving the days. Drinking (i.e., alcohol involved �1) mostly has neg- ative local efects, meaning that drinking will reduce the time violation categories of improper turns or unsafe lane changes possibly cause long time and large distance gaps. and distance gap. Tis is consistent with reality. Te wet surface (i.e., road surface �1) inhibits both gaps, which is Te contributions of this study can be summarized in consistent with existing knowledge [16]. It makes sense that the following three aspects: (1) proposing a two-stage SC a wet road surface harms the vehicle’s stability, such as identifcation method, which combined the static and a brake failure, thus accelerating the occurrence of SCs. dynamic approaches. And the identifcation results on the test data are consistent with existing works, providing a reliable basis for SC analysis. (2) Applying random forest 6. Conclusions to simultaneously predict the time and distance gaps, Tis study aimed at predicting the time and distance gaps which facilitated understanding the relationship between the dependent and independent variables.17 independent between SCs and PCs on highways and to analyze how the infuencing factors contribute to the gaps comprehensively. variables selected from temporal, primary crash, roadway, and trafc characteristics and two dependent variables, First, a data-driven identifcation method combining the fxed spatiotemporal thresholds-based method and the namely time gap and distance gap, were used as inputs to speed contour map-based method was developed to identify train and test the random forest model. Te results SCs. A total of 368 SCs were sought out from the total achieved better performance compared with other number of 24643 crashes. Ten, the RF model was applied to models. (3) Using a brand-new interpretation technique predict the two gaps. Te data samples were split into SHAP to explain the RF model from global and local ways. training and testing sets at a ratio of 7:3. Te results showed We made several signifcant fndings which will be def- that the RF model performed better than KNN and MPR. nitely helpful for trafc decision makers to formulate Additionally, the SHAP method was conducted to explain strategies. Tis research also raises issues in need of further ex- the outputs of the RF model. Based on this local in- terpretation method, we revealed variables’ global impor- plorations in the future. First, 368 crashes were used in the model training. Although we applied ML models that are tance and main and interaction efects on the time and distance gaps. advantageous for handling sparse data, small sample sizes We found that trafc volume and speed are the im- may reduce the performance of the models. More data are portant contributors to the time and distance gaps; moni- expected to be required to improve the model performance. toring trafc conditions helps implement timely and Second, 17 variables were used, and future work will cover efective management to prevent SCs. Several temporal more types of factors. Tis study focused on temporal characteristics, such as lighting and population, contribute characteristics, primary crash factors, roadway conditions, and real-time trafc parameters. Other factors, such as more to both gaps than primary crash features and road factors. Compared with road factors, the primary crash shoulder width and truck proportion which have shown correlations with the time gap and distance gap of SCs, will characteristics of violation category, party count, and col- lision severity demonstrate more signifcant efects. With be considered in future research. Te SC factors are also worthy of being discussed. In the future, it is a potential idea these fndings about factor priorities, trafc managers and policymakers can develop prevention plans and allocate to combine the PC and SC characteristics to explore the time resources more efciently. and distance gaps. Te local dependence plots quantify the efects of var- iables. Plots for the continuous variables, i.e., volume and Data Availability speed, reveal developing trends and several infection points. For example, the local efects of volume increase mono- Te data used to support the fndings of this study are tonically from −0.3 to 0.4 as the volume grows. Such var- available from the corresponding author upon reasonable iation indicates that low volume sharply inhibits the time request. and distance gaps, while high volume boosts them signif- cantly. Additionally, the local efects on the distance gap are Conflicts of Interest around value 0 when volume falls between 300 and 400veh/ 5min, suggesting that the trafc state in this volume afects Te authors declare that there are no conficts of interest. the gap inconsiderably. Te plot for the main efects of speed on the distance gap shows an obvious infection point. Such Acknowledgments critical information above is considerable for trafc safety managers. As for plots about the discrete variables, dem- Tis research was funded in part by the National Natural onstrate the local efects and corresponding characteristics Science Foundation of China (grant no. 52172310), Hu- of diferent categories of variables. Take lighting as an ex- manities and Social Sciences Foundation of the Ministry of ample: the efects of daylight and dawn are positive, while Education (grant no. 21YJCZH147), Innovation-Driven those of streetlights and no streetlights are mostly having Project of Central South University (grant no. negative efects. Tat is to say, a darker environment 2020CX041), and the Fundamental Research Funds for the probably accelerates the occurrence of SCs. Where the Central Universities of Central South University (grant no. economic condition allows, it is advantageous to increase the 2022ZZTS0717). 18 Journal of Advanced Transportation [17] R. Raub, “Occurrence of secondary crashes on urban arterial References roadways,” Transportation Research Record: Journal of the [1] World health organization, “Road trafc injuries,” 2023, Transportation Research Board, vol. 1581, no. 1, pp. 53–58, https://www.who.int/news-room/fact-sheets/detail/road- trafc-injuries. [18] B. Yang, Y. Guo, and C. Xu, “Analysis of freeway secondary [2] H. Yang, Z. Wang, K. Xie, K. Ozbay, and M. Imprialou, crashes with a two-step method by loop detector data,” IEEE “Methodological evolution and frontiers of identifying, Access, vol. 7, pp. 22884–22890, 2019. modeling and preventing secondary crashes on highways,” [19] J. E. Moore, G. Giuliano, and S. Cho, “Secondary accident Accident Analysis and Prevention, vol. 117, pp. 40–54, 2018. rates on Los Angeles freeways,” Journal of Transportation [3] S. A. Tedesco, V. Alexiadis, W. R. Loudon, R. Margiotta, and Engineering, vol. 130, no. 3, pp. 280–285, 2004. D. Skinner, “Development of a 40 model to assess the safety [20] W. Hirunyanitiwattana and S. P. Mattingly, “Identifying Secondary Crash Characteristics for California Highway impacts of implementing IVHS user services, moving toward System,” in Proceedings of the Transportation Research Board deployment,” in Proceedings of the IVHS America Annual Meeting, Washington, DC, USA, January 2006. Meeting, pp. 343–352, Atlanta GA, USA, March 1994. [21] L. Kopitch and J.-D. M. Saphores, “Assessing efectiveness of [4] N. Owens, A. Armstrong, P. Sullivan et al., Trafc Incident changeable message signs on secondary crashes,” in Pro- Management Handbook, Federal Highway Administration, ceedings of the Transportation Research Board 90th Annual Washington, DC, USA, 2010. Meeting, Washington, DC, USA, January 2011. [5] J. G. Pigman, J. R. Walton, and E. R. Green, “Identifcation of [22] H. Yang, Z. Wang, K. Xie, and D. Dai, “Use of ubiquitous secondary crashes and recommended countermeasures,” probe vehicle data for identifying secondary crashes,” Crash Severity, Transport Research International Documen- Transportation Research Part C: Emerging Technologies, tation, Washington, DC, USA, 2011. vol. 82, pp. 138–160, 2017. [6] M. Jalayer, F. Baratian-Ghorghi, and H. Zhou, “Identifying [23] E. I. Vlahogianni, M. G. Karlaftis, and F. P. Orfanou, and characterizing secondary crashes on the Alabama state “Modeling the efects of weather and trafc on the risk of highway systems,” Advances in Transportation Studies, vol. 37, secondary incidents,” Journal of Intelligent Transportation pp. 129–140, 2015. Systems, vol. 16, no. 3, pp. 109–117, 2012. [7] Y. Tian, H. Chen, and D. Truong, “A case study to identify [24] M.-I. M. Imprialou, F. P. Orfanou, E. I. Vlahogianni, and secondary crashes on interstate highways in Florida by using M. G. Karlaftis, “Methods for defning spatiotemporal in- geographic information systems (gis),” Advances in Trans- fuence areas and secondary incident detection in freeways,” portation Studies, vol. 2, pp. 103–112, 2016. Journal of Transportation Engineering, vol. 140, no. 1, [8] H. Yang, Z. Wang, and K. Xie, “Impact of connected vehicles pp. 70–80, 2014. on mitigating secondary crash risk,” International Journal of [25] W. Junhua, L.Boya, Z. Lanfang, and D.R. Ragland, “Modeling Transportation Science and Technology, vol. 6, no. 3, secondary accidents identifed by trafc shock waves,” Acci- pp. 196–207, 2017a. dent Analysis and Prevention, vol. 87, pp. 141–147, 2016. [9] C. Zhan, L. Shen, M. A. Hadi, and A. Gan, “Understanding the [26] A. A. Sarker, R. Paleti, S. Mishra, M. M. Golias, and characteristics of secondary crashes on freeways,” in Pro- P. B. Freeze, “Prediction of secondary crash frequency on ceedings of the Transportation Research Board 87th Annual highway networks,” Accident Analysis and Prevention, vol. 98, Meeting, Washington, DC, USA, January 2008. pp. 108–117, 2017. [10] H. Yang, K. Ozbay, and K. Xie, “Assessing the risk of sec- [27] N. J. Goodall, “Probability of secondary crash occurrence on ondary crashes on highways,” Journal of Safety Research, freeways with the use of private-sector speed data,” Trans- vol. 49, no. 143, pp. 143.e1–149, 2014. portation Research Record: Journal of the Transportation Re- [11] C. Xu, P. Liu, B. Yang, and W. Wang, “Real-time estimation of search Board, vol. 2635, no. 1, pp. 11–18, 2017. secondary crash likelihood on freeways using high-resolution [28] H. Park, S. Gao, and A. Haghani, “Sequential interpretation loop detector data,” Transportation Research Part C: Emerging and prediction of secondary incident probability in real time,” Technologies, vol. 71, pp. 406–418, 2016. in Proceedings of the Transportation Research Board 96th [12] H. Park and A. Haghani, “Real-time prediction of secondary Annual Meeting, Washington, DC, USA, January 2017. incident occurrences using vehicle probe data,” Trans- [29] A. E. Kitali, P. Alluri, T. Sando, H. Haule, E. Kidando, and portation Research Part C: Emerging Technologies, vol. 70, R. Lentz, “Likelihood estimation of secondary crashes using pp. 69–85, 2016. Bayesian complementary log-log model,” Accident Analysis [13] P. Li and M. Abdel-Aty, “A hybrid machine learning model and Prevention, vol. 119, pp. 58–67, 2018. for predicting Real-Time secondary crash likelihood,” Acci- [30] J. Tang, F. Liu, W. Zhang, R. Ke, and Y. Zou, “Lane-changes dent Analysis and Prevention, vol. 165, Article ID 106504, prediction based on adaptive fuzzy neural network,” Expert Systems with Applications, vol. 91, pp. 452–463, 2018. [14] H. B. Zhang and A. Khattak, “Spatiotemporal patterns of [31] X. M. Chen, S. Zhang, and L. Li, “Multi-model ensemble for primary and secondary incidents on urban freeways,” short-term trafc fow prediction under normal and abnormal Transportation Research Record: Journal of the Transportation conditions,” IET Intelligent Transport Systems, vol. 13, no. 2, Research Board, vol. 2229, no. 1, pp. 19–27, 2011. pp. 260–268, 2018. [15] D. Chimba and B. Kutela, “Scanning secondary derived [32] A. Jamal, M. Zahid, M. Tauhidur Rahman et al., “Injury se- crashes from disabled and abandoned vehicle incidents on verity prediction of trafc crashes with ensemble machine uninterrupted fow highways,” Journal of Safety Research, learning techniques: a comparative study,” International vol. 50, no. 5, pp. 109–116, 2014. Journal of Injury Control and Safety Promotion, vol. 28, no. 4, [16] J.Wang, B. Liu, T. Fu, S.Liu, and J. Stipancic, “Modelingwhen pp. 408–427, 2021. and where a secondary accident occurs,” Accident Analysis ´ ´ [33] G. Asencio-Cortes, E. Florido, A. Troncoso, and F. Martınez- and Prevention, vol. 130, pp. 160–166, 2019. Alvarez, “A novel methodology to predict urban trafc Journal of Advanced Transportation 19 congestion with ensemble learning,” Soft Computing, vol. 20, [51] L. Xiao, S. Lo, J. Liu, J. Zhou, and Q. Li, “Nonlinear and no. 11, pp. 4205–4216, 2016. synergistic efects of TOD on urban vibrancy: applying local [34] SWITRS, “Statewide Integrated Trafc Records System explanations for gradient boosting decision tree,” Sustainable (SWITRS),” 2022, https://iswitrs.chp.ca.gov/Reports/jsp/ Cities and Society, vol. 72, Article ID 103063, 2021. [52] S. M. Lundberg and S.-I. Lee, “A unifed approach to inter- index.jsp. preting model predictions,” Advances in Neural Information [35] Pems, “Caltrans PeMS,” 2022, https://pems.dot.ca.gov/. Processing Systems, vol. 30, pp. 4765–4774, 2017. [36] H. T. Yang, G. C. Zhai, L. C. Yang, and K. Xie, “How does the [53] S. M. Lundberg, G. Erion, H. Chen et al., “From local ex- suspension of ride-sourcing afect the transportation system planations to global understanding with explainable AI for and environment?” Transportation Research Part D: Transport trees,” Nature Machine Intelligence, vol. 2, no. 1, pp. 56–67, and Environment, vol. 102, Article ID 103131, 2022. [37] H. T. Yang, J. H. Huo, R. B. Pan, K. Xie, W. J. Zhang, and X. J. Luo, “Exploring built environment factors that infuence the market share of ridesourcing service,” Applied Geography, vol. 142, Article ID 102699, 2022. [38] A. A. Sarker, A. Naimi, S. Mishra, M. M. Golias, and P. B. Freeze, “Development of a secondary crash identifcation algorithm and occurrence pattern determination in large scale multi-facility transportation network,” Transportation Re- search Part C: Emerging Technologies, vol. 60, pp. 142–160, [39] S. Mishra, M. Golias, A. Sarker, and A. Naimi, Efect of Primary and Secondary Crashes: Identifcation, Visualization, and Prediction Research Report No. CFIRE 09-05, University of Wisconsin-Madison, Madison, WI, USA, 2016. [40] J. Wang, W. Xie, B. Liu, S. Fang, and D. R. Ragland, “Identifcation of freeway secondary accidents with trafc shock wave detected by loop detectors,” Safety Science, vol. 87, pp. 195–201, 2016. [41] D. Yao, J. Yang, and X. Zhan, “A novel method for disease prediction: hybrid of random forest and multivariate adaptive regression splines,” Journal of Computers, vol. 8, no. 1, pp. 73-74, 2013. [42] K. Miller, F. Huettmann, B. Norcross, and M. Lorenz, “Multivariate random forest models of estuarine-associated fsh and invertebrate communities,” Marine Ecology Progress Series, vol. 500, pp. 159–174, 2014. [43] H. Hong, H. R. Pourghasemi, and Z. S. Pourtaghi, “Landslide susceptibility assessment in Lianhua County (China): a comparison between a random forest data mining technique and bivariate and multivariate statistical models,” Geo- morphology, vol. 259, pp. 105–118, 2016. [44] Y. Li, C. Zou, M. Berecibar et al., “Random forest regression for online capacity estimation of lithium-ion batteries,” Ap- plied Energy, vol. 232, pp. 197–210, 2018. [45] J. Tang, J. Liang, C. Han, Z. Li, and H. Huang, “Crash injury severity analysis using a two-layer stacking framework,” Accident Analysis and Prevention, vol.122, pp. 226–238, 2019. [46] G. Xu, M. Liu, Z. Jiang, D. Sofker, ¨ and W. Shen, “Bearing Fault diagnosis method based on deep convolutional neural network and random forest ensemble learning,” Sensors, vol. 19, no. 5, p. 1088, 2019. [47] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001. [48] G. De’ath, “Multivariate regression trees: a new technique for modeling species-environment relationships,” Ecology, vol. 83, no. 4, pp. 1105–1117, 2002. [49] M. Segal and Y. Xiao, “Multivariate random forests,” WIREs Data Mining and Knowledge Discovery, vol.1, pp. 80–87, 2011. [50] X. Wen, Y. Xie, L. Wu, and L. Jiang, “Quantifying and comparing the efects of key risk factors on various types of roadway segment crashes with LightGBM and SHAP,” Ac- cident Analysis and Prevention, vol. 159, Article ID 106261, http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Advanced Transportation Hindawi Publishing Corporation

Time and Distance Gaps of Primary-Secondary Crashes Prediction and Analysis Using Random Forests and SHAP Model

Loading next page...
 
/lp/hindawi-publishing-corporation/time-and-distance-gaps-of-primary-secondary-crashes-prediction-and-NU6KjcyudH

References (51)

Publisher
Hindawi Publishing Corporation
ISSN
0197-6729
eISSN
2042-3195
DOI
10.1155/2023/7833555
Publisher site
See Article on Publisher Site

Abstract

Hindawi Journal of Advanced Transportation Volume 2023, Article ID 7833555, 19 pages https://doi.org/10.1155/2023/7833555 Research Article Time and Distance Gaps of Primary-Secondary Crashes Prediction and Analysis Using Random Forests and SHAP Model 1 1 1,2 1 Xinyuan Liu , Jinjun Tang , Fan Gao , and Xizhi Ding Smart Transportation Key Laboratory of Hunan Province, School of Trafc and Transportation Engineering, Central South University, Changsha 410075, China Department of Geography and Resource Management, Te Chinese University of Hong Kong, Shatin, N.T., Hong Kong Correspondence should be addressed to Jinjun Tang; jinjuntang@csu.edu.cn Received 18 September 2022; Revised 12 December 2022; Accepted 18 March 2023; Published 14 April 2023 Academic Editor: Wen Liu Copyright © 2023 Xinyuan Liu et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Secondary crashes (SCs) are typically defned as the crash that occurs within the spatiotemporal boundaries of the impact area of the primary crashes (PCs), which will intensify trafc congestion and induce a series of road safety issues. Predicting and analyzing the time and distance gaps between the SCs and PCs will help to prevent the occurrence of SCs. In this paper, a combined data- driven method of static and dynamic approaches is applied to identify SCs. Ten, the random forests (RF) method is implemented to predict the two gaps using temporal, primary crash, roadway, and real-time trafc characteristics data collected from 2016 to 2019 at California interstate freeways. Subsequently, the SHapley Additive explanation (SHAP) approach is employed to interpret the RF outputs. Te results show that the trafc volume, speed, lighting, and population are considered the most signifcant factors in both gaps. Furthermore, the main and interaction efects of factors are also quantifed. High volume possibly promotes the time and distance gaps, while low volume inhibits them. And volume afects the distance gap inconsiderably when it falls between 300 and 400veh/5min. Trafc conditions with high speed and low volume are strongly associated with short-time and short-distance gaps. Darker surroundings probably accelerate the occurrence of SCs. Moreover, crashes involving the violation categories of improper turns or unsafe lane changes likely result in long time and distance gaps. Tese results have important implications for proposing trafc management and improving road safety. United States. In this context, SC prevention has become 1. Introduction a major consideration in the trafc safety feld. Road trafc crashes pose a threat to normal trafc operations In the past decades, a large body of literature has been and safety and can cause property damage or even serious devoted to investigating the identifcation of SCs and injuries. According to the world health organization [1], modeling the risk of SC occurrence [5–13]. Various statis- approximately 1.3 million people die each year as a result of tical and machine learning (ML) methods were applied to road trafc crashes. Between 20 and 50 million more people explore these two aspects of SCs [9–12]. However, the time sufer nonfatal injuries, with many incurring a disability. gap (i.e., the time diference) and distance gap (i.e., the Furthermore, road trafc crashes cost most countries 3% of spatial separation) between an SC and the corresponding PC their gross domestic product [1]. SCs, happening in the have received less attention, which might hinder a better spatiotemporal impact area of primary crashes (PCs), understanding of the possible time and location of SCs. commonly result in an additional impact on trafc and extra Among the few methods applied to study these two gaps, personal injury [2, 3]. According to [4], SCs can account for statistical approaches subjected themselves to the possibility 20% of all crashes and 18% of all fatalities on freeways in the of predicting infnitely large gaps [14, 15], while ML methods 2 Journal of Advanced Transportation failed to provide satisfactory prediction performance on the historical average speed to brighten the impact area. Like- distance gap [16]. Moreover, the black-box models need wise [11, 13], we applied this method to identify SCs and considered recurrent congestion. more explanation to discuss the efects of contributing factors in detail [16]. Terefore, some promising methods In summary, static methods are easy to implement and and data experiments are required. quickly obtain identifcation results, while dynamic methods To better capture the characteristics of SCs, we frst achieve better performance but consume a lot of compu- developed a hybrid method (i.e., static spatiotemporal tational time. Combining these two methods for SC iden- threshold-based and speed contour map-based methods) to tifcation can improve efciency and accuracy [16, 25]. Tis identify SCs and obtain the time and distance gaps. Sub- paper proposes a two-stage strategy to identify SCs by in- sequently, random forests (RFs) were used to predict the corporating the fxed spatiotemporal threshold-based and time and distance gaps, which have high prediction per- speed contour map-based methods. formance and diversity. And an interpretation technique, namely the SHapley Additive explanation (SHAP) approach, 2.2. Secondary Crash Risk Modeling and Predicting. was applied to examine the model outputs and estimate the global and local efects of the infuencing factors. Un- Several statistical and ML models have been applied to derstanding time and distance gaps and their infuencing explore the relationship between SC occurrences and con- factors can provide management strategies for trans- tributing factors [9–12]. For example, [10] proposed a logit portation agencies and improve trafc operations and road model to predict SC likelihood, and their results revealed safety. that rear-end crashes could increase the SC likelihood [11] developed a random efects logit model to link the proba- bility of SCs with real-time trafc volume conditions, pri- 2. Literature Review mary crash characteristics, environmental conditions, and geometric characteristics. Similarly, [29] used the Bayesian 2.1. Secondary Crash Identifcation. Overall, two types of methods, static and dynamic methods, were widely used to complementary log-log model to predict the likelihood of SCs and examine their relationship with several variables. identify SCs. Static methods identify SCs by setting the fxed spatiotemporal thresholds, which means crashes are iden- However, previous studies focused less on the time and distance gaps between the SCs and PCs. Several studies have tifed as SCs if they fall within the spatiotemporal thresholds of another crash [17]. First introduced this method and made attempts using regression approaches. For example, defned the thresholds equivalent to one mile upstream of [14] selected the ordinary least-squares (OLS) regression to a PC and 15minutes after clearance time. Following this model the two gaps separately. Teir results showed that study, further research associated with static methods has time and distance gaps were closely associated with collision been explored [5–7, 18]. For example, some studies pro- type and the duration of the primary crash. Likewise, [15] posed a spatial threshold of 2 miles and time thresholds of 2, applied OLS regression to evaluate the relationship between the time and distance gaps concerning individual crash 1, and 2hours, respectively, to identify California secondary crashes [19–21]. SCs can be selected quickly and efectively characteristics. Tey found that the number of lanes, total vehicles involved in the crash, morning time, and AADT from massive crashes according to spatiotemporal thresh- olds [2, 16]. However, static methods have the problem of were the most signifcant factors afecting time and distance subjective judgment: overestimation or underestimation of gaps. Although most independent variables had a high the thresholds [2, 22]. As an improvement [7], we in- signifcance, traditionalstatistical models usually made more troduced three sets of spatiotemporal thresholds to identify prior assumptions for input variables, and they were unable SCs on Florida interstates. Te spatial thresholds for all three to predict the possibility of massive gaps. Moreover, [14, 15] sets were 2 miles, and the time thresholds were 2h, built an independent regression model for the time and 15minutes, and 30minutes after the PCs’ clearance time. distance gaps, ignoring the potential correlation of the two Teir results confrmed that the identifcation ratio of SCs gaps because they happen at the same time. Terefore, it is necessary to consider an alternative model to investigate varied for diferent sets. With the support of various sensor technologies, dy- gaps simultaneously. By contrast, ML methods have become increasingly namic methods are becoming increasingly popular and used because of an improvement in the misclassifcation of SCs attractive and have gained more attention due to their high [22]. Tere are three main dynamic methods: (a) queuing prediction power and low limitation on data [30]. Multiple theory-based method [23, 24]; and (b) shockwave-based ML methods have been employed in trafc safety studies approaches [25, 26]; (c) speed contour map-based method [8, 13, 16, 29], such as neural network models, genetic al- [11, 13, 18]. In practical application, due to the data quality gorithms, random forests, XGBoost, etc. In a small number and quantity requirements of methods (a) and (b), the of studies on the time and distance gaps [16], the authors models are often simplifed and set assumptions, failing to utilized a linear regression model and two ML algorithms, refect the actual condition in the real world. Nevertheless, including a back-propagation neural network (BPNN) and the least-squares support vector machine (LSSVM), to build the speed contour map-based method has performed well without any simplifcation or assumptions since it can ac- three prediction models. Te results indicated that the BPNN and LSSVM models outperformed the linear re- curately capture the impact area of PCs [13, 27, 28]. For example, [18] compared the crash state speed with the gression model, but these two ML models also failed to Journal of Advanced Transportation 3 provide adequate performance on distance gap prediction. number. A two-step matching strategy is devised to obtain Regarding ML models, many other promising approaches, trafc volume and average speed for each crash. Te frst step matches the nearest detector upstream for every crash based such as ensemble algorithms, combine several base learners to enhance the prediction performance [31–33]. on the latitude and longitude of the crashes and the loop Besides, relatively fewer studies have focused on SC detectors. Te second step is extracting the volume and prevention. As [2] summarized, available data and high costs speed for 5minutes before the crashes. have limited relevant investigations, so continued endeavors Referring to the previous studies on SCs [14, 16], 17 are still needed. Te main objective of this study is to develop variables were selected from 4 dimensions. Specifcally, a reliable model to predict the time and distance gaps and temporal characteristics consist of 5 variables, namely, peak, analyze associated infuencing factors, which can help with weekend, weather, lighting, and population, which refect proactive prevention and improve safety. Several existing the environment’s state. Population density has a relation- research gaps and insufciencies were mitigated and sup- ship with vehicle trips [36, 37]. Primary crash factors include plemented in this study. 8 variables: collision severity, collision type, violation cat- egory, part count, etc. Tese variables demonstrate all the 3. Data Preparation information associated with a crash. Road condition and surface refect the roadway characteristics, including In this study, crash data were collected from the Statewide whether the pavement is a maintenance area or free from Integrated Trafc Records System (SWITRS), which records abnormal conditions or whether the pavement is dry/wet. detailed description of crash-related information, such as the Trafc volume (veh/5min) and speed (mile/h) report the unique case identifer, location (state route, postmile, lati- trafc characteristics. Detailed descriptions and statistical tude and longitude), collision year and time, collision se- information are expressed in Table 1. Additionally, the verity and type, lighting, weather, etc. A total of 24643 Pearson correlation coefcients (PCCs) were applied to crasheswere collectedfromfreeways I-10, I-5,US-101, I-210, examine the multicollinearity between the 17 variables. and I-110 in Los Angeles County of California over four Figure 1 demonstrates the computed results. As shown, all years, from June 2016 to December 2019 [34]. Trough the absolute values of PPC are less than 0.8, indicating a low a detailed examination, we removed the issues of redundant linear correlation between variables. attributes and missing values from the crash data. In order to combine real-time trafc data into the 4. Methodology analysis of crashes, volume, and speed were extracted from the caltrans performance measurement system [35]. In 4.1. SC Identifcation. Te identifcation of SCs is the basis PeMS, data were gathered from a set of loop detectors on the for conducting SC modeling and analysis. Te static spa- road and transmitted to the management center for storage. tiotemporal threshold-based estimation is the frst stage to And the confguration information of the detector was in- identify SCs roughly, and it can be defned in the following tegrated, including the location and unique identifcation equation: 1, if􏼂t ∈ t , t + t 􏼁􏼃 ∪ 􏼂S ∈ S , S + S 􏼁􏼃, B A A threshold B A A threshold SC � 􏼨 (1) 0, others, where (t , S ) denotes the location and occurrence time of where V and V denote the current and the reference A A (t,S) (t,S) the crash A, (t , S ) denotes the location and occurrence speed of one cell; V � 1 denotes that the cell is afected; B B (t,S) time of another crash B that needs to be examined, and V � 0 denotes that the cell is not afected. Te size of (t,S) (t , S ) denotes the defned time threshold and the impact area was determined by the reference speed V . threshold threshold (t,S) spatial threshold, and the value of 1 means that crash B is Te detailed procedures of the identifcation method are as identifed as a secondary crash corresponding to crash A and follows: 0 otherwise. (i) Apply the fxed spatiotemporal thresholds to Speed contour map-based method estimates the impact identify the candidate SCs. Referring to previous area of the PC based on the change in trafc speed, and a SC is studies on SC analysis in California [19–21], 2 miles identifed when it is discovered in this area. Te speed contour and 2hours were selected as the thresholds in this map comprises grid cells split by defned time intervals and the study. Te initial identifcation on 24,643 crashes milepost of sensor stations [2]. Te impact area can be has yielded 563 possible SCs. ascertained by checking the speed of each cell near the crash. In (ii) Extract the 5-min speed data to develop a speed general, it can be written as the following equation: contour map for a potential PC. More specifcally, 1, if V < V , (t,S) b (t,S) given the fxed spatiotemporal thresholds that have V � 􏼨 (2) (t,S) been determined, the time period for extracting 0, others, speed data is between 2hours before and 2hours 4 Journal of Advanced Transportation Table 1: Description of variables used in crash analysis. Variables Types Description Count Percent Mean Std Temporal characteristics 0 �no 261 70.9 Peak Binary — — 1 �yes (7:00–9:00 or 17:00–19:00) 107 29.1 0 �no 259 70.4 Weekend Binary — — 1 �yes 109 29.6 0 �clear 306 83.2 Weather Categorical 1 �cloudy 46 12.5 — — 2 �rainy 16 4.3 0 �daylight 217 59.0 1 �dusk-dawn 17 4.6 Lighting Categorical — — 2 �dark-streetlights 92 25.0 3 �dark-no streetlights 42 11.4 0 �incorporated (less than 25,000) 10 2.7 1 �incorporated (25,000–100,000) 93 25.3 Population Categorical 2 �incorporated (100,000–250,000) 67 18.2 — — 3 �incorporated (over 250,000) 188 51.1 4 �unincorporated (rural) 10 2.7 Primary crash characteristic 0 �fatal 3 0.8 Collision 1 �severe injury 98 26.6 Categorical — — severity 2 �other visible injury 15 4.1 3 �complaint of pain 252 68.5 0 �head on 2 0.5 1 �sideswipe 53 14.4 2 �rear-end 242 65.8 Collision type Categorical 3 �broadside 11 3.0 — — 4 �hit object 45 12.2 5 �overturned 12 3.3 6 �vehicle/pedestrian 3 0.8 0 �alcohol or drug 22 6.0 1 �unsafe speed 247 67.1 Violation 2 �following too closely 4 1.1 Categorical — — category 3 �unsafe lane change 44 12.0 4 �improper turning 38 10.3 5 �other 13 3.5 Counting total parties in the collision — — 0 �1 party 44 12.0 1 �2 parties 202 54.9 Party count Discrete 2 �3 parties 87 23.6 — — 3 �4 parties 29 7.9 4 �5 parties 4 1.1 5 �6 parties 2 0.5 0 �no 128 34.8 Tow away Binary — — 1 �yes 240 65.2 0 �no 343 93.2 Truck involved Binary — — 1 �yes 25 6.8 0 �felony 29 7.9 Hit and run Categorical 1 �misdemeanor 14 3.8 — — 2 �no hit and run 325 88.3 Alcohol 0 �no 333 90.5 Binary — — involved 1 �yes 35 9.5 Roadway characteristic 0 �construction or repair zone 23 6.2 Road condition Binary — — 1 �no unusual condition 345 93.8 0 �dry 335 91.0 Road surface Binary — — 1 �wet 33 9.0 Trafc characteristics Volume Continuous Vehicle counts over the 5minutes period preceding PCs — — 369.73 158.82 (veh/5min) Speed (mile/h) Continuous Vehicle speed over the 5minutes period preceding PCs — — 48.22 17.16 Journal of Advanced Transportation 5 1.0 Peak Weekend 0.8 Weather Lighting Population 0.6 Collision Severity Collision Type 0.4 Violation Category Party Count Tow Away 0.2 Truck Involved Hit and Run 0.0 Alcohol Involved Road Condition –0.2 Road Surface Volume (veh/5 min) Speed (mph) –0.4 Figure 1: Pearson correlation coefcients of variables. after the PC, and the spatial period is 2 miles up- others will not be selected once. After k rounds of extraction, stream and 2 miles downstream of the PC location. k new sample sets are obtained. (2) Decision trees genera- To eliminate the efects of recurrent congestion, the tion: training k decision trees using k sample sets of data. historical average speed was calculated by collecting During each round of generating trees, m variables from speed data from the PC-free days in a year [13, 18]. M(m < M) features are selected for training. Te ran- domness of the training data and variable combinations (iii) Estimate the impact area of a potential PC using improves the prediction performance of the model and equation (2). Te crashes that occur in the impact essentially prevents overftting. (3) Result combination. area of PC are identifed as SCs. Since the decision trees generated are independent, they Following the two-stage identifcation method, 368 SCs have the same contribution to the predicted result. Tere- are identifed in this study. Te ratio of the number of SCs to fore, the fnal result is obtained by averaging the k predicted the number of all crashes is 1.49%, which is consistent with results. For multioutput problems, the following changes are the fndings of the references in this area that this ratio is required in the decision trees: First is to store several output around 1–1.6% [11–13, 18, 25, 38–40]. values instead of 1. Ten use splitting criteria that calculate the average reduction across all outputs. 4.2. Random Forests. Tis study used RF to predict the time and distance gaps, which has been widely used in the 4.3. SHAP Method. ML methods commonly demonstrate an transportation feld [41–46]. RF uses a bootstrap sampling outstanding prediction performance, while their abilities are method to change the training set to build an integration of limited due to their low interpretability. Although the RF regression trees [47]. Such a mechanism expresses the fol- model can obtain global explanations (i.e., the relative im- lowing advantages: gaining higher performance. Further- portance), it cannot quantify local explanations for indi- more, RF can perform multiple output modeling [48, 49], vidual predictions. Nevertheless, local explanations provide more detailed information than global ones [50, 51]. Shapley which is suitable for simultaneously predicting the time and distance gaps. additive explanations (SHAP) technology is a representative Te input vectors for the RF model are represented as local interpretation method that can explain the main local 􏼈x � [x , x , . . . , x ],y � [y , y ]􏼉, i � 1,2, . . . , N. M efects and interaction efects of independent variables on i1 i2 iM i1 i2 and N are the number of features and samples, y and y dependent variables, as proposed by [52]. Furthermore, [53] i1 i2 indicate the time gap and the distance gap of sample i, improved SHAP to better and faster explain tree-based ML respectively. Figure 2 expresses the structural framework of models, such as random forests and gradient boosted trees. RF,which consists of thefollowing threeparts: (1)Sample set SHAP value is the core of the method which is computed selection: using the resampling method p times on the based on the game-theoretic approach, and it represents the original dataset to generate a sample set. In other words, average marginal contributions of one variable on a single some samples are likely to be chosen multiple times, while prediction. SHAP value is defned as the following equation: Peak Weekend Weather Lighting Population Collision Severity Collision Type Violation Category Party Count Tow Away Truck Involved Hit and Run Alcohol Involved Road Condition Road Surface Volume (veh/5 min) Speed (mph) 6 Journal of Advanced Transportation Table 2: Optimal values of parameters of the RF model. Original data set Parameters Values n_estimators 110 max_depth 10 max_features “auto” ... Sample set 1 Sample set 2 Sample set k min_samples_split 2 min_samples_leaf 1 Tree 1 Tree 2 ... Tree k ... Result 1 Result 2 Result k model and the multilayer perceptron regression (MPR) model. All the models were trained and validated by ap- plying the same dataset to guarantee the reliability of the comparison results. Specifcally, at a ratio of 7:3, the raw Final result samples were split into a training set and a testing set for training and testing model. Two classical regression evalu- Figure 2: Structural framework of RF. ation measures, namely, mean absolute error (MAE) and mean squared error (MSE), were used to assess model performance. Te fnal evaluation results are presented in R R ϕ (f, x) � 􏽘 􏽨f 􏼐P ∪ i􏼑 − f 􏼐P 􏼑􏽩, (3) i x i x i Table 3. As shown, the RF model mostly outperformed the M! R∈R other two models on both the training and testing sets in terms of predicting the time and distance gaps. where R indicates the set of all variable orderings, P represents the set of all variables that rank before the variable i in the ordering R, M is the number of variables, x means 5.2. Global Importance of Variables. Figure 3 visualizes the the values of explanatory variables, and f refers to the global importance of variables on the time gap. In the left single prediction, which can be written by the following part, variables are sorted in descending order according to equation: their global importance, computed by averaging their ab- solute SHAP values per variable. Te left x-axis indicates the f(x) � ϕ (f) + 􏽘 ϕ (f, x), (4) 0 i mean(|SHAPvalue|). As shown, lighting is the most dom- i�1 inant variable on the time gap, and its average efect on the predicted value is 0.11, followed closely by volume and where ϕ (f) means the base value, i.e., the average value of speed, which change the predicted value by 0.093 and 0.056, overall predictions. respectively, on average. It suggested that the trafc char- Additionally, the global importance of variables is the acteristics signifcantly afect the time gap. Tis fnding is not sum of the contribution of one variable on all predictions, surprising; Trafc characteristics are the direct response of which is calculated by averaging absolute SHAP values as the trafc state, which largely infuences the travel sur- shown in the following equation: roundings and driver status. As [11] indicated, more than 􏼌 􏼌 􏼌 􏼌 (j) geometric characteristics and primary crash characteristics, 􏼌 􏼌 I � 􏽘 􏼌 􏼌 ϕ , (5) i 􏼌 􏼌 trafc characteristics could signifcantly afect the SC like- j�1 lihood. Subsequently, population has a greater contribution (j) where I represents the importance of variable i, ϕ in- than party count and collision severity, indicating that the i i dicates the SHAP value for variable i in the single prediction temporal characteristic of population impacts the time gap j, and n is the number of all predictions. more than the primary crash characteristic. By contrast, the Te proposed RF model and SHAP method were mainly roadway characteristics of road surface and condition have implemented in Python (3.8.8) using scikit-learn (0.24.1) a substantially minor efect on the time gap, with the and shap (0.40.0). Te SHAP package contains three ap- mean(|SHAPvalue|) less than 0.005. plications: force plot, summary plot and dependence plot. In In the right part, the diagram consists of points repre- this study, we apply the summary plot to describe the im- senting the samples, and the color visually reveals the portance of each variable and the dependence plot to refect magnitude of variables (red means a high value, while blue the main efects and the interaction efects of all variables. means a low value). Te right x-axis indicates the SHAP value, which refers to the efects of all variables on a single model output (i.e., the local efect). Tis diagram roughly 5. Results and Discussion illustrates the variation of efects with the change of either 5.1. Results. In this study, the grid-search with 5-fold cross- variable. Taking lighting as an example, its left side of the validation techniques (i.e., GridSearchCV) was used to vertical axis is covered with red points (indicate dark) and its determine the core parameters of the RF model. Table 2 right side is stacked with blue points (refer to daylight). Tis reports the optimal values of the parameters. In the appli- demonstrates that night may decrease the time gap, while the daytime probably promotes the time gap. In addition, high cation, the proposed RF model is compared with two tra- ditional multivariate models: the K-nearest neighbor (KNN) volume (red points) mostly has a positive SHAP value and Journal of Advanced Transportation 7 Table 3: Results of several models. Time gap Distance gap MAE MSE MAE MSE Training set Testing set Training set Testing set Training set Testing set Training set Testing set RF 0.22 0.46 0.07 0.31 0.45 0.45 0.33 0.31 KNN 0.44 0.47 0.28 0.32 0.49 0.47 0.36 0.36 MPR 0.45 0.46 0.30 0.32 0.46 0.47 0.31 0.32 Bold values refer to the maximum prediction performance in each circumstance. High Lighting Volume (veh/5 min) Speed (mph) Population Party Count Collision Severity Weather Tow Away Truck Involved Violation Category Peak Collision Type Hit and Run Weekend Alcohol Involved Road Surface Road Condition Low 0.00 0.02 0.04 0.06 0.08 0.10 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 mean (|SHAP value|) SHAP value Figure 3: Global importance of independent variables and a summary of local explanations for the time gap. low volume (blue points) mainly has a negative one, re- were calculated for each variable. In addition, considering vealing that high volume promotes the time gap while low the nontrivial efects of trafc characteristics on the time and volume inhibits it. distance gaps (see Figures 3 and 4 in the previous section), their interaction efects with the rest of the variables were Figure 4 represents the global importance of variables on the distance gap. As shown in the left part, volume is the also estimated. In this section, we select variables with strong efects for analysis. most signifcant contributor and has an overwhelming efect on the distance gap, changing the predicted value by 0.136. Figure 5 shows the local dependence plots for volume on Defnitely, volume size directly infuences the length of the the time and distance gaps. Specifcally, the frst two plots vehicle queueand, thus, the distance gap between the PCand reveal the main efects of volume, and the last two refect the SC. Lighting, speed, and population also rank at the top of interaction efects between volume and speed. Moreover, the the importance list. Road surface and condition are in the left column is for the time gap, while the right column is for bottom third and second places. Generally, the importance the distance gap. In each plot, every point corresponds to ranking of variables for the two gaps is diferent, but there a sample. Te x-axis represents the volume value; the left y- are overall similarities. Trafc features are always the most axis indicates the SHAP value (i.e., the local efect); the right y-axis and the diferent colored points in the last two plots important. Crash and temporal characteristics are com- monly distributed throughout the importance list. And road describe the speed value. As shown in Figures 5(a) and 5(b), traits contribute relatively small to both time and distance plots for volume reveal an overall upward trend. When gaps. Regarding the right part, it shows that high volume, volume is around 100veh/5min in the two plots, its local daylight, enormous speed, and a dense population have efects remain at the negative highest level, suggesting that a positive SHAP value, possibly increasing the distance gap. low volume may lead to a sharp decline in the time and distance gaps. One possible explanation is that low volume allows for such long distances between vehicles that drivers 5.3. Local Efects of Variables. In previous studies, the local tend to relax their vigilance generally. When faced with efects of a particular variable on the predicted outcome are a sudden crash, they are likely to react slowly and are unable often observed assuming that other variables are constant. to stop timely at high speed (as shown in the lower-left corner of Figures 5(c) and 5(d), the corresponding speed is Te drawback is that this way does not consider the issue that the changes of specifc variable likely cause variations in around 65mph). Another reasonable interpretation is that low volume does not contribute to long queue length for- other variables (rather than assuming that all other variables are constant). Te local dependence plot obtained based on mation, thus creating a short-distance gap. As volume grows to 500veh/5min, its local efects remain at the positive the SHAP method can quantify the variables’ efects while avoiding theabovementioneddisadvantage. Temain efects highest level, indicating that high volume is likely to rapidly Feature value 8 Journal of Advanced Transportation High Volume (veh/5 min) Lighting Speed (mph) Population Violation Category Collision Severity Party Count Weather Truck Involved Weekend Tow Away Hit and Run Peak Collision Type Road Surface Road Condition Alcohol Involved Low mean (|SHAP value|) SHAP value Figure 4: Global importance of independent variables and a summary of local explanations for the distance gap. 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (a) (b) 0.3 0.4 70 70 0.3 0.2 60 60 0.2 0.1 50 50 0.1 40 40 0.0 0.0 30 –0.1 30 –0.1 –0.2 20 20 –0.2 –0.3 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (c) (d) Figure 5: SHAP local dependence plots of volume. (a) Main efects of volume on the time gap. (b) Main efects of volume on the distance gap. (c) Interaction efects between volume and speed on the time gap. (d) Interaction efects between volume and speed on the distance gap. increase the time gap and distance gap. Tis fnding is Figures 6(a) and 6(b) show the main local efects of speed consistent with existing works [15]. Te reason might be that on the time and distance gaps, respectively. Te trends in the high volume makes the trafc situation entirely stressful, and two plots are similar in general (down then up), but the drivers have developed a cautious driving style under this infection points correspond to diferent speed values. In circumstance. When a PC occurs, drivers in the immediate Figure 6(a), as speed ranges between 0 and 50mph, its local vicinity upstream will not feel large shock, so SC does not efects on the time gap decline to negative from positive as it occur as quickly. Moreover, high volume can prolong queue increases. When speed falls 50–75mph, its local efects show length and thus increase the distance gap. When volume is a steep upward trend. As for Figure 6(b), when speed in- around 500veh/5min, its corresponding speed falls in an creased from 0 to 30mph, its local efects decline from 0.05 extensive range of 24–76mph. to −0.22, indicating that this value range of speed inhibits the SHAP value SHAP value 0.00 0.02 0.04 0.06 0.08 0.10 Speed (mph) 0.12 SHAP value SHAP value 0.14 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 Feature value Speed (mph) Journal of Advanced Transportation 9 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (a) (b) 0.3 0.3 800 800 0.2 0.2 600 600 0.1 0.1 0.0 400 400 0.0 –0.1 –0.1 200 200 –0.2 –0.2 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (c) (d) Figure 6: SHAP local dependence plots of speed. (a) Main efects of speed on the time gap. (b) Main efects of speed on the distance gap. (c) Interaction efects between speed and volume on the time gap. (d) Interaction efects between speed and volume on the distance gap. distance gap. As the speed continues to increase, the local In other words, a bright environment has a larger volume efects grow to be positive. Moreover, we found that when and positive local efects, while a dark condition has a rel- the speed ranges between 60 and 75mph (the average atively smaller volume and negative local efects. It makes volume for this speed range is 281veh/5min), the corre- sense that the vehicle trips are more during the day than at sponding efects for both time and distance gaps are stable night. Likewise, it is reasonable to consider that high volume around value 0, as observed from Figures 6(c) and 6(d). Such likely prolongs queue length and therefore increases the a fnding demonstrates that this trafc state has minor distance gap. promotion/inhibition on both gaps. Figures 8(a) and 8(b) represent the main efects of vi- Figures 7(a) and 7(b) demonstrate the main efects of olation category on the two gaps. As observed, improper turns (i.e., violation category �4) have the maximum SHAP lighting on the time and distance gaps; the two plots reveal an approximate concave trend. As shown in Figure 7(a), the value. Specifcally, its local efects on the time and instance local efects of daylight and dawn (i.e., lighting �0 and 1) on gaps roughly fall between 0–0.10 and 0–0.15, respectively; the time gap fall in the range of 0–0.20, while streetlights and such ranges indicate that this violation category promotes no streetlights (i.e., lighting �2 and 3) have the most neg- the time and distance gaps to a varying degree. Te reason ative efects. Such variations in local efects indicate that might be that the crashes in which the violation category is a darker environment will accelerate the occurrence of SCs. improper turns probably block turn lanes (usually on a one- Probably because the driver’s sight distance in dark situa- way road), thus afecting the vehicles behind and causing tions depends on the space illuminated by the streetlights a long queue length. Followed by another violation category and headlights, leading to a lack of timely and clear per- of unsafe lane changes (i.e., violation category �3), which ception of the current road condition, resulting in in- shows positive correlations with both gaps. Likewise, crashes caused by unsafelane changes likely block multiple lanesand sufcient avoidance of PCs and thus reducing the time gap. Figures 7(c) and 7(d) display the interaction efects between involve several vehicles, thus decreasing the road capacity lighting and volume. As observed, all points are approxi- signifcantly and extending the queue length. Besides, this mately divided by their color into upper-right and lower-left type of crash is more visible. Tat means drivers behind can parts, with most of the pale and dark blue points (i.e., catch the crash information at a distance and drive more representing daylight and dawn) being above the horizontal carefully, increasing the time and distance gaps. By contrast, axis where the local efect is −0.1, the red and orange points the other four violation categories have more negligible local (i.e., denoting streetlights and no streetlights) being below it. efects. As shown in Figures 8(c) and 8(d), it is the SHAP value SHAP value Volume (veh/5 min) SHAP value SHAP value Volume (veh/5 min) 10 Journal of Advanced Transportation 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 12 3 03 12 Lighting Lighting (a) (b) 3 3 0.3 0.4 0.3 0.2 0.2 2 2 0.1 0.1 0.0 0.0 1 1 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (c) (d) Figure 7: SHAP local dependence plots of lighting. (a) Main efects of lighting on the time gap. (b) Main efects of lighting on the distance gap. (c) Interaction efects between volume and lighting on the time gap. (d) Interaction efects between volume and lighting on the distance gap. fatal crashes) occur in the speed range of 60–70mph, sug- interaction efects between violation category and speed. We fnd a strong association between crashes involving alcohol gesting that serious crashes frequently occur at high speeds. (i.e., violation category �0) and high speed, because points Te main and interaction efects of other variables are are red on the frst vertical column. Another interesting presented in Figures 10 and 11. As shown, plots of pop- fnding is that the red points in the ffth vertical column (i.e., ulation reveal a broadly upward trend, varying from negative violation category �4) are concentrated at the bottom, il- to positive. A dense population (i.e., Population �3 and 4, lustrating that those crashes, which occurred due to unsafe indicating the population is more than 250000) promotes lane changes at high speeds, reduce the time and the time gap and distance gap. One possible explanation is distance gaps. that car ownership and travel trips may be relatively high in Figures 9(a) and 9(b) represent the main efects of these densely populated areas, leading to long queuing times collision severity. Te fatal crashes, severe injury crashes, and length. Te local efects of most weekdays (i.e., week- end �0) and peak periods (i.e., peak �1) on the distance gap and light injury crashes (i.e., collision severity �0, 1, and 2) have a promotion on the time and distance gaps, while only are greater than the value 0. It makes sense that weekdays complaining crashes (i.e., collision severity �3) mainly have and peak periods have many commuter trips, resulting in inhibition on the two gaps. One possible reason is that high volume on the road. Te plot for collision type shows serious crashes attract more attention, such as rapid rescue a v trend on the time gap while a downward trend on the and intervention by trafc police, so that SCs do not occur at distance gap. Te efects of clear days are around the value 0, a close time and distance. Figures 9(c) and 9(d) show the while the efects of most cloudy days are less than the value 0. interaction efects between collision severity and speed. As Such a comparison indicates that cloudy days will inhibit observed, most of the blue points (represent the sample of both gaps, i.e., SCs will occur sooner and closer on cloudy SHAP value SHAP value Lighting SHAP value SHAP value Lighting Journal of Advanced Transportation 11 0.10 0.150 0.08 0.125 0.06 0.100 0.04 0.075 0.02 0.050 0.00 0.025 –0.02 0.000 –0.04 –0.025 –0.06 –0.050 Violation Category Violation Category (a) (b) 0.150 0.10 70 70 0.125 0.08 60 60 0.100 0.06 0.075 0.04 50 50 0.050 0.02 0.025 0.00 0.000 –0.02 –0.04 20 –0.025 –0.050 –0.06 0 4 5 0 12345 Violation Category Violation Category (c) (d) Figure 8: SHAP local dependence plots of violation category. (a) Main efects of the violation category on the time gap. (b) Main efects of violation category on the distance gap. (c) Interaction efects between violation category and speed on the time gap. (d) Interaction efects between violation category and speed on the distance gap. 0.150 0.125 0.125 0.100 0.100 0.075 0.075 0.050 0.050 0.025 0.025 0.000 0.000 –0.025 –0.025 –0.050 –0.050 0123 0123 Collision Severity Collision Severity (a) (b) Figure 9: Continued. SHAP value SHAP value SHAP value Speed (mph) SHAP value SHAP value SHAP value Speed (mph) 12 Journal of Advanced Transportation 3 3 0.3 0.3 0.2 0.2 2 2 0.1 0.1 0.0 0.0 1 1 –0.1 –0.1 –0.2 –0.2 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (c) (d) Figure 9: SHAP local dependence plots of collision severity. (a) Main efects of collision severity on the time gap. (b) Main efects of collision severity on the distance gap. (c) Interaction efects between collision severity and speed on the time gap. (d) Interaction efects between collision severity and speed on the distance gap. 0.10 0.10 0.05 0.05 0.00 0.00 –0.05 –0.05 –0.10 –0.10 Population Population (a) (b) 0.06 0.04 0.02 0.04 0.00 0.02 –0.02 –0.04 0.00 –0.06 –0.02 –0.08 01 0 1 Weekend Weekend (c) (d) Figure 10: Continued. SHAP value SHAP value SHAP value Collision Severity SHAP value SHAP value SHAP value Collision Severity Journal of Advanced Transportation 13 0.05 0.04 0.04 0.03 0.02 0.02 0.00 0.01 0.00 –0.02 –0.01 –0.02 –0.04 0 1 Peak Peak (e) (f) 0.12 0.06 0.10 0.04 0.08 0.02 0.06 0.04 0.00 0.02 –0.02 0.00 –0.04 –0.02 –0.06 0123456 0123456 Collision Type Collision Type (g) (h) 0.05 0.05 0.00 0.00 –0.05 –0.05 –0.10 –0.10 –0.15 –0.15 –0.20 012 012 Weather Weather (i) (j) 0.03 0.01 0.02 0.00 0.01 –0.01 0.00 –0.02 –0.01 –0.03 –0.02 –0.04 –0.03 –0.05 –0.04 –0.06 0 1 0 1 Alcohol Involved Alcohol Involved (k) (l) Figure 10: Continued. SHAP value SHAP value SHAP value SHAP value SHAP value SHAP value SHAP value SHAP value 14 Journal of Advanced Transportation 0.01 0.01 0.00 0.00 –0.01 –0.01 –0.02 –0.02 –0.03 –0.03 –0.04 –0.04 –0.05 –0.05 –0.06 –0.06 0 1 0 1 Road Surface Road Surface (m) (n) 0.005 0.06 0.000 –0.005 0.04 –0.010 0.02 –0.015 –0.020 0.00 –0.025 –0.030 –0.02 0 1 Road Condition Road Condition (o) (p) Figure 10: SHAP main efects of variables on the time gap and the distance gap. (a) Population on the time gap. (b) Population on the distance gap. (c) Weekend on the time gap. (d) Weekend on the distance gap. (e) Peak on the time gap. (f) Peak on the distance gap. (g) Collision type on the time gap. (h) Collision type on the distance gap. (i) Weather on the time gap. (j) Weather on the distance gap. (k) Alcohol involved on the time gap. (l) Alcohol involved on the distance gap. (m) Road surface on the time gap. (n) Road surface on the distance gap. (o) Road condition on the time gap. (p) Road condition on the distance gap. 1.0 1 0.3 0.3 0.8 0.2 0.2 0.6 0.1 0.1 0.4 0.0 0.0 –0.1 –0.1 0.2 –0.2 –0.2 0.0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (a) (b) Figure 11: Continued. SHAP value SHAP value SHAP value Weekend SHAP value SHAP value SHAP value Weekend Journal of Advanced Transportation 15 0.12 0.06 0.10 800 800 0.04 0.08 600 0.02 600 0.06 0.04 0.00 400 400 0.02 –0.02 0.00 200 200 –0.04 –0.02 –0.06 01234 5 6 01234 5 6 Collision Type Collision Type (c) (d) 6 6 0.3 0.3 5 5 0.2 0.2 4 4 0.1 0.1 3 3 0.0 0.0 2 2 –0.1 –0.1 1 1 –0.2 –0.2 0 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 Speed (mph) Speed (mph) (e) (f) 2 2 0.3 0.4 0.3 0.2 0.2 0.1 0.1 1 1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (g) (h) Figure 11: Continued. SHAP value SHAP value SHAP value Weather Collision Type Volume (veh/5 min) SHAP value SHAP value SHAP value Weather Collision Type Volume (veh/5 min) 16 Journal of Advanced Transportation 1 1 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (i) (j) 1 1 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (k) (l) 1 1 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0.0 0.0 –0.1 –0.1 –0.2 –0.2 –0.3 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 Volume (veh/5 min) Volume (veh/5 min) (m) (n) Figure 11: SHAP interaction efect plots among variables on the time gap and the distance gap. (a) Speed and weekend on the time gap. (b) Speed andweekend onthe distancegap.(c) Collisiontype and volumeon the timegap.(d)Collision type andvolume onthe distancegap. (e) Speed and collision type on the time gap. (f) Speed and collision type on the distance gap. (g) Volume and weather on the time gap. (h) Volume and weather on the distance gap. (i) Volume and alcohol involved on the time gap. (j) Volume and alcohol involved on the distance gap. (k) Volume and road surface on the time gap. (l) Volume and road surface on the distance gap. (m) Volume and road condition on the time gap. (n) Volume and road condition on the distance gap. SHAP value SHAP value SHAP value Road Condition Road Surface Alcohol Involved SHAP value SHAP value SHAP value Road Condition Road Surface Alcohol Involved Journal of Advanced Transportation 17 intensity of the lighting. Moreover, crashes involving the days. Drinking (i.e., alcohol involved �1) mostly has neg- ative local efects, meaning that drinking will reduce the time violation categories of improper turns or unsafe lane changes possibly cause long time and large distance gaps. and distance gap. Tis is consistent with reality. Te wet surface (i.e., road surface �1) inhibits both gaps, which is Te contributions of this study can be summarized in consistent with existing knowledge [16]. It makes sense that the following three aspects: (1) proposing a two-stage SC a wet road surface harms the vehicle’s stability, such as identifcation method, which combined the static and a brake failure, thus accelerating the occurrence of SCs. dynamic approaches. And the identifcation results on the test data are consistent with existing works, providing a reliable basis for SC analysis. (2) Applying random forest 6. Conclusions to simultaneously predict the time and distance gaps, Tis study aimed at predicting the time and distance gaps which facilitated understanding the relationship between the dependent and independent variables.17 independent between SCs and PCs on highways and to analyze how the infuencing factors contribute to the gaps comprehensively. variables selected from temporal, primary crash, roadway, and trafc characteristics and two dependent variables, First, a data-driven identifcation method combining the fxed spatiotemporal thresholds-based method and the namely time gap and distance gap, were used as inputs to speed contour map-based method was developed to identify train and test the random forest model. Te results SCs. A total of 368 SCs were sought out from the total achieved better performance compared with other number of 24643 crashes. Ten, the RF model was applied to models. (3) Using a brand-new interpretation technique predict the two gaps. Te data samples were split into SHAP to explain the RF model from global and local ways. training and testing sets at a ratio of 7:3. Te results showed We made several signifcant fndings which will be def- that the RF model performed better than KNN and MPR. nitely helpful for trafc decision makers to formulate Additionally, the SHAP method was conducted to explain strategies. Tis research also raises issues in need of further ex- the outputs of the RF model. Based on this local in- terpretation method, we revealed variables’ global impor- plorations in the future. First, 368 crashes were used in the model training. Although we applied ML models that are tance and main and interaction efects on the time and distance gaps. advantageous for handling sparse data, small sample sizes We found that trafc volume and speed are the im- may reduce the performance of the models. More data are portant contributors to the time and distance gaps; moni- expected to be required to improve the model performance. toring trafc conditions helps implement timely and Second, 17 variables were used, and future work will cover efective management to prevent SCs. Several temporal more types of factors. Tis study focused on temporal characteristics, such as lighting and population, contribute characteristics, primary crash factors, roadway conditions, and real-time trafc parameters. Other factors, such as more to both gaps than primary crash features and road factors. Compared with road factors, the primary crash shoulder width and truck proportion which have shown correlations with the time gap and distance gap of SCs, will characteristics of violation category, party count, and col- lision severity demonstrate more signifcant efects. With be considered in future research. Te SC factors are also worthy of being discussed. In the future, it is a potential idea these fndings about factor priorities, trafc managers and policymakers can develop prevention plans and allocate to combine the PC and SC characteristics to explore the time resources more efciently. and distance gaps. Te local dependence plots quantify the efects of var- iables. Plots for the continuous variables, i.e., volume and Data Availability speed, reveal developing trends and several infection points. For example, the local efects of volume increase mono- Te data used to support the fndings of this study are tonically from −0.3 to 0.4 as the volume grows. Such var- available from the corresponding author upon reasonable iation indicates that low volume sharply inhibits the time request. and distance gaps, while high volume boosts them signif- cantly. Additionally, the local efects on the distance gap are Conflicts of Interest around value 0 when volume falls between 300 and 400veh/ 5min, suggesting that the trafc state in this volume afects Te authors declare that there are no conficts of interest. the gap inconsiderably. Te plot for the main efects of speed on the distance gap shows an obvious infection point. Such Acknowledgments critical information above is considerable for trafc safety managers. As for plots about the discrete variables, dem- Tis research was funded in part by the National Natural onstrate the local efects and corresponding characteristics Science Foundation of China (grant no. 52172310), Hu- of diferent categories of variables. Take lighting as an ex- manities and Social Sciences Foundation of the Ministry of ample: the efects of daylight and dawn are positive, while Education (grant no. 21YJCZH147), Innovation-Driven those of streetlights and no streetlights are mostly having Project of Central South University (grant no. negative efects. Tat is to say, a darker environment 2020CX041), and the Fundamental Research Funds for the probably accelerates the occurrence of SCs. Where the Central Universities of Central South University (grant no. economic condition allows, it is advantageous to increase the 2022ZZTS0717). 18 Journal of Advanced Transportation [17] R. Raub, “Occurrence of secondary crashes on urban arterial References roadways,” Transportation Research Record: Journal of the [1] World health organization, “Road trafc injuries,” 2023, Transportation Research Board, vol. 1581, no. 1, pp. 53–58, https://www.who.int/news-room/fact-sheets/detail/road- trafc-injuries. [18] B. Yang, Y. Guo, and C. Xu, “Analysis of freeway secondary [2] H. Yang, Z. Wang, K. Xie, K. Ozbay, and M. Imprialou, crashes with a two-step method by loop detector data,” IEEE “Methodological evolution and frontiers of identifying, Access, vol. 7, pp. 22884–22890, 2019. modeling and preventing secondary crashes on highways,” [19] J. E. Moore, G. Giuliano, and S. Cho, “Secondary accident Accident Analysis and Prevention, vol. 117, pp. 40–54, 2018. rates on Los Angeles freeways,” Journal of Transportation [3] S. A. Tedesco, V. Alexiadis, W. R. Loudon, R. Margiotta, and Engineering, vol. 130, no. 3, pp. 280–285, 2004. D. Skinner, “Development of a 40 model to assess the safety [20] W. Hirunyanitiwattana and S. P. Mattingly, “Identifying Secondary Crash Characteristics for California Highway impacts of implementing IVHS user services, moving toward System,” in Proceedings of the Transportation Research Board deployment,” in Proceedings of the IVHS America Annual Meeting, Washington, DC, USA, January 2006. Meeting, pp. 343–352, Atlanta GA, USA, March 1994. [21] L. Kopitch and J.-D. M. Saphores, “Assessing efectiveness of [4] N. Owens, A. Armstrong, P. Sullivan et al., Trafc Incident changeable message signs on secondary crashes,” in Pro- Management Handbook, Federal Highway Administration, ceedings of the Transportation Research Board 90th Annual Washington, DC, USA, 2010. Meeting, Washington, DC, USA, January 2011. [5] J. G. Pigman, J. R. Walton, and E. R. Green, “Identifcation of [22] H. Yang, Z. Wang, K. Xie, and D. Dai, “Use of ubiquitous secondary crashes and recommended countermeasures,” probe vehicle data for identifying secondary crashes,” Crash Severity, Transport Research International Documen- Transportation Research Part C: Emerging Technologies, tation, Washington, DC, USA, 2011. vol. 82, pp. 138–160, 2017. [6] M. Jalayer, F. Baratian-Ghorghi, and H. Zhou, “Identifying [23] E. I. Vlahogianni, M. G. Karlaftis, and F. P. Orfanou, and characterizing secondary crashes on the Alabama state “Modeling the efects of weather and trafc on the risk of highway systems,” Advances in Transportation Studies, vol. 37, secondary incidents,” Journal of Intelligent Transportation pp. 129–140, 2015. Systems, vol. 16, no. 3, pp. 109–117, 2012. [7] Y. Tian, H. Chen, and D. Truong, “A case study to identify [24] M.-I. M. Imprialou, F. P. Orfanou, E. I. Vlahogianni, and secondary crashes on interstate highways in Florida by using M. G. Karlaftis, “Methods for defning spatiotemporal in- geographic information systems (gis),” Advances in Trans- fuence areas and secondary incident detection in freeways,” portation Studies, vol. 2, pp. 103–112, 2016. Journal of Transportation Engineering, vol. 140, no. 1, [8] H. Yang, Z. Wang, and K. Xie, “Impact of connected vehicles pp. 70–80, 2014. on mitigating secondary crash risk,” International Journal of [25] W. Junhua, L.Boya, Z. Lanfang, and D.R. Ragland, “Modeling Transportation Science and Technology, vol. 6, no. 3, secondary accidents identifed by trafc shock waves,” Acci- pp. 196–207, 2017a. dent Analysis and Prevention, vol. 87, pp. 141–147, 2016. [9] C. Zhan, L. Shen, M. A. Hadi, and A. Gan, “Understanding the [26] A. A. Sarker, R. Paleti, S. Mishra, M. M. Golias, and characteristics of secondary crashes on freeways,” in Pro- P. B. Freeze, “Prediction of secondary crash frequency on ceedings of the Transportation Research Board 87th Annual highway networks,” Accident Analysis and Prevention, vol. 98, Meeting, Washington, DC, USA, January 2008. pp. 108–117, 2017. [10] H. Yang, K. Ozbay, and K. Xie, “Assessing the risk of sec- [27] N. J. Goodall, “Probability of secondary crash occurrence on ondary crashes on highways,” Journal of Safety Research, freeways with the use of private-sector speed data,” Trans- vol. 49, no. 143, pp. 143.e1–149, 2014. portation Research Record: Journal of the Transportation Re- [11] C. Xu, P. Liu, B. Yang, and W. Wang, “Real-time estimation of search Board, vol. 2635, no. 1, pp. 11–18, 2017. secondary crash likelihood on freeways using high-resolution [28] H. Park, S. Gao, and A. Haghani, “Sequential interpretation loop detector data,” Transportation Research Part C: Emerging and prediction of secondary incident probability in real time,” Technologies, vol. 71, pp. 406–418, 2016. in Proceedings of the Transportation Research Board 96th [12] H. Park and A. Haghani, “Real-time prediction of secondary Annual Meeting, Washington, DC, USA, January 2017. incident occurrences using vehicle probe data,” Trans- [29] A. E. Kitali, P. Alluri, T. Sando, H. Haule, E. Kidando, and portation Research Part C: Emerging Technologies, vol. 70, R. Lentz, “Likelihood estimation of secondary crashes using pp. 69–85, 2016. Bayesian complementary log-log model,” Accident Analysis [13] P. Li and M. Abdel-Aty, “A hybrid machine learning model and Prevention, vol. 119, pp. 58–67, 2018. for predicting Real-Time secondary crash likelihood,” Acci- [30] J. Tang, F. Liu, W. Zhang, R. Ke, and Y. Zou, “Lane-changes dent Analysis and Prevention, vol. 165, Article ID 106504, prediction based on adaptive fuzzy neural network,” Expert Systems with Applications, vol. 91, pp. 452–463, 2018. [14] H. B. Zhang and A. Khattak, “Spatiotemporal patterns of [31] X. M. Chen, S. Zhang, and L. Li, “Multi-model ensemble for primary and secondary incidents on urban freeways,” short-term trafc fow prediction under normal and abnormal Transportation Research Record: Journal of the Transportation conditions,” IET Intelligent Transport Systems, vol. 13, no. 2, Research Board, vol. 2229, no. 1, pp. 19–27, 2011. pp. 260–268, 2018. [15] D. Chimba and B. Kutela, “Scanning secondary derived [32] A. Jamal, M. Zahid, M. Tauhidur Rahman et al., “Injury se- crashes from disabled and abandoned vehicle incidents on verity prediction of trafc crashes with ensemble machine uninterrupted fow highways,” Journal of Safety Research, learning techniques: a comparative study,” International vol. 50, no. 5, pp. 109–116, 2014. Journal of Injury Control and Safety Promotion, vol. 28, no. 4, [16] J.Wang, B. Liu, T. Fu, S.Liu, and J. Stipancic, “Modelingwhen pp. 408–427, 2021. and where a secondary accident occurs,” Accident Analysis ´ ´ [33] G. Asencio-Cortes, E. Florido, A. Troncoso, and F. Martınez- and Prevention, vol. 130, pp. 160–166, 2019. Alvarez, “A novel methodology to predict urban trafc Journal of Advanced Transportation 19 congestion with ensemble learning,” Soft Computing, vol. 20, [51] L. Xiao, S. Lo, J. Liu, J. Zhou, and Q. Li, “Nonlinear and no. 11, pp. 4205–4216, 2016. synergistic efects of TOD on urban vibrancy: applying local [34] SWITRS, “Statewide Integrated Trafc Records System explanations for gradient boosting decision tree,” Sustainable (SWITRS),” 2022, https://iswitrs.chp.ca.gov/Reports/jsp/ Cities and Society, vol. 72, Article ID 103063, 2021. [52] S. M. Lundberg and S.-I. Lee, “A unifed approach to inter- index.jsp. preting model predictions,” Advances in Neural Information [35] Pems, “Caltrans PeMS,” 2022, https://pems.dot.ca.gov/. Processing Systems, vol. 30, pp. 4765–4774, 2017. [36] H. T. Yang, G. C. Zhai, L. C. Yang, and K. Xie, “How does the [53] S. M. Lundberg, G. Erion, H. Chen et al., “From local ex- suspension of ride-sourcing afect the transportation system planations to global understanding with explainable AI for and environment?” Transportation Research Part D: Transport trees,” Nature Machine Intelligence, vol. 2, no. 1, pp. 56–67, and Environment, vol. 102, Article ID 103131, 2022. [37] H. T. Yang, J. H. Huo, R. B. Pan, K. Xie, W. J. Zhang, and X. J. Luo, “Exploring built environment factors that infuence the market share of ridesourcing service,” Applied Geography, vol. 142, Article ID 102699, 2022. [38] A. A. Sarker, A. Naimi, S. Mishra, M. M. Golias, and P. B. Freeze, “Development of a secondary crash identifcation algorithm and occurrence pattern determination in large scale multi-facility transportation network,” Transportation Re- search Part C: Emerging Technologies, vol. 60, pp. 142–160, [39] S. Mishra, M. Golias, A. Sarker, and A. Naimi, Efect of Primary and Secondary Crashes: Identifcation, Visualization, and Prediction Research Report No. CFIRE 09-05, University of Wisconsin-Madison, Madison, WI, USA, 2016. [40] J. Wang, W. Xie, B. Liu, S. Fang, and D. R. Ragland, “Identifcation of freeway secondary accidents with trafc shock wave detected by loop detectors,” Safety Science, vol. 87, pp. 195–201, 2016. [41] D. Yao, J. Yang, and X. Zhan, “A novel method for disease prediction: hybrid of random forest and multivariate adaptive regression splines,” Journal of Computers, vol. 8, no. 1, pp. 73-74, 2013. [42] K. Miller, F. Huettmann, B. Norcross, and M. Lorenz, “Multivariate random forest models of estuarine-associated fsh and invertebrate communities,” Marine Ecology Progress Series, vol. 500, pp. 159–174, 2014. [43] H. Hong, H. R. Pourghasemi, and Z. S. Pourtaghi, “Landslide susceptibility assessment in Lianhua County (China): a comparison between a random forest data mining technique and bivariate and multivariate statistical models,” Geo- morphology, vol. 259, pp. 105–118, 2016. [44] Y. Li, C. Zou, M. Berecibar et al., “Random forest regression for online capacity estimation of lithium-ion batteries,” Ap- plied Energy, vol. 232, pp. 197–210, 2018. [45] J. Tang, J. Liang, C. Han, Z. Li, and H. Huang, “Crash injury severity analysis using a two-layer stacking framework,” Accident Analysis and Prevention, vol.122, pp. 226–238, 2019. [46] G. Xu, M. Liu, Z. Jiang, D. Sofker, ¨ and W. Shen, “Bearing Fault diagnosis method based on deep convolutional neural network and random forest ensemble learning,” Sensors, vol. 19, no. 5, p. 1088, 2019. [47] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001. [48] G. De’ath, “Multivariate regression trees: a new technique for modeling species-environment relationships,” Ecology, vol. 83, no. 4, pp. 1105–1117, 2002. [49] M. Segal and Y. Xiao, “Multivariate random forests,” WIREs Data Mining and Knowledge Discovery, vol.1, pp. 80–87, 2011. [50] X. Wen, Y. Xie, L. Wu, and L. Jiang, “Quantifying and comparing the efects of key risk factors on various types of roadway segment crashes with LightGBM and SHAP,” Ac- cident Analysis and Prevention, vol. 159, Article ID 106261,

Journal

Journal of Advanced TransportationHindawi Publishing Corporation

Published: Apr 14, 2023

There are no references for this article.