Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Using Reinforcement Learning to Handle the Unintended Lateral Attack in the Intelligent Connected Vehicle Environment

Using Reinforcement Learning to Handle the Unintended Lateral Attack in the Intelligent Connected... Hindawi Journal of Advanced Transportation Volume 2023, Article ID 3187944, 10 pages https://doi.org/10.1155/2023/3187944 Research Article UsingReinforcementLearningtoHandletheUnintendedLateral Attack in the Intelligent Connected Vehicle Environment 1,2 1 1 1 Luoyi Huang , Wanjing Ma , Ling Wang , and Kun An Te Key Laboratory of Road and Trafc Engineering, Ministry of Education, Tongji University, Shanghai 201804, China Bosch Automotive Products (Suzhou) Co. Ltd., Suzhou 215025, China Correspondence should be addressed to Wanjing Ma; mawanjing@tongji.edu.cn Received 16 August 2022; Revised 15 October 2022; Accepted 17 March 2023; Published 21 April 2023 Academic Editor: Wenxiang Li Copyright © 2023 Luoyi Huang et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. It is widely accepted that an unintended lateral attack is inevitable in the intelligent connected vehicle environment. Tis paper explores the feasibility of a reinforcement learning method named PPO (Proximal Policy Optimization) to handle the unintended lateral attack and keep the vehicle stay in the ego lane. Based on the China highway design guide, the discrete speed variants of 120 km/h, 100 km/h, and 80 km/h were selected, along with diferent curvatures ranging from 250 m to 1200 m in every 50 m as combinations of speed-curvature test. Te tests were implemented in the Open.ai CarRacing-v0 simulation environment with an external racing wheel attached to simulate the unintended lateral attack. Te simulation results show that the PPO can handle the unintended lateral attack on the standard-designed highway in China. Te results can be applied to the intelligent connected vehicle to be mass-produced in the future. severity, probability of exposure, and controllability. As of 1.Introduction today, there is no fully autonomous vehicle that end-users can buy in the market, and one reason is that absolute safety Intelligent connected vehicle is shaping the automotive industry. It allows the vehicle to communicate with other cannot be proved in a commonly accepted way. trafc participants. In a connected environment, the attack is Functional safety cares about the Electric/Electronic inevitable. It has many possible ways to handle the identifed malfunction behavior of the vehicle. Its practice has evolved attack from a security perspective [1]. However, it has been for many years and has already been applied in mass- found to be only a few studies on the unintended attack from produced vehicles. SOTIF, which stands for Safety of the a functional safety perspective. Intended Functionality, is a logical supplement to the Along with the rapid development of the intelligent established functional safety standard ISO 26262. SOTIF connected vehicle, safety is becoming an important issue to deals with the functional limitation of the vehicle concerning consider. Automotive Functional Safety (ISO 26262, Road the absence of unreasonable risk due to hazards resulting vehicles-Functional safety [2]) has become the de facto from functional insufciency of the intended functionality practice for intelligent connected vehicle to be produced in together with the reasonably foreseeable misuse by persons the market. ISO 26262 generally gives a system credit for [4]. SOTIF is currently difcult to quantify. Take the example a human driver ultimately being responsible for safety, of lines of source code. Air force F-22 has around 1.7 million which consists of three evaluation factors: severity, the lines of source code, while Boeing 787 has 6.5 million lines probability of exposure, and controllability [3]. ISO 26262 is and air force F-36 has 24 million lines. Compared with these explicitly targeted for automotive safety, providing a safety examples, the luxury vehicle already mass-produced in the lifecycle that includes development, production, operation, market has 100 million lines [5]. Based on the experts’ as- service, and decommissioning. ISO 26262 defnes the ASIL sumption, the lines of source code of autonomous vehicle (Automotive Safety Integrity Level). ASIL is calculated from will increase exponentially considering the complexity of the 2 Journal of Advanced Transportation functionality. With this tremendous amount of source code T increased, there is a higher risk of an autonomous vehicle to be failed in some corner cases. In addition, with the in- troduction of deep learning technology, it can be more challenging to bring unknown compared to the traditional I B Vee model development. To address this, Garc´ıa and Fernandez ´ [6] introduced the Safe Reinforcement Learning, which was defned as the process of learning policies that Figure 1: Reinforcement learning model. Te vehicle interacts with maximize the expectation of the return in problems in which the environment via trial-and-error method. it is crucial to ensure reasonable system performance and/or respect safety constraints during the learning and/or de- a hot research area [18–20]. Whatever single-agent or ployment processes. multiagent reinforcement learning, there are some basic and Besides safety, security is also an essential factor to commonly used reinforcement learning methods like DQN consider in the intelligent connected vehicle since intelligent (Deep Q Networks), PG (Policy Gradient), DDPG (Deep connected vehicle connect with other vehicles, in- frastructure, and the cloud. A vehicle is no longer an isolated Deterministic Policy Gradient), TD3 (Twin Delayed DDPG), SAC (Soft Actor Critic), and A2C (Advantage Actor Critic). object in an intelligent connected vehicle environment. Along with the enriched functionality enabled by connec- In 2017, OpenAI published a novel objective function that enables multiple epochs of minibatch updates named PPO tivity, the vehicle opens the attack interface for external resources. Dibaei et al. [7] summarized common attack (Proximal Policy Optimization) [21], a family of policy optimization methods, which achieved a favorable balance methods in the intelligent connected vehicle environment, mainly including DoS (Denial of Service), DDoS (Distrib- between sample complexity, simplicity, and wall-time. Te PPO algorithm is illustrated as follows: uted Denial-of-service), black-hole attack, replay attack, Sybil attack, impersonation attack, malware, falsifed in- for iteration � 1, 2, . . . , do formation attack, and timing attack. Even with modern for actor � 1, 2, . . . , N do encryption technology, security vulnerabilities can still be found in the automotive industry. Chattopadhyay et al. [8] Run policy π in environment for T timesteps old found that the security-by-design principle for autonomous 􏽢 􏽢 Compute advantage estimates A , . . . , A 1 T vehicle is poorly understood and rarely practiced. Te in- end for telligent connected vehicle is prone to attack, and un- Optimize surrogate L wrt θ, with K epochs and intended attack is inevitable. Te assumption was made in minibatch size M≤ NT this paper that attack exists and cannot be eliminated. In addition, the feet-free longitudinal function has been widely θ ←θ old studied and released in the market for years like adaptive end for cruise control, so the unintended lateral attack was focused PPO uses two neural networks: the policy π(s) and the on in this paper. value function V(s). Te policy π(s) maps an observation s Te unintended lateral attack is not only an automotive t to an action a , while the value function V(s) maps an issue, but it can also cause an environmental problem and observation s to a scalar value showing how advantageous it become a barrier to achieving low-carbon transportation [9]. t is to be in that state. Te value network estimates the value of Terefore, research on “how to handle the unintended lateral each state by minimizing the error between the predicted attack” is a must to reach the future intelligent trans- value and the actual value. Te policy network uses the portation systems. One possible way to handle unintended estimate of value function to select actions that lead to higher lateral attack is reinforcement learning. Reinforcement rewards. learning is being used by an agent to learn behavior through Resulting from these considerations, the remainder of trial-and-error interactions with the environment [10]. A this paper is organized as follows. Section 2 introduces the standard reinforcement learning model is shown in Figure 1. methods used in this study, including test scenario design, An agent is connected to the environment through simulation environment construction, training procedure, perception and action. At each step of the interaction, the and attack injection logic. Section 3 presents the positive agent receives an input: i, with the indication of the current simulation results and illustrates the efectiveness of our state: s, selects an action: a, generates an output. Te action method. Section 4 concludes the fndings and identifes open changes the state of the environment, and the value of this areas of research for future work. state transition is communicated to the agent via a scalar: r. Te agent’s behavior B, is targeting to select actions that can increase the long-run sum of rewards. Te agent can learn to 2.Methods do this over time through trial-and-error interactions. In recent years, reinforcement learning has been applied in the Te test scenarios were defned based on the standard of the game of Go [11], highly automated driving [12–14], trafc highway in China. A modifed CarRacing-v0 simulation signal control [15–17], and has proven its efectiveness. environment was used to generate the test scenarios and Meanwhile, multi-agent reinforcement learning is becoming provide a secondary development interface for Journal of Advanced Transportation 3 Table 1: Design of curvature and lane width in “Design Specif- reinforcement learning implementation. Te PPO algo- cation for Highway Alignment.” rithm was then applied to the selected scenarios for training; afterwards, the trained models were used to infer Limit curvature value the rest of the test scenarios. Te unintended lateral attack (m), with I � 8%, Design speed Lane width was simulated by attaching an external driving force where I donates (km/h) (m) racing wheel. maximum superelevation value 120 650 3.75 2.1. Test Scenario Design. Te test scenario is the combi- 100 400 3.75 nations of speed and curvature on the highway. Based on the 80 250 3.75 “Technical Standard of Highway Engineering” [22] and 60 125 3.5 “Design Specifcation for Highway Alignment” [23], see 40 60 3.5 Table 1. Te most common speed limits on the highway in China are 120 km/h on the standard-designed highway, 100 km/h on the class-1 highway, and 80 km/h on the class-2 Due to the limitation of the reward calculation in highway. A design speed of less than 80 km/h is not CarRacing-v0, the vehicle’s exact position and the distance a standard highway in China. Terefore, the minimum speed between the vehicle and the lane markers are unknown to us. considered in this paper is 80 km/h. For curvature, four Terefore, in this paper, the following criteria were defned specifc numbers were identifed as follows: 250 m, 400 m, to decide whether reinforcement learning can handle the 650 m, and 1200 m. Te reasons for selecting these numbers unintended attack or not. are as follows: (i) PASS: the vehicle can move back into the ego lane (i) 250 m: the minimum curvature on the standard after the unintended attack; 10 out of 10 succeed highway in China. Tis usually appears at on-ramp (ii) FAIL: the vehicle cannot move back into the ego lane and of-ramp and leaves the lane ultimately; >1 out of 10 failed (ii) 400 m: the minimum curvature of highway, which Te version of CarRacing-v0 is 0.18.3, which was re- has a speed limit of 100 km/h leased in May 2021. Te course shape, lane width, and (iii) 650 m: the minimum curvature of highway, which traveling speed were modifed accordingly to meet our test has a speed limit of 120 km/h requirements. Te course in original CarRacing-v0 was (iv) 1200 m: the threshold of curvature, which can cover randomly generated for reinforcement training and testing. 95% of highway in the Yangtze River area in China In this paper, the source code was modifed and recompiled to generate the fxed shape of the course. Figure 2 shows the Te lane width was set to 3.75 m in our test scenario randomly generated courses and Figure 3 shows the fxed considering the speed variants were 120 km/h, 100 km/h, shape generation after code modifcation. and 80 km/h. Te test matrix was defned in Table 2. In addition to the course shape, the lane width was Take the 100 km/h case for example, curvatures were changed to 3.75 m according to the needs of our study, which selected from the range [250, 1200] in every 50 m, thus the is shown in Figure 4. following list of curvature can be derived: (250, 300, 350, 400, CarRacing-v0 provides the observation and action 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, control interfaces in a Box manner. Box represents the 1050, 1100, 1150, 1200). Cartesian product of n closed intervals, it is the specifc type defned by OpenAI gym. Te reward range of CarRacing-v0 2.2. Simulation Environment Construction. Simulation is is (−inf, inf). Te actions are in a discrete vector shown in commonly used as an environmental tool to train re- Table 3. inforcement learning algorithms. Te results have the po- With the diferent combinations of the action elements, tential to be transferred to solve real-world problems [24]. In the following action spaces can be derived in Table 4. this paper, CarRacing-v0 was selected as the simulation To simplify the action and avoid causing the vehicle to environment. CarRacing-v0 is a reinforcement learning drift, only the single action from each action space was environment developed by OpenAI [25] to support con- selected, marked in bold in Table 4. Tey are steer left: [− 1, 0, tinuous control. It provides a bird-view racing environment 0], no action: [0, 0, 0], accelerate: [0, 0, 1], brake: [0, 0.5, 0], that fts well for the future infrastructure-supported auto- and steer right: [1, 0, 0]. In addition, to better monitor the mated driving environment, which difers from the in- parameters in real time, the label texts were added in the vehicle sensing perspective. CarRacing-v0 provides a state, following order, from left to right: reward, ABS sensor, which consists of 96 × 96 pixels. speed, wheel angle, and angular velocity, which is shown in Te CarRacing-v0 reward is −0.1 every frame and Figure 5. + 1000/N for every track tile visited, where N is the total number of tiles in track. According to the example on the ofcial website, if we have fnished in 732 frames, the reward 2.3. Training Logic and Parameter Setting. Te combination is 1000 − 0.1 × 732 � 926.8 points. Te episode fnishes space of speed versus curvature is too large; thus, the fol- when all tiles are visited. lowing combinations were selected for initial training: 4 Journal of Advanced Transportation Table 2: Test matrix between speed and curvature. Speed (km/h) Curvature (m) 120 Range [250, 1200] : 50, i.e., select curvature from 250 m–1200 m in every 50 m 100 Range [250, 1200] : 50, i.e., select curvature from 250 m–1200 m in every 50 m 80 Range [250, 1200] : 50, i.e., select curvature from 250 m–1200 m in every 50 m (a) (b) Figure 2: Randomly generated courses in CarRacing-v0. (a) Random course generation example 1. (b) Random course generation example 2. (a) (b) Figure 3: Generation of fxed shapes after code modifcation. (a) Fixed curvature of 500 m. (b) Fixed curvature of 1700 m. (a) (b) Figure 4: Lane width modifcation. (a) Original lane width. (b) Modifed lane width to 3.75 m. Table 3: Action element. Action element Action type Data range Meaning of vector 1 Steering [−1, 0, 1] [Steer left, no action, steer right] 2 Braking [0, 0.5] [No action, brake] 3 Acceleration [0, 1] [No action, accelerate] (120 km/h, 650 m), (100 km/h, 400 m), and (80 km/h, 250 m). CarRacing-v0 leaderboard, the “solving” is defned as getting Te training and inference logic is illustrated in Figure 6. the average reward of 900+ over 100 consecutive episodes, In the training session shown on the left-hand side in which indicates that the reinforcement learning based in- Figure 6, the three separated models for (120 km/h, 650 m), lane driving has been achieved. After that, the training (100 km/h, 400 m), and (80 km/h, 250 m) were trained as the session fnished. In the inference session, the trained model base. According to the defnition-of-done from the from (120 km/h, 650 m) was used to test the variant of Journal of Advanced Transportation 5 Table 4: Action space. Inference for following combinations Action space Action combination (120 km/h, 250 m) 1 [−1, 0, 0] (120 km/h, 300 m) 2 [–1, 0, 1] (120 km/h, ... m) 3 [–1, 0.5, 0] Trained model of (120 km/h, 1,200 m) following combinations 4 [–1, 0.5, 1] 5 [0, 0, 0] (120 km/h, 650 m) (100 km/h, 250 m) 6 [0, 0, 1] (100 km/h, 300 m) 7 [0, 0.5, 0] (100 km/h, 400 m) (100 km/h, ... m) 8 [0, 0.5, 1] (100 km/h, 1,200 m) (80 km/h, 250 m) 9 [1, 0, 0] 10 [1, 0, 1] 11 [1, 0.5, 0] (80 km/h, 250 m) (80 km/h, 300 m) 12 [1, 0.5, 1] (80 km/h, ... m) (80 km/h, 1,200 m) Figure 6: Training and inference logic. Train the reference model as a base and apply the reference model to diferent scenarios. Te parameters used in the training session: gamma (discount factor) � 0.99 gae_lambda � 0.95 image stack � 4 max_grad_norm � 0.5 epoch � 10 batch_size � 128 learning rate � 1e − 3 0152 0079 4.20 2.57 value_coef � 0.5 ABS Wheel Angular Reward Speed sensor angle velocity entropy_coef � 0.01 Figure 5: Modifed display of parameters in real time. Showing Te training infrastructure was using AMD Ryzen R9- reward/ABS sensor/speed/wheel angle/angular velocity 4900HS, 16G DRAM, and Nvidia RTX 2060 Max-Q with information. cudnn 10.2. PyTorch was used to generate the neural net- work models. (120 km/h, curvature ), where curvature donates the y y number in the curvature list [250, 1200] : 50. Te same logic 2.4. Unintended Lateral Attack Injection. To better simulate applies to the speed variants of 100 km/h and 80 km/h. the unintended lateral attack, a Logitech G29 driving force Te architecture of the convolutional neural network is racing wheel was used as the attack input instead of using illustrated in Figure 7. Te architecture consists of 6 con- pure software button simulation. G29 provides a 900-degree volutional layers. From the left, the input of RGB image is of steering angle, in a real-world scenario, especially in highly 96 × 96 pixels. Te grass in the picture was removed to re- automated driving, the steering torque or steering angle is duce the complexity since the grass is not crucial in our case. limited to a value due to functional safety requirements. Te Te RGB image was converted to a single gray channel to most signifcant steering value was implemented for un- further reduce the input dimension from three to one. intended lateral attack, that is, −1.0 for the left and +1.0 for Every four frames were used to generate actions. the right. Te attack injection lasted for 100 milliseconds. Terefore, the input to the neural network was 96 × 96 × 4, Te short period implies a sudden action and indicates the where 4 denotes the four continuous frames. Followed by the most-common calculation cycle from perception to vehicle convolutional layers of 47 × 47 × 8, 23 × 23 × 16, motion control. 11 × 11 × 32, 5 × 5 × 64, 3 × 3 × 128, and 1 × 1 × 256 [26], Figure 8 shows the connection between the simulation ReLU was used as the activation function. Kingma and Ba environment and the attack injection input. In Step 1, the [27] was used as the optimizer. Mean squared error loss was test vehicle was running in the CarRacing-v0 using the used to optimize the diference between the predicted value trained model in Figure 6. Te vehicle can keep itself to drive and the actual value of each state. Te clipped loss function in the ego lane. In Step 2, a test driver triggered a sudden was used to limit the probability change that may occur in steering force using the G29, the simulated unintended a single step. lateral attack was passed to the vehicle via python SDK in the 6 Journal of Advanced Transportation Remove green grass Covert to gray 96x96x4 47x47x8 23x23x16 11x11x32 5x5x64 3x3x128 1x1x256 Figure 7: Neural network architecture used in training from image color handling to neural network design. Vehicle is running PPO algorithm to keep the vehicle in the lane Sudden steering force 0131 0079 0.00 0.00 Simulated unintended lateral attack Vehicle encounters unintended lateral 0152 0079 4.20 2.57 attack Vehicle is moving back in the lane 0174 0080 0.00 0.01 Figure 8: Simulated unintended lateral attack. Use external driving force to simulate the unintended lateral attack. CarRacing-v0 environment. In Step 3, the vehicle’s move- 400 m), and (148, 918) in the curve of (80 km/h, 250 m). After ment was observed to check whether the PPO algorithm these turning points, the rewards can reach a relatively stable could bring the vehicle back in the lane. number above 900 and achieve a mean value of 900+ over Te injection was unintended for the vehicle in the the next 100 episodes. Te mean value of these 100 episodes is calculated in Figure 10. CarRacing-v0 environment, which means the vehicle did not know when the injection would occur. Ten test drivers were In the scenario of (120 km/h, 650 m), the mean value invited to trigger the unintended lateral attack injection reaches 927.19. In the scenario of (100 km/h, 400 m), the using G29, data and video were recorded for analysis. mean value reaches 955.32, and in the scenario of (80 km/h, 250 m), the mean value reaches 929.35. All mean values are greater than 900. 3.Results and Discussion Te “solving” state was achieved in only 200 training As outlined in Figure 6, in-lane driving must frst be episodes in this paper. Te typical number is compared to achieved by training and models must be applied to handle reach the “solving” state from the CarRacing-v0 leaderboard: 5000, which has 25 times diference. It could be inferred that unintended lateral attack. Te training lasted for around 2 hours for each scenario: (120 km/h, 650 m), (100 km/h, randomly generated courses increased the complexity of the training. To further explore the situation after 200 training 400 m), and (80 km/h, 250 m). Te training episode versus reward and the ftted curve using the logistic ftting method episodes, the result of the (100 km/h, 400 m) scenario in are illustrated in Figure 9. After 200 training episodes, the a consecutive 5000 training episodes was drawn in Figure 11. agent achieved a mean score of 900+ over the next 100 Te curve was expected to become stable after 200 episodes in the three scenarios, reached the “solving” state training episodes; however, the curve sharply dropped from defned by the CarRacing-v0 leaderboard. episode 340 and started rising again. Tis recovered training From Figure 9, three turning points were identifed from ramp cost 1415 episodes, increased 7 times compared with the training results; they are (197, 901) in the curve of the frst ramp-up in Phase 1. When the reward reached 900+ (120 km/h, 650 m), (159, 1000) in the curve of (100 km/h, again in Phase 2, the reward stayed high for around 1100 Journal of Advanced Transportation 7 (159, 1000) (148, 918) 800 (197, 901) 0 50 100 150 200 250 Episode Reward (80 km/h, 250 m) Reward (100 km/h, 400 m) Reward (120 km/h, 650 m) (a) 0 50 100 150 200 250 Episode Reward (80 km/h, 250 m) Poly. (Reward (100 km/h, 400 m)) Poly. (Reward (80 km/h, 250 m)) Reward (120 km/h, 650 m) Reward (100 km/h, 400 m) Poly. (Reward (120 km/h, 650 m)) (b) Figure 9: Training result: episode versus reward. (a) Episode versus reward with turning points. (b) Fitted episode versus reward. episodes and started to drop again at episode 2931. In Phase 300 m), and (120 km/h, 250 m). Te same logic was applied 3, it cost 502 episodes to reach the “solving” state again at in scenarios of 100 km/h and 80 km/h. All combinations episode 3433 and stayed high for the rest in a relatively stable passed the tests except (100 km/h, 300 m) and (100 km/h, state. Tis curve indicates that the reward can still fuctuate 250 m). Te results show that our trained reference model with time even when the “solving” state is reached. can cover 88% of the unintended lateral attacks listed in Based on the logic described in Figure 6, the trained this paper. models were applied to other scenarios for the unintended Taking into account the design standard in China listed lateral attack tests. Te test results are shown in Figure 12. in 1, trained models are the worst cases of diferent speed 120 km/h case was shown at the upper part of Figure 12. variants on the highway. If the model can handle the case of Te trained model of (120 km/h, 650 m) was applied as (120 km/h, 650 m), the model can also handle the cases of (120 km/h, curvature reference model in the curvature variants of 250, 300, 350, ) theoretically, where curvature is y y 400, 450, 500, 550, 600, 700, 750, 800, 850, 900, 950, 1000, larger than 650 m. According to the design guide in China, 1050, 1100, 1150, and 1200 for unintended lateral attack the combinations of failed cases (120 km/h, 450 m), (120 km/ tests. All combinations passed the tests except (120 km/h, h, 400 m), (120 km/h, 350 m), (120 km/h, 300 m), (120 km/h, 450 m), (120 km/h, 400 m), (120 km/h, 350 m), (120 km/h, 250 m), (100 km/h, 300 m), and (100 km/h, 250 m) do not Reward Reward 8 Journal of Advanced Transportation 955.32 929.35 927.19 120 km/h, 650 m 100 km/h, 400 m 80 km/h, 250 m Combination variant 120 km/h, 650 m 100 km/h, 400 m 80 km/h, 250 m Figure 10: Reward statistics of 3 scenarios. Te mean value of the 100 episodes shown before the “solving” state. Phase 1 Phase 2 Phase 3 (3433, 998) (148, 1000) (1755, 1000) (2931, 1000) (340, 1000) 0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 4500 4750 Episode Figure 11: Consecutive 5000 training episodes of the scenario (100 km/h, 400 m). Te fuctuation curves are shown in 3 phases. (120, 450) (120, 650) (100, 300) (100, 400) (80, 250) 200 300 400 500 600 700 800 900 1000 1100 1200 Curvature (m) Trained reference model Pass Fail Figure 12: Simulated unintended lateral attack in 3 scenarios: 120 km/h, 100 km/h, and 80 km/h. Speed (km/h) Reward Average reward over 100 episodes Journal of Advanced Transportation 9 [8] A. Chattopadhyay, K. Lam, and Y. Tavva, “Autonomous exist in the real world. It seems, therefore, that the results can vehicle: security by design,” IEEE Transactions on Intelligent be applied to the standard-designed highways in China. Transportation Systems, vol. 22, no. 11, pp. 7015–7029, 2021. [9] W. Li, L. Bao, Y. Li, H. Si, and Y. Li, “Assessing the transition 4.Conclusions to low-carbon urban transport: a global comparison,” Re- sources, Conservation and Recycling, vol. 180, Article ID Tis paper proved the feasibility of PPO reinforcement 106179, 2022. learning to keep the vehicle in lane driving on the standard- [10] M. Wiering and M. van Otterlo, Reinforcement Learning, designed highway in China. In addition, PPO can handle the Springer, Berlin, Germany, 2012. unintended lateral attack and bring the vehicle back in the [11] D. Silver, A. Huang, C. J. Maddison et al., “Mastering the game ego lane in the scenarios of (120 km/h, 500 m to 1200 m), of Go with deep neural networks and tree search,” Nature, (100 km/h, 350 m to 1200 m), and (80 km/h, 250 m to vol. 529, no. 7587, pp. 484–489, 2016. 1200 m). Te results were achieved using the modifed [12] J. Duan, S. Eben Li, Y. Guan, Q. Sun, and B. Cheng, “Hi- erarchical reinforcement learning for self-driving decision- CarRacing-v0 simulation environment. making without reliance on labelled driving data,” IET In- However, this paper trains diferent models in three telligent Transport Systems, vol. 14, no. 5, pp. 297–305, 2020. diferent scenarios. It is not the best practice in the real [13] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, and S. E. Li, “Decision world, which may bring an overftting problem. In the fu- making of autonomous vehicles in lane change scenarios: ture, the feasibility of using a single model to cover all deep reinforcement learning approaches with risk awareness,” scenarios on the real-world highway will be studied. Transportation Research Part C: Emerging Technologies, vol. 134, 2022. Data Availability [14] L. Wang, W. Ma, L. Wang, Y. Ren, and C. Yu, “Enabling in- depot automated routing and recharging scheduling for au- Te data used to support the fndings of this study are tomated electric bus transit systems,” Journal of Advanced Transportation, vol. 2021, Article ID 5531063, 15 pages, 2021. available from the corresponding author upon request. [15] M. Cheng, C. Zhang, H. Jin, Z. Wang, and X. Yang, “Adaptive coordinated variable speed limit between highway mainline Conflicts of Interest and on-ramp with deep reinforcement learning,” Journal of Advanced Transportation, vol. 2022, Article ID 2435643, Te authors declare that there are no conficts of interest. 16 pages, 2022. [16] Z. Ma, T. Cui, W. Deng, F. Jiang, and L. Zhang, “Adaptive Acknowledgments optimization of trafc signal timing via deep reinforcement learning,” Journal of Advanced Transportation, vol. 2021, Tis study was supported by the National Natural Science Article ID 6616702, 14 pages, 2021. Foundation of China (no. 52131204) and Bosch Automotive [17] L. Zheng, B. Wu, and P. J. Jin, “A reinforcement learning Products (Suzhou) Co., Ltd. based trafc control strategy in a macroscopic fundamental diagram region,” Journal of Advanced Transportation, vol. 2022, Article ID 5681234, 12 pages, 2022. References [18] L. Elmoiz Alatabani, E. Sayed Ali, R. A. Mokhtar, R. A. Saeed, H. Alhumyani, and M. Kamrul Hasan, “Deep and re- [1] P. Goyal, S. Batra, and A. Singh, “A literature review of se- inforcement learning technologies on internet of vehicle (IoV) curity attack in mobile ad-hoc networks,” International applications: current issues and future trends,” Journal of Journal of Computer Application, vol. 9, no. 11, pp. 11–15, Advanced Transportation, vol. 2022, Article ID 1947886, 16 pages, 2022. [2] Iso, “ISO 26262-1:2018 Road vehicles - functional safety - Part [19] S. Wang, S. K. J. Chang, and S. Fallah, “Autonomous bus feet 1: vocabulary,” 2018, https://www.iso.org/obp/ui/#iso:std:iso: control using multiagent reinforcement learning,” Journal of 26262:-1:ed-2:v1:en. Advanced Transportation, vol. 2021, Article ID 6654254, [3] P. Koopman and M. Wagner, “Autonomous vehicle safety: an 14 pages, 2021. interdisciplinary challenge,” IEEE Intelligent Transportation [20] T. Zhu, X. Li, W. Fan, C. Wang, H. Liu, and R. Zhao, Systems Magazine, vol. 9, no. 1, pp. 90–96, 2017. “Trajectory optimization of CAVs in freeway work zone [4] M. Khatun, M. Glaß, and R. Jung, “Scenario-based extended considering car-following behaviors using online multiagent HARA incorporating functional safety & SOTIF for auton- reinforcement learning,” Journal of Advanced Transportation, omous driving,” in Proceedings of the 30th European Safety vol. 2021, Article ID 9805560, 17 pages, 2021. and Reliability Conference and 15th Probabilistic Safety As- [21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and sessment and Management Conference, pp. 53–59, Singapore, O. Klimov, “Proximal policy optimization algorithms,” 2017, January 2020. [5] D. McCandless, “Codebases: millions of lines of code,” 2022, https://arxiv.org/abs/1707.06347. [22] Minisry of Transport of the People’s Republic of China, https://www.informationisbeautiful.net/visualizations/ million-lines-of-code/. Technical Standard of Highway Engineering, China Com- munications Press Co. Ltd, Beijing, China, 2014. [6] J. Garc´ıa and F. Fernandez, ´ “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Re- [23] Minisry of Transport of the People’s Republic of China, Design Specifcation for Highway Alignment, China Communications search, vol. 16, no. 1, pp. 1437–1480, 2015. [7] M. Dibaei, X. Zheng, K. Jiang et al., “An overview of attacks Press Co Ltd, Beijing, China, 2018. [24] M. Kaspar, J. D. M. Osorio, and J. Bock, “Sim2Real transfer for and defences on intelligent connected vehicles,” 2019, https:// arxiv.org/abs/1907.07455. reinforcement learning without dynamics randomization,” in 10 Journal of Advanced Transportation Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4383–4388, IEEE, Las Vegas, NV, USA, January 2020. [25] OpenAI, “CarRacing-v0,” 2022, https://github.com/ AGiannoutsos/car_racer_gym. [26] X. Ma, “Reinforcement learning for gym CarRacing-v0 with PyTorch,” 2022, https://github.com/xtma/pytorch_car_ caring. [27] D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, December, http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Advanced Transportation Hindawi Publishing Corporation

Using Reinforcement Learning to Handle the Unintended Lateral Attack in the Intelligent Connected Vehicle Environment

Loading next page...
 
/lp/hindawi-publishing-corporation/using-reinforcement-learning-to-handle-the-unintended-lateral-attack-cf83ZNSXH0

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Hindawi Publishing Corporation
ISSN
0197-6729
eISSN
2042-3195
DOI
10.1155/2023/3187944
Publisher site
See Article on Publisher Site

Abstract

Hindawi Journal of Advanced Transportation Volume 2023, Article ID 3187944, 10 pages https://doi.org/10.1155/2023/3187944 Research Article UsingReinforcementLearningtoHandletheUnintendedLateral Attack in the Intelligent Connected Vehicle Environment 1,2 1 1 1 Luoyi Huang , Wanjing Ma , Ling Wang , and Kun An Te Key Laboratory of Road and Trafc Engineering, Ministry of Education, Tongji University, Shanghai 201804, China Bosch Automotive Products (Suzhou) Co. Ltd., Suzhou 215025, China Correspondence should be addressed to Wanjing Ma; mawanjing@tongji.edu.cn Received 16 August 2022; Revised 15 October 2022; Accepted 17 March 2023; Published 21 April 2023 Academic Editor: Wenxiang Li Copyright © 2023 Luoyi Huang et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. It is widely accepted that an unintended lateral attack is inevitable in the intelligent connected vehicle environment. Tis paper explores the feasibility of a reinforcement learning method named PPO (Proximal Policy Optimization) to handle the unintended lateral attack and keep the vehicle stay in the ego lane. Based on the China highway design guide, the discrete speed variants of 120 km/h, 100 km/h, and 80 km/h were selected, along with diferent curvatures ranging from 250 m to 1200 m in every 50 m as combinations of speed-curvature test. Te tests were implemented in the Open.ai CarRacing-v0 simulation environment with an external racing wheel attached to simulate the unintended lateral attack. Te simulation results show that the PPO can handle the unintended lateral attack on the standard-designed highway in China. Te results can be applied to the intelligent connected vehicle to be mass-produced in the future. severity, probability of exposure, and controllability. As of 1.Introduction today, there is no fully autonomous vehicle that end-users can buy in the market, and one reason is that absolute safety Intelligent connected vehicle is shaping the automotive industry. It allows the vehicle to communicate with other cannot be proved in a commonly accepted way. trafc participants. In a connected environment, the attack is Functional safety cares about the Electric/Electronic inevitable. It has many possible ways to handle the identifed malfunction behavior of the vehicle. Its practice has evolved attack from a security perspective [1]. However, it has been for many years and has already been applied in mass- found to be only a few studies on the unintended attack from produced vehicles. SOTIF, which stands for Safety of the a functional safety perspective. Intended Functionality, is a logical supplement to the Along with the rapid development of the intelligent established functional safety standard ISO 26262. SOTIF connected vehicle, safety is becoming an important issue to deals with the functional limitation of the vehicle concerning consider. Automotive Functional Safety (ISO 26262, Road the absence of unreasonable risk due to hazards resulting vehicles-Functional safety [2]) has become the de facto from functional insufciency of the intended functionality practice for intelligent connected vehicle to be produced in together with the reasonably foreseeable misuse by persons the market. ISO 26262 generally gives a system credit for [4]. SOTIF is currently difcult to quantify. Take the example a human driver ultimately being responsible for safety, of lines of source code. Air force F-22 has around 1.7 million which consists of three evaluation factors: severity, the lines of source code, while Boeing 787 has 6.5 million lines probability of exposure, and controllability [3]. ISO 26262 is and air force F-36 has 24 million lines. Compared with these explicitly targeted for automotive safety, providing a safety examples, the luxury vehicle already mass-produced in the lifecycle that includes development, production, operation, market has 100 million lines [5]. Based on the experts’ as- service, and decommissioning. ISO 26262 defnes the ASIL sumption, the lines of source code of autonomous vehicle (Automotive Safety Integrity Level). ASIL is calculated from will increase exponentially considering the complexity of the 2 Journal of Advanced Transportation functionality. With this tremendous amount of source code T increased, there is a higher risk of an autonomous vehicle to be failed in some corner cases. In addition, with the in- troduction of deep learning technology, it can be more challenging to bring unknown compared to the traditional I B Vee model development. To address this, Garc´ıa and Fernandez ´ [6] introduced the Safe Reinforcement Learning, which was defned as the process of learning policies that Figure 1: Reinforcement learning model. Te vehicle interacts with maximize the expectation of the return in problems in which the environment via trial-and-error method. it is crucial to ensure reasonable system performance and/or respect safety constraints during the learning and/or de- a hot research area [18–20]. Whatever single-agent or ployment processes. multiagent reinforcement learning, there are some basic and Besides safety, security is also an essential factor to commonly used reinforcement learning methods like DQN consider in the intelligent connected vehicle since intelligent (Deep Q Networks), PG (Policy Gradient), DDPG (Deep connected vehicle connect with other vehicles, in- frastructure, and the cloud. A vehicle is no longer an isolated Deterministic Policy Gradient), TD3 (Twin Delayed DDPG), SAC (Soft Actor Critic), and A2C (Advantage Actor Critic). object in an intelligent connected vehicle environment. Along with the enriched functionality enabled by connec- In 2017, OpenAI published a novel objective function that enables multiple epochs of minibatch updates named PPO tivity, the vehicle opens the attack interface for external resources. Dibaei et al. [7] summarized common attack (Proximal Policy Optimization) [21], a family of policy optimization methods, which achieved a favorable balance methods in the intelligent connected vehicle environment, mainly including DoS (Denial of Service), DDoS (Distrib- between sample complexity, simplicity, and wall-time. Te PPO algorithm is illustrated as follows: uted Denial-of-service), black-hole attack, replay attack, Sybil attack, impersonation attack, malware, falsifed in- for iteration � 1, 2, . . . , do formation attack, and timing attack. Even with modern for actor � 1, 2, . . . , N do encryption technology, security vulnerabilities can still be found in the automotive industry. Chattopadhyay et al. [8] Run policy π in environment for T timesteps old found that the security-by-design principle for autonomous 􏽢 􏽢 Compute advantage estimates A , . . . , A 1 T vehicle is poorly understood and rarely practiced. Te in- end for telligent connected vehicle is prone to attack, and un- Optimize surrogate L wrt θ, with K epochs and intended attack is inevitable. Te assumption was made in minibatch size M≤ NT this paper that attack exists and cannot be eliminated. In addition, the feet-free longitudinal function has been widely θ ←θ old studied and released in the market for years like adaptive end for cruise control, so the unintended lateral attack was focused PPO uses two neural networks: the policy π(s) and the on in this paper. value function V(s). Te policy π(s) maps an observation s Te unintended lateral attack is not only an automotive t to an action a , while the value function V(s) maps an issue, but it can also cause an environmental problem and observation s to a scalar value showing how advantageous it become a barrier to achieving low-carbon transportation [9]. t is to be in that state. Te value network estimates the value of Terefore, research on “how to handle the unintended lateral each state by minimizing the error between the predicted attack” is a must to reach the future intelligent trans- value and the actual value. Te policy network uses the portation systems. One possible way to handle unintended estimate of value function to select actions that lead to higher lateral attack is reinforcement learning. Reinforcement rewards. learning is being used by an agent to learn behavior through Resulting from these considerations, the remainder of trial-and-error interactions with the environment [10]. A this paper is organized as follows. Section 2 introduces the standard reinforcement learning model is shown in Figure 1. methods used in this study, including test scenario design, An agent is connected to the environment through simulation environment construction, training procedure, perception and action. At each step of the interaction, the and attack injection logic. Section 3 presents the positive agent receives an input: i, with the indication of the current simulation results and illustrates the efectiveness of our state: s, selects an action: a, generates an output. Te action method. Section 4 concludes the fndings and identifes open changes the state of the environment, and the value of this areas of research for future work. state transition is communicated to the agent via a scalar: r. Te agent’s behavior B, is targeting to select actions that can increase the long-run sum of rewards. Te agent can learn to 2.Methods do this over time through trial-and-error interactions. In recent years, reinforcement learning has been applied in the Te test scenarios were defned based on the standard of the game of Go [11], highly automated driving [12–14], trafc highway in China. A modifed CarRacing-v0 simulation signal control [15–17], and has proven its efectiveness. environment was used to generate the test scenarios and Meanwhile, multi-agent reinforcement learning is becoming provide a secondary development interface for Journal of Advanced Transportation 3 Table 1: Design of curvature and lane width in “Design Specif- reinforcement learning implementation. Te PPO algo- cation for Highway Alignment.” rithm was then applied to the selected scenarios for training; afterwards, the trained models were used to infer Limit curvature value the rest of the test scenarios. Te unintended lateral attack (m), with I � 8%, Design speed Lane width was simulated by attaching an external driving force where I donates (km/h) (m) racing wheel. maximum superelevation value 120 650 3.75 2.1. Test Scenario Design. Te test scenario is the combi- 100 400 3.75 nations of speed and curvature on the highway. Based on the 80 250 3.75 “Technical Standard of Highway Engineering” [22] and 60 125 3.5 “Design Specifcation for Highway Alignment” [23], see 40 60 3.5 Table 1. Te most common speed limits on the highway in China are 120 km/h on the standard-designed highway, 100 km/h on the class-1 highway, and 80 km/h on the class-2 Due to the limitation of the reward calculation in highway. A design speed of less than 80 km/h is not CarRacing-v0, the vehicle’s exact position and the distance a standard highway in China. Terefore, the minimum speed between the vehicle and the lane markers are unknown to us. considered in this paper is 80 km/h. For curvature, four Terefore, in this paper, the following criteria were defned specifc numbers were identifed as follows: 250 m, 400 m, to decide whether reinforcement learning can handle the 650 m, and 1200 m. Te reasons for selecting these numbers unintended attack or not. are as follows: (i) PASS: the vehicle can move back into the ego lane (i) 250 m: the minimum curvature on the standard after the unintended attack; 10 out of 10 succeed highway in China. Tis usually appears at on-ramp (ii) FAIL: the vehicle cannot move back into the ego lane and of-ramp and leaves the lane ultimately; >1 out of 10 failed (ii) 400 m: the minimum curvature of highway, which Te version of CarRacing-v0 is 0.18.3, which was re- has a speed limit of 100 km/h leased in May 2021. Te course shape, lane width, and (iii) 650 m: the minimum curvature of highway, which traveling speed were modifed accordingly to meet our test has a speed limit of 120 km/h requirements. Te course in original CarRacing-v0 was (iv) 1200 m: the threshold of curvature, which can cover randomly generated for reinforcement training and testing. 95% of highway in the Yangtze River area in China In this paper, the source code was modifed and recompiled to generate the fxed shape of the course. Figure 2 shows the Te lane width was set to 3.75 m in our test scenario randomly generated courses and Figure 3 shows the fxed considering the speed variants were 120 km/h, 100 km/h, shape generation after code modifcation. and 80 km/h. Te test matrix was defned in Table 2. In addition to the course shape, the lane width was Take the 100 km/h case for example, curvatures were changed to 3.75 m according to the needs of our study, which selected from the range [250, 1200] in every 50 m, thus the is shown in Figure 4. following list of curvature can be derived: (250, 300, 350, 400, CarRacing-v0 provides the observation and action 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, control interfaces in a Box manner. Box represents the 1050, 1100, 1150, 1200). Cartesian product of n closed intervals, it is the specifc type defned by OpenAI gym. Te reward range of CarRacing-v0 2.2. Simulation Environment Construction. Simulation is is (−inf, inf). Te actions are in a discrete vector shown in commonly used as an environmental tool to train re- Table 3. inforcement learning algorithms. Te results have the po- With the diferent combinations of the action elements, tential to be transferred to solve real-world problems [24]. In the following action spaces can be derived in Table 4. this paper, CarRacing-v0 was selected as the simulation To simplify the action and avoid causing the vehicle to environment. CarRacing-v0 is a reinforcement learning drift, only the single action from each action space was environment developed by OpenAI [25] to support con- selected, marked in bold in Table 4. Tey are steer left: [− 1, 0, tinuous control. It provides a bird-view racing environment 0], no action: [0, 0, 0], accelerate: [0, 0, 1], brake: [0, 0.5, 0], that fts well for the future infrastructure-supported auto- and steer right: [1, 0, 0]. In addition, to better monitor the mated driving environment, which difers from the in- parameters in real time, the label texts were added in the vehicle sensing perspective. CarRacing-v0 provides a state, following order, from left to right: reward, ABS sensor, which consists of 96 × 96 pixels. speed, wheel angle, and angular velocity, which is shown in Te CarRacing-v0 reward is −0.1 every frame and Figure 5. + 1000/N for every track tile visited, where N is the total number of tiles in track. According to the example on the ofcial website, if we have fnished in 732 frames, the reward 2.3. Training Logic and Parameter Setting. Te combination is 1000 − 0.1 × 732 � 926.8 points. Te episode fnishes space of speed versus curvature is too large; thus, the fol- when all tiles are visited. lowing combinations were selected for initial training: 4 Journal of Advanced Transportation Table 2: Test matrix between speed and curvature. Speed (km/h) Curvature (m) 120 Range [250, 1200] : 50, i.e., select curvature from 250 m–1200 m in every 50 m 100 Range [250, 1200] : 50, i.e., select curvature from 250 m–1200 m in every 50 m 80 Range [250, 1200] : 50, i.e., select curvature from 250 m–1200 m in every 50 m (a) (b) Figure 2: Randomly generated courses in CarRacing-v0. (a) Random course generation example 1. (b) Random course generation example 2. (a) (b) Figure 3: Generation of fxed shapes after code modifcation. (a) Fixed curvature of 500 m. (b) Fixed curvature of 1700 m. (a) (b) Figure 4: Lane width modifcation. (a) Original lane width. (b) Modifed lane width to 3.75 m. Table 3: Action element. Action element Action type Data range Meaning of vector 1 Steering [−1, 0, 1] [Steer left, no action, steer right] 2 Braking [0, 0.5] [No action, brake] 3 Acceleration [0, 1] [No action, accelerate] (120 km/h, 650 m), (100 km/h, 400 m), and (80 km/h, 250 m). CarRacing-v0 leaderboard, the “solving” is defned as getting Te training and inference logic is illustrated in Figure 6. the average reward of 900+ over 100 consecutive episodes, In the training session shown on the left-hand side in which indicates that the reinforcement learning based in- Figure 6, the three separated models for (120 km/h, 650 m), lane driving has been achieved. After that, the training (100 km/h, 400 m), and (80 km/h, 250 m) were trained as the session fnished. In the inference session, the trained model base. According to the defnition-of-done from the from (120 km/h, 650 m) was used to test the variant of Journal of Advanced Transportation 5 Table 4: Action space. Inference for following combinations Action space Action combination (120 km/h, 250 m) 1 [−1, 0, 0] (120 km/h, 300 m) 2 [–1, 0, 1] (120 km/h, ... m) 3 [–1, 0.5, 0] Trained model of (120 km/h, 1,200 m) following combinations 4 [–1, 0.5, 1] 5 [0, 0, 0] (120 km/h, 650 m) (100 km/h, 250 m) 6 [0, 0, 1] (100 km/h, 300 m) 7 [0, 0.5, 0] (100 km/h, 400 m) (100 km/h, ... m) 8 [0, 0.5, 1] (100 km/h, 1,200 m) (80 km/h, 250 m) 9 [1, 0, 0] 10 [1, 0, 1] 11 [1, 0.5, 0] (80 km/h, 250 m) (80 km/h, 300 m) 12 [1, 0.5, 1] (80 km/h, ... m) (80 km/h, 1,200 m) Figure 6: Training and inference logic. Train the reference model as a base and apply the reference model to diferent scenarios. Te parameters used in the training session: gamma (discount factor) � 0.99 gae_lambda � 0.95 image stack � 4 max_grad_norm � 0.5 epoch � 10 batch_size � 128 learning rate � 1e − 3 0152 0079 4.20 2.57 value_coef � 0.5 ABS Wheel Angular Reward Speed sensor angle velocity entropy_coef � 0.01 Figure 5: Modifed display of parameters in real time. Showing Te training infrastructure was using AMD Ryzen R9- reward/ABS sensor/speed/wheel angle/angular velocity 4900HS, 16G DRAM, and Nvidia RTX 2060 Max-Q with information. cudnn 10.2. PyTorch was used to generate the neural net- work models. (120 km/h, curvature ), where curvature donates the y y number in the curvature list [250, 1200] : 50. Te same logic 2.4. Unintended Lateral Attack Injection. To better simulate applies to the speed variants of 100 km/h and 80 km/h. the unintended lateral attack, a Logitech G29 driving force Te architecture of the convolutional neural network is racing wheel was used as the attack input instead of using illustrated in Figure 7. Te architecture consists of 6 con- pure software button simulation. G29 provides a 900-degree volutional layers. From the left, the input of RGB image is of steering angle, in a real-world scenario, especially in highly 96 × 96 pixels. Te grass in the picture was removed to re- automated driving, the steering torque or steering angle is duce the complexity since the grass is not crucial in our case. limited to a value due to functional safety requirements. Te Te RGB image was converted to a single gray channel to most signifcant steering value was implemented for un- further reduce the input dimension from three to one. intended lateral attack, that is, −1.0 for the left and +1.0 for Every four frames were used to generate actions. the right. Te attack injection lasted for 100 milliseconds. Terefore, the input to the neural network was 96 × 96 × 4, Te short period implies a sudden action and indicates the where 4 denotes the four continuous frames. Followed by the most-common calculation cycle from perception to vehicle convolutional layers of 47 × 47 × 8, 23 × 23 × 16, motion control. 11 × 11 × 32, 5 × 5 × 64, 3 × 3 × 128, and 1 × 1 × 256 [26], Figure 8 shows the connection between the simulation ReLU was used as the activation function. Kingma and Ba environment and the attack injection input. In Step 1, the [27] was used as the optimizer. Mean squared error loss was test vehicle was running in the CarRacing-v0 using the used to optimize the diference between the predicted value trained model in Figure 6. Te vehicle can keep itself to drive and the actual value of each state. Te clipped loss function in the ego lane. In Step 2, a test driver triggered a sudden was used to limit the probability change that may occur in steering force using the G29, the simulated unintended a single step. lateral attack was passed to the vehicle via python SDK in the 6 Journal of Advanced Transportation Remove green grass Covert to gray 96x96x4 47x47x8 23x23x16 11x11x32 5x5x64 3x3x128 1x1x256 Figure 7: Neural network architecture used in training from image color handling to neural network design. Vehicle is running PPO algorithm to keep the vehicle in the lane Sudden steering force 0131 0079 0.00 0.00 Simulated unintended lateral attack Vehicle encounters unintended lateral 0152 0079 4.20 2.57 attack Vehicle is moving back in the lane 0174 0080 0.00 0.01 Figure 8: Simulated unintended lateral attack. Use external driving force to simulate the unintended lateral attack. CarRacing-v0 environment. In Step 3, the vehicle’s move- 400 m), and (148, 918) in the curve of (80 km/h, 250 m). After ment was observed to check whether the PPO algorithm these turning points, the rewards can reach a relatively stable could bring the vehicle back in the lane. number above 900 and achieve a mean value of 900+ over Te injection was unintended for the vehicle in the the next 100 episodes. Te mean value of these 100 episodes is calculated in Figure 10. CarRacing-v0 environment, which means the vehicle did not know when the injection would occur. Ten test drivers were In the scenario of (120 km/h, 650 m), the mean value invited to trigger the unintended lateral attack injection reaches 927.19. In the scenario of (100 km/h, 400 m), the using G29, data and video were recorded for analysis. mean value reaches 955.32, and in the scenario of (80 km/h, 250 m), the mean value reaches 929.35. All mean values are greater than 900. 3.Results and Discussion Te “solving” state was achieved in only 200 training As outlined in Figure 6, in-lane driving must frst be episodes in this paper. Te typical number is compared to achieved by training and models must be applied to handle reach the “solving” state from the CarRacing-v0 leaderboard: 5000, which has 25 times diference. It could be inferred that unintended lateral attack. Te training lasted for around 2 hours for each scenario: (120 km/h, 650 m), (100 km/h, randomly generated courses increased the complexity of the training. To further explore the situation after 200 training 400 m), and (80 km/h, 250 m). Te training episode versus reward and the ftted curve using the logistic ftting method episodes, the result of the (100 km/h, 400 m) scenario in are illustrated in Figure 9. After 200 training episodes, the a consecutive 5000 training episodes was drawn in Figure 11. agent achieved a mean score of 900+ over the next 100 Te curve was expected to become stable after 200 episodes in the three scenarios, reached the “solving” state training episodes; however, the curve sharply dropped from defned by the CarRacing-v0 leaderboard. episode 340 and started rising again. Tis recovered training From Figure 9, three turning points were identifed from ramp cost 1415 episodes, increased 7 times compared with the training results; they are (197, 901) in the curve of the frst ramp-up in Phase 1. When the reward reached 900+ (120 km/h, 650 m), (159, 1000) in the curve of (100 km/h, again in Phase 2, the reward stayed high for around 1100 Journal of Advanced Transportation 7 (159, 1000) (148, 918) 800 (197, 901) 0 50 100 150 200 250 Episode Reward (80 km/h, 250 m) Reward (100 km/h, 400 m) Reward (120 km/h, 650 m) (a) 0 50 100 150 200 250 Episode Reward (80 km/h, 250 m) Poly. (Reward (100 km/h, 400 m)) Poly. (Reward (80 km/h, 250 m)) Reward (120 km/h, 650 m) Reward (100 km/h, 400 m) Poly. (Reward (120 km/h, 650 m)) (b) Figure 9: Training result: episode versus reward. (a) Episode versus reward with turning points. (b) Fitted episode versus reward. episodes and started to drop again at episode 2931. In Phase 300 m), and (120 km/h, 250 m). Te same logic was applied 3, it cost 502 episodes to reach the “solving” state again at in scenarios of 100 km/h and 80 km/h. All combinations episode 3433 and stayed high for the rest in a relatively stable passed the tests except (100 km/h, 300 m) and (100 km/h, state. Tis curve indicates that the reward can still fuctuate 250 m). Te results show that our trained reference model with time even when the “solving” state is reached. can cover 88% of the unintended lateral attacks listed in Based on the logic described in Figure 6, the trained this paper. models were applied to other scenarios for the unintended Taking into account the design standard in China listed lateral attack tests. Te test results are shown in Figure 12. in 1, trained models are the worst cases of diferent speed 120 km/h case was shown at the upper part of Figure 12. variants on the highway. If the model can handle the case of Te trained model of (120 km/h, 650 m) was applied as (120 km/h, 650 m), the model can also handle the cases of (120 km/h, curvature reference model in the curvature variants of 250, 300, 350, ) theoretically, where curvature is y y 400, 450, 500, 550, 600, 700, 750, 800, 850, 900, 950, 1000, larger than 650 m. According to the design guide in China, 1050, 1100, 1150, and 1200 for unintended lateral attack the combinations of failed cases (120 km/h, 450 m), (120 km/ tests. All combinations passed the tests except (120 km/h, h, 400 m), (120 km/h, 350 m), (120 km/h, 300 m), (120 km/h, 450 m), (120 km/h, 400 m), (120 km/h, 350 m), (120 km/h, 250 m), (100 km/h, 300 m), and (100 km/h, 250 m) do not Reward Reward 8 Journal of Advanced Transportation 955.32 929.35 927.19 120 km/h, 650 m 100 km/h, 400 m 80 km/h, 250 m Combination variant 120 km/h, 650 m 100 km/h, 400 m 80 km/h, 250 m Figure 10: Reward statistics of 3 scenarios. Te mean value of the 100 episodes shown before the “solving” state. Phase 1 Phase 2 Phase 3 (3433, 998) (148, 1000) (1755, 1000) (2931, 1000) (340, 1000) 0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 4500 4750 Episode Figure 11: Consecutive 5000 training episodes of the scenario (100 km/h, 400 m). Te fuctuation curves are shown in 3 phases. (120, 450) (120, 650) (100, 300) (100, 400) (80, 250) 200 300 400 500 600 700 800 900 1000 1100 1200 Curvature (m) Trained reference model Pass Fail Figure 12: Simulated unintended lateral attack in 3 scenarios: 120 km/h, 100 km/h, and 80 km/h. Speed (km/h) Reward Average reward over 100 episodes Journal of Advanced Transportation 9 [8] A. Chattopadhyay, K. Lam, and Y. Tavva, “Autonomous exist in the real world. It seems, therefore, that the results can vehicle: security by design,” IEEE Transactions on Intelligent be applied to the standard-designed highways in China. Transportation Systems, vol. 22, no. 11, pp. 7015–7029, 2021. [9] W. Li, L. Bao, Y. Li, H. Si, and Y. Li, “Assessing the transition 4.Conclusions to low-carbon urban transport: a global comparison,” Re- sources, Conservation and Recycling, vol. 180, Article ID Tis paper proved the feasibility of PPO reinforcement 106179, 2022. learning to keep the vehicle in lane driving on the standard- [10] M. Wiering and M. van Otterlo, Reinforcement Learning, designed highway in China. In addition, PPO can handle the Springer, Berlin, Germany, 2012. unintended lateral attack and bring the vehicle back in the [11] D. Silver, A. Huang, C. J. Maddison et al., “Mastering the game ego lane in the scenarios of (120 km/h, 500 m to 1200 m), of Go with deep neural networks and tree search,” Nature, (100 km/h, 350 m to 1200 m), and (80 km/h, 250 m to vol. 529, no. 7587, pp. 484–489, 2016. 1200 m). Te results were achieved using the modifed [12] J. Duan, S. Eben Li, Y. Guan, Q. Sun, and B. Cheng, “Hi- erarchical reinforcement learning for self-driving decision- CarRacing-v0 simulation environment. making without reliance on labelled driving data,” IET In- However, this paper trains diferent models in three telligent Transport Systems, vol. 14, no. 5, pp. 297–305, 2020. diferent scenarios. It is not the best practice in the real [13] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, and S. E. Li, “Decision world, which may bring an overftting problem. In the fu- making of autonomous vehicles in lane change scenarios: ture, the feasibility of using a single model to cover all deep reinforcement learning approaches with risk awareness,” scenarios on the real-world highway will be studied. Transportation Research Part C: Emerging Technologies, vol. 134, 2022. Data Availability [14] L. Wang, W. Ma, L. Wang, Y. Ren, and C. Yu, “Enabling in- depot automated routing and recharging scheduling for au- Te data used to support the fndings of this study are tomated electric bus transit systems,” Journal of Advanced Transportation, vol. 2021, Article ID 5531063, 15 pages, 2021. available from the corresponding author upon request. [15] M. Cheng, C. Zhang, H. Jin, Z. Wang, and X. Yang, “Adaptive coordinated variable speed limit between highway mainline Conflicts of Interest and on-ramp with deep reinforcement learning,” Journal of Advanced Transportation, vol. 2022, Article ID 2435643, Te authors declare that there are no conficts of interest. 16 pages, 2022. [16] Z. Ma, T. Cui, W. Deng, F. Jiang, and L. Zhang, “Adaptive Acknowledgments optimization of trafc signal timing via deep reinforcement learning,” Journal of Advanced Transportation, vol. 2021, Tis study was supported by the National Natural Science Article ID 6616702, 14 pages, 2021. Foundation of China (no. 52131204) and Bosch Automotive [17] L. Zheng, B. Wu, and P. J. Jin, “A reinforcement learning Products (Suzhou) Co., Ltd. based trafc control strategy in a macroscopic fundamental diagram region,” Journal of Advanced Transportation, vol. 2022, Article ID 5681234, 12 pages, 2022. References [18] L. Elmoiz Alatabani, E. Sayed Ali, R. A. Mokhtar, R. A. Saeed, H. Alhumyani, and M. Kamrul Hasan, “Deep and re- [1] P. Goyal, S. Batra, and A. Singh, “A literature review of se- inforcement learning technologies on internet of vehicle (IoV) curity attack in mobile ad-hoc networks,” International applications: current issues and future trends,” Journal of Journal of Computer Application, vol. 9, no. 11, pp. 11–15, Advanced Transportation, vol. 2022, Article ID 1947886, 16 pages, 2022. [2] Iso, “ISO 26262-1:2018 Road vehicles - functional safety - Part [19] S. Wang, S. K. J. Chang, and S. Fallah, “Autonomous bus feet 1: vocabulary,” 2018, https://www.iso.org/obp/ui/#iso:std:iso: control using multiagent reinforcement learning,” Journal of 26262:-1:ed-2:v1:en. Advanced Transportation, vol. 2021, Article ID 6654254, [3] P. Koopman and M. Wagner, “Autonomous vehicle safety: an 14 pages, 2021. interdisciplinary challenge,” IEEE Intelligent Transportation [20] T. Zhu, X. Li, W. Fan, C. Wang, H. Liu, and R. Zhao, Systems Magazine, vol. 9, no. 1, pp. 90–96, 2017. “Trajectory optimization of CAVs in freeway work zone [4] M. Khatun, M. Glaß, and R. Jung, “Scenario-based extended considering car-following behaviors using online multiagent HARA incorporating functional safety & SOTIF for auton- reinforcement learning,” Journal of Advanced Transportation, omous driving,” in Proceedings of the 30th European Safety vol. 2021, Article ID 9805560, 17 pages, 2021. and Reliability Conference and 15th Probabilistic Safety As- [21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and sessment and Management Conference, pp. 53–59, Singapore, O. Klimov, “Proximal policy optimization algorithms,” 2017, January 2020. [5] D. McCandless, “Codebases: millions of lines of code,” 2022, https://arxiv.org/abs/1707.06347. [22] Minisry of Transport of the People’s Republic of China, https://www.informationisbeautiful.net/visualizations/ million-lines-of-code/. Technical Standard of Highway Engineering, China Com- munications Press Co. Ltd, Beijing, China, 2014. [6] J. Garc´ıa and F. Fernandez, ´ “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Re- [23] Minisry of Transport of the People’s Republic of China, Design Specifcation for Highway Alignment, China Communications search, vol. 16, no. 1, pp. 1437–1480, 2015. [7] M. Dibaei, X. Zheng, K. Jiang et al., “An overview of attacks Press Co Ltd, Beijing, China, 2018. [24] M. Kaspar, J. D. M. Osorio, and J. Bock, “Sim2Real transfer for and defences on intelligent connected vehicles,” 2019, https:// arxiv.org/abs/1907.07455. reinforcement learning without dynamics randomization,” in 10 Journal of Advanced Transportation Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4383–4388, IEEE, Las Vegas, NV, USA, January 2020. [25] OpenAI, “CarRacing-v0,” 2022, https://github.com/ AGiannoutsos/car_racer_gym. [26] X. Ma, “Reinforcement learning for gym CarRacing-v0 with PyTorch,” 2022, https://github.com/xtma/pytorch_car_ caring. [27] D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” in Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, December,

Journal

Journal of Advanced TransportationHindawi Publishing Corporation

Published: Apr 21, 2023

There are no references for this article.