We study the robustness of deep reinforcement learning agents when their state observations (e.g., measurements of environment states) contain noises or even adversarial perturbations. For example, a self-driving car can observe its location through GPS, however GPS signal contains uncertainty and the driving policy must take this uncertainty into consideration to plan routes safely. To guarantee performance under even the worst case uncertainty, we study the foundamental state-adversarial Markov Decision Process (SA-MDP) and propose theoretically principled robustness regularizers for PPO, DDPG and DQN. In addition, we also propose two strong adversarial attacks for PPO and DDPG, the maximal action difference (MAD) attack and the robust sarsa (RS) attack. More details can be found in our paper:
Huan Zhang*, Hongge Chen*, Chaowei Xiao, Mingyan Liu, Bo Li, Duane Boning,
and Cho-Jui Hsieh, "Robust Deep Reinforcement Learning against Adversarial
Perturbations on Observations" (*Equal contribution)
NeurIPS 2020
(Spotlight)
Please also checkout our new work on optimal adversary and alternating training of learned adversary and agent (ATLA) for a robust RL agent:
Huan Zhang*, Hongge Chen*, Duane Boning,
and Cho-Jui Hsieh, "Robust Reinforcement Learning on State Observations with Learned Optimal Adversary" (*Equal contribution)
ICLR 2020 (Code)
In our paper, we conduct experiments on deep Q network (DQN) for discrete action space environments (e.g. Atari games), deep deterministic policy gradient (DDPG) for continous action space environments with deterministic actions, and proximal policy optimization (PPO) for continous action space environments with stochastic actions. Our proposed algorithms (SA-PPO, SA-DDPG and SA-PPO) are evaluated using 11 environments.
Reference implementation for SA-DQN can be found at https://github.com/chenhongge/SA_DQN.
Reference implementation for SA-PPO can be found at https://github.com/huanzhang12/SA_PPO.
Reference implementation for SA-DDPG can be found at https://github.com/huanzhang12/SA_DDPG.
Environment | Evaluation | Vanilla DQN | SA-DQN (convex relaxation) |
---|---|---|---|
Pong | No attack | 21.0±0.0 | 21.0±0.0 |
PGD 10-step attack | -21.0±0.0 | 21.0±0.0 | |
PGD 50-step attack | -21.0±0.0 | 21.0±0.0 | |
RoadRunner | No attack | 45534.0±7066.0 | 44638.0±7367.0 |
PGD 10-step attack | 0.0±0.0 | 44732.0±8059.5 | |
PGD 50-step attack | 0.0±0.0 | 44678.0±6954.0 | |
BankHeist | No attack | 1308.4±24.1 | 1235.4±9.8 |
PGD 10-step attack | 56.4±21.2 | 1232.4±16.2 | |
PGD 50-step attack | 31.0±32.6 | 1234.6±16.6 | |
Freeway | No attack | 34.0±0.2 | 30.0±0.0 |
PGD 10-step attack | 0.0±0.0 | 30.0±0.0 | |
PGD 50-step attack | 0.0±0.0 | 30.0±0.0 |
See our SA-DQN repository for more details.
We repeatedly train each agent configuration at least 15 times, and rank them with their average cumulative rewards over 50 episodes under the strongest attack (among 5 attacks used). We report the performance for agents with median robustnes (we do not cherry-pick the best agents).
Environment | Evaluation | Vanilla PPO | SA-PPO (convex) | SA-PPO (SGLD) |
---|---|---|---|---|
Humanoid-v2 | No attack | 5270.6 | 6400.6 | 6624.0 |
Strongest attack | 884.1 | 4690.3 | 6073.8 | |
Walker2d-v2 | No attack | 4619.5 | 4486.6 | 4911.8 |
Strongest attack | 913.7 | 2076.1 | 2468.4 | |
Hopper-v2 | No attack | 3167.6 | 3704.1 | 3523.1 |
Strongest attack | 733 | 1224.2 | 1403.3 |
See our SA-PPO repository for more details.
We attack each agent with 5 different attacks (random attack, critic attack, MAD attack, RS attack and RS+MAD attack). Here we report the lowest reward of all 5 attacks in "Strongest attack" rows. Additionally, we train each setting 11 times and we report the agent with median robustness (we do not cherry-pick the best results). This is important due to the potential large training variance in RL.
Environment | Evaluation | Vanilla DDPG | SA-DDPG (SGLD) | SA-DDPG (convex) |
---|---|---|---|---|
Ant-v2 | No attack | 1487 | 2186 | 2254 |
Strongest attack | 142 | 2007 | 1820 | |
Walker2d-v2 | No attack | 1870 | 3318 | 4540 |
Strongest attack | 790 | 1210 | 1986 | |
Hopper-v2 | No attack | 3302 | 3068 | 3128 |
Strongest attack | 606 | 1609 | 1202 | |
Reacher-v2 | No attack | -4.37 | -5.00 | -5.24 |
Strongest attack | -27.87 | -12.10 | -12.44 | |
InvertedPendulum-v2 | No attack | 1000 | 1000 | 1000 |
Strongest attack | 92 | 423 | 1000 |
See our SA-DDPG repository for more details.