A list of deep neural network architectures for reinforcement learning tasks.
Paper | Domain | Model | Architecture | Source code |
---|---|---|---|---|
Mnih et al., 2013 | Atari | DQN (NIPS version) | The first conv layer: 16 filters of 8×8 with stride 4. The second layer: 32 filters of 4×4 filters with stride 2. The final hidden layer is fc and consists of 256 units. All hidden layers were followed by ReLU . |
|
Mnih et al., 2015 | Atari | DQN (Nature version) | The first conv layer: 32 filters of 8×8 with stride 4. The second layer: 64 filters of 4×4 with stride 2. The third layer: 64 filters of 3×3 with stride 1. The final hidden layer is fc and consists of 512 units. All hidden layers were followed by ReLU . |
Torch, deepmind |
Mnih et al., 2016 | Atari, MuJoCo, Labyrinth, TORCS | A3C (Asynchronous Advantage Actor-Critic) | Atari: The agents used the network architecture from (Mnih et al., 2013) as well as a recurrent agent with an additional 256 LSTM cells after the final hidden layer. Mujoco: In the low dimensional physical state case, the inputs are mapped to a hidden state using 1 hidden layer with 200 ReLU units. In the pixels case, the input was passed through 2 conv layers without any non-linearity or pooling. In either case, the output of the encoder layers were fed to a single layer of 128 LSTM cells. Labyrinth: A3C LSTM agent trained on this task using only 84×84 RGB images as input. |
|
Hausknecht et al., 2015 | Atari | DRQN (Recurrent DQN) | For input, the recurrent network takes a single 84×84 preprocessed image. Convolutional outputs of DQN are fed to LSTM layer with 512 cells. |
Caffe, mhauskn |
Sorokin et al., 2015 | Atari | DARQN (Attention Recurrent DQN) | The input is a 84 × 84 × 1 tensor, and the output of its last (third) conv layer contains 256 feature maps 7 × 7. The attention network takes 49 vectors as input, each vector has a dimension of 256. The number of hidden units in the attention network is chosen to be equal to 256. The LSTM network also has 256 units, which is consistent with the number of attention network outputs. |
Torch, 5vision |
Lillicrap et al., 2016 | MuJoCo, TORCS | Deep DPG (Deterministic Policy Gradient) | The low-dimensional networks had 2 hidden layers with 400 and 300 units respectively (≈ 130,000 parameters). Actions were not included until the 2nd hidden layer of Q. In the pixels case, 3 conv layers (no pooling) with 32 filters at each layer. This was followed by two fc layers with 200 units (≈ 430,000 parameters). In the low-dimensional case, batch normalization is used on the state input and all layers of the μ network and all layers of the Q network prior to the action input. The final output layer of the actor was a tanh layer, to bound the actions. All hidden layers were followed by ReLU . |
Torch, iassael |
Gu et al., 2016 | MuJoCo | NAF (Normalized Adantage Functions) | For both method and the prior DDPG (Lillicrap et al., 2016) algorithm in the comparisons, networks have 2 layers of 200 ReLU to produce each of the output parameters – the Q-function and policy in DDPG, and the value function V, the advantage matrix L, and the mean μ for NAF. |
|
Schulman et al., 2015 | MuJoCo, Atari | TRPO (Trust Region Policy Optimization) | Locomotion tasks: Swimmer - 30, Hopper - 50 and Walker - 50 hidden units. Atari: two conv layers with 16 channels and stride 2, followed by one fc layer with 20 units, yielding 33,500 parameters. |
Theano, joschu |
Duan et al., 2016 | Box2D, MuJoCo | Benchmarking: REINFORCE, TNPG, RWR, REPS, TRPO, CEM, CMA-ES, DDPG | For basic, locomotion, and hierarchical tasks and for batch algorithms, network policy has 3 hidden layers, consisting of 100, 50, and 25 hidden units with tanh nonlinearity at the first two hidden layers, which map each state to the mean of a Gaussian distribution. For all partially observable tasks, we use LSTM with 32 hidden units. |
Theano, rllab |
Mohamed and Rezende, 2015 | Room environment (lava-filled maze, key/predator scenarios) | Stochastic variational information maximisation | The first conv layer: 10 filters of 4×4 with stride 1, and the second: 10 filters 3×3 with stride 2. The output of the convolution is passed through a fc layer with 100 hidden units. All hidden layers were followed by ReLU . |
|
Blundell et al., 2016 | Atari, Labyrinth | Model-Free Episodic Control | In all experiments the encoder has four conv layers using {32, 32, 64, 64} kernels respectively, kernel sizes {4, 5, 5, 4}, kernel strides {2, 2, 2, 2} , no padding, and ReLU non-linearity. The conv layer are followed by a fc layer of 512 ReLU units, from which a linear layer outputs the means and log-standard-deviations of the approximate posterior q(z|x), where z is 32 dimension vector and x - 7056 (84*84). The decoder is setup mirroring the encoder. |
|
Houthooft et al., 2016 | MuJoCo | VIME (Variational Information Maximizing Exploration) | Bayesian NN: In case of the classic tasks, it has one hidden layer of 32 units. For the locomotion tasks, it has two hidden layers of 64 units each. All hidden layers were followed by ReLU . NN policy: The classic tasks make use of a network with one layer of 32 tanh units, while the locomotion tasks make use of a two-layer network of 64 and 32 tanh units. The classic tasks make use of a network baseline with one layer of 32 ReLU units, while the locomotion tasks make use linear baseline function. |
Theano, openai |
Ho and Ermon, 2016 | MuJoCo | Generative Adversarial Imitation Learning | The same neural network architecture for all tasks: two hidden layers of 100 units each, with tanh nonlinearities in between. |
Theano, openai |
Levine et al., 2015 | PR2 robot | Visuomotor Policy | The images were downsampled to 240×240×3. The network contains 3 conv layers (one with 64 filters of 7×7 with stride 2 and two layers with 32 filters 5×5), followed by a spatial softmax and an expected position layer that converts pixel-wise features to 64 feature points. The points are concatenated with 39 robot's configurations, then passed through 3 fc layers (40, 40 and 7 units) to produce the torques. The network has 7 layers and around 92,000 parameters. |
|
Watter et al., 2015 | Visual version of the classic tasks | Embed to Control (E2C) | Plane: Encoder: 150 - 150 - 150 - 4 Linear (2 for AE). Decoder: 200 - 200 - 1600 Linear (Sigmoid for AE). Dynamics: 100 - 100 + Output layer. Pendulum swing-up: Encoder: 800 - 800 - 6 Linear (3 for AE). Decoder: 800 - 800 - 4608 Linear (Sigmoid for AE). Dynamics: 100 - 100 + Output layer. Cart-Pole balancing: Encoder: 32×5×5 - 32×5×5 - 32×5×5 - 512 - 512. Decoder:512 - 512 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 32×5×5. Dynamics: 200 - 200 + 32 Linear. Three-link arm: Encoder: 64×5×5 - 2×2 max-pooling - 32×5×5 - 2×2 max-pooling - 32×5×5 - 2×2 max-pooling - 512 - 512. Decoder: 512 - 512 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 32×5×5 - 2×2 up-sampling - 64×5×5. Dynamics: 200 - 200 + 48 Linear. All hidden layers were followed by ReLU . |
|
Assael et al., 2015 | Pendulum (pixels-to-torques) | DDM (Deep Dynamical Models) | Planar Pendulum: The screenshots 40×40 = 1600 reduced to 100 using PCA . f_enc: 100×50 – 50×50 – 50×2, f_pred: 5×100 – 100×100 – 100×2. f_dec: 2×50 – 50×50 – 50×100. Planar Double Pendulum: The screenshots 48×48 = 2304 reduced to 512 using PCA . f_enc and f_dec: 512×256 – 256×256 – 256×4, f_pred: 10×200 – 200×200 – 200×4. All hidden layers were followed by ReLU . |
|
Mordatch et al., 2015 | MuJoCo | Interactive control policy | All experiments use 3 hidden layer neural networks with 250 hidden units in each layer and tanh activation function. |
|
Peng et al., 2016 | BulletPhysics | MACE (Mixture of Actor-Critic Experts) | The first conv layer: 16 filters of 8×1. The second layer: 32 filters 4×1. The third layer: 32 filters of 4×1. A stride of 1 is used for all conv layers. The output of the final conv layer is processed by 64 fc units, and the resulting features are then concatenated with the character features. The combined features are processed by a fc layer composed of 256 units. The network then branches into critic and actor subnetworks with a fc layer of 128 units followed by a linear output layer. The size of the output layers vary depending on the subnetwork, ranging from 3 output units for the critics to 29 units for each actor. The combined network has approximately 570k parameters. All hidden layers were followed by ReLU . |
Caffe, xbpeng |
Parisotto et al., 2015 | Atari | Actor-Mimic | The network used for transfer consisted of the following architecture: 8×8×4×256-4 → 4×4×256×512-2 → 3×3×512×512-1 → 3×3×512×512-1 → 2048 fc units → 1024 fc units → 18 actions. All hidden layers were followed by ReLU . |
Torch, eparisotto |
Rusu et al., 2016; Raia Hadsell slides | Atari, Labyrinth, MuJoCo, Jaco arm | Progressive nets | Atari: a model with 3 conv layers followed by a fc layer from which the policy and value function are predicted. The conv layers have 12 feature maps. The first layer has a kernel of size 8x8 and a stride of 4x4. The second layer has a kernel of size 4 and a stride of 2. The third layer has size 3x4 with a stride of 1. The fc layer has 256 hidden units. |
|
Oh et al., 2015 | Atari | Action-Conditional | The encoding layers consist of 4 conv layers and 1 fc layer with 2048 hidden units. The conv layers use 64 (8×8), 128 (6×6), 128 (6×6), and 128 (4×4) filters with stride of 2. Every layer is followed by ReLU . In the recurrent encoding network, an LSTM layer with 2048 hidden units is added on top of the fc layer. The number of factors in the transformation layer is 2048. The decoding layers consists of one fc layer with 11264 (= 128×11×8) hidden units followed by 4 deconv layers with 128 (4×4), 128 (6×6), 128 (6×6), and 3 (8×8) filters with stride of 2. |
Caffe, junhyukoh |
Stadie et al., 2015 | Atari | Incentivizing exploration | The autoencoder has 8 hidden layers (1000-500-250-128-250-500-1000-7056 units), followed by a Euclidean loss layer. | |
Sukhbaatar et al., 2015 | MazeBase, StarCraft | Memory network | ConvNet: 4 conv layers (the first layer has 1×1 kernel, which essentially makes it an embedding of words). Items without spatial location (e.g. “Info” items) are each represented as a bag of words, and then combined via a fc layer to the outputs of the conv layers; these are then passed through 2 fc layers to output the actions (and a baseline for reinforcement). MemNN: the architecture from (Sukhbaatar et al., 2015) is used with 3 hops and tanh non-linearities. |
Torch, facebook |
Kulkarni et al., 2016 | MazeBase, ViZDoom | DSR (Deep Successor Representation) | Feature branch is CNN with four layers: 32 (8×8), 64 (4×4), 64 (3×3), 512 and additional fifth layer with 512 tanh units (equal to SR dimension). Intrinsic reward decoder: 512 (4×4), 256 (4×4), 128 (4×4), 64 (4×4), 3 (4×4). Successor branch: 512, 256, 512. All hidden layers were followed by ReLU . |
Torch, Ardavans |
Sunehag et al., 2015 | Recommendation system | Slate-MDP (high-dimensional control) | For all agents' Q-functions, neural networks have 2 hidden layers, each with a 100 units. The policies are feed-forward neural networks with 2 hidden layers with 25 hidden units each. | |
... |