EAI Course Project - Milestones

Part I Literature Review

In this project, we will mainly use deep reinforcement learning or imitation learning to tackle the task of generalizable robotic manipulation. Therefore, we first review some relevant robotic manipulation works using deep learning based on the challenges they aim to solve.

Sample efficiency is a central topic in the field of RL, and when RL is applied to robotic tasks, sample efficiency becomes a critical issue that may prevent large-scale real life application of current robotics algorithms. In general, model-free RL algorithms, like standard SAC, TRPO, and PPO, are less sample efficient and are hence less likely to be employed by future large-scale intelligent robot systems. People have proposed many methods leveraging model-based RL to try to solve the problems in robotic manipulation.

In the early days, model-based RL works like PILCO exhibit relatively high sample efficiency, only needing about 4 minutes to learn a complex task such as the block-stacking task. PILCO learns a whitebox model, namely it specifies the analytic form of the model, so that one can directly optimize value functions via gradient descent with no additional data, either real or synthetic. Later, people became more interested in leveraging a learned (blackbox) model to generate more synthetic data for model-free training, and related works include Dyna-style methods, Model Predictive Control (MPC), Model-based Policy Optimization (MBPO), etc. This is probably because model-free RL has better asymptotic performance due to accurate transition samples, and using blackbox models to generate more data helps to balance between sample accuracy and efficiency.

Recently, people have been interested in enlisting help of the powerful knowledge present within large language models. RoboAgent uses world prior from foundation models to semantically augment trajectories from a relatively small dataset. The robot was able to learn a diverse set of non-trivial skills from training on the dataset as well as the hallucinated data.

Exploration and exploitation is also a classic topic in RL that has been extensively studied. Exploration is of great importance in robotic manipulation tasks because it can be costly to manually engineer a tailored reward function for each complex tasks, and thus the agent would need to learn to properly navigate the state space even when there is only sparse reward. Simple RL algorithms enable exploration by adding noise to the output action. For example, DQN uses an $\epsilon$-greedy policy, and policy gradient methods like PPO samples action from a specific class of distribution, typically Gaussian. However, these local disturbances are agnostic to the task and cannot guide the agent to explore more promising decisions. There have been a number of works that encourages the agent to explore novel states, typically with intrinsic reward from either model uncertainty/error or the state entropy. Soft Actor-Critic (SAC) encourages exploration by adding a entropy regularization term to the objective. Other methods like Intrinsic Curiosity Module (ICM) and Random Network Distillation (RND) rewards the agent for visiting states where there is high prediction error. Yet, these exploration algorithms usually work for pixel-based discrete observation space like Atari games, but not for robotic tasks where the state space is continuous and high-dimensional.

There have been several works dedicated to dealing with exploration in robotic manipulation. For instance, Schneider et al. proposed a method to do exploration in a model-based RL setting by leveraging an information gain objective estimated from an ensemble of learned models.

Another solution is, instead of through exploration with intrinsic reward, to automatically generate a denser reward field. Recent works also try to exploit the prior knowledge in foundation models to avoid reward engineering and directly generate dense reward. VoxPoser uses VLM and LLM to generate code defining value and constraint maps in the space, through which the robot directly plans to finish the task. RoboGen also uses an LLM to design reward, though its main contribution is a complete pipeline for automatic skill learning.

Generalization is another ability people expect the future robotic systems to have. Standard RL algorithms usually produce a policy that is highly task-specific and cannot transfer to a different environment. Currently many works on robotic manipulation is focused on one or a few types of objects and tasks, but in the real life, a fully functional robot may be asked to do a wide range of tasks with varying objects, some of which it may have never seen.

A line of works attack the generalization problem through meta RL. MAML and its following works aim to find a set of initial parameters for the agent such that it can achieve good performance by training on the meta-testing tasks for a few steps. PEARL is another meta RL method using a latent context variable for fast task inference at meta-testing. Another idea is to train a robust policy that performs well across different environment parameters. Robust Adversarial RL (RARL) exploits this idea using robust adversarial training, where the agent plays against another antagonist agent in a zero sum Markov game.

Many recent works on generalist robots simply train the agent on various objects to attain the ability to generalize to unseen objects, which aligns with the method of meta RL but appears to be more straightforward. RoboAgent mentioned above is capable of learning 12 distinct skills that succeeds on unseen objects and layouts, thanks to the augmented data from foundation models. Indeed, leveraging large language and/or vision models is also a promising way to achieve generalizable robotic manipulation.

Another challenge faced by research on robot manipulation is a lack of universal benchmark regarding the mentioned aspects of robot learning. Fortunately, the ManiSkill2 benckmark provides a framework to compare different robot learning algorithms in terms of sample efficiency, generalizability, etc.

Finally, we will briefly introduce the methods of the two winners of the ManiSkill challenge.

The work "Learning Category-Level Generalizable Object Manipulation Policy via Generative Adversarial Self-Imitation Learning from Demonstrations" tackles imitation learning for generalizable object manipulation. The authors identified some issues when using Generative Adversarial Imitation Learning (GAIL) for the task and proposed their solutions. GAIL optimizes a policy meant to generate trajectories from the same distribution as the expert demonstrations, along with a discriminator meant to distinguish between policy generated data and expert data. A problem similar to what might happen in GAN is that the discriminator quickly learns to classify the trajectories, and the policy is not receiving any reward for tricking the discriminator. Hence, the authors proposes to progressively grow the architecture of the discriminator from a simple ensemble of PointNet to a complex PointNet emsemble + Transformer architecture. This is done by progressively interpolate between the pooled output of the two architectures.

Still, the discriminator is observed to win over the policy as training goes on. The authors argue that this may be due to the non-uniformity of expert data distribution, which makes it hard for a single policy to imitate the expert trajectory distribution. To overcome this, the authors incorporate the method of self-imitation, gradually replacing the expert trajectories in the expert buffer with successful trajectories generated by the policy. To facilitate generalization to novel objects, the authors also introduce a category-level instance-balancing (CLIB) expert buffer, where the number of demonstrations for each object instance is kept equal. With these three improvements, their method achieves a 18% improvement over the GAIL+SAC baseline on validation set.

The work "A Two-stage Fine-tuning Strategy for Generalizable Manipulation Skill of Embodied AI" won the ManiSkill 2023 challenge. The authors used PointNet + PPO for rigid body tasks and PointNet + Behavior Cloning for soft body tasks. They observed that at some point during training the sucess rate starts to go down, signaling overfitting to the training set. Therefore, they proposed a two-stage fintuning strategy, taking the best checkpoint from the first round of training and continue to train the model after slightly reducing the batch size and the number of samples in each steps. It turns out that this alleviates the overfitting problem and the success rate is able to continue rising.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

literature_review.md

literature_review.md

EAI Course Project - Milestones

Part I Literature Review

Part II Results from the baselines

Files

literature_review.md

Latest commit

History

literature_review.md

File metadata and controls

EAI Course Project - Milestones

Part I Literature Review

Part II Results from the baselines