A committee of experts from RL field organized a competition using Minecraft environment for sample efficient reinforcement learning. Competition is held as part of NeurIPS 2019 event and winners will demonstrate their results during the conference.
Although deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples. Many of these systems cannot be applied to real-world problems, where environment samples are expensive. Resolution of these limitations requires new, sample-efficient methods.
This competition is designed to foster the development of algorithms which can drastically reduce the number of samples needed to solve complex, hierarchical, and sparse environments using human demonstrations. Participants compete to develop systems which solve a hard task in Minecraft, obtaining a diamond, with a limited number of samples.
This repository of code attempts to solve the problem by using Deep Deterministic Policy Gradient (DDPG) method to estimate the agent policy in coordination with Recurrent Layers to estimate the agent state.
Minecraft is a rich environment in which to perform learning: it is an open-world environment, has sparse rewards, and has many innate task hierarchies and subgoals. Furthermore, it encompasses many of the problems that we must solve as we move towards more general AI (for example, what is the reward structure of “building a house”?). Besides all this, Minecraft has more than 90 million monthly active users, making it a good environment on which to collect a large-scale dataset.
Minecraft is a rich environment in which to perform learning: it is an open-world environment, has sparse rewards, and has many innate task hierarchies and subgoals. Furthermore, it encompasses many of the problems that we must solve as we move towards more general AI (for example, what is the reward structure of “building a house”?). Besides all this, Minecraft has more than 90 million monthly active users, making it a good environment on which to collect a large-scale dataset.
Some of the stages of obtaining a diamond: obtaining wood, a stone pickaxe, iron, and diamond
During an episode the agent is rewarded only once per item the first time it obtains that item in the requisite item hierarchy for obtaining an iron pickaxe. The Rewards for each item are;
reward="1" type="log"
reward="2" type="planks"
reward="4" type="stick" reward="4" type="crafting_table"
reward="8" type="wooden_pickaxe" reward="16" type="cobblestone"
reward="32" type="furnace" reward="32" type="stone_pickaxe"
reward="64" type="iron_ore" reward="128" type="iron_ingot"
reward="256" type="iron_pickaxe" reward="1024" type="diamond"
Observation of the agent consists of “equipped_items” currently in the hands of the agent, “inventory” of items the agent acquired during the game and “pov” an RGB image observation of the agent’s first-person perspective.
Dict({
"equipped_items":{
"mainhand":{
"damage": "Box()",
"maxDamage": "Box()",
"type": "Enum(none,air,wooden_axe,wooden_pickaxe,stone_axe,stone_pickaxe,iron_axe,iron_pickaxe,other)"
}
}
"inventory":{
"coal": "Box()",
"cobblestone": "Box()",
"crafting_table": "Box()",
"dirt": "Box()",
"furnace": "Box()",
"iron_axe": "Box()",
"iron_ingot": "Box()",
"iron_ore": "Box()",
"iron_pickaxe": "Box()",
"log": "Box()",
"planks": "Box()",
"stick": "Box()",
"stone": "Box()",
"stone_axe": "Box()",
"stone_pickaxe": "Box()",
"torch": "Box()",
"wooden_axe": "Box()",
"wooden_pickaxe": "Box()"
},
"pov": "Box(64, 64, 3)"
})
Agent has the following list of Actions in its action space;
Dict({ "attack": "Discrete(2)",
"back": "Discrete(2)",
"camera": "Box(2,)",
"craft": "Enum(none,torch,stick,planks,crafting_table)",
"equip": "Enum(none,air,wooden_axe,wooden_pickaxe,stone_axe,stone_pickaxe,iron_axe,iron_pickaxe)",
"forward": "Discrete(2)",
"jump": "Discrete(2)",
"left": "Discrete(2)",
"nearbyCraft": "Enum(none,wooden_axe,wooden_pickaxe,stone_axe,stone_pickaxe,iron_axe,iron_pickaxe,furnace)",
"nearbySmelt": "Enum(none,iron_ingot,coal)",
"place": "Enum(none,dirt,stone,cobblestone,crafting_table,furnace,torch)",
"right": "Discrete(2)",
"sneak": "Discrete(2)",
"sprint": "Discrete(2)"
})
For more details on the environment, observation and action space click here
Agent state in Minecraft environment depends on sequence of actions it takes while the action depends on the Agent's policy which Agent needs to learn through trial and error. One of the recent algorithms to learn Agent policy is called Deep Deterministic Policy Gradient (DDPG). DDPG uses actor-critic method in which there are deep neural nets to estimate the action to be taken(actor) and to estimate the action-state value of the Agent(critic).
Through iterative process of action-state value prediction, action prediction and taking the predicted action in the environment along with rewards from the environment, the agent tries to learn the actor-critic nets. Iterative process provides the Agent with an increasing inventory of {State, Action, Reward, Next_State} quadruples with which learning happens off-policy while the Agent keeps interacting in the environment. Off-policy means, during learning, Agent uses previous quadruples of {State, Action, Reward, Next_State} which is obtained under different policy while acting with the most recent policy in the environment.
Inventory of {State, Action, Reward, Next_State} is called Replay Buffer which is one of the tricks employed by DDPG algorithm to make it converge. The Agent stores all past experiences, shuffles them to remove correlations in the training data and applies DDPG algorithm to update it's actor-critic network parameters. Another trick employed by DDPG is using two actor-critic networks named as target and local actor-critic for stability. While trying to estimate optimal actor-critic networks, new estimates are based on the previous estimates of the same networks and this creates a moving target problem. To mitigate the moving target problem, target actor-critic networks are updated periodically while local actor-critic networks updates it's parameters based on the target actor-critic network estimates.
Learning the critic network is closely related to the Q-Learning, another algorithm for learning state-action values for dicrete state-action space. Both critic network and Q-Learning make use of Bellman Equations of Optimality as follows;pip3 install --upgrade minerl
where s' ~~ P is shorthand for saying that the next state, s', is sampled by the environment from a distribution P(.| s,a). This Bellman equation is the starting point for learning an approximator to Q*(s,a). Suppose the approximator is a neural network QΦ(s,a), with parameters Φ, and that we have collected a set of transitions (s,a,r,s',d) (where d indicates whether state s' is terminal). We can set up a mean-squared Bellman error (MSBE) function, which tells us roughly how closely QΦ(s,a) comes to satisfying the Bellman equation
Here, in evaluating (1-d), we’ve used a Python convention of evaluating True to one and False to zero. Thus, when d==True — which is to say, when s' is a terminal state — the Q-function should show that the agent gets no additional rewards after the current state. (This choice of notation corresponds to what we later implement in code.)
Policy learning in DDPG is fairly simple. We want to learn a deterministic policy μθ(s) which gives the action that maximizes QΦ(s,a). Because the action space is continuous, and we assume the Q-function is differentiable with respect to action, we can just perform gradient ascent (with respect to policy parameters θ only) to solve
In other words, performance objective for the policy is chosen as the value function of the target policy, averaged over the state distribution of the behaviour policy
I have used GPU Quadro M3000M enabled machine with CUDA Version 10.1, PyTorch 1.2.0 to run and implement the DDPG agent. For installation of Minerl and required packages refer to [5]. I have setup both docker and conda environments. Docker is my preffered environment but since it is headless and MineRL GUI is not setup for headless environments, you cannot see the agent in action. I have used docker during training but to see the agent in action I have switched to conda environment.
Competition organizers provided data from expert players to train the agent for sample efficient learning. You can download all the data as follows;
import minerl
minerl.data.download(directory="PATH_TO_MINERL_DIRECTORY/data")
Once you have a docker image (here it is named minerl) do the following;
docker run --gpus all -v PATH_TO_MINERL_DIRECTORY:/workspace -it minerl
For the above docker command to run natively with GPU support, you need docker version 19.03 or higher.
When you are in the docker minerl container, you can run the training as follows;
xvfb-run python3 minerl_train_sequention.py
During trainig, trained network parameters are saved periodically as checkpoint_actor.pth and checkpoint_critic.pth. Saved network parameters are read by minerl_run.py to observe the agent in action. minerl_run.py runs in the conda environment to be able to run the GUI.
After activating conda environment, run the following
python3 minerl_run.py