The problem to be solved in this project consists in training two agents bouncing a ball over a net using rackets. The agents need to keep the ball in play, obtaining a positive reward of +0.1 for every time they pass the ball over the net and a negative reward of -0.01 when the ball hits the ground or goes out of bounds.
The state space consists of 8 variables including information about the velocity and the position of both, the ball and the racket of each agent. Notice that each agent receives its own local observation. Furthermore, there are 2 continuous actions, corresponding to the horizontal movement of the racket (toward or away from the net) and jumping.
The whole training of the agent is implemented in the CollaborationAndCompetition.ipynb notebook. You can either visualize the last execution or run it by yourself in a Jupyter server. To do so, you can take the following steps to fulfill the requirements:
-
Create (and activate) a new environment with Python 3.6.
- Linux or Mac:
conda create --name drlnd python=3.6 source activate drlnd
- Windows:
conda create --name drlnd python=3.6 activate drlnd
-
Follow the instructions in this repository to perform a minimal install of OpenAI gym.
-
Install the dependencies in the
python/
folder.cd ./python/ pip install .
-
Create an IPython kernel for the
drlnd
environment.python -m ipykernel install --user --name drlnd --display-name "drlnd"
-
Download the Unity Environment and unzip it inside the
solution/
directory. If you are using Windows 64 bits, you can skip this step since the repository already contains the environment files. Otherwise, delete the environment files and place the ones matching your OS. -
Before running the notebook, change the kernel to match the
drlnd
environment by using the drop-downKernel
menu.
The agent is able to solve the problem in 3395 episodes (averaging 0.5 points from episode 3395 to episode 3495). The weights of the actor networks of the agents are stored in actor1.pth and actor2.pth, while the weights of the critic networks are stored in critic1.pth and critic2.pth. A detailed description of the implemented algorithm can be found in report.pdf. Here's the multi-agent's score history: