diff --git a/docs/source/_static/img/world_batchboth.png b/docs/source/_static/img/world_batchboth.png deleted file mode 100644 index 3a3c0353520..00000000000 Binary files a/docs/source/_static/img/world_batchboth.png and /dev/null differ diff --git a/docs/source/_static/img/world_hogwild.png b/docs/source/_static/img/world_hogwild.png deleted file mode 100644 index 927f7df8044..00000000000 Binary files a/docs/source/_static/img/world_hogwild.png and /dev/null differ diff --git a/docs/source/tutorial_basic.md b/docs/source/tutorial_basic.md index 1a5511a449d..570339a30a5 100644 --- a/docs/source/tutorial_basic.md +++ b/docs/source/tutorial_basic.md @@ -171,15 +171,13 @@ Another simple world we include is MultiAgentDialogWorld, which is similar but generalizes this to cycle between any number of agents in a round robin fashion. -### Advanced Worlds - -We also include a few more advanced "container" worlds: in particular, -we include both a BatchWorld and a HogwildWorld. These worlds are -automatically used when either the `numthreads` parameter or the -`batchsize` parameter are set to greater than one. Some extra -functionality is needed to get these to work on the side of both the -teacher and the learner, but we'll cover that in a different tutorial -(see: tutorial\_worlds). +:::{note} Advanced Worlds +We also include a few more advanced "container" worlds: in particular, we +include both a BatchWorld and a DynamicBatchWorld. These worlds may be used when +certain options are sent. See the [Worlds](tutorial_worlds) tutorial to +understand how these work. +::: + Using ParlAI ------------ diff --git a/docs/source/tutorial_worlds.md b/docs/source/tutorial_worlds.md index 87885ced0b5..04f695cfdc9 100644 --- a/docs/source/tutorial_worlds.md +++ b/docs/source/tutorial_worlds.md @@ -1,9 +1,9 @@ -Data Handling and Batching +Worlds, Sharing & Batching ========================== -__Authors__: Alexander Holden Miller, Kurt Shuster +__Authors__: Alexander Holden Miller, Kurt Shuster, Stephen Roller -:::{tip} Before you begin +:::{important} Before you begin If you are unfamiliar with the basics of displaying data or calling train or evaluate on a model, please first see the [getting started](tutorial_basic) section. If you are interested in creating a task, please see @@ -13,120 +13,363 @@ section. If you are interested in creating a task, please see Introduction ------------ -This tutorial will cover the details of batched data, and why we use -Shared Worlds. +This document discusses the overview of [__Worlds__](parlai.core.worlds.World), a +core concept within ParlAI. It aims to be a high level overview of +_how_ some concepts are implemented in ParlAI, so that you can know what's +happening behind the scenes, but you are unlikely to write any custom worlds +yourself. + +Worlds are where agents live, and the world defines the communication flow +between agents. If you are familiar with +[Reinforcement Learning](https://en.wikipedia.org/wiki/Reinforcement_learning), +Worlds are roughly analogous to environments. + +At a most basic level, worlds simply house agents and pass messages between +each of them. Each message pass happens as a __Parley__. For example, the +[DialogPartnerWorld](parlai.core.worlds.DialogPartnerWorld) contains this rough +pseudo-code as its parley: + +:::{note} +This is only pseudo-code, but the +[real code](https://github.com/facebookresearch/ParlAI/blob/master/parlai/core/worlds.py) +is reasonably easy to read, even for newcomers. +::: + +```python +class SimpleWorld(World): + def __init__(opt, agents): + self.teacher, self.model = agents + + def parley(): + # produce an example dataset item + teacher_act = self.teacher.act() + # perform preprocessing, vectorizing, etc. + self.model.observe(teacher_act) + # produce a model response + model_act = self.model.act() + # compute any metrics of the response + self.teacher.observe(model_act) +``` -With relatively small modifications to a basic agent, it will be able to -support multithreading and batching. If you need extra speed or are -using a very large dataset which does not fit in memory, we can use a -multiprocessed pytorch dataloader for improved performance. +This is represented by the following image: -First, let's consider a diagram of the basic flow of DialogPartnerWorld, -a simple world with two conversing agents. +
+a simple parley +
-![image](_static/img/world_basic.png) -The teacher generates a message, which is shown to the agent. The agent -generates a reply, which is seen by the teacher. +The parley method is usually run in a loop, until we run out of examples: + +```python +while not world.epoch_done(): + world.parley() +``` -### Expanding to batching using share() +This simple loop structure is powerful, and allows us to present a unified view +of all agents, whether they are [datasets](tutorial_task), [models](core/agents), +or humans connected via a [chat service](tutorial\_chat_service): +there is always a __world__ and it facilitates agents passing messages back +and forth. -For all tasks one might make, there's one function we need to support -for batching: `share()`. This function should provide whatever is needed -to set up a "copy" of the original instance of the agent for either each -row of a batch. +However, this loop may be inefficient. Modern hardware like GPUs and learning +algorithms like Stochastic Gradient Descent benefit from _batching_, or +processing multiple items at the same time. -We create shared agents by instantiating them in the following way: +In the remainder of this document, we will cover how Batching works, and eventually +discuss the implementation of Dynamic Batching. + +## Agent Sharing + +_Agent Sharing_ is the primary mechanism by which we implement batching, model +serving, and many other ParlAI features. Agent sharing works by creating +_clones_ of an Agent. Each clone of an Agent has _shared state_ and +_independent state_. Independent state includes things like the current +dialogue context (conversation history, which is different for every batch +element). Shared state includes things like the model weights. ```python -Agent0 = Agent(opt) -... -Agent1 = Agent(opt, Agent0.share()) -Agent2 = Agent(opt, Agent0.share()) -Agent3 = Agent(opt, Agent0.share()) +agent_original = Agent(opt) +agent_copy1 = agent_original.clone() +agent_copy2 = agent_original.clone() +agent_copy3 = agent_original.clone() ``` -![image](_static/img/world_share.png) - -That is, the executed are: - -1. We set up a starting instance of the world: that is, we use - `create_task` to set up a base world with the appropriate agents and - tasks. -2. We pass this world to a `BatchWorld`. -3. We create `batchsize` worlds, each initialized from a `share()`'d - version of the world and the agents therein. - -Now, every time we call `parley` on this BatchWorld, we will complete -`batchsize` examples. - -![image](_static/img/world_batchbasic.png) - -There's a few more complex steps to actually completing a parley in this -world. - -1. Call `parley_init` on each shared world, if the world has it - implemented. Most classes don't need this--we currently only use it - for our `MultiWorld`, which handles the case when one specifies - multiple separate tasks to launch (e.g. `-t babi,squad`). This does - any pre-parley setup, here choosing which sub-tasks to use in the - next parley. -2. Then, iterate over the number of agents involved in the task. For - most tasks, this is just two agents: the teacher (task) and the - student (model). For each agent, we do two steps: - - a. Call `BatchWorld.batch_act()`. This method first checks if the - __original__ instance of the agent (not the copies) has a - function named `batch_act` implemented and does not have an - attribute `use_batch_act` set to `False`. This function is - described more below. If condition is not met, the BatchWorld's - `batch_act` method iterates through each agent copy in the batch - and calls the `act()` method for that instance. This is the - default behavior in most circumstances, and allows agents to - immediately work for batching without any extra work--the - batch\_act method is merely for improved efficiency. - b. Call `BatchWorld.batch_observe()`. This method goes through - every __other__ agent, and tries to call the `observe()` method - on those agents. This gives other agents (usually, just the one - other agent) the chance to see the action of the agent whose - turn it is to act currently. - -Next, we'll look at how teachers and models can take advantage of the -setup above to improve performance. - -### Batched Models - -Finally, models need to be able to handle observations arriving in -batches. - -The first key concept to remember is that, if the model agent implements -`batch_act()`, __act will not be called__ as long as `batchsize` > 1. - -However, copies of the agent will still be created, and the `observe` -method of each one will be called. This allows each copy to maintain a -state related to a single row in the batch. Remember, since each row in -the batch is represented by a separate world, they are completely -unrelated. This means that the model only needs to be set up in the -original instance, and need not be shared with its copies. - -The `observe()` method returns a (possibly modified) version of the -observation it sees, which are collected into a list for the agent's -`batch_act()` method. - -![image](_static/img/world_batchagent.png) - -Now, the agent can process the entire batch at once. This is especially -helpful for GPU-based models, which prefer to process more examples at a -time. +Clones are why most agents have this rough structure in their initialization +procedures. -:::{tip} Implementing batch\_act and act +```python +class PseudoAgent(Agent): + def __init__(self, opt: Opt, shared: dict = None): + super().__init__(opt, shared) + if shared is None: + # When shared is None, we are the original, and we need to + # initialize any state + self.model = self.build_model() + else: + # When shared is NOT None, we are a clone. + # Use the shared dict to set any variables. this is where we + # incorporate any shared state + self.model = shared['model'] + + def share(self): + # This is used to create the "shared" dictionary that clones will + # receive in initialization. Make sure you enumerate everything + # that needs to be seen in all clones! + + # You will probably never call this method yourself. Rather, it is + # called automatically by `clone()` + shared = super().share() + shared['model'] = self.model + return shared +``` + +:::{note} Real implementations +The above is really just a very simple example. You may find it illuminating to +read the implementation of +[UnigramAgent](https://github.com/facebookresearch/ParlAI/blob/master/parlai/agents/unigram/unigram.py) +or even [TorchAgent](parlai.core.torch_agent.TorchAgent). +::: + + +:::{note} Usefulness +The above pseudo-code is likely to be used frequently as a user of ParlAI. +Some things are automatically shared by our abstractions (like models and +optimizers), but some users may have to include extra sharing. +::: + +Each clone of a model is relatively cheap to create: the only new memory used +is the dialogue context. The share object enables us to reuse memory for +expensive objects, like neural network weights. + +At a glance, creating these clones may seem strange. Why do we create all these +copies? Differentiating between shared and independent state lets us write our +agents with _independence_ in mind. Agents only need to __focus on a single +conversation__, maintaining and manipulating the state for just that conversation. +Every clone will be maintaining _separate conversations_, so when we implement +a new dataset or model, we only focus on one conversation at a time. + +This is what makes up the fundamental backend behind batching or chat services: +each clone of the agent only has to focus on one conversation at a time, with +only specific spots for synchronization. In the next section, we'll +take a look at how this is used in Batching. + +:::{note} When do I clone? +Note that while cloning is fundamental to the inner workings of ParlAI, it is +rare that you will need to call clone yourself, unless you are creating +custom worlds or backends. +::: + +## Batching + +:::{warning} Already implemented in `TorchAgent`. +If you're implementing your own custom Neural Network model, you don't need to +worry about this, except at a high level. This is all handled for you by +`TorchAgent`. +::: + +Equipped with the ability to create clones, we can use this to implement +Batching. For a batchsize of 3, we will create 3 clones of the Teacher, and 3 +clones of the Agent. Each of these clones will maintain their own separate, +independent conversations. Naively, this could be implemented with a simple +for loop: + +```python +class NaiveBatchWorld(World): + def __init__(self, opt, agents): + # store the originals + self.teacher, self.model = agents + # make batchsize copies of all the models + self.teacher_copies = [] + self.model_copies = [] + self.batchsize = opt['batchsize'] + for i in range(self.batchsize): + self.teacher_copies.append(self.teacher.clone()) + self.model_copies.append(self.model.clone()) +``` +This initialization code is then represented by this graphic: + +
+Sharing the teacher and agents +
+ +We continue with the implementation of parley: + +```python + def parley(self): + for i in range(self.batchsize): + # produce an example dataset item + teacher_act = self.teacher_copies[i].act() + # perform preprocessing, vectorizing, etc. + self.model_copies[i].observe(teacher_act) + # produce a model response + model_act = self.model_copies[i].act() + # compute any metrics of the response + self.teacher_copies[i].observe(model_act) +``` + +
+Independent acts +
+ +However, this is inefficient, and prevents us from utilizing the amazing +vectorization capabilities of modern GPUs. Instead, we'll implement a special +`batch_act method`. This method will instead handle all the acts at once + +```python +class BatchWorld(World): + def __init__(self, opt, agents): + # store the originals + self.teacher, self.model = agents + # make batchsize copies of all the models + self.teacher_copies = [] + self.model_copies = [] + self.batchsize = opt['batchsize'] + for i in range(self.batchsize): + self.teacher_copies.append(self.teacher.clone()) + self.model_copies.append(self.model.clone()) + + def parley(self): + observations = [] + for i in range(self.batchsize): + # produce an example dataset item + teacher_act = self.teacher_copies[i].act() + # perform preprocessing, vectorizing, etc. + observation[i] = self.model_copies[i].observe(teacher_act) + + # now batch_act can efficiently do everything on the GPU + model_acts = self.model.batch_act(observations) + + # return the results of the batch_act back to individual conversations + for i in range(self.batchsize): + # self_observe is how we tell each copy what their individual + # actions are + self.model_copies[i].self_observe(model_acts[i]) + # compute any metrics of the response + self.teacher_copies[i].observe(model_acts[i]) +``` + +This logic is more complicated, but enables us to efficiently implement batched +operations. The new logic can be encapsulated in this graph: + +
+Parley with batch_act +
+ + +:::{tip} Implementing batch\_act and act Tip: if you implement `batch_act()`, your `act()` method can just call -`batchact()` and pass the observation it is supposed to process in a +`batch_act()` and pass the observation it is supposed to process in a list of length 1. ::: -Of course, this also means that we can use `batch_act` in both the task -and the agent code: +## Dynamic Batching + +:::{note} +This is only a sketch. The real code is more complicated, but advanced users +may be interested in +[reading it](https://github.com/facebookresearch/ParlAI/blob/master/parlai/core/worlds.py). +::: + +As a final diversion, we'll discuss at a high level of how Dynamic Batching +(also known as adaptive batching) is implemented. Dynamic Batching is supported +by all TorchAgents by default and can give you a +[2-3x speedup in training](tutorial_fast). + +Dynamic batching is used to _maximize the usage of GPUs_ by grouping similarly +length examples so they occur at the same time. Intuitively, to maximize usage +of GPU memory, we can either process _a few very long conversations_ or we can +process _many short conversations_. If we can do this artfully, we will be able +to maximize throughput and minimize waste from +[padding tokens](https://d2l.ai/chapter_recurrent-neural-networks/text-preprocessing.html). +Padding tokens occur whenever one conversation is much longer than another, +so we must _pad_ the batch with empty tokens to make our tensors full rectangles. + +As a simple algorithm for minimizing padding, we can simply "group" similarly +length examples so they are processed at the same time. To do this, we will +maintain a sort of buffer of many possible conversations. + +Let's imagine we have a buffer of 12 conversations going on at once, and a +batchsize of 3. Each of the conversations is in the following state: + +ID | Batch #| Message | # Words +---|--------|----------|------- +0 | 0 | Frankly, my dear, I don't give a damn. | 8 +1 | 0 | I'm going to make him an offer he can't refuse. | 10 +2 | 0 | Mama always said life was like a box of chocolates. You never know what you're gonna get. | 17 +3 | 1 | Here's Johnny! | 2 +4 | 1 | Expecto patronum! | 2 +5 | 1 | Do you wish me a good morning, or mean that it is a good morning whether I want it or not; or that you feel good this morning; or that it is a morning to be good on? | 38 +6 | 2 | I'm walking here! I'm walking here! | 6 +7 | 2 | You are without a doubt the worst pirate I've ever heard of. | 12 +8 | 2 | May the Force be with you. | 6 +9 | 3 | A million dollars isn't cool. You know what's cool? A billion dollars. | 12 +10 | 3 | I'm as mad as hell, and I'm not going to take this anymore! | 13 +11 | 3 | I'm king of the world! | 5 + +Naively processing each of these conversations in order will result in very +uneven batches: the very long LOTR quote is combined with our shortest +utterances. Instead, we can organize our conversations so that like-lengthed +are next to each other. This will save processing time and minimize packing. + +ID | Batch #| Message | # Words +---|--------|----------|------- +3 | 0 | Here's Johnny! | 2 +4 | 0 | Expecto patronum! | 2 +11 | 0 | I'm king of the world! | 5 +6 | 1 | I'm walking here! I'm walking here! | 6 +8 | 1 | May the Force be with you. | 6 +0 | 1 | Frankly, my dear, I don't give a damn. | 8 +1 | 2 | I'm going to make him an offer he can't refuse. | 10 +7 | 2 | You are without a doubt the worst pirate I've ever heard of. | 12 +9 | 2 | A million dollars isn't cool. You know what's cool? A billion dollars. | 12 +10 | 3 | I'm as mad as hell, and I'm not going to take this anymore! | 13 +2 | 3 | Mama always said life was like a box of chocolates. You never know what you're gonna get. | 17 +5 | 3 | Do you wish me a good morning, or mean that it is a good morning whether I want it or not; or that you feel good this morning; or that it is a morning to be good on? | 38 + +This is called the __batchsort__ algorithm. Note that our batch size remains +fixed at 4, but we've grouped our worst examples so that they happen at the +same time. + +:::{tip} +You can use this your training or evaluation runs with `-dynb batchsort` or +`--dynamic-batching batchsort`. +::: + +:::{note} +Notice how the conversations no longer play in order. ParlAI handles all of this +extra complexity for you. Remember, you only need to implement state tracking +for _individual_ conversations, and along with `batch\_act`. +::: + +But we can take this even one step further, and implement fully dynamic batching. +In dynamic batching, we grow our batches so that the number of _words_ per batch +stays relatively fixed. This means we can process many short conversations at +the same time. Let's imagine we set the number of total number of words per batch +to be 80. + +ID | Batch #| Message | # Words +---|--------|----------|------- +5 | 0 | Do you wish me a good morning, or mean that it is a good morning whether I want it or not; or that you feel good this morning; or that it is a morning to be good on? | 38 +2 | 0 | Mama always said life was like a box of chocolates. You never know what you're gonna get. | 17 +10 | 0 | I'm as mad as hell, and I'm not going to take this anymore! | 13 +9 | 0 | A million dollars isn't cool. You know what's cool? A billion dollars. | 12 +7 | 1 | You are without a doubt the worst pirate I've ever heard of. | 12 +1 | 1 | I'm going to make him an offer he can't refuse. | 10 +0 | 1 | Frankly, my dear, I don't give a damn. | 8 +6 | 1 | I'm walking here! I'm walking here! | 6 +8 | 1 | May the Force be with you. | 6 +11 | 1 | I'm king of the world! | 5 +3 | 1 | Here's Johnny! | 2 +4 | 1 | Expecto patronum! | 2 + +With this algorithm, we tried to get as close to 80 words as possible, without +going over. The first batch ends up with exactly 80 words. The batch ends up +with only 51. But we've now reduced the number of batches we need to process +from 4 to only 2! This is the trick of how dynamic batching can provide +[massive speed ups](tutorial_fast). + +:::{tip} +You can use this mode with `-dynb full` or `--dynamic-batching full`. +::: -![image](_static/img/world_batchboth.png)