diff --git a/docs/source/_static/img/world_batchboth.png b/docs/source/_static/img/world_batchboth.png
deleted file mode 100644
index 3a3c0353520..00000000000
Binary files a/docs/source/_static/img/world_batchboth.png and /dev/null differ
diff --git a/docs/source/_static/img/world_hogwild.png b/docs/source/_static/img/world_hogwild.png
deleted file mode 100644
index 927f7df8044..00000000000
Binary files a/docs/source/_static/img/world_hogwild.png and /dev/null differ
diff --git a/docs/source/tutorial_basic.md b/docs/source/tutorial_basic.md
index 1a5511a449d..570339a30a5 100644
--- a/docs/source/tutorial_basic.md
+++ b/docs/source/tutorial_basic.md
@@ -171,15 +171,13 @@ Another simple world we include is MultiAgentDialogWorld, which is
similar but generalizes this to cycle between any number of agents in a
round robin fashion.
-### Advanced Worlds
-
-We also include a few more advanced "container" worlds: in particular,
-we include both a BatchWorld and a HogwildWorld. These worlds are
-automatically used when either the `numthreads` parameter or the
-`batchsize` parameter are set to greater than one. Some extra
-functionality is needed to get these to work on the side of both the
-teacher and the learner, but we'll cover that in a different tutorial
-(see: tutorial\_worlds).
+:::{note} Advanced Worlds
+We also include a few more advanced "container" worlds: in particular, we
+include both a BatchWorld and a DynamicBatchWorld. These worlds may be used when
+certain options are sent. See the [Worlds](tutorial_worlds) tutorial to
+understand how these work.
+:::
+
Using ParlAI
------------
diff --git a/docs/source/tutorial_worlds.md b/docs/source/tutorial_worlds.md
index 87885ced0b5..04f695cfdc9 100644
--- a/docs/source/tutorial_worlds.md
+++ b/docs/source/tutorial_worlds.md
@@ -1,9 +1,9 @@
-Data Handling and Batching
+Worlds, Sharing & Batching
==========================
-__Authors__: Alexander Holden Miller, Kurt Shuster
+__Authors__: Alexander Holden Miller, Kurt Shuster, Stephen Roller
-:::{tip} Before you begin
+:::{important} Before you begin
If you are unfamiliar with the basics of displaying data or calling train or
evaluate on a model, please first see the [getting started](tutorial_basic)
section. If you are interested in creating a task, please see
@@ -13,120 +13,363 @@ section. If you are interested in creating a task, please see
Introduction
------------
-This tutorial will cover the details of batched data, and why we use
-Shared Worlds.
+This document discusses the overview of [__Worlds__](parlai.core.worlds.World), a
+core concept within ParlAI. It aims to be a high level overview of
+_how_ some concepts are implemented in ParlAI, so that you can know what's
+happening behind the scenes, but you are unlikely to write any custom worlds
+yourself.
+
+Worlds are where agents live, and the world defines the communication flow
+between agents. If you are familiar with
+[Reinforcement Learning](https://en.wikipedia.org/wiki/Reinforcement_learning),
+Worlds are roughly analogous to environments.
+
+At a most basic level, worlds simply house agents and pass messages between
+each of them. Each message pass happens as a __Parley__. For example, the
+[DialogPartnerWorld](parlai.core.worlds.DialogPartnerWorld) contains this rough
+pseudo-code as its parley:
+
+:::{note}
+This is only pseudo-code, but the
+[real code](https://github.com/facebookresearch/ParlAI/blob/master/parlai/core/worlds.py)
+is reasonably easy to read, even for newcomers.
+:::
+
+```python
+class SimpleWorld(World):
+ def __init__(opt, agents):
+ self.teacher, self.model = agents
+
+ def parley():
+ # produce an example dataset item
+ teacher_act = self.teacher.act()
+ # perform preprocessing, vectorizing, etc.
+ self.model.observe(teacher_act)
+ # produce a model response
+ model_act = self.model.act()
+ # compute any metrics of the response
+ self.teacher.observe(model_act)
+```
-With relatively small modifications to a basic agent, it will be able to
-support multithreading and batching. If you need extra speed or are
-using a very large dataset which does not fit in memory, we can use a
-multiprocessed pytorch dataloader for improved performance.
+This is represented by the following image:
-First, let's consider a diagram of the basic flow of DialogPartnerWorld,
-a simple world with two conversing agents.
+
+
+
-![image](_static/img/world_basic.png)
-The teacher generates a message, which is shown to the agent. The agent
-generates a reply, which is seen by the teacher.
+The parley method is usually run in a loop, until we run out of examples:
+
+```python
+while not world.epoch_done():
+ world.parley()
+```
-### Expanding to batching using share()
+This simple loop structure is powerful, and allows us to present a unified view
+of all agents, whether they are [datasets](tutorial_task), [models](core/agents),
+or humans connected via a [chat service](tutorial\_chat_service):
+there is always a __world__ and it facilitates agents passing messages back
+and forth.
-For all tasks one might make, there's one function we need to support
-for batching: `share()`. This function should provide whatever is needed
-to set up a "copy" of the original instance of the agent for either each
-row of a batch.
+However, this loop may be inefficient. Modern hardware like GPUs and learning
+algorithms like Stochastic Gradient Descent benefit from _batching_, or
+processing multiple items at the same time.
-We create shared agents by instantiating them in the following way:
+In the remainder of this document, we will cover how Batching works, and eventually
+discuss the implementation of Dynamic Batching.
+
+## Agent Sharing
+
+_Agent Sharing_ is the primary mechanism by which we implement batching, model
+serving, and many other ParlAI features. Agent sharing works by creating
+_clones_ of an Agent. Each clone of an Agent has _shared state_ and
+_independent state_. Independent state includes things like the current
+dialogue context (conversation history, which is different for every batch
+element). Shared state includes things like the model weights.
```python
-Agent0 = Agent(opt)
-...
-Agent1 = Agent(opt, Agent0.share())
-Agent2 = Agent(opt, Agent0.share())
-Agent3 = Agent(opt, Agent0.share())
+agent_original = Agent(opt)
+agent_copy1 = agent_original.clone()
+agent_copy2 = agent_original.clone()
+agent_copy3 = agent_original.clone()
```
-![image](_static/img/world_share.png)
-
-That is, the executed are:
-
-1. We set up a starting instance of the world: that is, we use
- `create_task` to set up a base world with the appropriate agents and
- tasks.
-2. We pass this world to a `BatchWorld`.
-3. We create `batchsize` worlds, each initialized from a `share()`'d
- version of the world and the agents therein.
-
-Now, every time we call `parley` on this BatchWorld, we will complete
-`batchsize` examples.
-
-![image](_static/img/world_batchbasic.png)
-
-There's a few more complex steps to actually completing a parley in this
-world.
-
-1. Call `parley_init` on each shared world, if the world has it
- implemented. Most classes don't need this--we currently only use it
- for our `MultiWorld`, which handles the case when one specifies
- multiple separate tasks to launch (e.g. `-t babi,squad`). This does
- any pre-parley setup, here choosing which sub-tasks to use in the
- next parley.
-2. Then, iterate over the number of agents involved in the task. For
- most tasks, this is just two agents: the teacher (task) and the
- student (model). For each agent, we do two steps:
-
- a. Call `BatchWorld.batch_act()`. This method first checks if the
- __original__ instance of the agent (not the copies) has a
- function named `batch_act` implemented and does not have an
- attribute `use_batch_act` set to `False`. This function is
- described more below. If condition is not met, the BatchWorld's
- `batch_act` method iterates through each agent copy in the batch
- and calls the `act()` method for that instance. This is the
- default behavior in most circumstances, and allows agents to
- immediately work for batching without any extra work--the
- batch\_act method is merely for improved efficiency.
- b. Call `BatchWorld.batch_observe()`. This method goes through
- every __other__ agent, and tries to call the `observe()` method
- on those agents. This gives other agents (usually, just the one
- other agent) the chance to see the action of the agent whose
- turn it is to act currently.
-
-Next, we'll look at how teachers and models can take advantage of the
-setup above to improve performance.
-
-### Batched Models
-
-Finally, models need to be able to handle observations arriving in
-batches.
-
-The first key concept to remember is that, if the model agent implements
-`batch_act()`, __act will not be called__ as long as `batchsize` > 1.
-
-However, copies of the agent will still be created, and the `observe`
-method of each one will be called. This allows each copy to maintain a
-state related to a single row in the batch. Remember, since each row in
-the batch is represented by a separate world, they are completely
-unrelated. This means that the model only needs to be set up in the
-original instance, and need not be shared with its copies.
-
-The `observe()` method returns a (possibly modified) version of the
-observation it sees, which are collected into a list for the agent's
-`batch_act()` method.
-
-![image](_static/img/world_batchagent.png)
-
-Now, the agent can process the entire batch at once. This is especially
-helpful for GPU-based models, which prefer to process more examples at a
-time.
+Clones are why most agents have this rough structure in their initialization
+procedures.
-:::{tip} Implementing batch\_act and act
+```python
+class PseudoAgent(Agent):
+ def __init__(self, opt: Opt, shared: dict = None):
+ super().__init__(opt, shared)
+ if shared is None:
+ # When shared is None, we are the original, and we need to
+ # initialize any state
+ self.model = self.build_model()
+ else:
+ # When shared is NOT None, we are a clone.
+ # Use the shared dict to set any variables. this is where we
+ # incorporate any shared state
+ self.model = shared['model']
+
+ def share(self):
+ # This is used to create the "shared" dictionary that clones will
+ # receive in initialization. Make sure you enumerate everything
+ # that needs to be seen in all clones!
+
+ # You will probably never call this method yourself. Rather, it is
+ # called automatically by `clone()`
+ shared = super().share()
+ shared['model'] = self.model
+ return shared
+```
+
+:::{note} Real implementations
+The above is really just a very simple example. You may find it illuminating to
+read the implementation of
+[UnigramAgent](https://github.com/facebookresearch/ParlAI/blob/master/parlai/agents/unigram/unigram.py)
+or even [TorchAgent](parlai.core.torch_agent.TorchAgent).
+:::
+
+
+:::{note} Usefulness
+The above pseudo-code is likely to be used frequently as a user of ParlAI.
+Some things are automatically shared by our abstractions (like models and
+optimizers), but some users may have to include extra sharing.
+:::
+
+Each clone of a model is relatively cheap to create: the only new memory used
+is the dialogue context. The share object enables us to reuse memory for
+expensive objects, like neural network weights.
+
+At a glance, creating these clones may seem strange. Why do we create all these
+copies? Differentiating between shared and independent state lets us write our
+agents with _independence_ in mind. Agents only need to __focus on a single
+conversation__, maintaining and manipulating the state for just that conversation.
+Every clone will be maintaining _separate conversations_, so when we implement
+a new dataset or model, we only focus on one conversation at a time.
+
+This is what makes up the fundamental backend behind batching or chat services:
+each clone of the agent only has to focus on one conversation at a time, with
+only specific spots for synchronization. In the next section, we'll
+take a look at how this is used in Batching.
+
+:::{note} When do I clone?
+Note that while cloning is fundamental to the inner workings of ParlAI, it is
+rare that you will need to call clone yourself, unless you are creating
+custom worlds or backends.
+:::
+
+## Batching
+
+:::{warning} Already implemented in `TorchAgent`.
+If you're implementing your own custom Neural Network model, you don't need to
+worry about this, except at a high level. This is all handled for you by
+`TorchAgent`.
+:::
+
+Equipped with the ability to create clones, we can use this to implement
+Batching. For a batchsize of 3, we will create 3 clones of the Teacher, and 3
+clones of the Agent. Each of these clones will maintain their own separate,
+independent conversations. Naively, this could be implemented with a simple
+for loop:
+
+```python
+class NaiveBatchWorld(World):
+ def __init__(self, opt, agents):
+ # store the originals
+ self.teacher, self.model = agents
+ # make batchsize copies of all the models
+ self.teacher_copies = []
+ self.model_copies = []
+ self.batchsize = opt['batchsize']
+ for i in range(self.batchsize):
+ self.teacher_copies.append(self.teacher.clone())
+ self.model_copies.append(self.model.clone())
+```
+This initialization code is then represented by this graphic:
+
+
+
+
+
+We continue with the implementation of parley:
+
+```python
+ def parley(self):
+ for i in range(self.batchsize):
+ # produce an example dataset item
+ teacher_act = self.teacher_copies[i].act()
+ # perform preprocessing, vectorizing, etc.
+ self.model_copies[i].observe(teacher_act)
+ # produce a model response
+ model_act = self.model_copies[i].act()
+ # compute any metrics of the response
+ self.teacher_copies[i].observe(model_act)
+```
+
+
+
+
+
+However, this is inefficient, and prevents us from utilizing the amazing
+vectorization capabilities of modern GPUs. Instead, we'll implement a special
+`batch_act method`. This method will instead handle all the acts at once
+
+```python
+class BatchWorld(World):
+ def __init__(self, opt, agents):
+ # store the originals
+ self.teacher, self.model = agents
+ # make batchsize copies of all the models
+ self.teacher_copies = []
+ self.model_copies = []
+ self.batchsize = opt['batchsize']
+ for i in range(self.batchsize):
+ self.teacher_copies.append(self.teacher.clone())
+ self.model_copies.append(self.model.clone())
+
+ def parley(self):
+ observations = []
+ for i in range(self.batchsize):
+ # produce an example dataset item
+ teacher_act = self.teacher_copies[i].act()
+ # perform preprocessing, vectorizing, etc.
+ observation[i] = self.model_copies[i].observe(teacher_act)
+
+ # now batch_act can efficiently do everything on the GPU
+ model_acts = self.model.batch_act(observations)
+
+ # return the results of the batch_act back to individual conversations
+ for i in range(self.batchsize):
+ # self_observe is how we tell each copy what their individual
+ # actions are
+ self.model_copies[i].self_observe(model_acts[i])
+ # compute any metrics of the response
+ self.teacher_copies[i].observe(model_acts[i])
+```
+
+This logic is more complicated, but enables us to efficiently implement batched
+operations. The new logic can be encapsulated in this graph:
+
+
+
+
+
+
+:::{tip} Implementing batch\_act and act
Tip: if you implement `batch_act()`, your `act()` method can just call
-`batchact()` and pass the observation it is supposed to process in a
+`batch_act()` and pass the observation it is supposed to process in a
list of length 1.
:::
-Of course, this also means that we can use `batch_act` in both the task
-and the agent code:
+## Dynamic Batching
+
+:::{note}
+This is only a sketch. The real code is more complicated, but advanced users
+may be interested in
+[reading it](https://github.com/facebookresearch/ParlAI/blob/master/parlai/core/worlds.py).
+:::
+
+As a final diversion, we'll discuss at a high level of how Dynamic Batching
+(also known as adaptive batching) is implemented. Dynamic Batching is supported
+by all TorchAgents by default and can give you a
+[2-3x speedup in training](tutorial_fast).
+
+Dynamic batching is used to _maximize the usage of GPUs_ by grouping similarly
+length examples so they occur at the same time. Intuitively, to maximize usage
+of GPU memory, we can either process _a few very long conversations_ or we can
+process _many short conversations_. If we can do this artfully, we will be able
+to maximize throughput and minimize waste from
+[padding tokens](https://d2l.ai/chapter_recurrent-neural-networks/text-preprocessing.html).
+Padding tokens occur whenever one conversation is much longer than another,
+so we must _pad_ the batch with empty tokens to make our tensors full rectangles.
+
+As a simple algorithm for minimizing padding, we can simply "group" similarly
+length examples so they are processed at the same time. To do this, we will
+maintain a sort of buffer of many possible conversations.
+
+Let's imagine we have a buffer of 12 conversations going on at once, and a
+batchsize of 3. Each of the conversations is in the following state:
+
+ID | Batch #| Message | # Words
+---|--------|----------|-------
+0 | 0 | Frankly, my dear, I don't give a damn. | 8
+1 | 0 | I'm going to make him an offer he can't refuse. | 10
+2 | 0 | Mama always said life was like a box of chocolates. You never know what you're gonna get. | 17
+3 | 1 | Here's Johnny! | 2
+4 | 1 | Expecto patronum! | 2
+5 | 1 | Do you wish me a good morning, or mean that it is a good morning whether I want it or not; or that you feel good this morning; or that it is a morning to be good on? | 38
+6 | 2 | I'm walking here! I'm walking here! | 6
+7 | 2 | You are without a doubt the worst pirate I've ever heard of. | 12
+8 | 2 | May the Force be with you. | 6
+9 | 3 | A million dollars isn't cool. You know what's cool? A billion dollars. | 12
+10 | 3 | I'm as mad as hell, and I'm not going to take this anymore! | 13
+11 | 3 | I'm king of the world! | 5
+
+Naively processing each of these conversations in order will result in very
+uneven batches: the very long LOTR quote is combined with our shortest
+utterances. Instead, we can organize our conversations so that like-lengthed
+are next to each other. This will save processing time and minimize packing.
+
+ID | Batch #| Message | # Words
+---|--------|----------|-------
+3 | 0 | Here's Johnny! | 2
+4 | 0 | Expecto patronum! | 2
+11 | 0 | I'm king of the world! | 5
+6 | 1 | I'm walking here! I'm walking here! | 6
+8 | 1 | May the Force be with you. | 6
+0 | 1 | Frankly, my dear, I don't give a damn. | 8
+1 | 2 | I'm going to make him an offer he can't refuse. | 10
+7 | 2 | You are without a doubt the worst pirate I've ever heard of. | 12
+9 | 2 | A million dollars isn't cool. You know what's cool? A billion dollars. | 12
+10 | 3 | I'm as mad as hell, and I'm not going to take this anymore! | 13
+2 | 3 | Mama always said life was like a box of chocolates. You never know what you're gonna get. | 17
+5 | 3 | Do you wish me a good morning, or mean that it is a good morning whether I want it or not; or that you feel good this morning; or that it is a morning to be good on? | 38
+
+This is called the __batchsort__ algorithm. Note that our batch size remains
+fixed at 4, but we've grouped our worst examples so that they happen at the
+same time.
+
+:::{tip}
+You can use this your training or evaluation runs with `-dynb batchsort` or
+`--dynamic-batching batchsort`.
+:::
+
+:::{note}
+Notice how the conversations no longer play in order. ParlAI handles all of this
+extra complexity for you. Remember, you only need to implement state tracking
+for _individual_ conversations, and along with `batch\_act`.
+:::
+
+But we can take this even one step further, and implement fully dynamic batching.
+In dynamic batching, we grow our batches so that the number of _words_ per batch
+stays relatively fixed. This means we can process many short conversations at
+the same time. Let's imagine we set the number of total number of words per batch
+to be 80.
+
+ID | Batch #| Message | # Words
+---|--------|----------|-------
+5 | 0 | Do you wish me a good morning, or mean that it is a good morning whether I want it or not; or that you feel good this morning; or that it is a morning to be good on? | 38
+2 | 0 | Mama always said life was like a box of chocolates. You never know what you're gonna get. | 17
+10 | 0 | I'm as mad as hell, and I'm not going to take this anymore! | 13
+9 | 0 | A million dollars isn't cool. You know what's cool? A billion dollars. | 12
+7 | 1 | You are without a doubt the worst pirate I've ever heard of. | 12
+1 | 1 | I'm going to make him an offer he can't refuse. | 10
+0 | 1 | Frankly, my dear, I don't give a damn. | 8
+6 | 1 | I'm walking here! I'm walking here! | 6
+8 | 1 | May the Force be with you. | 6
+11 | 1 | I'm king of the world! | 5
+3 | 1 | Here's Johnny! | 2
+4 | 1 | Expecto patronum! | 2
+
+With this algorithm, we tried to get as close to 80 words as possible, without
+going over. The first batch ends up with exactly 80 words. The batch ends up
+with only 51. But we've now reduced the number of batches we need to process
+from 4 to only 2! This is the trick of how dynamic batching can provide
+[massive speed ups](tutorial_fast).
+
+:::{tip}
+You can use this mode with `-dynb full` or `--dynamic-batching full`.
+:::
-![image](_static/img/world_batchboth.png)