-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Conversation
a4216c3
to
c7bdd80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Makes sense to me - will defer to Kurt to approve because I think he has more context on this but happy to do so if desired. What value of --num-workers
would you recommend for different cases? How much additional mem does this typically use?
parlai/core/teachers.py
Outdated
|
||
def receive_data(self, future): | ||
def receive_data(self, future, direct_result=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm maybe it'd be worth adding a line of documentation on future
? I understand from the line above that it's a chunk, but offhand I don't even know what a chunk is...
|
||
self.re = re | ||
except ImportError: | ||
if regex is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was always weird
Number of workers gives diminishing returns, but I have still found that 8 is faster than 4. 10 is the "max" we can do on our cluster based on CPUs-per-GPU. |
As far as memory, it's loading the dataset separately per worker. My chunk teachers haven't had issues tho. |
@@ -2277,12 +2304,23 @@ def __init__(self, opt, shared=None): | |||
c for c in self.fold_chunks if c % self.dws == self.rank | |||
] | |||
|
|||
# deal with --num-workers | |||
self.threading = not (opt.get('num_workers', 0) > 0 and self.is_train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we still thread during validation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some clarifying comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the clarification
parlai/core/teachers.py
Outdated
""" | ||
Loads data. | ||
|
||
Load data into self.samples until buffersize is reached. | ||
""" | ||
chunk_output, chunk_reset_cnt = future.result() | ||
output = future if direct_result else future.result() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: direct_result
is a bit ambiguous --> if true, we return the future
, not the future.result()
, which is somewhat counterintuitive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, refactored to be more clear I hope
@@ -162,20 +165,26 @@ def to(self, dev): | |||
""" | |||
Move all tensors in the batch to a device. | |||
|
|||
NOT in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out, attrdict doesn't link batch.value
to batch['value']
after initialization so it was like real weird lol
import torch.multiprocessing as mp | ||
|
||
self._num_workers = self.opt['num_workers'] | ||
self._process_queue = mp.Queue(maxsize=4 * self._num_workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya seemed good to me lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left a comment
def launch_process(cls, index, opt, model_agent, process_queue): | ||
import torch | ||
|
||
torch.set_num_threads(1) # prevent threads from spawning in this worker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this local in scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya i think so but not 100% sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really appreciate the clarifications, sorry for the super delayed re-review
docs/source/tutorial_fast.md
Outdated
these tricks, or simply use these options in the first place. | ||
|
||
This tutorial is only for illustration purposes of how to speed up training, | ||
and may not get you the best _performing_ model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming this is just a standard disclaimer? or d you actually see worse performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LR isn't optimized for this, and many of them don't train for remotely long enough.
docs/source/tutorial_fast.md
Outdated
|
||
This tutorial is only for illustration purposes of how to speed up training, | ||
and may not get you the best _performing_ model. | ||
::: | ||
|
||
A summary of the speedups is in this table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe indicate here, summary of speedups "for 90m blenderbot model" or something like that
| FP16 | 212s | 11s | 223s | 2.60x | | ||
| Larger batchsize (FP16) | 168s | 10s | 178s | 3.59x | | ||
| Background preprocessing | 144s | 10s | 154s | 4.15x | | ||
| Using 4 GPUs | 63s | 4s | 67s | 8.64x | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gimme 8x all day
--dynamic-batching full \ | ||
--fp16 true --batchsize 128 | ||
--fp16 true --fp16-impl mem_efficient --batchsize 128 --eval-batchsize 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to explicitly specify which --fp16-impl
for max efficiency?
docs/source/tutorial_fast.md
Outdated
tokenization and dialogue state tracking, in the background thread. This can be | ||
enabled by setting `--num-workers` to a value greater than 0. A good rule of | ||
thumb is to set `--num-workers` to the number of CPU cores you have PER GPU. | ||
Sometimes you can go a bit over, but you will need to play. On my server, there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "you will need to play..." "around with it"?
--skip-generation true \ | ||
--dynamic-batching full \ | ||
--fp16 true --fp16-impl mem_efficient --batchsize 128 --eval-batchsize 256 \ | ||
--num-workers 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you set it to 8 tho specify ur setting to 4
docs/source/tutorial_fast.md
Outdated
training runs like in this document. If we train for 4 epochs instead, our FP16 | ||
large-batch run takes 614 seconds and the multi-GPU training takes 222 seconds. | ||
Also, if you have access to a SLURM cluster, distributed_train is sometimes | ||
faster than multiprocessin_train. With SLURM, multi-GPU training takes 167 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: multiprocessin
@@ -2277,12 +2304,23 @@ def __init__(self, opt, shared=None): | |||
c for c in self.fold_chunks if c % self.dws == self.rank | |||
] | |||
|
|||
# deal with --num-workers | |||
self.threading = not (opt.get('num_workers', 0) > 0 and self.is_train) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the clarification
…ng. prevent a hang.
51fadf2
to
d245d08
Compare
Hi
|
Sorry, we don't support Windows at this time. You can try without bg preprocessing. |
Thanks for the speedy reply @stephenroller |
I don't think Bg preprocessing will do anything to save you GPU memory. It just does tokenization etc in a background thread so the GPU is always being fed examples. Unfortunately, our implementation heavily depends on the semantics of fork, and won't work with others. |
Ok, thanks! I’ve tried all the other tricks I can find to reduce GPU memory such as blocklength set to 1, fp16 using mem_efficient, and adafactor optimization (I didn’t want to reduce the various truncation settings from the BlenderBot2 settings since I am not sure whether they would affect results from executing the BlenderBot2), so I guess I’ll have to live with the training speed I’ve got. |
If you have multiple GPUs, --ddp-backend zero2 can help. I've got an intern who is also lamenting how slow training similar models is. I'll let you know how we speed it up. |
Thanks, any help is appreciated. And unfortunately I only have 1 GPU … |
Patch description
The fabled background preprocessing is here. You can enable it by setting
--num-workers
to a value greater than 0!Some rough estimates of performance increases:
Testing steps
Sooooo much manual testing. There is also new CI.