-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make BufferShuffledExamplesIterable
resumable
#7056
base: main
Are you sure you want to change the base?
Conversation
Oh cool ! The time it takes to resume depends on the expected maximum distance in this case right ? Do you know its relationship with In your test it already as high as 15k for Maybe we could just add a warning message on resuming to tell the user that it might take some time to recover the shuffle buffer (with a progress bar maybe ?), and have the option to stop + re-run with an env variable to disable shuffle buffer recovering ? WDYT ? |
Maybe there's a middle ground between rebuilding the buffer from scratch and storing the entire buffer, but the logic is a bit complicated and takes time to implement. At least for now, we have a way to make shuffled |
@lhoestq I'm not sure if it's ok to use progress bar when having multiple workers. |
I feel like the default behavior should ideally be fast and perfect resuming. Loading from disk is a good option for this (although it's not always possible to serialize the content of the buffer, in that case the buffer would restart empty and we can show a warning). The state_dict() would be part of the training state_dict that is saved to disk along with the model and optimizer anyway. Cc @muellerzr from that worked on storing training state_dicts for the I also feel like it is simpler and more intuitive to users. It doesn't require to explain why we need to stream a lot of data just to recover a buffer.
definitely, and it would also make things even harder to understand to users |
Yea, agree with you. But here's the thing: saving buffers as state dict can get pretty tricky. When it comes to tokenized text data, working with multi-worker shuffle can take around x hundreds GB of memories in my case. That's just not feasible for most machine envs out there, and can be more severe for audio/video data. Also, serializing the buffer does take a major toll on performance, and in my experience, I've had to lean heavily on numpy/torch tensor operations to manage those tokenized text data efficiently, which isn't easily transferable to other scenarios—it's kind of a custom fix that works for now, but it's not a one-size-fits-all solution. So, for me it's not that ideal to directly serialize the buffer content with those limitations. |
it's kinda close to the size of a model + optimizer no ? Anyway that makes sense and adding the feature to recover a buffer shuffle (at least as an opt-in for now, we can decide on the default later based on users feedback and experience). Are you ok with adding |
Of course, appreciate your feedbacks. |
This PR aims to implement a resumable
BufferShuffledExamplesIterable
.Instead of saving the entire buffer content, which is very memory-intensive, the newly implemented
BufferShuffledExamplesIterable
saves only the minimal state necessary for recovery, e.g., the random generator states and the state of the first example in the buffer dict.The idea is that since the buffer size is limited, even if the entire buffer is discarded, we can rebuild it as long as the state of the oldest example is recorded. For buffer size$B$ , the expected distance between when an example is pushed and when it is yielded is
$d = \sum_{k=1}^{\infty} k\frac{1}{B} (1 - \frac{1}{B} )^{k-1} =B$ .
Simulation experiments support these claims:
which produces the following output:
The overall time for reconstructing the buffer and recovery should not be too long.
The following code mimics the cases of resuming online tokenization by
datasets
andStatefulDataLoader
under distributed scenarios,and the outputs are
Not sure if this PR complies with the
datasets
code style. Looking for your help @lhoestq, also very willing to further improve the code if any suggestions are given.