-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
training_data_fraction is slow. #180
Comments
Can we pre-generate the hashes during indexing for tasks where we want to do this? |
That might be the cleanest solution, yes. |
Another solution is sharded training files - @pappagari is working on this for Reddit I believe. For these experiments we could generate, say, 10 files and use only some of them. |
Relevant to @Jan21 |
Shard-based solution sounds more efficient, but probably messier/less reproducible. May be wrong, though. |
This is an automatically generated comment. As we update If this issue is still affecting you in If this issue is still affecting you in |
If you're using a small fraction (10% or less), you spend most of your time hashing examples, GPU usage drops, and samples per second drops dramatically. The lazy solution for now is to use a cumbersome two different setups: training_data_fraction when training on >1% of the training data, and a custom data file and new task when training on less.
The text was updated successfully, but these errors were encountered: