-
-
Notifications
You must be signed in to change notification settings - Fork 971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Duplicate shard config names? #46
Comments
The first one is fixed via #58 . Maybe the second works as intended? |
One is for loading a single shard of a single dataset. This is more of a hack to load a subset of data from a dataset to scale that particular dataset down, whereas the second one was to simply scale everything down at once. The second one is also useful if you want to experiment with an subset of data so that it runs quickly, and if it seems okay, you change the shard index and run the rest of the shards. The first hack could be improved by using some % or decimal representation since shards don't have good granularity (50%, 33%, 25%, etc) |
Ah! I see now. It's single dataset vs total. I guess this needs to be added to Readme. |
I noticed two different shard code using different configs in
load_tokenized_prepared_datasets
andload_prepare_datasets
https://github.com/OpenAccess-AI-Collective/axolotl/blob/a617f1b65eb3d986ab7844630944fe4c979158fe/src/axolotl/utils/data.py#L114-L115
https://github.com/OpenAccess-AI-Collective/axolotl/blob/a617f1b65eb3d986ab7844630944fe4c979158fe/src/axolotl/utils/data.py#L345-L351
Not sure if these two parts should be combined and called elsewhere, but I think the config should be unified to use same name.
The text was updated successfully, but these errors were encountered: