Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Duplicate shard config names? #46

Closed
NanoCode012 opened this issue May 25, 2023 · 3 comments · Fixed by #85
Closed

[Question] Duplicate shard config names? #46

NanoCode012 opened this issue May 25, 2023 · 3 comments · Fixed by #85

Comments

@NanoCode012
Copy link
Collaborator

I noticed two different shard code using different configs in load_tokenized_prepared_datasets and load_prepare_datasets

https://github.com/OpenAccess-AI-Collective/axolotl/blob/a617f1b65eb3d986ab7844630944fe4c979158fe/src/axolotl/utils/data.py#L114-L115

https://github.com/OpenAccess-AI-Collective/axolotl/blob/a617f1b65eb3d986ab7844630944fe4c979158fe/src/axolotl/utils/data.py#L345-L351

Not sure if these two parts should be combined and called elsewhere, but I think the config should be unified to use same name.

@NanoCode012
Copy link
Collaborator Author

The first one is fixed via #58 . Maybe the second works as intended?

@winglian
Copy link
Collaborator

One is for loading a single shard of a single dataset. This is more of a hack to load a subset of data from a dataset to scale that particular dataset down, whereas the second one was to simply scale everything down at once. The second one is also useful if you want to experiment with an subset of data so that it runs quickly, and if it seems okay, you change the shard index and run the rest of the shards.

The first hack could be improved by using some % or decimal representation since shards don't have good granularity (50%, 33%, 25%, etc)

@NanoCode012
Copy link
Collaborator Author

Ah! I see now. It's single dataset vs total. I guess this needs to be added to Readme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants