[Question] Duplicate shard config names? #46

NanoCode012 · 2023-05-25T08:16:37Z

I noticed two different shard code using different configs in load_tokenized_prepared_datasets and load_prepare_datasets

https://github.com/OpenAccess-AI-Collective/axolotl/blob/a617f1b65eb3d986ab7844630944fe4c979158fe/src/axolotl/utils/data.py#L114-L115

https://github.com/OpenAccess-AI-Collective/axolotl/blob/a617f1b65eb3d986ab7844630944fe4c979158fe/src/axolotl/utils/data.py#L345-L351

Not sure if these two parts should be combined and called elsewhere, but I think the config should be unified to use same name.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2023-05-26T02:06:53Z

The first one is fixed via #58 . Maybe the second works as intended?

winglian · 2023-05-27T12:02:23Z

One is for loading a single shard of a single dataset. This is more of a hack to load a subset of data from a dataset to scale that particular dataset down, whereas the second one was to simply scale everything down at once. The second one is also useful if you want to experiment with an subset of data so that it runs quickly, and if it seems okay, you change the shard index and run the rest of the shards.

The first hack could be improved by using some % or decimal representation since shards don't have good granularity (50%, 33%, 25%, etc)

NanoCode012 · 2023-05-27T13:39:45Z

Ah! I see now. It's single dataset vs total. I guess this needs to be added to Readme.

NanoCode012 mentioned this issue May 27, 2023

Feat: Add dataset_shard_num and dataset_shard_idx to Readme #85

Merged

NanoCode012 closed this as completed in #85 May 27, 2023

unknown-submitter-000 mentioned this issue Nov 1, 2023

Socket Timeout after 30 minutes running Issue #809

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Duplicate shard config names? #46

[Question] Duplicate shard config names? #46

NanoCode012 commented May 25, 2023

NanoCode012 commented May 26, 2023

winglian commented May 27, 2023

NanoCode012 commented May 27, 2023

[Question] Duplicate shard config names? #46

[Question] Duplicate shard config names? #46

Comments

NanoCode012 commented May 25, 2023

NanoCode012 commented May 26, 2023

winglian commented May 27, 2023

NanoCode012 commented May 27, 2023