Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PartialState().local_main_process_first() when map in examples #1926

Merged
merged 2 commits into from
Aug 14, 2024

Conversation

qgallouedec
Copy link
Member

No description provided.

@qgallouedec qgallouedec requested review from kashif and lewtun and removed request for kashif August 13, 2024 15:28
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Member Author

Special attention here: do we need load_from_cache=False? I've removed them, otherwise it makes the loading in the main process first useless.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement @qgallouedec !

If I recall correctly, we set load_from_cache_file=False to avoid a subtle side effect where a training script is run twice with a different number of samples in the dataset (e.g. imagine you train once on 50% of the data, followed by 100% of the data). What happens in that case is that datasets caches the results from map() in the first run and reuses them in the second one because the fingerprint hasn't changed (even though the number of samples has).

Having said that, I think it's an advanced use case and for sample scripts we can keep them simple as possible. (At some point I/we should see how to resolve the root cause on the datasets side)

@qgallouedec qgallouedec merged commit f05f63c into main Aug 14, 2024
10 checks passed
@qgallouedec qgallouedec deleted the map_in_main_process_first branch August 14, 2024 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants