Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch exception raised in hf prep properly #749

Merged
merged 18 commits into from
Nov 21, 2023
Merged

Catch exception raised in hf prep properly #749

merged 18 commits into from
Nov 21, 2023

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Nov 20, 2023

Description

Issue: There is a barrier deadlock in build_from_hf if an Exception is raised which causes errored runs to not fail properly and burn compute.

Concretely, the non local rank 0 processes depend on a signal file to be written but if a rank 0 process raises an exception then we never reach the dist.barrier() state resulting in a deadlock.

Solution: We add a try catch around the dataset preparation code to ensure that local rank 0 gets to writing/removing the signal file enabling all processes to reach the dist.barrier() state.

Tests:

python3   -m composer.cli.launcher -n 2 --master_port 26002   -m pytest -v --durations=20 -m 'gpu' -o tmp_path_retention_policy=none  -k "test_finetuning_dataloader_small"

=============================================================================================================== slowest 20 durations ================================================================================================================
3.69s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-True-2-4]
3.09s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-True-4-8]
3.05s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-False-2-8]
2.95s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-True-2-8]
2.74s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-False-2-8]
2.73s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-False-4-8]
2.63s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-True-4-8]
2.63s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-True-2-8]
2.56s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-True-2-4]
2.55s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-False-2-4]
2.50s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-True-4-4]
2.49s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-False-4-8]
2.43s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-False-4-4]
2.36s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-False-2-4]
2.32s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-True-4-4]
2.27s call     tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-False-4-4]
1.97s setup    tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-True-2-4]
0.24s setup    tests/test_dataloader.py::test_finetuning_dataloader_small_data[True-False-2-4]
0.24s setup    tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-True-2-4]
0.24s setup    tests/test_dataloader.py::test_finetuning_dataloader_small_data[False-True-4-8]
================================================================================================= 16 passed, 1030 deselected, 24 warnings in 48.55s =================================================================================================

Also ported the test_finetuning_dataloader_small to GPU tests since it wasn't being ran

j316chuck and others added 3 commits November 20, 2023 23:38
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, lets add a test for this. world size 2 with a broken jsonl file should reproduce for the test I think?

@dakinggg dakinggg enabled auto-merge (squash) November 21, 2023 20:19
@dakinggg dakinggg merged commit 7f5d70c into main Nov 21, 2023
10 checks passed
@dakinggg dakinggg deleted the chuck/test_raise branch February 3, 2024 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants