-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] : Fix resume issues with combined streaming dataset in dataloader #362
base: main
Are you sure you want to change the base?
[WIP] : Fix resume issues with combined streaming dataset in dataloader #362
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #362 +/- ##
===================================
Coverage 78% 78%
===================================
Files 34 34
Lines 5016 5020 +4
===================================
+ Hits 3929 3934 +5
+ Misses 1087 1086 -1 |
for more information, see https://pre-commit.ci
Combined Dataset (no weights): Resuming from the complete last epoch iteration is working now, |
Separated because the states were somehow getting accumulated from last test, leading to some weird numbers of samples yielded tests/streaming/test_combined.py:974: AssertionError ----------------------------- Captured stdout call ----------------------------- {'dataset': {'0': {'num_samples_yielded': 3, 'num_workers': 4, 'batch_size': 4, 'current_epoch': 1, 'input_dir_path': '/tmp/pytest-of-runner/pytest-0/test_combined_dataset_dataload0/dataset_0', 'input_dir_url': None, 'item_loader': None, 'drop_last': False, 'seed': 42, 'world_size': 1, 'shuffle': True, 'subsampled_files': ['chunk-0-0.bin'], 'region_of_interest': [(0, 50)]}, '1': {'num_samples_yielded': 1, 'num_workers': 4, 'batch_size': 4, 'current_epoch': 1, 'input_dir_path': '/tmp/pytest-of-runner/pytest-0/test_combined_dataset_dataload0/dataset_1', 'input_dir_url': None, 'item_loader': None, 'drop_last': False, 'seed': 42, 'world_size': 1, 'shuffle': True, 'subsampled_files': ['chunk-0-0.bin'], 'region_of_interest': [(0, 50)]}}, 'current_epoch': 1, 'latest_worker_idx': 2, 'num_samples_yielded': {0: [15, 25], 1: [16, 20], 2: [16, 20], 3: [16, 16]}} =========================== short test summary info ============================
…razy/litdata into fix/combined-dataset-loading-states
hi @bhimrazy |
Hi @deependujha I'm still facing some issues with an
I haven't had much time lately, but I plan to continue working on from this weekend. |
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
5685611 | Triggered | Generic High Entropy Secret | 3762b11 | tests/streaming/test_resolver.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
for more information, see https://pre-commit.ci
@bhimrazy Is this an active issue? I tried restarting training midway during an epoch last week and was able to continue training, when using a CombinedStreamingDataset. |
Thank you, @schopra8, for bringing this to my attention. |
Before submitting
How does this PR impact the user?
Currently, users experience issues when attempting to resume a combined streaming dataset with the streaming dataloader, as saving and restoring checkpoints doesn’t work as expected. This PR addresses the root cause of the error, enabling successful checkpoint resuming of the dataloader, ensuring smoother and more reliable training workflows.
What does this PR do?
Fixes #331.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃