Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix: updates the training sampling strategy to complete the last batch #538

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wiitt
Copy link
Collaborator

@wiitt wiitt commented Aug 30, 2024

Fixes #438

PR Goal?

Updates sampling strategy in training and complete last batches with random samples from other batches instead of dropping last batches.

Fixes?

Fixes #438

Feedback sought?

If a model works with this new sampler. If it produces results that are better or at least not worse after training.

Priority?

Low

Tests added?

No tests added, but it would be good to have some testing of this sampler.

How to test?

Place a breakpoint and inspect the composition of a last batch in an epoch. Check that the number of batches correspond to expectations during training. Train a model in a scenario when difference between dropping and keeping last batches is noticeable (e.g. very small dataset or a dataset where samples in last batch have unique phonemes).

Confidence?

Low. This code wasn't properly tested.

Version change?

No. Can be a part of a larger update.

Related PRs?

No.

Copy link

semanticdiff-com bot commented Aug 30, 2024

Review changes with SemanticDiff.

Analyzed 1 of 2 files.

Overall, the semantic diff is 60% smaller than the GitHub diff.

Filename Status
✔️ everyvoice/dataloader/__init__.py 59.3% smaller
everyvoice/dataloader/oversampler.py Unsupported file format

@wiitt wiitt marked this pull request as draft August 30, 2024 21:52
@wiitt wiitt requested a review from roedoejet August 30, 2024 21:52
Copy link
Contributor

CLI load time: 0:00.23
Pull Request HEAD: ad8cd7f4850f6c316605a546d155c1c0ec65eb98
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package

Copy link

codecov bot commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 19.51220% with 33 lines in your changes missing coverage. Please review.

Project coverage is 75.60%. Comparing base (8fc4099) to head (ad8cd7f).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
everyvoice/dataloader/oversampler.py 20.00% 28 Missing ⚠️
everyvoice/dataloader/__init__.py 16.66% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #538      +/-   ##
==========================================
+ Coverage   74.48%   75.60%   +1.12%     
==========================================
  Files          45       46       +1     
  Lines        3029     3283     +254     
  Branches      491      580      +89     
==========================================
+ Hits         2256     2482     +226     
- Misses        679      704      +25     
- Partials       94       97       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synthesize can only process even multiples of the batch size
1 participant