Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataset] support bucket by seq length #2333

Merged
merged 3 commits into from
Feb 4, 2024
Merged

Conversation

Mddct
Copy link
Collaborator

@Mddct Mddct commented Feb 1, 2024

follow this pr #2316

TODO:

  • unit test
  • aishell training

@Mddct
Copy link
Collaborator Author

Mddct commented Feb 1, 2024

it works

    batch_conf:
        batch_type: 'bucket' # static or bucket or dynamic
        bucket_boundaries:  [500, 1000, 1500]
        bucket_batch_sizes: [82, 64, 32, 16]

[500 1000 15000] 代表:
[0,500) [500-1000) [1500, sys.max]
-> 5秒的一个桶 (出来数据为batch size=82)
-> 5-10 的一个桶
-> 10-15 的一个桶
-> 15- 之外的一个桶
这些桶分别产生对应 batch_size的数据,

train_transformer.yaml
Screenshot 2024-02-02 at 00 57 11

以前一个epoch需要1100个batch
Screenshot 2024-02-02 at 01 27 21

现在:
Screenshot 2024-02-02 at 01 22 05

raw 模式训练时间: 14h13min

结果和这里一致: #2316 (comment)
Screenshot 2024-02-02 at 18 56 30

@Mddct
Copy link
Collaborator Author

Mddct commented Feb 4, 2024

以下增大bucket的测试:

    batch_conf:
        batch_type: 'bucket' # static or bucket or dynamic
        bucket_boundaries:  [500, 1000, 1500]
        bucket_batch_sizes: [128, 64, 32, 16]
  • raw模式, 训练时见14h11min
Screenshot 2024-02-04 at 09 45 21
  • shard 模式, 训练时间13h10min
Screenshot 2024-02-04 at 09 43 26

@Mddct
Copy link
Collaborator Author

Mddct commented Feb 4, 2024

这里列下上边两个表格中提到的一些指标, 直观一些

batch size data type 训练时间 att/rescore/ctc greedy/ctc beam wer
old io static 26 raw 20h 5min 5.64/5.31/5.88/5.88
static 26 raw 17h20min 5.56/5.25/5.89/5.89
static26 shard 14h25min 5.53/5.27/5.86/5.86
bucket_boundaries: [500, 1000, 1500]
bucket_batch_sizes: [82, 64, 32, 16]
raw 14h13min 5.58/5.27/5.82/5.82
bucket_boundaries: [500, 1000, 1500]
bucket_batch_sizes: [128, 64, 32, 16]
raw 14h11min 5.53/5.24/5.85/5.85
bucket_boundaries: [500, 1000, 1500]
bucket_batch_sizes: [128, 64, 32, 16]
shard 13h 10min 5.59/5.20/5.73/5.72

@robin1001 robin1001 merged commit b115913 into main Feb 4, 2024
6 checks passed
@robin1001
Copy link
Collaborator

Great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants