Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataset] new io for code reuse for many speech tasks #2316

Merged
merged 35 commits into from
Jan 31, 2024

Conversation

Mddct
Copy link
Collaborator

@Mddct Mddct commented Jan 22, 2024

#2152

  • TODO

    • chain call
    • support raw dataset source
    • support tar shard dataset source
    • prefetch in synchronize way
    • mapper Ignore error
    • sort data pipes
    • check shuffle deterministic
    • batching
      • static batch
      • dynamic batch
    • padding
    • unit test
      • raw dataset
      • tar dataset
      • unit test for all related function in processor.py
    • reproduce exp
      • raw
      • shard
  • next pr todo

@Mddct
Copy link
Collaborator Author

Mddct commented Jan 23, 2024

chain works:

from torch.utils.data import DataLoader
from wenet.dataset.datapipes import (WenetRawDatasetSource,
                                     WenetTarShardDatasetSource)
from wenet.dataset.processor_v2 import (compute_fbank,
                                        compute_log_mel_spectrogram,
                                        decode_wav, parse_json)

if __name__ == '__main__':

    dataset = WenetTarShardDatasetSource(
        'test/resources/dataset/data.shards.list')
    # dataset = dataset.map(decode_wav)
    # dataset = dataset.map(compute_fbank)
    # dataset = dataset.map(compute_log_mel_spectrogram)
    dataloader = DataLoader(dataset,
                            num_workers=1,
                            persistent_workers=True,
                            batch_size=None)
    for d in dataloader:
        print(d['file_name'], d['tar_file_name'], d['txt'], len(d['wav']))
    for d in dataloader:
        print(d['file_name'], d['tar_file_name'], d['txt'], len(d['wav']))
    for d in dataloader:
        print(d['file_name'], d['tar_file_name'], d['txt'], len(d['wav']))

    dataset = dataset.map(decode_wav).map(compute_fbank)
    dataloader = DataLoader(dataset,
                            num_workers=1,
                            persistent_workers=True,
                            batch_size=None)
    for d in dataloader:
        print(d['file_name'], d['tar_file_name'], d['txt'], len(d['wav']),
              d['feat'].size())

    dataset = WenetRawDatasetSource("test/resources/dataset/data.list",
                                    prefetch=10)
    dataset = dataset.map(parse_json)
    dataset = dataset.map(decode_wav)
    dataset = dataset.map(compute_log_mel_spectrogram)
    dataloader = DataLoader(dataset,
                            num_workers=2,
                            persistent_workers=True,
                            batch_size=None)

    for d in dataloader:
        print(d.keys())

    for d in dataloader:
        print(d.keys())
Screenshot 2024-01-23 at 16 34 49

@Mddct
Copy link
Collaborator Author

Mddct commented Jan 23, 2024

batch work

    dataset = WenetRawDatasetSource("test/resources/dataset/data.list",
                                    prefetch=10)
    dataset = dataset.map(parse_json)
    dataset = dataset.map(decode_wav)
    dataset = dataset.map(compute_log_mel_spectrogram)

    def fake_labels(sample):
        sample['label'] = [1, 2, 3, 4]
        return sample

    dataset = dataset.map(fake_labels)
    dataset = dataset.batch(batch_size=2, wrapper_class=padding)

    for d in dataset:
        print(d)
Screenshot 2024-01-23 at 18 16 25

@Mddct Mddct force-pushed the Mddct-dataset-datapipes branch from 2c359da to 32721da Compare January 23, 2024 14:29
@Mddct
Copy link
Collaborator Author

Mddct commented Jan 23, 2024

dynamic batch work

Screenshot 2024-01-23 at 22 30 20

test/wenet/dataset/test_datapipes.py Outdated Show resolved Hide resolved
test/wenet/dataset/test_dataset.py Outdated Show resolved Hide resolved
@Mddct Mddct force-pushed the Mddct-dataset-datapipes branch from 451f66a to ca33aa5 Compare January 24, 2024 03:47
@Mddct Mddct changed the title [WIP][dataset] new io for code reuse for many speech tasks [dataset] new io for code reuse for many speech tasks Jan 27, 2024
@Mddct
Copy link
Collaborator Author

Mddct commented Jan 28, 2024

shard 模式下 new io 和old io loss下降曲线几乎一致
Screenshot 2024-01-28 at 09 48 25

@Mddct
Copy link
Collaborator Author

Mddct commented Jan 28, 2024

shard 模式结果复现

Screenshot 2024-01-28 at 18 30 26

@Mddct Mddct requested review from xingchensong, kobenaxie and robin1001 and removed request for kobenaxie January 28, 2024 10:34
@xingchensong
Copy link
Member

great job!

@robin1001 robin1001 merged commit 4c81459 into main Jan 31, 2024
6 checks passed
@robin1001 robin1001 deleted the Mddct-dataset-datapipes branch January 31, 2024 02:59
@robin1001
Copy link
Collaborator

Big job!

@kobenaxie
Copy link
Contributor

kobenaxie commented Mar 28, 2024

对比了 #2152 (comment) 中提到的lhotse和new-IO的速度:

  • IO 分钟/epoch
    new-IO-dynamic 18
    new-IO-bucket 21
    Lhotse-IO 13

相关配置

  • dynamic: max_frames_in_batch = 18000
  • bucket_boundaries: [500, 750, 1000]
  • bucket_batch_sizes: [36, 24, 18, 9]
  • num_workers = 8
  • prefectch = 20
  • 数据集:好未来中英混数据集
  • GPU: 8卡A100
  • torch2.2.1-cu118

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants