National Speech Corpus data prep, optimizations in Cut, limited export to Kaldi data dir #149

pzelasko · 2020-11-23T22:04:00Z

I'm adding a data prep script for National Speech Corpus (thousands of hours of Singaporean English), I actually want to build a Kaldi system for it so I am adding a utility to export RecordingSet + SupervisionSet into a Kaldi data dir (it is limited but maybe a good starting point to extend for anybody who needs more). Since the recordings in this corpus have 2h, there is a lot of SupervisionSegments and it revealed a quadratic complexity in some CutSet operations - I am optimizing them by building an IntervalTree of supervisions that can be used for efficient queries about overlaps and over-spanning segments.

pzelasko · 2020-11-24T01:23:27Z

I think this is okay to merge; I might eventually extend the NSC data prep to other parts of the corpus.

janvainer · 2020-11-24T11:05:40Z

Nice, export to kaldi datadir will be very handy! Thanks :)

pzelasko added 17 commits November 19, 2020 15:09

Partial NSC data prep

8724303

First take at exporting Lhotse manifests to Kaldi data dir

e4d039d

Add a CLI for NSC data prep and Kaldi export

b0574f0

Fix nsc data prep function name

55fb11f

Fix superfluous append

1f2c483

Add a missing "trim_to_supervisions"

e923362

Fix empty segments in NSC

f01f13b

Make supervisions frozen and hashable

a3d15ed

Test faster? truncation

a7fd1a1

Fix the offsets in truncate

0bb3e2c

Try pre-computing the supervision interval tree

ff17c0e

Fix

c3743dd

Filter <Z> segments in NSC

82ad4d9

Make all truncate() methods work with the supervisions index

7578614

Merge branch 'master' into feature/nsc

9607ec2

Add new dependency - intervaltree

919ef99

A bit of refactoring and extra comments/documentation

4b38d1c

pzelasko changed the title ~~[WIP] National Speech Corpus data prep, optimizations in Cut, limited export to Kaldi data dir~~ National Speech Corpus data prep, optimizations in Cut, limited export to Kaldi data dir Nov 24, 2020

pzelasko merged commit 21ee379 into master Nov 24, 2020

pzelasko added this to the v0.3 milestone Nov 24, 2020

sw005320 mentioned this pull request Dec 31, 2020

espnet example? #171

Closed

pzelasko deleted the feature/nsc branch July 1, 2021 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

National Speech Corpus data prep, optimizations in Cut, limited export to Kaldi data dir #149

National Speech Corpus data prep, optimizations in Cut, limited export to Kaldi data dir #149

pzelasko commented Nov 23, 2020

pzelasko commented Nov 24, 2020

janvainer commented Nov 24, 2020

National Speech Corpus data prep, optimizations in Cut, limited export to Kaldi data dir #149

National Speech Corpus data prep, optimizations in Cut, limited export to Kaldi data dir #149

Conversation

pzelasko commented Nov 23, 2020

pzelasko commented Nov 24, 2020

janvainer commented Nov 24, 2020