feat: change the flow of data preprocess and avoid bug in remove columns #26516

pphuc25 · 2023-10-01T12:34:01Z

What does this PR do?

I change the data flow of prepare_dataset function, make a case to avoid remove speech columns

While examining the 'wav2vec2' workflow, I noticed that the prepare_dataset function typically takes the path of audio files and converts them into audio arrays. However, I believe this approach may not be ideal for several reasons:

Not all data entries contain a path column, or the path column may not always be correctly populated (e.g., in the case of 'vivos' data). When attempting to use this code with such data, errors can occur.
This process is somewhat redundant, especially in cases like 'common voice' datasets, where we already have the audio data stored in the audio column. In these instances, it would be more efficient to directly pass the audio array to the speech column.

To address these issues, I've adjusted the data flow to accept the audio file path as an input column, ensuring that the sampling rate matches the feature extractor's requirements. Additionally, I've created a list of columns to exclude during data processing to prevent inadvertently removing the 'speech' column."

I would like cc @sanchit-gandhi to review my PR.

…ove columns

sanchit-gandhi

Hey @pphuc25 - the Flax Wav2Vec2 pre-training script is under 'research projects' because it's still very much a WIP, and currently does not have correctness (see #19588). Thus, it's not really actively maintained, and so is not a fruitful place to make new contributions. I would encourage you to either develop on-top of the existing script and publish it standalone, or pick-up the work started in #19588 to try and get equivalence with PyTorch before adding new functionality!

examples/research_projects/jax-projects/wav2vec2/run_wav2vec2_pretrain_flax.py

sanchit-gandhi · 2023-10-02T16:38:19Z

examples/research_projects/jax-projects/wav2vec2/run_wav2vec2_pretrain_flax.py

- prepare_dataset, num_proc=data_args.preprocessing_num_workers, remove_columns=datasets["train"].column_names
+ prepare_dataset, num_proc=data_args.preprocessing_num_workers, remove_columns=remove_columns_values


Why change this? It was fine before no?

in function preprocess, the column will be assigned to colum name speech, I think maybe the bug can occur seen some data have the column speech will automatic remove, this is not a rare case seen something I name my audio column name as speech

examples/research_projects/jax-projects/wav2vec2/run_wav2vec2_pretrain_flax.py

src/transformers/models/persimmon/modeling_persimmon.py

sanchit-gandhi · 2023-10-04T14:24:47Z

Hey @pphuc25! Thanks for your enthusiasm here! As mentioned previously, this examples script is not the best place to make performance optimisations, since it's currently a WIP script (or more truthfully, a 'broken' script). If you're interested in making a contribution for Flax Wav2Vec2 pre-training, I would encourage you to take a look at the issue #19588, which endeavours to correct this script by obtaining equivalence with PyTorch. We should fix this script first before making performance optimisations like the ones proposed in this PR. Thanks for your understanding.

pphuc25 added 3 commits October 1, 2023 19:21

feat: change the flow of data preprocess and enhance avoid bug in rem…

755960c

…ove columns

chorse: change layout of import

e07f2f9

fix: fix bug

ca05218

sanchit-gandhi reviewed Oct 2, 2023

View reviewed changes

docs: revert argument name and document install libraries

014b9d9

pphuc25 requested a review from sanchit-gandhi October 3, 2023 05:45

pphuc25 closed this Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: change the flow of data preprocess and avoid bug in remove columns #26516

feat: change the flow of data preprocess and avoid bug in remove columns #26516

pphuc25 commented Oct 1, 2023

sanchit-gandhi left a comment

sanchit-gandhi Oct 2, 2023

pphuc25 Oct 3, 2023

sanchit-gandhi commented Oct 4, 2023 •

edited

Loading

		prepare_dataset, num_proc=data_args.preprocessing_num_workers, remove_columns=datasets["train"].column_names
		prepare_dataset, num_proc=data_args.preprocessing_num_workers, remove_columns=remove_columns_values

feat: change the flow of data preprocess and avoid bug in remove columns #26516

feat: change the flow of data preprocess and avoid bug in remove columns #26516

Conversation

pphuc25 commented Oct 1, 2023

What does this PR do?

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Oct 2, 2023

Choose a reason for hiding this comment

pphuc25 Oct 3, 2023

Choose a reason for hiding this comment

sanchit-gandhi commented Oct 4, 2023 • edited Loading

sanchit-gandhi commented Oct 4, 2023 •

edited

Loading