Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change the flow of data preprocess and avoid bug in remove columns #26516

Closed
wants to merge 4 commits into from
Closed

feat: change the flow of data preprocess and avoid bug in remove columns #26516

wants to merge 4 commits into from

Conversation

pphuc25
Copy link
Contributor

@pphuc25 pphuc25 commented Oct 1, 2023

What does this PR do?

I change the data flow of prepare_dataset function, make a case to avoid remove speech columns

While examining the 'wav2vec2' workflow, I noticed that the prepare_dataset function typically takes the path of audio files and converts them into audio arrays. However, I believe this approach may not be ideal for several reasons:

  • Not all data entries contain a path column, or the path column may not always be correctly populated (e.g., in the case of 'vivos' data). When attempting to use this code with such data, errors can occur.
  • This process is somewhat redundant, especially in cases like 'common voice' datasets, where we already have the audio data stored in the audio column. In these instances, it would be more efficient to directly pass the audio array to the speech column.

To address these issues, I've adjusted the data flow to accept the audio file path as an input column, ensuring that the sampling rate matches the feature extractor's requirements. Additionally, I've created a list of columns to exclude during data processing to prevent inadvertently removing the 'speech' column."

I would like cc @sanchit-gandhi to review my PR.

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pphuc25 - the Flax Wav2Vec2 pre-training script is under 'research projects' because it's still very much a WIP, and currently does not have correctness (see #19588). Thus, it's not really actively maintained, and so is not a fruitful place to make new contributions. I would encourage you to either develop on-top of the existing script and publish it standalone, or pick-up the work started in #19588 to try and get equivalence with PyTorch before adding new functionality!

Comment on lines -339 to +344
prepare_dataset, num_proc=data_args.preprocessing_num_workers, remove_columns=datasets["train"].column_names
prepare_dataset, num_proc=data_args.preprocessing_num_workers, remove_columns=remove_columns_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this? It was fine before no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in function preprocess, the column will be assigned to colum name speech, I think maybe the bug can occur seen some data have the column speech will automatic remove, this is not a rare case seen something I name my audio column name as speech

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Oct 4, 2023

Hey @pphuc25! Thanks for your enthusiasm here! As mentioned previously, this examples script is not the best place to make performance optimisations, since it's currently a WIP script (or more truthfully, a 'broken' script). If you're interested in making a contribution for Flax Wav2Vec2 pre-training, I would encourage you to take a look at the issue #19588, which endeavours to correct this script by obtaining equivalence with PyTorch. We should fix this script first before making performance optimisations like the ones proposed in this PR. Thanks for your understanding.

@pphuc25 pphuc25 closed this Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants