feat: change the flow of data preprocess and avoid bug in remove columns #26516
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
I change the data flow of prepare_dataset function, make a case to avoid remove
speech
columnsWhile examining the 'wav2vec2' workflow, I noticed that the
prepare_dataset
function typically takes the path of audio files and converts them into audio arrays. However, I believe this approach may not be ideal for several reasons:path
column, or thepath
column may not always be correctly populated (e.g., in the case of 'vivos' data). When attempting to use this code with such data, errors can occur.audio
column. In these instances, it would be more efficient to directly pass the audio array to thespeech
column.To address these issues, I've adjusted the data flow to accept the audio file path as an input column, ensuring that the sampling rate matches the feature extractor's requirements. Additionally, I've created a list of columns to exclude during data processing to prevent inadvertently removing the 'speech' column."
I would like cc @sanchit-gandhi to review my PR.