Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the merge audio dirs job for fairseq training #454

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

Judyxujj
Copy link
Contributor

@Judyxujj Judyxujj commented Oct 5, 2023

With the current version, if file name from two directories are the same, there will be file already exists error when doing the symlink os.symlink(file, dst) since the dst are the same

a.wav -> dir_path_1/a.wav
a.wav -> dir_path_2/a.wav

This PR fixes this issue by renaming dst for the second file

a.wav -> dir_path_1/a.wav
a_2.wav -> dir_path_2/a.wav

fairseq/manifest.py Outdated Show resolved Hide resolved
Judyxujj and others added 3 commits January 3, 2024 15:06
Co-authored-by: michelwi <michelwi@users.noreply.github.com>
@Judyxujj Judyxujj requested a review from michelwi January 3, 2024 14:56
fairseq/manifest.py Show resolved Hide resolved
fairseq/manifest.py Show resolved Hide resolved
os.symlink(file, dst)
creation_complete = True
except OSError as err:
if err.errno != errno.EEXIST:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would there be an OSError with errno.EEXIST (i.e. FileExistsError)? You have just completed the while loop and this implies that it did not exist.

Oh, you are running several tasks in parallel and there could be race conditions. if folder1/a.wav is in self.audio_dir_paths[0] and folder2/a.wav is in self.audio_dir_paths[1]. But then the correct solution is not to ignore the error but make a thread save implementation.... which I now realize is the purpose of the double while loop..

Ok. nevermind my previous comment then. (But do we really need parallelism for symlinking a bunch of files?)

Except for the issue with the ever expanding underscores. That should probably be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the issue for the ever expanding underscores, sorry I couldn't find the relevant comments, could you please refer me to that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while os.path.exists(dst):
    dst = f"{os.path.splitext(dst)[0]}_{i}.{self.file_extension}"

this will keep appending underscores to the filename:

a.wav
a_2.wav
a_2_3.wav
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants