-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race-condition issue when downloading from multiple threads #2534
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :) thanks @Wauplin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that this issue doesn't happen anymore in my CI with this PR 🎉
Thanks a lot for fixing it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
Thanks for the reviews :) And thanks @regisss for confirming! This is the best test we can do 😄 |
* Fix race-condition issue when downloading from multiple threads * correct fix
This ensures we have this fix: huggingface/huggingface_hub#2534
This PR fixes an issue that has been introduced in #2223.
It has been reported at least in huggingface/transformers#27421 and #2223 but also on slack (private, here and here).
Current process
The issue comes from the process to download a file to the cache directory. The process is
./blobs/
folder./blobs/
destination./blobs/
folder and the correct destination in the./snapshot/
folderIssue
The problem is that when the download process is triggered by multiple processes or threads at once, we can have:
./blobs/
fileThe problem is that if thread A tries to read to file just when thread B is (re-)creating the symlink, then we have a race-condition error and more specifically a
RuntimeError: unable to open file (<path/to/cached/file>) in read-only mode: No such file or directory (2)
.Fix
This PR fixes the issue by adding a simple check before creating the symlink. If the pointer file already exists, no need to recreate it.
Many many thanks to @regisss who helped me reproduce the error in a reliable way by providing me with the correct setup. The issue has not been easy to track down and having a consistent way to reproduce it solved everything! 🙏
(For info, I was able to reproduce the error with a script once every 3-4 runs maximum (thanks to @regisss!). With this PR I ran it 50 times in a raw without a single incident!)