Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race-condition issue when downloading from multiple threads #2534

Merged
merged 3 commits into from
Sep 12, 2024

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Sep 11, 2024

This PR fixes an issue that has been introduced in #2223.
It has been reported at least in huggingface/transformers#27421 and #2223 but also on slack (private, here and here).

Current process

The issue comes from the process to download a file to the cache directory. The process is

  1. fetch the metadata
  2. check if file is already in cache (if yes, return)
  3. acquire a lock
    1. check if the file has already been downloaded to the ./blobs/ folder
      1. download the file to a temporary file
      2. if not, download it and move to ./blobs/ destination
    2. create a symlink between the ./blobs/ folder and the correct destination in the ./snapshot/ folder
    3. release the lock and return

Issue

The problem is that when the download process is triggered by multiple processes or threads at once, we can have:

  1. thread A checks for file => doesn't exist
  2. thread B checks for file => doesn't exist
  3. thread A acquire lock, thread B wait
  4. thread A download the file, move it, creates the symlink and release the lock
  5. thread B, do not re-download the ./blobs/ file
  6. thread B creates a symlink to the correct destination without checking if the pointer file has been created while waiting for the lock

The problem is that if thread A tries to read to file just when thread B is (re-)creating the symlink, then we have a race-condition error and more specifically a RuntimeError: unable to open file (<path/to/cached/file>) in read-only mode: No such file or directory (2).

Fix

This PR fixes the issue by adding a simple check before creating the symlink. If the pointer file already exists, no need to recreate it.


Many many thanks to @regisss who helped me reproduce the error in a reliable way by providing me with the correct setup. The issue has not been easy to track down and having a consistent way to reproduce it solved everything! 🙏

(For info, I was able to reproduce the error with a script once every 3-4 runs maximum (thanks to @regisss!). With this PR I ran it 50 times in a raw without a single incident!)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 11, 2024

cc @alozowski @ArthurZucker @lewtun

Copy link
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) thanks @Wauplin

Copy link

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that this issue doesn't happen anymore in my CI with this PR 🎉
Thanks a lot for fixing it!

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 12, 2024

Thanks for the reviews :) And thanks @regisss for confirming! This is the best test we can do 😄

@Wauplin Wauplin merged commit 2f27522 into main Sep 12, 2024
19 checks passed
@Wauplin Wauplin deleted the fix-download-race-condition branch September 12, 2024 08:43
hanouticelina pushed a commit that referenced this pull request Sep 12, 2024
* Fix race-condition issue when downloading from multiple threads

* correct fix
regisss added a commit to huggingface/optimum-habana that referenced this pull request Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants