Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverwriteNewerLocalError when reading same resource in parallel #283

Open
Gilthans opened this issue Oct 30, 2022 · 5 comments
Open

OverwriteNewerLocalError when reading same resource in parallel #283

Gilthans opened this issue Oct 30, 2022 · 5 comments

Comments

@Gilthans
Copy link
Contributor

This appears to be similar to #128, with two caveats:

  1. This happens despite using the workaround mentioned by @jayqi in that issue
  2. There is no writing happening, only reading

The relevant code is:

       # self.image_path is a cloud path
        with self.image_path.open("rb") as file:
            image_contents = typing.cast(np.ndarray, imageio.imread(file))

This line raises the following exception:

cloudpathlib.exceptions.OverwriteNewerLocalError: Local file (...) for cloud path (...) is newer on disk, but is being requested for download from cloud. Either (1) push your changes to the cloud, (2) remove the local file, or (3) pass `force_overwrite_from_cloud=True` to overwrite.

This is despite the fact the program does not write to the file or even open it in 'w' mode.

What may be relevant is that the code is part of an HTTP server which receives several requests in parallel, and it often needs to read the same files for different requests.
Since reading the image file may be a lengthy process which probably uses C code extensively, I think it's possible open gets called from another thread before the with clause is released, which causes the issue.

Is this a known issue? Does the explanation make sense, or am I missing something else?

@pjbull
Copy link
Member

pjbull commented Nov 2, 2022

@Gilthans It would be helpful to have a minimally reproducible example here so we can dig in.

To me, it seems most likely similar to #49, not #128. We can potentially address things like this by not using time as our check (for example, like #12) or turning off cache checks entirely. There may be other workarounds as well (e.g, a flag to never re-download from the cloud, adding sleeps to your code on first download, manually making sure the mtime matches the cloud version, explicitly managing the download/caching with download_to and exists checks and passing around local paths on your server).

@bdc34
Copy link

bdc34 commented May 16, 2023

I'm encountering this too in 0.13.0 as part of a HTTP flask web app.

@pjbull
Copy link
Member

pjbull commented May 25, 2023

Thanks for mentioning @bdc34.

I think this will continue to be an issue until we implement a parallel-safe caching layer that's independent of the file system (related issues #9, #11, #12, #128).

Here are a few mitigation strategies that might be helpful:

  • Architect your parallelism to ensure you create a new Client on each thread/process and make sure they have a different local_cache_dir passed to them. This will mean independent caches per worker, but the disk space tradeoff may be worth the simplicity of the implementation.
  • If your application is just passing the file on the the end user, use a presigned url instead of passing the file through your backend. It would be good to get Add as_url method to each CloudPath implementation with ability to generate presigned_urls #236 in to support this generally, but you can do it in the meantime with something like S3Client.client.generate_presigned_url as shown in that PR.

Finally, if someone can provide a minimal code-snippet that reproduces this problem consistently there may be additional mitigations that we can build into the cloudpathlib library for this use case.

@Gilthans
Copy link
Contributor Author

I forgot to mention this here, but I tried to create a reproducible example with no luck. I might give it another shot later on

@pjbull
Copy link
Member

pjbull commented Dec 11, 2024

Similar (likely same) report in #492

What we've been looking for there is a reproducible minimal example that we can use for debugging.

Some other things to try:

  • Write your code to be file-parallel (each worker works on a single file). This should prevent collisions in reading the same file to the same place in the cache.
  • If your workers all depend on the same file, use a lock (e.g., mutex) to ensure the file gets downloaded once and then just read from the cache in the different workers. (Or explicitly download the file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants