-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open(cloud_path) acts on cached local file instead of cloud file #128
Comments
The gist of the problem here is that our Because However, the problem happens when the caller writes. We're able to refresh the cache within This may be impossible to fix. Once If it's the case that this is impossible to fix, then we have two options: 1. Keep our
|
I think (1) is probably worth doing, and that we also may add a section to the docs on writing to the cloud so that we can be very explicit about what we recommend, when you might see errors, and how it could be fixed. Probably with having a library level setting for making this a warning/error. Here are two other ideas while we're brainstorming: 3. Patch
|
Hi! I'm encountering this issue in a context similar to #240, with pandas. This issue causes a problem with Example: import pandas as pd
import cloudpathlib
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
cloud_path = cloudpathlib.GSPath('gs://bucket/df.parquet')
df.to_parquet(cloud_path) # works fine
df.to_parquet(cloud_path) # has OverwriteNewerLocalError exception
|
Hi @grisaitis. This is indeed another case of the problem documented by this issue—cloud paths do not support being directly called with third-party functions (like |
@jayqi thank you so much for the reply. should i be calling |
@grisaitis there are Update: See best option (0) in the following comment (1): Introduce explicit local file paths and then use cloudpathlib to write/upload to the cloud. Some example pseudo-code: cloud_path = CloudPath(...)
local_path = Path(...)
df.to_parquet(local_path)
cloud_path.upload_from(local_path) This introduces local paths that you need to define and manage. It is also not the most efficient, because it introduces another copy of the data that uses up disk space, as well as associated extra reads/writes. (2): Use |
Good news! @pjbull pointed out to me another option, which is probably the best for cases that support it. (0): For write functions that support getting file objects/handles as inputs, you can using See example: from cloudpathlib import CloudPath
import pandas as pd
cloud_path = CloudPath("s3://cloudpathlib-test-bucket/test.csv")
cloud_path.exists()
#> False
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
with cloud_path.open("w") as fp:
df.to_csv(fp)
cloud_path.exists()
#> True
cloud_path.read_text()
#> ',x,y\n0,1,4\n1,2,5\n2,3,6\n' Created at 2022-09-15 16:53:55 EDT by reprexlite v0.5.0 |
Thanks! Since I want to avoid having any extra clutter around just calling |
You could start an asynchronous file watcher (eg in a thread) that detects changes whenever |
Related issue in the standard library: python/cpython#99818 |
Another work around I've found is to use the context manage to open the CloudPath and do the write operation as normal. I'm working with
|
Whoa, I don't think I've considered that usage pattern. Thanks @FullMetalMeowchemist! It's really interesting that that works, given that we didn't (I don't think, anyways) do it intentionally. I think it makes sense, given that @pjbull I'm wondering if we do something like: on calls to with cloud_path.open():
whatever_write(cloud_path) and error if I guess it may still be undesirably cumbersome for read cases though to require a context manager, and this doesn't help with detecting the difference between read and write usage. |
@jayqi it sounds then like there isn't much difference between (continuing the previous example for objects with write methods that don't accept a file handle but only a file name): with target_filepath.open("w"):
parquet.write_table(table, target_filepath) and something like import tempfile
with tempfile.NamedTemporaryFile() as fp:
parquet.write_table(table, fp.name)
fp.flush()
target_filepath.upload_from(fp.name) ... would there be any technical reasons to prefer one over the other? Thanks |
There aren't any huge differences in this single operation. However, if you later need to read the file you just uploaded, you may be able to save a download using the first approach. This is because first approach will create a cloudpathlib-managed local cache file, whereas your second approach explicitly manages its own temporary file. A later read operation would compare the cloudpathlib local cache file to the cloud file and would skip a download if it determines the cache is up-to-date. The second approach wouldn't have a local cache file and so it will always download to create one. |
Unfortunately the following code does not behave as expected:
(as opposed to using
cloud_path.open("w")
which does work as expected).What will happen is that this will only modify the local cached copy of the file without uploading it to the cloud. Then further attempts to interact with the cloud path will correctly raise
OverwriteNewerLocal
exceptions. Because cloud paths have an__fspath__
method (#72) that returns the local file path, theopen
function will only interact with that local file path.Users can still get the correct behavior by using the
open
method. However, I've seen a lot of people using regular pathlib, especially those who are still new to it, withopen(path, "w")
on theirPath
objects, so I expect this may be a common pattern for users of cloudpathlib as well.The text was updated successfully, but these errors were encountered: