Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open(cloud_path) acts on cached local file instead of cloud file #128

Open
jayqi opened this issue Mar 5, 2021 · 14 comments
Open

open(cloud_path) acts on cached local file instead of cloud file #128

jayqi opened this issue Mar 5, 2021 · 14 comments
Labels
bug Something isn't working

Comments

@jayqi
Copy link
Member

jayqi commented Mar 5, 2021

Unfortunately the following code does not behave as expected:

with open(cloud_path, "w") as fp:
    fp.write("foo")

(as opposed to using cloud_path.open("w") which does work as expected).

What will happen is that this will only modify the local cached copy of the file without uploading it to the cloud. Then further attempts to interact with the cloud path will correctly raise OverwriteNewerLocal exceptions. Because cloud paths have an __fspath__ method (#72) that returns the local file path, the open function will only interact with that local file path.

Users can still get the correct behavior by using the open method. However, I've seen a lot of people using regular pathlib, especially those who are still new to it, with open(path, "w") on their Path objects, so I expect this may be a common pattern for users of cloudpathlib as well.

@jayqi jayqi added the bug Something isn't working label Mar 5, 2021
@jayqi
Copy link
Member Author

jayqi commented Mar 7, 2021

The gist of the problem here is that our os.PathLike support fails in write situations.

Because CloudPath implements __fspath__ and therefore the os.PathLike protocol, it can be accepted by anything that supports an os.PathLike input, which will call __fspath__ and do whatever with the returned filepath. This fine and dandy for read operations, because the local cache file's path is returned and the caller will read that file, and we have the ability to refresh the cache before that happens.

However, the problem happens when the caller writes. We're able to refresh the cache within __fspath__ before returning the local cache's path, but after we return, we're out of that method's scope and can't do anything else. The caller will write to the local cache, but we won't be able to sync the cache back up to the cloud.

This may be impossible to fix. Once __fspath__ returns, we won't be able to do anything. We'd really like __fspath__ to behave like a context manager so we can run some code after, but unfortunately we can't actually do that because the os.PathLike protocol requires __fspath__ to return bytes or str. We'd need the Python standard library to implement a protocol that supports a context manager and then they have to convince all of the other libraries to support their new context manager interface. os.PathLike has been out since Python 3.6 and there are still lots of places that don't support it yet and still only support filepaths as strings.


If it's the case that this is impossible to fix, then we have two options:

1. Keep our os.PathLike support but try to detect when it's being used in a write case so we can raise an exception.

@pjbull suggested that we could maybe use the inspect module to trace the callstack. We could potentially try to detect if it's being used by the built-in open function and then check whether mode is set to one of the writable ones.

It's not clear to me we can come up with a solution that is guaranteed to catch all cases though. If some user calls __fspath__, stores the content as a string, and then writes to that file later, that would be harder to reliably detect. I'm also not sure if the builtins open is the only way you can write to files in Python or not.

2. Remove __fspath__ and don't implement the os.PathLike interface.

That means users can't pass cloud paths into functions that expect os.PathLike inputs, even for reading.

@pjbull
Copy link
Member

pjbull commented Mar 13, 2021

I think (1) is probably worth doing, and that we also may add a section to the docs on writing to the cloud so that we can be very explicit about what we recommend, when you might see errors, and how it could be fixed. Probably with having a library level setting for making this a warning/error.

Here are two other ideas while we're brainstorming:

3. Patch open

I would not do this by default, but I could see having a CLOUDPATHLIB_OVERRIDE_OPEN env var that is documented and could be set. If it is set at import time, we override the open builtin (and maybe also os.open?) With our own version that does the write-mode check (only adds the one check for read scenarios, so shouldn't be a huge perf hit). If we are in write mode, we supply our own object that will upload on __exit__/.close().

I guess the primary use case here is if you have large/complex code bases that you want to pass CloudPath objects into, but don't want to change all of the open calls in the existing code.

4. Warn/raise on del if cache file has changed.

We track modified-time of the cache file at the object level. In __del__, if the on-disk modified time is greater than the one on the object, we can warn or raise.

I could also see an env var here that would enable uploading to the cloud in __del__ if the local cache is newer, but that seems high risk for foot-shooting.

@grisaitis
Copy link

Hi! I'm encountering this issue in a context similar to #240, with pandas.

This issue causes a problem with pandas.DataFrame.to_*(path), where you can't write to the same file more than once :/

Example:

import pandas as pd
import cloudpathlib

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
cloud_path = cloudpathlib.GSPath('gs://bucket/df.parquet')
df.to_parquet(cloud_path)  # works fine
df.to_parquet(cloud_path)  # has OverwriteNewerLocalError exception
---------------------------------------------------------------------------
OverwriteNewerLocalError                  Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 df.to_parquet(cloud_path)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    205     else:
    206         kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/core/frame.py:2835, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2749 """
   2750 Write a DataFrame to the binary parquet format.
   2751 
   (...)
   2831 >>> content = f.read()
   2832 """
   2833 from pandas.io.parquet import to_parquet
-> 2835 return to_parquet(
   2836     self,
   2837     path,
   2838     engine,
   2839     compression=compression,
   2840     index=index,
   2841     partition_cols=partition_cols,
   2842     storage_options=storage_options,
   2843     **kwargs,
   2844 )

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:420, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    416 impl = get_engine(engine)
    418 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 420 impl.write(
    421     df,
    422     path_or_buf,
    423     compression=compression,
    424     index=index,
    425     partition_cols=partition_cols,
    426     storage_options=storage_options,
    427     **kwargs,
    428 )
    430 if path is None:
    431     assert isinstance(path_or_buf, io.BytesIO)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:176, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    172     from_pandas_kwargs["preserve_index"] = index
    174 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
--> 176 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    177     path,
    178     kwargs.pop("filesystem", None),
    179     storage_options=storage_options,
    180     mode="wb",
    181     is_dir=partition_cols is not None,
    182 )
    183 try:
    184     if partition_cols is not None:
    185         # writes to multiple files under the given path

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:80, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
     70 def _get_path_or_handle(
     71     path: FilePath | ReadBuffer[bytes] | WriteBuffer[bytes],
     72     fs: Any,
   (...)
     77     FilePath | ReadBuffer[bytes] | WriteBuffer[bytes], IOHandles[bytes] | None, Any
     78 ]:
     79     """File handling for PyArrow."""
---> 80     path_or_handle = stringify_path(path)
     81     if is_fsspec_url(path_or_handle) and fs is None:
     82         fsspec = import_optional_dependency("fsspec")

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/common.py:225, in stringify_path(filepath_or_buffer, convert_file_like)
    222     return cast(BaseBufferT, filepath_or_buffer)
    224 if isinstance(filepath_or_buffer, os.PathLike):
--> 225     filepath_or_buffer = filepath_or_buffer.__fspath__()
    226 return _expand_user(filepath_or_buffer)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:233, in CloudPath.__fspath__(self)
    231 def __fspath__(self):
    232     if self.is_file():
--> 233         self._refresh_cache(force_overwrite_from_cloud=False)
    234     return str(self._local)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:844, in CloudPath._refresh_cache(self, force_overwrite_from_cloud)
    841 # if local newer but not dirty, it was updated
    842 # by a separate process; do not overwrite unless forced to
    843 if self._local.stat().st_mtime > stats.st_mtime:
--> 844     raise OverwriteNewerLocalError(
    845         f"Local file ({self._local}) for cloud path ({self}) is newer on disk, but "
    846         f"is being requested for download from cloud. Either (1) push your changes to the cloud, "
    847         f"(2) remove the local file, or (3) pass `force_overwrite_from_cloud=True` to "
    848         f"overwrite."
    849     )

OverwriteNewerLocalError: Local file (/tmp/tmp1bto4kma/<hidden>/df.parquet) for cloud path (gs://<hidden>/df.parquet) is newer on disk, but is being requested for download from cloud. Either (1) push your changes to the cloud, (2) remove the local file, or (3) pass `force_overwrite_from_cloud=True` to overwrite.

@jayqi
Copy link
Member Author

jayqi commented Sep 15, 2022

Hi @grisaitis. This is indeed another case of the problem documented by this issue—cloud paths do not support being directly called with third-party functions (like pd.to_parquet(cloud_path)) that generically do writes. Unfortunately, we are not close to being able to support it. Most of the discussion related to this issue involves giving users a clearer error to indicate that this is not supported.

@grisaitis
Copy link

@jayqi thank you so much for the reply.

should i be calling str(cloud_path) or cloud_path.to_uri()?

@jayqi
Copy link
Member Author

jayqi commented Sep 15, 2022

@grisaitis there are two three directions of workarounds you can take:

Update: See best option (0) in the following comment

(1): Introduce explicit local file paths and then use cloudpathlib to write/upload to the cloud. Some example pseudo-code:

cloud_path = CloudPath(...)
local_path = Path(...)
df.to_parquet(local_path)
cloud_path.upload_from(local_path)

This introduces local paths that you need to define and manage. It is also not the most efficient, because it introduces another copy of the data that uses up disk space, as well as associated extra reads/writes.

(2): Use as_uri() to leverage other cloud filesystem libraries, like fsspec. This only works if the function you're passing cloud_path.as_uri() into has some other third-party support and if you have those dependencies installed. It works for pandas because pandas supports fsspec. What happens in this case is that you're using cloudpathlib to potentially manipulate paths (using a pathlib-like syntax) but using an fsspec backend (s3fs, gcsfs, etc.), which is an entirely independent library, for the actual write to the cloud.

@jayqi
Copy link
Member Author

jayqi commented Sep 15, 2022

Good news! @pjbull pointed out to me another option, which is probably the best for cases that support it.

(0): For write functions that support getting file objects/handles as inputs, you can using cloud_path.open("w"). pandas does indeed support this. The syntax here is a little clunky but it has the advantage over the other two options of not involving an extra copy of the file and leveraging purely cloudpathlib on the backend.

See example:

from cloudpathlib import CloudPath
import pandas as pd


cloud_path = CloudPath("s3://cloudpathlib-test-bucket/test.csv")
cloud_path.exists()
#> False

df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})

with cloud_path.open("w") as fp:
    df.to_csv(fp)
    
cloud_path.exists()
#> True
cloud_path.read_text()
#> ',x,y\n0,1,4\n1,2,5\n2,3,6\n'

Created at 2022-09-15 16:53:55 EDT by reprexlite v0.5.0

@grisaitis

@grisaitis
Copy link

Thanks!

Since I want to avoid having any extra clutter around just calling df.to_*(path), I might try out https://github.com/fsspec/universal_pathlib as well

@EpicWink
Copy link

You could start an asynchronous file watcher (eg in a thread) that detects changes whenever __fspath__ is called. It would have to know when the changes come from cloupath vs externally

@barneygale
Copy link

Related issue in the standard library: python/cpython#99818

@FullMetalMeowchemist
Copy link

FullMetalMeowchemist commented Nov 3, 2023

Another work around I've found is to use the context manage to open the CloudPath and do the write operation as normal.

I'm working with pyarrow.parquet, and the write_table method only accepts a str or pyarrow.NativeFile. This is the example I've ended up with

target_filepath = CloudPath("gs://some_bucket/target_file.parquet")
target_filepath.exists()  # False

with target_filepath.open("w"):
    parquet.write_table(table, target_filepath)

target_filepath.exists()  # True

@jayqi
Copy link
Member Author

jayqi commented Nov 5, 2023

Whoa, I don't think I've considered that usage pattern. Thanks @FullMetalMeowchemist! It's really interesting that that works, given that we didn't (I don't think, anyways) do it intentionally. I think it makes sense, given that CloudPath.open patches the returned context manager to upload the local file to the cloud.

@pjbull I'm wondering if we do something like: on calls to fspath, check whether we have a buffer context manager active, and error if not? Then we could potentially enforce the pattern of

with cloud_path.open():
    whatever_write(cloud_path)

and error if fspath is called without being inside the context manager.

I guess it may still be undesirably cumbersome for read cases though to require a context manager, and this doesn't help with detecting the difference between read and write usage.

@ollie-bell
Copy link

ollie-bell commented Jul 1, 2024

Whoa, I don't think I've considered that usage pattern. Thanks @FullMetalMeowchemist! It's really interesting that that works, given that we didn't (I don't think, anyways) do it intentionally. I think it makes sense, given that CloudPath.open patches the returned context manager to upload the local file to the cloud.

@jayqi it sounds then like there isn't much difference between (continuing the previous example for objects with write methods that don't accept a file handle but only a file name):

with target_filepath.open("w"):
    parquet.write_table(table, target_filepath)    

and something like

import tempfile

with tempfile.NamedTemporaryFile() as fp:
    parquet.write_table(table, fp.name)
    fp.flush()
    target_filepath.upload_from(fp.name)

... would there be any technical reasons to prefer one over the other? Thanks

@jayqi
Copy link
Member Author

jayqi commented Jul 1, 2024

@jayqi it sounds then like there isn't much difference between (continuing the previous example for objects with write methods that don't accept a file handle but only a file name):

with target_filepath.open("w"):
    parquet.write_table(table, target_filepath)    

and something like

import tempfile

with tempfile.NamedTemporaryFile() as fp:
    parquet.write_table(table, fp.name)
    fp.flush()
    target_filepath.upload_from(fp.name)

... would there be any technical reasons to prefer one over the other? Thanks

There aren't any huge differences in this single operation.

However, if you later need to read the file you just uploaded, you may be able to save a download using the first approach. This is because first approach will create a cloudpathlib-managed local cache file, whereas your second approach explicitly manages its own temporary file. A later read operation would compare the cloudpathlib local cache file to the cloud file and would skip a download if it determines the cache is up-to-date. The second approach wouldn't have a local cache file and so it will always download to create one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants