Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O operation on closed file #1032

Closed
jontwo opened this issue Aug 31, 2022 · 10 comments · Fixed by #1035
Closed

I/O operation on closed file #1032

jontwo opened this issue Aug 31, 2022 · 10 comments · Fixed by #1035

Comments

@jontwo
Copy link

jontwo commented Aug 31, 2022

The latest version has broken our CI, but it might be that you've just exposed an issue in dask and I actually need to create a ticket there instead. This test worked fine at fsspec==2022.7.1.

Callstack

_______________________ TestSortLargeCSV.test_one_column _______________________
farmlib/core/helpers/tests/test_io.py:657: in test_one_column
    sort_large_csv(self.input_filename,
farmlib/core/helpers/io.py:742: in sort_large_csv
    events = dd.read_csv(input_file, blocksize=blocksize)
venv/lib/python3.8/site-packages/dask/dataframe/io/csv.py:744: in read
    return read_pandas(
venv/lib/python3.8/site-packages/dask/dataframe/io/csv.py:548: in read_pandas
    b_out = read_bytes(
venv/lib/python3.8/site-packages/dask/bytes/core.py:[149](https://gitlab.com/jbariskmanagement/code/farmlib/-/jobs/2957863814#L149): in read_bytes
    values = [
venv/lib/python3.8/site-packages/dask/bytes/core.py:[150](https://gitlab.com/jbariskmanagement/code/farmlib/-/jobs/2957863814#L150): in <listcomp>
    delayed_read(
venv/lib/python3.8/site-packages/dask/delayed.py:695: in __call__
    return call_function(
venv/lib/python3.8/site-packages/dask/delayed.py:662: in call_function
    args2, collections = unzip(map(unpack_collections, args), 2)
venv/lib/python3.8/site-packages/dask/delayed.py:38: in unzip
    out = list(zip(*ls))
venv/lib/python3.8/site-packages/dask/delayed.py:93: in unpack_collections
    if is_dask_collection(expr):
venv/lib/python3.8/site-packages/dask/base.py:187: in is_dask_collection
    return x.__dask_graph__() is not None
venv/lib/python3.8/site-packages/fsspec/core.py:212: in __getattr__
    return getattr(self.f, item)
venv/lib/python3.8/site-packages/fsspec/core.py:149: in f
    raise ValueError(
E   ValueError: I/O operation on closed file. Please call open() or use a with context

Repro case

class TestSortLargeCSV:
    @pytest.fixture(autouse=True)
    def setup_method(self):
        self.temp_dir = TemporaryDirectory()
        self.input_filename = os.path.join(self.temp_dir.name, "input.csv")

    def test_one_column(self):
        df = pd.DataFrame(columns=["col1"],
                          data=[["a"], ["b"], ["z"], ["x"]])
        df.to_csv(self.input_filename, index=False)

        sort_large_csv(self.input_filename,
                       self.output_filename,
                       index_column="col1",
                       blocksize=self.blocksize)

@martindurant
Copy link
Member

Yes, there is a known regression that I should be able to clean up this morning.

@tommyjcarpenter
Copy link

Is it possible to yank the bad version? Our builds also failed to this (we are now pinning it back)

@martindurant
Copy link
Member

The fixed version is now out. Do you still need the yank>

@jasonwdon
Copy link

Hi! I'm still having this issue on version 2022.8.1. I get this error when I use pandas.read_csv("s3://file", compression='gzip', header=0). Pinning to 2022.7.1 resolves the issue

@cperriard
Copy link

Hi, I get the same error when writing to S3 with pandas_df.to_json("s3://bucket/file.jsonlines", orient="records", lines=True).
It worked with version 2022.8.0.

@tommyjcarpenter
Copy link

We didn't need the yank since we back pinned, however this thread seems to have continued

@martindurant
Copy link
Member

Sorry for the mess. I yanked and made 2022.8.2 which should work for everyone.

@tommyjcarpenter
Copy link

Thanks for your quick resolution. This package has huge indirect exposure since pandas depends on it, so errors propagate quickly :)

@martindurant
Copy link
Member

...which is both good and bad. Created #1036 to try to do a better job at this.

@tommyjcarpenter
Copy link

IMO a very simple test is simply

import pandas as pd
pd.read_csv("s3://some_csv_in_s3_somewhere")
pd.read_parquet(..

we use this extensively - fsspec is mentioned even in the pandas documentation for read_csv (see "storage options" https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) which is how I even tracked down what broke yesterday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants