-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsspec: support fsspec>=2023.12.0 glob changes #6687
Conversation
c79dba2
to
ceb7a04
Compare
setup.py
Outdated
@@ -131,7 +131,7 @@ | |||
"multiprocess", | |||
# to save datasets locally or on any filesystem | |||
# minimum 2023.1.0 to support protocol=kwargs in fsspec's `open`, `get_fs_token_paths`, etc.: see https://github.com/fsspec/filesystem_spec/pull/1143 | |||
"fsspec[http]>=2023.1.0,<=2023.10.0", | |||
"fsspec[http]>=2023.1.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was unsure if the preference here is to remove the upper bound or to increase the range to the current fsspec release (<=2024.2.0
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer using <=2024.2.0
, since a breaking change in fsspec
can happen in any monthly version
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
src/datasets/data_files.py
Outdated
@@ -134,6 +144,23 @@ def sanitize_patterns(patterns: Union[Dict, List, str]) -> Dict[str, Union[List[ | |||
return sanitize_patterns(list(patterns)) | |||
|
|||
|
|||
def _expand_globstar(pattern: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fully align with the new fsspec logic, so ideally it would be great to remove the extra _expand_globstar
logic.
Therefore it's fine to simply remove the test examples using unsupported patterns like EDIT: actually only remove data/**
, **train*
or **test*
, and and update the remaining patterns that still use **
like DEFAULT_PATTERNS_ALL
and METADATA_PATTERNS
(and some patterns from the new KEYWORDS_IN_PATH_NAME_BASE_PATTERNS
you added)**train*
and **test*
in the tests
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine with me, I was just going off of the original comment from #6644 which said the intention was to continue supporting the old glob behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @albertvillanova IMO we can drop the <prefix>**<suffix>
support, since it's been a behavior that only existed in fsspec
and no other well known globbing tool afaik
I double checked and actually we just need to remove the test involving **train*
and **test*
.
No need to modify data/**
, DEFAULT_PATTERNS_ALL
, METADATA_PATTERNS
or KEYWORDS_IN_PATH_NAME_BASE_PATTERNS
as I said in my previous message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think KEYWORDS_IN_PATH_NAME_BASE_PATTERNS
does still need to be modified, I still see test failures with the existing base patterns
FAILED tests/test_data_files.py::test_get_data_files_patterns[data_file_per_split0-] - ValueError: Invalid pattern: '**' can only be an entire path component
It looks to me like {keyword}[{sep}/]**
still generates {keyword}**
which is invalid in fsspec>=2023.12.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq I've pushed an update that removes the unsupported **train*
/**test*
tests, removes the _expand_globstar
behavior, and simplifies the updated keyword patterns to
KEYWORDS_IN_PATH_NAME_BASE_PATTERNS = [
"**/{keyword}[{sep}]*",
"**/{keyword}/**",
"**/*[{sep}]{keyword}[{sep}]*",
"**/*[{sep}]{keyword}[{sep}]*/**",
]
Looking into the CI failure, this PR is incompatible with a bisect indicates that it's related to huggingface/huggingface_hub#1815 |
It looks like huggingface-hub's I did a quick test with huggingface-hub self.assertEqual(
sorted(self.hffs.glob(self.hf_path + "/.gitattributes")),
sorted([self.hf_path + "/.gitattributes"]),
) the
(and with the compatible/passing older fsspec versions the glob call returns the single exact file match as expected) So it looks like the CI failure here isn't directly related to this PR. The failing patterns that don't contain any |
I just opened huggingface/huggingface_hub#2056 to fix this. Do you mind if I continue this PR to run the CI against EDIT: the fix has been released in |
f8059cf
to
c5b4803
Compare
I just added two additional patterns to cover cases like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright it looks all good now, thanks for the fix ! also cc @mariosasko
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Please note that people using
CC: @clefourrier |
fs.glob
changes introduced infsspec==2023.12.0
and unpins the current upper boundShould close #6644
Should close #6645
The
test_data_files
glob/pattern tests pass for me in:fsspec==2023.10.0
(the pinned max version in datasetsmain
)fsspec==2023.12.0
(Support fsspec 2023.12 #6644)fsspec==2024.2.0
(Support fsspec 2024.2 #6645)