Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow repeated extra arguments #1673

Merged
merged 6 commits into from
Oct 16, 2024
Merged

Conversation

karasikov
Copy link
Contributor

This fixes the error happening when the user passes extra parameters that are also inferred automatically. E.g., happens in lib datasets:

File ./python-default/3.10.14/lib/python3.10/site-packages/datasets/load.py:2692, in load_from_disk(dataset_path, fs, keep_in_memory, storage_options)
   2689     storage_options = fs.storage_options
   2691 fs: fsspec.AbstractFileSystem
-> 2692 fs, *_ = url_to_fs(dataset_path, **(storage_options or {}))
   2693 if not fs.exists(dataset_path):
   2694     raise FileNotFoundError(f"Directory {dataset_path} not found")

File ./3.10.14/lib/python3.10/site-packages/fsspec/core.py:396, in url_to_fs(url, **kwargs)
    385 known_kwargs = {
    386     "compression",
    387     "encoding",
   (...)
    393     "num",
    394 }
    395 kwargs = {k: v for k, v in kwargs.items() if k not in known_kwargs}
--> 396 chain = _un_chain(url, kwargs)
    397 inkwargs = {}
    398 # Reverse iterate the chain, creating a nested target_* structure

File ./3.10.14/lib/python3.10/site-packages/fsspec/core.py:349, in _un_chain(path, kwargs)
    347 if bit is bits[0]:
    348     kws.update(kwargs)
--> 349 kw = dict(**extra_kwargs, **kws)
    350 bit = cls._strip_protocol(bit)
    351 if (
    352     protocol in {"blockcache", "filecache", "simplecache"}
    353     and "target_protocol" not in kw
    354 ):

TypeError: dict() got multiple values for keyword argument 'account_name'

This fix

from fsspec.core import url_to_fs
url_to_fs('az://DIR@ACCOUNT.blob.core.windows.net/DATA', **{'anon': False, 'account_name': 'ACCOUNT'})

Out before:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 2
      1 from fsspec.core import url_to_fs
----> 2 url_to_fs('az://DIR@ACCOUNT.blob.core.windows.net/DATA', **{'anon': False, 'account_name': 'ACCOUNT'})

...
...
    347 if bit is bits[0]:
    348     kws.update(kwargs)
--> 349 kw = dict(**extra_kwargs, **kws)
    350 bit = cls._strip_protocol(bit)
    351 if (
    352     protocol in {"blockcache", "filecache", "simplecache"}
    353     and "target_protocol" not in kw
    354 ):

TypeError: dict() got multiple values for keyword argument 'account_name'

Out after:

(<adlfs.spec.AzureBlobFileSystem at 0x10708f460>, 'DIR/DATA')

fsspec/core.py Outdated
@@ -346,7 +346,7 @@ def _un_chain(path, kwargs):
kws = kwargs.pop(protocol, {})
if bit is bits[0]:
kws.update(kwargs)
kw = dict(**extra_kwargs, **kws)
kw = dict(**{k: v for k, v in extra_kwargs.items() if k not in kws or v != kws[k]}, **kws)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the update on the line above, this could be repeated updates, and that way we can be a little more explicit about the order of precedence. In your model, user-supplied arguments should always win, overriding inferred ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, to completely avoid thinking about the priorities, here we simply deduplicate the same key-value pairs (so the priority is irrelevant). If the same key has two different values, they're passed as before, so that will raise the same error, e.g., TypeError: dict() got multiple values for keyword argument 'account_name'

extra_kwargs = {'x': 5, 'y': 4}, kws = {'z': 4} becomes {'x': 5, 'y': 4, 'z': 4}
extra_kwargs = {'x': 5, 'y': 4}, kws = {'x': 5} becomes {'x': 5, 'y': 4}
extra_kwargs = {'x': 5, 'y': 4}, kws = {'x': 4} becomes dict(**{'x': 5, 'y': 4}, **{'x': 4}) and raises TypeError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the priority is irrelevant

Checking the values isn't always straight forward, they might not be simple ints and str. We could catch it and pass a useful message to the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please include the two passing examples in some sort of test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay.
Yes, I can add tests.

@martindurant
Copy link
Member

Merging in recent PRs will make CI pass

@martindurant
Copy link
Member

I made it green here, but I would still like a test or two.

@martindurant martindurant merged commit 9d56f92 into fsspec:master Oct 16, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants