Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: s3 reads from public buckets not working #34626

Closed
2 of 3 tasks
ayushdg opened this issue Jun 7, 2020 · 9 comments · Fixed by #34877
Closed
2 of 3 tasks

BUG: s3 reads from public buckets not working #34626

ayushdg opened this issue Jun 7, 2020 · 9 comments · Fixed by #34877
Assignees
Labels
Blocker Blocking issue or pull request for an upcoming release IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@ayushdg
Copy link

ayushdg commented Jun 7, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

# Your code here
import pandas as pd
df = pd.read_csv("s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv")
Error stack trace
Traceback (most recent call last):
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/pandas/io/s3.py", line 33, in get_file_and_filesystem
    file = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/fsspec/spec.py", line 775, in open
    **kwargs
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 378, in _open
    autocommit=autocommit, requester_pays=requester_pays)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 1097, in __init__
    cache_type=cache_type)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/fsspec/spec.py", line 1065, in __init__
    self.details = fs.info(path)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 530, in info
    Key=key, **version_id_kw(version_id), **self.req_kw)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 200, in _call_s3
    return method(**additional_kwargs)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/client.py", line 622, in _make_api_call
    operation_model, request_dict, request_context)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/client.py", line 641, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
    return self._send_request(request_dict, operation_model)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
    operation_name=operation_model.name)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
    return self._emit(event_name, kwargs)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
    response = handler(**kwargs)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
    return self.sign(operation_name, request)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
    auth.add_auth(request)
  File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
    raise NoCredentialsError
botocore.exceptions.NoCredentialsError: Unable to locate credentials

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/pandas/io/parsers.py", line 431, in _read
filepath_or_buffer, encoding, compression
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/pandas/io/common.py", line 212, in get_filepath_or_buffer
filepath_or_buffer, encoding=encoding, compression=compression, mode=mode
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/pandas/io/s3.py", line 52, in get_filepath_or_buffer
file, _fs = get_file_and_filesystem(filepath_or_buffer, mode=mode)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/pandas/io/s3.py", line 42, in get_file_and_filesystem
file = fs.open(_strip_schema(filepath_or_buffer), mode)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/fsspec/spec.py", line 775, in open
**kwargs
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 378, in _open
autocommit=autocommit, requester_pays=requester_pays)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 1097, in init
cache_type=cache_type)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/fsspec/spec.py", line 1065, in init
self.details = fs.info(path)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 530, in info
Key=key, **version_id_kw(version_id), **self.req_kw)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/s3fs/core.py", line 200, in _call_s3
return method(**additional_kwargs)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/client.py", line 622, in _make_api_call
operation_model, request_dict, request_context)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/client.py", line 641, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/endpoint.py", line 102, in make_request
return self._send_request(request_dict, operation_model)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/endpoint.py", line 132, in _send_request
request = self.create_request(request_dict, operation_model)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/endpoint.py", line 116, in create_request
operation_name=operation_model.name)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/hooks.py", line 356, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/hooks.py", line 228, in emit
return self._emit(event_name, kwargs)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/hooks.py", line 211, in _emit
response = handler(**kwargs)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/signers.py", line 90, in handler
return self.sign(operation_name, request)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/signers.py", line 160, in sign
auth.add_auth(request)
File "/home/conda/envs/pandas-test/lib/python3.7/site-packages/botocore/auth.py", line 357, in add_auth
raise NoCredentialsError

Problem description

Reading directly from s3 public buckets (without manually configuring the anon parameter via s3fs) is broken with pandas 1.0.4 (worked with 1.0.3).

Looks like reading from public buckets requires anon=True while creating the filesystem. This 22cf0f5 seems to have introduced the issue, where anon=False is passed when the noCredentialsError is encountered.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-55-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 47.1.1.post20200604
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : None
pyxlsb : None
s3fs : 0.4.2
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@ayushdg ayushdg added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 7, 2020
@jorisvandenbossche jorisvandenbossche added IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 7, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.5 milestone Jun 7, 2020
@jorisvandenbossche
Copy link
Member

@ayushdg thanks for the report!

cc @simonjayhawkins @alimcmaster1 for 1.0.5, it might be safer to revert #33632, and then target the fixes (like #34500) to master

@alimcmaster1
Copy link
Member

Agree @jorisvandenbossche - do you want me to open a PR to revert #33632 on 1.0.x branch? Apologies for this change it didn’t go as planned. I’ll check why our test cases didn’t catch the above!

@jorisvandenbossche
Copy link
Member

do you want me to open a PR to revert #33632 on 1.0.x branch?

Yes, that sounds good

Apologies for this change it didn’t go as planned.

No, no, nobody of us had foreseen the breakages ;)

@alimcmaster1
Copy link
Member

alimcmaster1 commented Jun 7, 2020

Can't seem to reproduce this using moto... Potentially related: https://github.com/dask/s3fs/blob/master/s3fs/tests/test_s3fs.py#L1089

(I can repo locally using the s3 URL above - if I remove AWS Creds from my environment)

@alimcmaster1
Copy link
Member

The fix for this to target 1.1 is to set ‘anon=True’ in S3FileSystem https://github.com/pandas-dev/pandas/pull/33632/files#diff-a37b395bed03f0404dec864a4529c97dR41

I’ll wait as we are moving to fsspec which gets rid of this logic #34266 - but we should definitely trying using moto to test this.

@TomAugspurger
Copy link
Contributor

Can anyone summarize the status here?

1.0.3: worked
1.0.4: broken
master: broken?
master+#34266: broken?

Do we have a plan in place to restore this? IIUC the old way was to

  1. try with the default (which I think looks up keys based on env vars)
  2. If we get an error, retry with anon=True

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 12, 2020

Yep, it broke in 1.0.4, and will be fixed in 1.0.5 by reverting the patch that broke it.
That means that master is still broken, and thus we first need to write a test for it, and check whether #34266 actually fixes it already, or otherwise still fix it differently.

The old way was indeed to try with anon=True if it first failed. I suppose we can "simply" restore that logic? (in case it's not automatically fixed with fsspec)

@TomAugspurger
Copy link
Contributor

Thanks

in case it's not automatically fixed with fsspec

It's not. So we'll need to do that explicitly. Long-term we might want to get away from this logic by asking users to do read_csv(..., storage_options={"requester_pays": False}). But for 1.1 we'll want to restore the old implicit retry behavior if possible.

@jorisvandenbossche
Copy link
Member

Long-term we might want to get away from this logic

On the other hand, it seems nice that reading from a public bucket just works out of the box without needing the pass any option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants