-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_parquet no longer supports file-like objects #34467
Comments
this is almost certainly a regression pyarrow itself pls report there |
I mean no offense, but this sounds like the responsibility of pandas maintainers, since pandas directly consumes pyarrow. |
@claytonlemons you are missing the point it’s calling pyarrow code - see the traceback |
@claytonlemons I am encountering the same issue. If I downgrade from 1.0.4 --> 1.0.3 (while keeping the pyarrow version the same), I can again read from BytesIO buffers without issue. Since upgrading the pandas version from 1.0.3 --> 1.0.4 seems both necessary and sufficient to cause the file-like object reading issues, it seems like it may indeed be correct to consider this as an issue with pandas, not pyarrow. @jreback Would you consider reopening this issue? |
ok sure something must have gone wrong in the backport. would be helpful to know exactly where |
Thanks, @jreback. For starters, it looks like the implementation changed significantly between 1.0.3 --> 1.0.4 for 1.0.3def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
kwargs["use_pandas_metadata"] = True
result = self.api.parquet.read_table(
path, columns=columns, **kwargs
).to_pandas()
if should_close:
path.close()
return result 1.0.4def read(self, path, columns=None, **kwargs):
parquet_ds = self.api.parquet.ParquetDataset(
path, filesystem=get_fs_for_path(path), **kwargs
)
kwargs["columns"] = columns
result = parquet_ds.read_pandas(**kwargs).to_pandas()
return result |
@jreback I understood your point, but I was referring to reporting the issue to pyarrow, not the fact that pyarrow is causing the traceback. That said, it's still an assumption that pyarrow caused the regression. That's why I'm reluctant to report anything to pyarrow in the first place. Let's dig into the issue some more:
So there are three possibilities:
|
Please see the referenced merge request above. |
@claytonlemons Thanks for the report! cc @simonjayhawkins @alimcmaster1 So after the fact, we apparently should not have backported this .. (#34173). Anyway, that can happen. What do we do with this one? |
On the actual issue, as far as I know, I think using the ParquetDataset was needed to get @alimcmaster1 do you remember why you switched to ParquetDataset? I think that |
This also fails on master, so we can either
From previous discussion #33970 (comment), probably best to create a 1.0.5 tag and setup the whatsnew.. I'll open a PR for that. |
|
Yes agree this is related to #33632. There is actually also an open PR to properly document the reading/writing file like obj behaviour #33709 . @jorisvandenbossche we switched to ParquetDataset as it allows a directory path and hence we can read a partitioned dataset. I can submit a PR to fix and properly test the file like behaviour that’s broken here. |
|
Understood thanks @jorisvandenbossche ! I think I missed the fact Fix for this here #34500 |
…able docstring Use same doc string as ParquetDataset. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html. Looks like the arg currently has no docs. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html cc. @jorisvandenbossche our discussion here: pandas-dev/pandas#34467 (comment) Closes #7332 from alimcmaster1/patch-1 Authored-by: alimcmaster1 <alimcmaster1@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
I also have a simular issues since version 1.0.4 and had to downgrade to 1.0.3. I was able to read files from an azure blob storage if I provided the https url and as param sas token directly. since 1.0.4 this is completely broken. Example on 1.0.4
Raises This work perfectly on 1.0.3 and forced us to rollback to pandas 1.0.3 |
We are reverting the original change for 1.0.5 (#34632), and then will need to ensure to fix this for 1.1.0 in master. |
Another consequence of using
But since
|
@kepler I wonder if explicitly separating the kwargs into two parameters might be a solution. Something like KW = Optional[Dict[str, Any]]
def read_parquet(.., init_kw: KW = None, read_kw: KW = None) -> pd.DataFrame:
.. Though I imagine it would be more "pandas-like" to just implicitly construct |
Yes, I'd agree it's better to keep consistency with <=1.0.3 and do the magic inside the So you see I don't have a strong opinion about how's designed. :) Except it would have been great if a patch version hadn't broken compatibility. |
Thanks for highlighting, as Joris mentioned we are reverting for 1.0.5. Could you confirm if this fix targeting master fixes your issue #34500 |
The fix for master pandas 1.1 is https://github.com/pandas-dev/pandas/pull/34500/files#diff-cbd427661c53f1dcde6ec5fb9ab0effaR134 We can potentially add tests that’s cover a few more of the kwargs since we clearly currently don’t have coverage here. |
There was a bug introduced in pandas 1.0.4 that caused pd.read_parquet to no longer be able to handle file-like objects. They're fixing it in 1.0.5. This change will skip 1.0.4 and use the newer version when available. pandas-dev/pandas#34467
Hi, I build the branch from the issue you mentioned and ran the code that uses azure data lake/blob storage and my data is being loaded again using direct urls in the format that i mentioned before.
|
I'm in the same situation as @DuncanCasteleyn using parquet files stored in an azure data lake (blob). When I upgraded to pandas 1.0.4 without changing any code I'm getting
Everything works as expected with version 1.0.3. |
Ahh - I was was bit by this but ended up using pq.read_table() with pyarrow and removed the pandas.read_parquet() portions of the code. Upgrading to 1.0.5 now. |
Code Sample, a copy-pastable example
Problem description
The current behavior of
read_parquet(buffer)
is that it raises the following exception:Expected Output
Instead,
read_parquet(buffer)
should return a new DataFrame with the same contents as the serialized DataFrame stored inbuffer
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.5.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-99-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.4
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 9.0.1
setuptools : 39.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 0.999999999
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
The text was updated successfully, but these errors were encountered: