Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Parquet does not support saving fp16 GH#44846 #44847

Closed
wants to merge 7 commits into from
Closed

BUG: Parquet does not support saving fp16 GH#44846 #44847

wants to merge 7 commits into from

Conversation

Anirudhsekar96
Copy link
Contributor

@pep8speaks
Copy link

pep8speaks commented Dec 10, 2021

Hello @Anirudhsekar96! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-14 04:01:15 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would need a test and release note

pandas/io/parquet.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

Is this explicit check needed? It seems pyarrow itself already gives a decent error message?

In [4]: import pandas as pd
   ...: import numpy as np
   ...: 
   ...: data = np.arange(2, 10, dtype=np.float16)
   ...: df = pd.DataFrame(data=data, columns=['fp16'])
   ...: df.to_parquet('./fp16.parquet')
...
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: halffloat

(adding an explicit check also means we wouldn't automatically get the improvement if this is fixed on pyarrow's side)

@Anirudhsekar96
Copy link
Contributor Author

In the given example, PyArrow creates a file called 'fp16.parquet' before throwing the error resulting in an empty file. An explicit check before passing to PyArrow would prevent the creation of the empty file.

Additionally, it seems that there is little interest in adding support for float16 in PyArrow at the moment (see ref: apache/arrow#2691). Parquet format also does not seem to have adding extensions for supporting float16 in the road map (ref: https://issues.apache.org/jira/browse/ARROW-7242).

Another option would be to coerce all the columns with float16 coerce to float32 and adding a warning indicating the conversion. Currently fastparquet handles fp16 using the conversion method.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am ok with this, a partial write is not a great end result here.

could you add a note in I/O section for 1.4.0

pandas/io/parquet.py Outdated Show resolved Hide resolved
pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.4 milestone Dec 14, 2021
@jreback jreback added the IO Parquet parquet, feather label Dec 14, 2021
@jorisvandenbossche
Copy link
Member

In the given example, PyArrow creates a file called 'fp16.parquet' before throwing the error resulting in an empty file. An explicit check before passing to PyArrow would prevent the creation of the empty file.

That might actually be a bug in the pandas code, as I can't reproduce it with just pyarrow (I do reproduce it with the pandas version above):

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': np.array([0, 1, 2], dtype="float16")})
pq.write_table(table, "test_halffloat.parquet")

@jorisvandenbossche
Copy link
Member

So this "creates emtpy file" issue also happens for other data types that aren't supported with Parquet (such as timedelta64, or for any other reason that pyarrow raises an error while writing). I think that is something we should fix in general instead of doing a special case check for float16.

@jreback
Copy link
Contributor

jreback commented Dec 14, 2021

good point @jorisvandenbossche

ok let's close this issue and open one for the writing partial files

@Anirudhsekar96
Copy link
Contributor Author

Opened a new bug report for the broader issue:

#44914

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Parquet format does not support saving float16 columns
4 participants