-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Parquet does not support saving fp16 GH#44846 #44847
BUG: Parquet does not support saving fp16 GH#44846 #44847
Conversation
Anirudhsekar96
commented
Dec 10, 2021
- closes BUG: Parquet format does not support saving float16 columns #44846
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- Adds validation check for float16 column in to_parquet function by raising a ValueError
Hello @Anirudhsekar96! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-12-14 04:01:15 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would need a test and release note
Is this explicit check needed? It seems pyarrow itself already gives a decent error message?
(adding an explicit check also means we wouldn't automatically get the improvement if this is fixed on pyarrow's side) |
In the given example, PyArrow creates a file called 'fp16.parquet' before throwing the error resulting in an empty file. An explicit check before passing to PyArrow would prevent the creation of the empty file. Additionally, it seems that there is little interest in adding support for float16 in PyArrow at the moment (see ref: apache/arrow#2691). Parquet format also does not seem to have adding extensions for supporting float16 in the road map (ref: https://issues.apache.org/jira/browse/ARROW-7242). Another option would be to coerce all the columns with float16 coerce to float32 and adding a warning indicating the conversion. Currently fastparquet handles fp16 using the conversion method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am ok with this, a partial write is not a great end result here.
could you add a note in I/O section for 1.4.0
That might actually be a bug in the pandas code, as I can't reproduce it with just pyarrow (I do reproduce it with the pandas version above): import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'a': np.array([0, 1, 2], dtype="float16")})
pq.write_table(table, "test_halffloat.parquet") |
So this "creates emtpy file" issue also happens for other data types that aren't supported with Parquet (such as timedelta64, or for any other reason that pyarrow raises an error while writing). I think that is something we should fix in general instead of doing a special case check for float16. |
good point @jorisvandenbossche ok let's close this issue and open one for the writing partial files |
Opened a new bug report for the broader issue: |