Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something wrong with putting bools in parquets #836

Closed
HenryDayHall opened this issue Apr 15, 2021 · 1 comment · Fixed by #837
Closed

Something wrong with putting bools in parquets #836

HenryDayHall opened this issue Apr 15, 2021 · 1 comment · Fixed by #837
Labels
bug The problem described is something that must be fixed

Comments

@HenryDayHall
Copy link
Contributor

HenryDayHall commented Apr 15, 2021

So when trying to save a parquet with an array of various types I came across an issue. If all my arrays contain ints and floats, it's fine but if one array contains bools I get a ValueError. This appears in both version 1.2.1 and 1.3.0rc1.

In [3]: gob = ak.Array([3.4, 5.6])[np.newaxis]
   ...: clob = ak.Array([2, 3, 7])[np.newaxis]
   ...: log = ak.zip({"c": clob, "g": gob}, depth_limit=1)
   ...: ak.to_parquet(log, "test.parquet")

In [4]: gob = ak.Array([True, True])[np.newaxis]
   ...: clob = ak.Array([2, 3, 7])[np.newaxis]
   ...: log = ak.zip({"c": clob, "g": gob}, depth_limit=1)
   ...: ak.to_parquet(log, "test.parquet")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-4eaa8e961b03> in <module>
      2 clob = ak.Array([2, 3, 7])[np.newaxis]
      3 log = ak.zip({"c": clob, "g": gob}, depth_limit=1)
----> 4 ak.to_parquet(log, "test.parquet")

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in to_parquet(array, where, explode_records, list_to32, string_to32, bytestring_to32, **options)
   2958     layout = to_layout(array, allow_record=False, allow_other=False)
   2959     iterator = batch_iterator(layout)
-> 2960     first = next(iterator)
   2961
   2962     if "schema" not in options:

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in batch_iterator(layout)
   2942                         string_to32=string_to32,
   2943                         bytestring_to32=bytestring_to32,
-> 2944                         allow_tensor=False,
   2945                     )
   2946                 )

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in to_arrow(array, list_to32, string_to32, bytestring_to32, allow_tensor)
   2464             )
   2465
-> 2466     return recurse(layout, None, False)
   2467
   2468

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in recurse(layout, mask, is_option)
   1997                         int(numpy.ceil(len(numpy_arr) / 8.0)) * 8, dtype=numpy_arr.dtype
   1998                     )
-> 1999                     ready_to_pack[: len(numpy_arr)] = numpy_arr
   2000                     ready_to_pack[len(numpy_arr) :] = 0
   2001                 numpy_arr = numpy.packbits(

ValueError: could not broadcast input array from shape (2) into shape (1)

If this is actually a bug and not some mistake on my part I'd be happy to attempt a fix?

@HenryDayHall HenryDayHall added the bug (unverified) The problem described would be a bug, but needs to be triaged label Apr 15, 2021
@jpivarski jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Apr 15, 2021
@jpivarski
Copy link
Member

It's not your mistake: booleans have to be treated specially for Arrow/Parquet because they're stored as bits, rather than bytes, and that code path didn't check for fixed-size dimensions != 1. That's why there's a length disagreement: the outer dimension is 1 and the inner dimension (what it's trying to copy into the buffer whose bytes will be packed into bits) is 2. I'm committing a fix now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants