-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Suppose that you have a DictionaryArray with repeated values in the dictionary:
>>> import pyarrow as pa
>>> pa_array = pa.DictionaryArray.from_arrays(
... pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),
... pa.array(["one", "two", "three", "one", "two", "three"])
... )
>>> pa_array
<pyarrow.lib.DictionaryArray object at 0x7f271befa4a0>``-- dictionary:
[
"one",
``"two",
``"three",
``"one",
``"two",
``"three"
]
-- indices:
[
``0,
``1,
``2,
``3,
``4,
``5,
``0,
``1,
``2,
``3,
``4,
``5
]
According to [the documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]],
Dictionary encoding is a data representation technique to represent values by integers referencing a dictionary usually consisting of unique values.
so a DictionaryArray like the one above is arguably invalid, but if so, then I'd expect some error messages, rather than corrupt data, when I try to write it to a Parquet file.
>>> pa_table = pa.Table.from_batches(
... [pa.RecordBatch.from_arrays([pa_array], ["column"])]
... )
>>> pa_table
pyarrow.Table
column: dictionary<values=string, indices=int64, ordered=0>
>>> import pyarrow.parquet
>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")
No errors so far. So we try to read it back and view it:
>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")
>>> pa_loaded
pyarrow.Table
column: dictionary<values=string, indices=int32, ordered=0>
>>> pa_loaded.to_pydict()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict
File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist
File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist
File "pyarrow/scalar.pxi", line 712, in pyarrow.lib.DictionaryScalar.as_py
File "pyarrow/scalar.pxi", line 701, in pyarrow.lib.DictionaryScalar.value.__get__
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 3 long
Looking more closely at this, we see that the dictionary has been minimized to include only unique values, but the indices haven't been correctly translated:
>>> pa_loaded["column"]
<pyarrow.lib.ChunkedArray object at 0x7f0a8fb16a90>
[
````-- dictionary:
``[
`` ``"one",
`` ``"two",
`` ``"three"
``]
``-- indices:
``[
`` ``0,
`` ``1,
`` ``2,
`` ``3,
`` ``0,
`` ``1,
`` ``1,
`` ``1,
`` ``2,
`` ``3,
`` ``0,
`` ``1
``]
]
It looks like an attempt was made to minimize it, but the indices ought to be
[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]
I don't know what your preferred course of action is—adding an error message or fixing the attempted conversion—but this is wrong. On my side, I'm adding code to prevent the creation of non-unique values in DictionaryArrays.
Environment: pyarrow 1.0.0 installed from conda-forge.
Reporter: Jim Pivarski / @jpivarski
Related issues:
- [C++] Duplicate values in a dictionary result in corrupted parquet (is duplicated by)
Note: This issue was originally created as ARROW-9801. Please see the migration documentation for further details.