Skip to content

DictionaryArray with non-unique values are silently corrupted when written to a Parquet file #25845

@asfimport

Description

@asfimport

Suppose that you have a DictionaryArray with repeated values in the dictionary:

>>> import pyarrow as pa
>>> pa_array = pa.DictionaryArray.from_arrays(
...     pa.array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]),
...     pa.array(["one", "two", "three", "one", "two", "three"])
... )
>>> pa_array
<pyarrow.lib.DictionaryArray object at 0x7f271befa4a0>``-- dictionary:
[
    "one",
    ``"two",
    ``"three",
    ``"one",
    ``"two",
    ``"three"
]
-- indices:
[
    ``0,
    ``1,
    ``2,
    ``3,
    ``4,
    ``5,
    ``0,
    ``1,
    ``2,
    ``3,
    ``4,
    ``5
]

According to [the documentation|[https://arrow.apache.org/docs/format/Columnar.html#dictionary-encoded-layout]],

Dictionary encoding is a data representation technique to represent values by integers referencing a dictionary usually consisting of unique values.
so a DictionaryArray like the one above is arguably invalid, but if so, then I'd expect some error messages, rather than corrupt data, when I try to write it to a Parquet file.

>>> pa_table = pa.Table.from_batches(
...     [pa.RecordBatch.from_arrays([pa_array], ["column"])]
... )
>>> pa_table
pyarrow.Table
column: dictionary<values=string, indices=int64, ordered=0>
>>> import pyarrow.parquet
>>> pyarrow.parquet.write_table(pa_table, "tmp2.parquet")

No errors so far. So we try to read it back and view it:

​>>> pa_loaded = pyarrow.parquet.read_table("tmp2.parquet")
>>> pa_loaded
pyarrow.Table
column: dictionary<values=string, indices=int32, ordered=0>
>>> pa_loaded.to_pydict()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/table.pxi", line 1587, in pyarrow.lib.Table.to_pydict
File "pyarrow/table.pxi", line 405, in pyarrow.lib.ChunkedArray.to_pylist
File "pyarrow/array.pxi", line 1144, in pyarrow.lib.Array.to_pylist
File "pyarrow/scalar.pxi", line 712, in pyarrow.lib.DictionaryScalar.as_py
File "pyarrow/scalar.pxi", line 701, in pyarrow.lib.DictionaryScalar.value.__get__
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 111, in pyarrow.lib.check_status
pyarrow.lib.ArrowIndexError: tried to refer to element 3 but array is only 3 long

Looking more closely at this, we see that the dictionary has been minimized to include only unique values, but the indices haven't been correctly translated:

>>> pa_loaded["column"]
<pyarrow.lib.ChunkedArray object at 0x7f0a8fb16a90>
[
    ````-- dictionary:
    ``[
    ``    ``"one",
    ``    ``"two",
    ``    ``"three"
    ``]
    ``-- indices:
    ``[
    ``    ``0,
    ``    ``1,
    ``    ``2,
    ``    ``3,
    ``    ``0,
    ``    ``1,
    ``    ``1,
    ``    ``1,
    ``    ``2,
    ``    ``3,
    ``    ``0,
    ``    ``1
    ``]
]

It looks like an attempt was made to minimize it, but the indices ought to be

[0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]

I don't know what your preferred course of action is—adding an error message or fixing the attempted conversion—but this is wrong. On my side, I'm adding code to prevent the creation of non-unique values in DictionaryArrays.

Environment: pyarrow 1.0.0 installed from conda-forge.
Reporter: Jim Pivarski / @jpivarski

Related issues:

Note: This issue was originally created as ARROW-9801. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions