Skip to content

DictionaryValuesWriter dictionaries are corrupted by user changes. #1506

@asfimport

Description

@asfimport

DictionaryValuesWriter passes incoming Binary objects directly to Object2IntMap to accumulate dictionary values. If the arrays backing the Binary objects passed in are reused by the caller, then the values are corrupted but still written without an error.

Because Hadoop reuses objects passed to mappers and reducers, this can happen easily. For example, Avro reuses the byte arrays backing Utf8 objects, which parquet-avro passes wrapped in a Binary object to writeBytes.

The fix is to make defensive copies of the values passed to the Dictionary writer code. I think this only affects the Binary dictionary classes because Strings, floats, longs, etc. are immutable.

Reporter: Ryan Blue / @rdblue
Assignee: Ryan Blue / @rdblue

Related issues:

Note: This issue was originally created as PARQUET-62. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions