Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix validation for offsets of StructArrays #942

Merged
merged 2 commits into from
Nov 12, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 11, 2021

Which issue does this PR close?

Resolves #940

Rationale for this change

It appears the arrow-rs slice logic for structs introduced in #389 by @nevi-me effectively slices the child data when a struct array is sliced.

The validation code from validate.cc (and perhaps the C++ code) assumes that the children are not sliced, so when I ported that logic over it is not correct for sliced structs. You can see by the comment I was somewhat confused about the need for offset even when it was originally introduced

What changes are included in this PR?

  1. Update the struct array validation logic to follow the Rust semantics where offsets are applied to both parent and child
  2. Test case from @bjchambers in https://github.com/bjchambers/arrow-rs/blob/reproduce-validation-error/arrow/src/array/data.rs#L1648

In pictures

In pictures, here is what the testcase looks like:


     0 ┌─────┐                     0 ┌─────┐  ┌─────┐        0 ┌─────┐  ┌─────┐
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
       │     │                       │     │  │     │          │     │  │     │
       ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
    5  │     │                    5  │     │  │     │       5  │     │  │     │
       └─────┘                       └─────┘  └─────┘          └─────┘  └─────┘

        Valid                         Valid    i32[]            Valid    bool[]


       StructArray                     Child Array #1           Child Array #2
                                        (Int32Array)            (BooleanArray)

In rust, when we do slice(1,3) the offset is applied to both the ArrayData and its children:



       0 ┌─────┐                     0 ┌─────┐  ┌─────┐        0 ┌─────┐  ┌─────┐
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─ ─ ─ ─
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─├─────┤─ ┼─────┼ ─ ─ ─ ─ ─ ─ ─ ─
      5  │     │                    5  │     │  │     │       5  │     │  │     │
         └─────┘                       └─────┘  └─────┘          └─────┘  └─────┘

          Valid                         Valid    i32[]            Valid    bool[]


         StructArray                     Child Array #1           Child Array #2
                                          (Int32Array)            (BooleanArray)

However, in the C++ validation logic, the assumption is that the children have no offsets (and the offset of the parent is applied):


       0 ┌─────┐                     0 ┌─────┐  ┌─────┐        0 ┌─────┐  ┌─────┐
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─            ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
         ├─────┤                       ├─────┤  ├─────┤          ├─────┤  ├─────┤
         │     │                       │     │  │     │          │     │  │     │
─ ─ ─ ─ ─├─────┤─ ─ ─ ─ ─ ─            ├─────┤  ├─────┤          ├─────┤  ├─────┤
      5  │     │                    5  │     │  │     │       5  │     │  │     │
         └─────┘                       └─────┘  └─────┘          └─────┘  └─────┘

          Valid                         Valid    i32[]            Valid    bool[]


         StructArray                     Child Array #1           Child Array #2
                                          (Int32Array)            (BooleanArray)

Are there any user-facing changes?

Sliced StructArrays can be created without validation errors (or using unsafe)

@bjchambers
Copy link
Contributor

Looks good. The explanation (and pictures) were helpful for my understanding -- thanks for explaining the problem so clearly!

@alamb
Copy link
Contributor Author

alamb commented Nov 11, 2021

FYI @jhorstmann

@alamb alamb merged commit 078bc10 into apache:master Nov 12, 2021
@alamb alamb deleted the alamb/validation_error branch November 12, 2021 11:16
alamb added a commit that referenced this pull request Nov 12, 2021
* reproduce validation error

* Fix validation bug

Co-authored-by: Ben Chambers <bjchambers@gmail.com>
alamb added a commit that referenced this pull request Nov 12, 2021
* reproduce validation error

* Fix validation bug

Co-authored-by: Ben Chambers <bjchambers@gmail.com>

Co-authored-by: Ben Chambers <bjchambers@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validation error when manually copying array data for a slice array
2 participants