Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Fix bug JSON input barfs on {"emptylist":[]} #1063

Closed
wants to merge 2 commits into from

Conversation

novemberkilo
Copy link
Contributor

Which issue does this PR close?

Closes #1036

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@novemberkilo novemberkilo marked this pull request as draft December 20, 2021 01:01
@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 20, 2021
@@ -1546,6 +1546,38 @@ mod tests {
one_column_roundtrip("list_single_column", values, true, Some(SMALL_SIZE / 2));
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for test driven development 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :)

Also pls #1036 (comment)

let is_nullable = match self.level_type {
LevelType::Primitive(is_nullable) => is_nullable,
LevelType::List(is_nullable) => is_nullable,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change to the code path -- it addresses the error message we were seeing where the List leveltype was being picked up by the catchall (thus the panic).

@@ -1574,6 +1570,9 @@ mod tests {
.unwrap();

let a = ListArray::from(a_list_data);
// let values = Arc::new(a);
// one_column_roundtrip("null_list_single_column", values, true, Some(SMALL_SIZE / 2));

let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a)]).unwrap();
roundtrip("test_null_list_single_column.parquet", batch, None);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error I am now seeing comes from the reader being unable to read the parquet that is written out.

thread 'arrow::arrow_writer::tests::null_list_single_column' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError("Unable to convert parquet INT32 logical type Some(UNKNOWN(NullType)) or converted type NONE")', parquet/src/arrow/array_reader.rs:1502:14

I have peppered the code with dbg! to see if I can make sense of what's going on but so far all I have is a hunch that we've also got a potential bug in the reader. Detailed output from this test appears below:

---- arrow::arrow_writer::tests::null_list_single_column stdout ----
[parquet/src/arrow/levels.rs:754] &self = LevelInfo {
    definition: [
        1,
    ],
    repetition: Some(
        [
            0,
        ],
    ),
    array_offsets: [
        0,
        0,
    ],
    array_mask: [
        true,
    ],
    max_definition: 2,
    level_type: List(
        true,
    ),
    offset: 0,
    length: 0,
}
[parquet/src/arrow/levels.rs:788] &filtered = []
[parquet/src/arrow/array_reader.rs:1844] &context = ArrayReaderBuilderContext {
    def_level: 0,
    rep_level: 0,
    path: ColumnPath {
        parts: [
            "arrow_schema",
        ],
    },
}
[parquet/src/arrow/array_reader.rs:1845] &child = GroupType {
    basic_info: BasicTypeInfo {
        name: "emptylist",
        repetition: Some(
            OPTIONAL,
        ),
        converted_type: LIST,
        logical_type: Some(
            LIST(
                ListType,
            ),
        ),
        id: None,
    },
    fields: [
        GroupType {
            basic_info: BasicTypeInfo {
                name: "list",
                repetition: Some(
                    REPEATED,
                ),
                converted_type: NONE,
                logical_type: None,
                id: None,
            },
            fields: [
                PrimitiveType {
                    basic_info: BasicTypeInfo {
                        name: "item",
                        repetition: Some(
                            OPTIONAL,
                        ),
                        converted_type: NONE,
                        logical_type: Some(
                            UNKNOWN(
                                NullType,
                            ),
                        ),
                        id: None,
                    },
                    physical_type: INT32,
                    type_length: -1,
                    scale: -1,
                    precision: -1,
                },
            ],
        },
    ],
}
[parquet/src/arrow/array_reader.rs:1495] &list_type = GroupType {
    basic_info: BasicTypeInfo {
        name: "emptylist",
        repetition: Some(
            OPTIONAL,
        ),
        converted_type: LIST,
        logical_type: Some(
            LIST(
                ListType,
            ),
        ),
        id: None,
    },
    fields: [
        GroupType {
            basic_info: BasicTypeInfo {
                name: "list",
                repetition: Some(
                    REPEATED,
                ),
                converted_type: NONE,
                logical_type: None,
                id: None,
            },
            fields: [
                PrimitiveType {
                    basic_info: BasicTypeInfo {
                        name: "item",
                        repetition: Some(
                            OPTIONAL,
                        ),
                        converted_type: NONE,
                        logical_type: Some(
                            UNKNOWN(
                                NullType,
                            ),
                        ),
                        id: None,
                    },
                    physical_type: INT32,
                    type_length: -1,
                    scale: -1,
                    precision: -1,
                },
            ],
        },
    ],
}
[parquet/src/arrow/array_reader.rs:1496] &context = ArrayReaderBuilderContext {
    def_level: 0,
    rep_level: 0,
    path: ColumnPath {
        parts: [
            "arrow_schema",
        ],
    },
}
[parquet/src/arrow/array_reader.rs:1497] &item_type = PrimitiveType {
    basic_info: BasicTypeInfo {
        name: "item",
        repetition: Some(
            OPTIONAL,
        ),
        converted_type: NONE,
        logical_type: Some(
            UNKNOWN(
                NullType,
            ),
        ),
        id: None,
    },
    physical_type: INT32,
    type_length: -1,
    scale: -1,
    precision: -1,
}
[parquet/src/arrow/array_reader.rs:1498] &new_context = ArrayReaderBuilderContext {
    def_level: 2,
    rep_level: 1,
    path: ColumnPath {
        parts: [
            "arrow_schema",
            "emptylist",
        ],
    },
}
thread 'arrow::arrow_writer::tests::null_list_single_column' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError("Unable to convert parquet INT32 logical type Some(UNKNOWN(NullType)) or converted type NONE")', parquet/src/arrow/array_reader.rs:1502:14

@novemberkilo
Copy link
Contributor Author

@alamb FYI I am afk until early Jan now so won't be making any progress on this until then.

@novemberkilo
Copy link
Contributor Author

Still here and interested. Sorry for the hiatus - will try and spin back up shortly. Confirming that this is still relevant pls @alamb

@alamb
Copy link
Contributor

alamb commented Feb 7, 2022

Still here and interested. Sorry for the hiatus - will try and spin back up shortly. Confirming that this is still relevant pls @alamb

👋 @novemberkilo #1036 is still open and I assume it is still relevant, but we will have to confirm with @chadbrewbaker I think

@novemberkilo
Copy link
Contributor Author

@alamb I have an update.

  1. I have confirmed that the issue persists as of current master
  2. I have re-examined and confirmed the approach -- I think that the test here is correct
  3. I've built json2parquet with the fix that appears in this PR (see this comment specifically) and this results in a parquet file.
  4. I can use pyarrow to read the file written out in step 3.

I am not sure whether I have uncovered a bug with array_reader or if I am writing out a dud parquet file. I want to investigate further and if I am reasonably sure that I have uncovered a bug, I will create an issue with a minimal example to reproduce.

I am going to close this PR for now because it is stale and a bit noisy. I will recreate a clean PR when I am clearer about the outcome of the code that appears here.

Copying @tustvold because I wonder if this is now close to the work you've done recently in #1246

@alamb
Copy link
Contributor

alamb commented Feb 28, 2022

Thanks @novemberkilo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON input barfs on {"emptylist":[]}
2 participants