Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset with nested field metadata could be written but cannot be read #2947

Closed
niyue opened this issue Sep 28, 2024 · 2 comments · Fixed by #2950
Closed

dataset with nested field metadata could be written but cannot be read #2947

niyue opened this issue Sep 28, 2024 · 2 comments · Fixed by #2950

Comments

@niyue
Copy link
Contributor

niyue commented Sep 28, 2024

If a dataset contains a nested field with associated metadata, writing the dataset is successful, but reading it can lead to errors like the one shown below:

called Result::unwrap() on an Err value: IO { source: "External error: Encountered internal error. Please file a bug report at https://github.com/lancedb/lance/issues. Error decoding batch: LanceError(Arrow): Invalid argument error: Incorrect datatype for StructArray field "names", expected List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"lance-encoding:compression": "zstd"} }) got List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), rust/lance-encoding/src/encodings/logical/struct.rs:530:21, rust/lance-encoding/src/decoder.rs:1273:31", location: Location { file: "rust/lance/src/dataset/scanner.rs", line: 1925, column: 83 } }
stack backtrace:
libunwind: stepWithCompactEncoding - invalid compact unwind encoding

This error occurs when decoding a struct, as Arrow's StructArray::try_new method is invoked:

 fn decode(self: Box<Self>) -> Result<ArrayRef> {
        // ...
        Ok(Arc::new(StructArray::try_new(
            self.child_fields,
            child_arrays,
            None,
        )?))
    }

In the current implementation of arrow-rs, StructArray::try_new internally compares the data types of each nested field, including their metadata. If the data type is the same but the metadata differs, an error is triggered:

if f.data_type() != a.data_type() {
                return Err(ArrowError::InvalidArgumentError(format!(
                    "Incorrect datatype for StructArray field {:?}, expected {} got {}",
                    f.name(),
                    f.data_type(),
                    a.data_type()
                )));
            }

I think this error might be related to arrow-rs, as it could potentially use the equals_datatype API instead of != to compare data types, though I’m not entirely certain. This is how arrow-rs currently handles data type comparisons, but it's unclear if Lance could modify its handling of field metadata to avoid this issue.

@niyue
Copy link
Contributor Author

niyue commented Sep 28, 2024

I submitted a PR (#2949) that includes a test case demonstrating this issue and a quick fix. However, I believe the current solution is not ideal, as it allows the dataset to be read but loses the nested field metadata afterward. It would be great if we could find a better solution for this case.

@westonpace
Copy link
Contributor

However, I believe the current solution is not ideal, as it allows the dataset to be read but loses the nested field metadata afterward. It would be great if we could find a better solution for this case.

I've added a PR that should keep the nested field metadata. I moved your test to python as it's a bit easier to maintain there but the rust test was passing before I did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants