Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Panic in v0.12: NotYetImplemented("Decoding FixedLenByteArray(4) \"Plain\"-encoded, dictionary-encoded optional parquet pages") #1191

Closed
hsuyuanyuan opened this issue Jul 28, 2022 · 2 comments · Fixed by #1192
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@hsuyuanyuan
Copy link

Version #0.12.

This is related to the issue #1055. I followed the sample code there and tried to iterate over the pages in some column chunks. But it panicked when processing the 3rd page

e=NotYetImplemented(
    "Decoding FixedLenByteArray(4) \"Plain\"-encoded, dictionary-encoded optional parquet pages",
  )

(Note that we don't see the same issue with #0.9, the last version we used. It can happily load all the pages in all the columns.)

From the debug logging, the encoding for the 3rd page is "Plain" while the first two use "PlainDictionary".

{"timestamp":"2022-07-27T07:02:51.994573Z","level":"INFO","shard_index":0,"row_group_index":0,"pages_len":4,"cur_page":0,"page_primitive_type":"PrimitiveType { field_info: FieldInfo { name: \"redacted\", repetition: Optional, id: None }, logical_type: Some(Decimal(7, 2)), converted_type: Some(Decimal(7, 2)), physical_type: FixedLenByteArray(4) }","page_encoding":"PlainDictionary","is_filtered":"\"\"","dict":"\", dictionary-encoded\""}

{"timestamp":"2022-07-27T07:02:52.018924Z","level":"INFO","shard_index":0,"row_group_index":0,"pages_len":4,"cur_page":1,"page_primitive_type":"PrimitiveType { field_info: FieldInfo { name: \"redacted\", repetition: Optional, id: None }, logical_type: Some(Decimal(7, 2)), converted_type: Some(Decimal(7, 2)), physical_type: FixedLenByteArray(4) }","page_encoding":"PlainDictionary","is_filtered":"\"\"","dict":"\", dictionary-encoded\""}

{"timestamp":"2022-07-27T07:02:52.026705Z","level":"INFO","shard_index":0,"row_group_index":0,"pages_len":4,"cur_page":2,"page_primitive_type":"PrimitiveType { field_info: FieldInfo { name: \"redacted\", repetition: Optional, id: None }, logical_type: Some(Decimal(7, 2)), converted_type: Some(Decimal(7, 2)), physical_type: FixedLenByteArray(4) }","page_encoding":"Plain","is_filtered":"\"\"","dict":"\", dictionary-encoded\""}

This is the code I used.

        let pages: Vec<Result<CompressedDataPage, ParquetError>> = _get_page_stream(
            column,
            &mut reader,
            page_buffer.clone(),
            Arc::new(|_, _| true),
        )
        .await
        .map_err(Error::internal)?
        .collect()
        .await;
        
        let pages_len = pages.len();
        let _ = tokio::task::block_in_place(|| -> Result<_> {
            let type_ = column.descriptor().descriptor.primitive_type.clone();
            let mut cur_page = 0;
            for maybe_page in pages {
                let encoded_page = decompress(maybe_page.unwrap(), &mut vec![]).unwrap();

                // copied from:  fn not_implemented(page: &DataPage) 
                let page_encoding = encoded_page.encoding();
                let is_filtered = encoded_page.selected_rows().is_some();
                let is_filtered = if is_filtered { ", index-filtered" } else { "" };
                let dict = if encoded_page.dictionary_page().is_some() {
                    ", dictionary-encoded"
                } else {
                    ""
                };
                let page_primitive_type = encoded_page.descriptor.primitive_type.clone();

                let iter = fallible_streaming_iterator::convert(std::iter::once(Ok(&encoded_page)));
                let mut iter = column_iter_to_arrays(
                    vec![iter],
                    vec![&type_],
                    field.clone(),
                    row_group.num_rows(),
                )
                .unwrap();

                let array = iter.next().unwrap().map_err(|e| {
                    grit!(
                        FailedPrecondition,
                        "shard_index={}, row_group_index={}, pages_len={}, cur_page={}, \
                        field= {:#?}, type={:#?},  \
                        e={:#?}",
                        shard_index,
                        row_group_index,
                        pages_len,
                        cur_page,
                        field.clone(),
                        type_,
                        e
                    )
                })?;
@hsuyuanyuan hsuyuanyuan changed the title Panic in v0.12: NotYetImplemented("Decoding FixedLenByteArray(4) \"Plain\"-encoded, dictionary-encoded optional parquet pages", Panic in v0.12: NotYetImplemented("Decoding FixedLenByteArray(4) \"Plain\"-encoded, dictionary-encoded optional parquet pages") Jul 28, 2022
@jorgecarleitao
Copy link
Owner

Hey @hsuyuanyuan , thanks a lot for the issue and sorry for it. I just PRed a fix for it and will hopefully release a new version, hopefully this weekend.

Would you like it backported to 0.12, or would you be able to migrate to a new version / main?

@hsuyuanyuan
Copy link
Author

Thanks for the lightning-speed fix, Jorge! Appreciate it if it can be backported to 0.12. We can also move to a newer version if backporting is too much hassle.

@jorgecarleitao jorgecarleitao added bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog labels Jul 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
2 participants