Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support take on MapArray #3875

Closed
rtyler opened this issue Mar 16, 2023 · 4 comments · Fixed by #3925
Closed

Support take on MapArray #3875

rtyler opened this issue Mar 16, 2023 · 4 comments · Fixed by #3925
Assignees
Labels
arrow Changes to the arrow crate question Further information is requested

Comments

@rtyler
Copy link
Contributor

rtyler commented Mar 16, 2023

Which part is this question about

I have been struggling to make RecordBatch objects with what I believe are maps. I cannot seem to create the right icantation to write a record which includes a map.

Describe your question

I have tried variables of MapBuilder and hand-crafting MapArrays but today I broke down and just tried to write a record with a JSON decoded object to parquet, for example:

    let json = r#"{"ds" : "1", "timestamp" : 1, "status" : 200, "url" : "https://",
    "method" : "GET", "response" : "", "headers" : {"this" : "is a map"}}"#;
    use arrow::json::reader::{Decoder, DecoderOptions};
    use arrow::datatypes::Schema as ArrowSchema;
    // don't mind this, this is just some Delta Lake schema translation
    let schema: ArrowSchema = <ArrowSchema as TryFrom<&Schema>>::try_from(&HttpRecord::schema()).unwrap();
    let schema_ref = Arc::new(schema);
    let value: serde_json::Value = serde_json::from_str(json).unwrap();

    let vit = vec![value];
    let mut vit = vit.iter().map(|v| Ok(v.to_owned()));

    let options = DecoderOptions::new();
    let d = Decoder::new(schema_ref, options);
    let batch = d.next_batch(&mut vit).unwrap();

The batch then looks like:

RecordBatch { schema: Schema { fields: [Field { name: "ds", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false,[32/742]
a: {} }, Field { name: "timestamp", data_type: Timestamp(Microsecond, None), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} },
Field { name: "status", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "url", data_type: Utf
8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "method", data_type: Utf8, nullable: false, dict_id: 0, dict
_is_ordered: false, metadata: {} }, Field { name: "response", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }
, Field { name: "headers", data_type: Map(Field { name: "key_value", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dic
t_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metad
ata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false), nullable: true, dict_id: 0, dict_is_ordered: false, meta
data: {} }], metadata: {} }, columns: [StringArray
[
  "1",
], PrimitiveArray<Timestamp(Microsecond, None)>
[
  1970-01-01T00:00:00.000001,
], PrimitiveArray<Int32>
[
  200,
], StringArray
[
  "https://",
], StringArray
[
  "GET",
], StringArray
[
  "",
], MapArray
[
  StructArray
[
-- child 0: "key" (Utf8)
StringArray
[
  "this",
]
-- child 1: "value" (Utf8)
StringArray
]], row_count: 1 }

Which when I attempt to write using RecordBatchWriter results in a panic:

thread 'model::tests::zip_batches' panicked at 'not implemented: Take not supported for data type Map(Field { name: "key_value", data_type: Struct
([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Utf
8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, fals
e)', /home/tyler/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-select-33.0.0/src/t
ake.rs:234:14

I would appreciate any examples whether using JSON or raw constructed RecordBatch objects to get records with maps written properly to parquet 😦

Additional context

This is with arrow 33 for what it's worth.

@rtyler rtyler added the question Further information is requested label Mar 16, 2023
@tustvold
Copy link
Contributor

I would appreciate any examples

There is an example using MapArrayBuilder - https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/array_reader/map_array.rs#L146

And an example using the JSON reader - https://github.com/apache/arrow-rs/blob/master/arrow-json/src/raw/mod.rs#L607

The documentation could definitely be improved for handling this particular type

Take not supported for data type

This looks like a failure in model::tests::zip_batches, that is calling the take kernel which doesn't currently support MapArray, not anything to do with parsing the MapArray itself.

@rtyler
Copy link
Contributor Author

rtyler commented Mar 16, 2023

@tustvold thank you for the suggestions. Investigating the take is a good suggestion, there's something fishy between my code, delta-rs, and arrow I'm following up on now

@rtyler
Copy link
Contributor Author

rtyler commented Mar 16, 2023

@tustvold as I am getting deeper into this problem, I'm curious if there's a reason why take() isn't implemented for MapArray? In the context of a RecordBatch it seems to me to be a reasonable thing to use to take rows out of the batch 😕

@tustvold
Copy link
Contributor

tustvold commented Mar 17, 2023

No reason that I'm aware of, MapArray support simply hasn't been added to many of the kernels yet

@wjones127 wjones127 changed the title Is it possible to write a map? Support take on MapArray Mar 24, 2023
@wjones127 wjones127 self-assigned this Mar 24, 2023
@tustvold tustvold added the arrow Changes to the arrow crate label Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants