Skip to content

Conversation

@debugmiller
Copy link

@debugmiller debugmiller commented Dec 15, 2025

Which issue does this PR close?

What changes are included in this PR?

With this change you can provide a Variant annotated Field to the arrow-json reader and it will deserialize JSON into that column as a Variant.

For example

let variant_array = VariantArrayBuilder::new(0).build();

let struct_field = Schema::new(vec![
    Field::new("id", DataType::Int32, false),
    // call VariantArray::field to get the correct Field
    variant_array.field("var"),
]);

let builder = ReaderBuilder::new(Arc::new(struct_field.clone()));
let result = builder
    .with_struct_mode(StructMode::ObjectOnly)
    .build(Cursor::new(b"{\"id\": 1, \"var\": [\"mixed data\", 1]}"));

This is my first PR to this project, so I wanted to get this up for feedback. Some things to point out:

  • I added this under an arrow-json feature variant_experimental which matches the feature in other crates. Does this align with expectations for how this feature would be exposed?
  • arrow-json must include parquet-variant-compute so to avoid a circular dependency I had to modify parquet-variant-compute to include arrow sub crates (e.g. arrow-schema) instead of including through arrow. That makes up a decent chunk of the changes in this PR and could be pulled out as a standalone change but I was not sure if it was intentional that parquet-variant-compute included from arrow directly.
  • It does not attempt to deserialize to Decimal (which aligns with the existing code in parquet-variant-compute), numbers are deserialized to either i64 or f64.

Are these changes tested?

Unit test was added. Open to feedback on whether that test is adequate or more is expected.

Are there any user-facing changes?

Yes, users can now provide Variant extensions fields to arrow-json reader.

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Dec 15, 2025
@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

FYI @scovich and @harshmotw-db

@alamb
Copy link
Contributor

alamb commented Dec 15, 2025

Thank you for this PR @debugmiller

but I was not sure if it was intentional that parquet-variant-compute included from arrow directly.

I think this was a convenience rather than anything deliberate

I agree that using the existing json reader to read variant is a great(!!) idea and I suspect it will be much faster than the json_to_variant kernel as well. I actually think long term we would like to change the implementation of json_to_variant to use the arrow JSON reader.

Another alternative to rejiggering the dependencies, could be to allow users do provide their own decoders for certain fields. This is probably overkill for just variant, it would also make a nice API for other potential extension types (like some of the geospatial types from @kylebarron and @paleolimbot )

Similar to how with_encoder_factory works for overriding writing of JSON fields, we could add a with_decoder_factory / DecoderFactory for customizing decoding and then provide a decoder factory implementation in parquet-variant-compute

Would you be willing to consider this approach?

arrow-cast = { workspace = true }
arrow-data = { workspace = true }
arrow-schema = { workspace = true }
parquet-variant-compute = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is kind of wacky -- that arrow now depends on some sub part of parquet

#[derive(Default)]
pub struct VariantArrayDecoder {}

impl ArrayDecoder for VariantArrayDecoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so great

@debugmiller
Copy link
Author

Another alternative to rejiggering the dependencies, could be to allow users do provide their own decoders for certain fields. This is probably overkill for just variant, it would also make a nice API for other potential extension types (like some of the geospatial types from @kylebarron and @paleolimbot )

Similar to how with_encoder_factory works for overriding writing of JSON fields, we could add a with_decoder_factory / DecoderFactory for customizing decoding and then provide a decoder factory implementation in parquet-variant-compute

That seems like a nice compromise. I can look into pulling the tape decoder into parquet-variant-compute and adding an extension mechanism to the reader.

Long(er) term I think its worth considering a more seamless integration for the canonical extension types. As a casual user it would be nice to not have to treat them any differently than standard types.

fixes and feature
@debugmiller
Copy link
Author

I just discovered #7442 which is exactly what we are discussing

@debugmiller
Copy link
Author

I just discovered #7442 which is exactly what we are discussing

In looking at that PR more closely, it would have to be tweaked to provide the Field to make_default_decoder (similar to how encoder works) since that is the only place where the extension information is available.

The main sticking point on that PR seems to have been concern over making Tape public. Ironically that PR seems to have died out with the expectation that variant parsing would fix the issue.

@paleolimbot
Copy link
Member

This is probably overkill for just variant, it would also make a nice API for other potential extension types (like some of the geospatial types from

This is 100% Kyle, but the place where geoarrow-rs does the encoding is here: https://github.com/geoarrow/geoarrow-rs/blob/main/rust/geoarrow-geojson/src/encoder/factory.rs#L19-L23 ...slightly different because the GeoJSON standard keeps geometry at the center of the universe (everything else is "properties"). Given the opportunity I think we'd be able to put pluggable encoders and decoders to good use.

Long(er) term I think its worth considering a more seamless integration for the canonical extension types. As a casual user it would be nice to not have to treat them any differently than standard types.

In DataFusion we've been discussing a registry as a centralized place to customize how these types are handled so that built-in components can handle them ergonomically (apache/datafusion#18223). I don't know as much about the expectation of arrow-rs users but most of the things we're discussing there at the moment are just packaging customizations of arrow-rs components.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks good. A few suggestions.

NOTE: I would have strongly preferred to review a pair of pull requests (one that did the reworking of deps, followed by the actual feature). That way, the first noisy but mechanical PR can be reviewed mechanically, while the second meaningful PR can be scrutinized more easily. The sum is more than the parts... is absolutely true but not a good thing when it comes to PR review overhead!

Comment on lines +697 to 701
field: Option<FieldRef>,
data_type: DataType,
coerce_primitive: bool,
strict_mode: bool,
is_nullable: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very awkward, to take a Field but still require both data_type and is_nullable?
Should we just admit that the decoder officially uses fields now, and rely on Field::is_nullalbe and Field::data_type?

Especially when ReaderBuilder::build_decoder (L280/305 above) is anyway deriving those two value from fields it was already working with?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it looks like nullability in particular is not always directly taken from a field (nested struct case below), and the data type is not always from a field either (decoder builder above). So maybe we should consider passing the field's metadata instead? But I don't know if we have extension type support directly on metadata (or if it requires a Field instance to work with)?

}
object_builder.finish();
}
TapeElement::EndObject(_u32) => unreachable!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it truly unreachable, even for invalid input JSON? Does the tape decoder have enough invariant checking to prove that this can never arise? What about bugs in the decoding process or tape navigation?

(perhaps it's better to just return an Err here instead of panicking)

Comment on lines +90 to +98
match lexical_core::parse::<i64>(s.as_bytes()) {
Ok(v) => Ok(Variant::from(v)),
Err(_) => {
match lexical_core::parse::<f64>(s.as_bytes()) {
Ok(v) => Ok(Variant::from(v)),
Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))),
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a few ideas to simplify the code

Suggested change
match lexical_core::parse::<i64>(s.as_bytes()) {
Ok(v) => Ok(Variant::from(v)),
Err(_) => {
match lexical_core::parse::<f64>(s.as_bytes()) {
Ok(v) => Ok(Variant::from(v)),
Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))),
}
}
}
if let Ok(v) = lexical_core::parse(s.as_bytes()) {
return Ok(Variant::Int64(v));
}
match lexical_core::parse(s.as_bytes()) {
Ok(v) => Ok(Variant::Double(v)),
Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))),
}

}
}
}
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about missing newline at EOF?

use arrow_buffer::{NullBuffer, NullBufferBuilder};
use arrow_cast::CastOptions;
use arrow_schema::{ArrowError, DataType, Field, FieldRef, Fields, TimeUnit};
use std::result::Result;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't Result a prelude auto-import?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there seems to be at least some precedent in existing arrow packages, to just define the arrow result manually if desired?
pub type Result = std::result::Result<T, ArrowError>;
(see arrow-array/ffi, arrow-integration-testing, parquet/errors, etc)

Given the amount of churn it causes across multiple files in the variant package, a type alias seems helpful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I've wondered why arrow-schema doesn't just provide that type alias for everyone to use, so we can simplify a lot of code across all of arrow-rs. Basically everyone has to depend on arrow-schema AFAIK.

Comment on lines +21 to 22
use arrow_array::types::{
self, ArrowPrimitiveType, ArrowTimestampType, Decimal32Type, Decimal64Type, Decimal128Type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted, could import self as datatypes to reduce churn?

(again below)

@debugmiller
Copy link
Author

debugmiller commented Dec 16, 2025

@scovich Thanks for the review. Big picture, what are your thoughts on the alternative approach that @alamb proposed above of adding the ability to register custom decoders on the reader and then exposing this an optional decoder that could be registered? This was essentially already discussed in #7442 and at the time the main concern seemed to be around making Tape public.

@scovich
Copy link
Contributor

scovich commented Dec 16, 2025

Big picture, what are your thoughts on the alternative approach that @alamb proposed above of adding the ability to register custom decoders on the reader and then exposing this an optional decoder that could be registered? This was essentially already discussed in #7442 and at the time the main concern seemed to be around making Tape public.

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has? There's a bunch of issues and pull requests around customizable JSON parsing, especially when it comes to error handling. But in theory variant shouldn't have that problem because every valid JSON value can be represented as some variant subtype. Some pull requests of my own also got caught up (and died) due to interactions with tape decoder, so my immediate reaction is to get something working here and worry about expanding the overall API as a separate effort. But again, low confidence answer.

@alamb
Copy link
Contributor

alamb commented Dec 17, 2025

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant

Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

@alamb
Copy link
Contributor

alamb commented Dec 17, 2025

This was essentially already discussed in #7442 and at the time the main concern seemed to be around making Tape public.

The rationale / justification on that ticket is also a bit sparse as it talked in a hypothetical terms of making a spark json decoder without any additional details that I could gather

@scovich
Copy link
Contributor

scovich commented Dec 17, 2025

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant

Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

Gotcha, thanks. What might the extension look like? More than making ArrayDecoder public so that the VariantArrayDecoder can move to a better place?

@debugmiller
Copy link
Author

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant
Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

Gotcha, thanks. What might the extension look like? More than making ArrayDecoder public so that the VariantArrayDecoder can move to a better place?

I was envisioning very minor tweaks to #7442

@scovich
Copy link
Contributor

scovich commented Dec 17, 2025

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant
Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

Gotcha, thanks. What might the extension look like? More than making ArrayDecoder public so that the VariantArrayDecoder can move to a better place?

I was envisioning very minor tweaks to #7442

The use case seems compelling.

My past bad experiences with Java and FactoryProviderRegistryFactory type classes makes it hard to like factory/registry based approaches at first glance... but I have no idea what a better approach might even look like so I should probably just bite my tongue 😛. Especially given that the companion PR (for JSON encoding) already merged.

@debugmiller
Copy link
Author

I have incorporated the discussion from this PR into #9021

@debugmiller
Copy link
Author

closing for now in favor of #9021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-json] deserialize Variant fields

4 participants