[arrow-json] support deserializing JSON to variant #8998

debugmiller · 2025-12-15T17:41:07Z

Which issue does this PR close?

Closes [arrow-json] deserialize Variant fields #8987.

What changes are included in this PR?

With this change you can provide a Variant annotated Field to the arrow-json reader and it will deserialize JSON into that column as a Variant.

For example

let variant_array = VariantArrayBuilder::new(0).build();

let struct_field = Schema::new(vec![
    Field::new("id", DataType::Int32, false),
    // call VariantArray::field to get the correct Field
    variant_array.field("var"),
]);

let builder = ReaderBuilder::new(Arc::new(struct_field.clone()));
let result = builder
    .with_struct_mode(StructMode::ObjectOnly)
    .build(Cursor::new(b"{\"id\": 1, \"var\": [\"mixed data\", 1]}"));

This is my first PR to this project, so I wanted to get this up for feedback. Some things to point out:

I added this under an arrow-json feature variant_experimental which matches the feature in other crates. Does this align with expectations for how this feature would be exposed?
arrow-json must include parquet-variant-compute so to avoid a circular dependency I had to modify parquet-variant-compute to include arrow sub crates (e.g. arrow-schema) instead of including through arrow. That makes up a decent chunk of the changes in this PR and could be pulled out as a standalone change but I was not sure if it was intentional that parquet-variant-compute included from arrow directly.
It does not attempt to deserialize to Decimal (which aligns with the existing code in parquet-variant-compute), numbers are deserialized to either i64 or f64.

Are these changes tested?

Unit test was added. Open to feedback on whether that test is adequate or more is expected.

Are there any user-facing changes?

Yes, users can now provide Variant extensions fields to arrow-json reader.

alamb · 2025-12-15T17:52:31Z

FYI @scovich and @harshmotw-db

alamb · 2025-12-15T18:08:21Z

Thank you for this PR @debugmiller

but I was not sure if it was intentional that parquet-variant-compute included from arrow directly.

I think this was a convenience rather than anything deliberate

I agree that using the existing json reader to read variant is a great(!!) idea and I suspect it will be much faster than the json_to_variant kernel as well. I actually think long term we would like to change the implementation of json_to_variant to use the arrow JSON reader.

Another alternative to rejiggering the dependencies, could be to allow users do provide their own decoders for certain fields. This is probably overkill for just variant, it would also make a nice API for other potential extension types (like some of the geospatial types from @kylebarron and @paleolimbot )

Similar to how with_encoder_factory works for overriding writing of JSON fields, we could add a with_decoder_factory / DecoderFactory for customizing decoding and then provide a decoder factory implementation in parquet-variant-compute

Would you be willing to consider this approach?

alamb · 2025-12-15T17:54:45Z

arrow-json/Cargo.toml

 arrow-cast = { workspace = true }
 arrow-data = { workspace = true }
 arrow-schema = { workspace = true }
+parquet-variant-compute = { workspace = true }


yeah this is kind of wacky -- that arrow now depends on some sub part of parquet

alamb · 2025-12-15T17:55:25Z

arrow-json/src/reader/variant_array.rs

+#[derive(Default)]
+pub struct VariantArrayDecoder {}
+
+impl ArrayDecoder for VariantArrayDecoder {


this is so great

debugmiller · 2025-12-15T19:10:23Z

Another alternative to rejiggering the dependencies, could be to allow users do provide their own decoders for certain fields. This is probably overkill for just variant, it would also make a nice API for other potential extension types (like some of the geospatial types from @kylebarron and @paleolimbot )

Similar to how with_encoder_factory works for overriding writing of JSON fields, we could add a with_decoder_factory / DecoderFactory for customizing decoding and then provide a decoder factory implementation in parquet-variant-compute

That seems like a nice compromise. I can look into pulling the tape decoder into parquet-variant-compute and adding an extension mechanism to the reader.

Long(er) term I think its worth considering a more seamless integration for the canonical extension types. As a casual user it would be nice to not have to treat them any differently than standard types.

fixes and feature

debugmiller · 2025-12-15T19:33:30Z

I just discovered #7442 which is exactly what we are discussing

debugmiller · 2025-12-15T19:39:35Z

I just discovered #7442 which is exactly what we are discussing

In looking at that PR more closely, it would have to be tweaked to provide the Field to make_default_decoder (similar to how encoder works) since that is the only place where the extension information is available.

The main sticking point on that PR seems to have been concern over making Tape public. Ironically that PR seems to have died out with the expectation that variant parsing would fix the issue.

paleolimbot · 2025-12-15T20:14:49Z

This is probably overkill for just variant, it would also make a nice API for other potential extension types (like some of the geospatial types from

This is 100% Kyle, but the place where geoarrow-rs does the encoding is here: https://github.com/geoarrow/geoarrow-rs/blob/main/rust/geoarrow-geojson/src/encoder/factory.rs#L19-L23 ...slightly different because the GeoJSON standard keeps geometry at the center of the universe (everything else is "properties"). Given the opportunity I think we'd be able to put pluggable encoders and decoders to good use.

Long(er) term I think its worth considering a more seamless integration for the canonical extension types. As a casual user it would be nice to not have to treat them any differently than standard types.

In DataFusion we've been discussing a registry as a centralized place to customize how these types are handled so that built-in components can handle them ergonomically (apache/datafusion#18223). I don't know as much about the expectation of arrow-rs users but most of the things we're discussing there at the moment are just packaging customizations of arrow-rs components.

scovich

Overall approach looks good. A few suggestions.

NOTE: I would have strongly preferred to review a pair of pull requests (one that did the reworking of deps, followed by the actual feature). That way, the first noisy but mechanical PR can be reviewed mechanically, while the second meaningful PR can be scrutinized more easily. The sum is more than the parts... is absolutely true but not a good thing when it comes to PR review overhead!

scovich · 2025-12-16T16:13:20Z

arrow-json/src/reader/mod.rs

+    field: Option<FieldRef>,
    data_type: DataType,
    coerce_primitive: bool,
    strict_mode: bool,
    is_nullable: bool,


This seems very awkward, to take a Field but still require both data_type and is_nullable?
Should we just admit that the decoder officially uses fields now, and rely on Field::is_nullalbe and Field::data_type?

Especially when ReaderBuilder::build_decoder (L280/305 above) is anyway deriving those two value from fields it was already working with?

But it looks like nullability in particular is not always directly taken from a field (nested struct case below), and the data type is not always from a field either (decoder builder above). So maybe we should consider passing the field's metadata instead? But I don't know if we have extension type support directly on metadata (or if it requires a Field instance to work with)?

scovich · 2025-12-16T16:25:59Z

arrow-json/src/reader/variant_array.rs

+            }
+            object_builder.finish();
+        }
+        TapeElement::EndObject(_u32) => unreachable!(),


Is it truly unreachable, even for invalid input JSON? Does the tape decoder have enough invariant checking to prove that this can never arise? What about bugs in the decoding process or tape navigation?

(perhaps it's better to just return an Err here instead of panicking)

scovich · 2025-12-16T16:32:27Z

arrow-json/src/reader/variant_array.rs

+    match lexical_core::parse::<i64>(s.as_bytes()) {
+        Ok(v) => Ok(Variant::from(v)),
+        Err(_) => {
+            match lexical_core::parse::<f64>(s.as_bytes()) {
+                Ok(v) => Ok(Variant::from(v)),
+                Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))),
+            }
+        }
+    }


nit: a few ideas to simplify the code

Suggested change

match lexical_core::parse::<i64>(s.as_bytes()) {

Ok(v) => Ok(Variant::from(v)),

Err(_) => {

match lexical_core::parse::<f64>(s.as_bytes()) {

Ok(v) => Ok(Variant::from(v)),

Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))),

}

}

}

if let Ok(v) = lexical_core::parse(s.as_bytes()) {

return Ok(Variant::Int64(v));

}

match lexical_core::parse(s.as_bytes()) {

Ok(v) => Ok(Variant::Double(v)),

Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))),

}

scovich · 2025-12-16T16:36:05Z

arrow-json/src/reader/variant_array.rs

+            }
+        }
+    }
+}


Do we care about missing newline at EOF?

scovich · 2025-12-16T16:37:13Z

parquet-variant-compute/src/shred_variant.rs

+use arrow_buffer::{NullBuffer, NullBufferBuilder};
+use arrow_cast::CastOptions;
+use arrow_schema::{ArrowError, DataType, Field, FieldRef, Fields, TimeUnit};
+use std::result::Result;


Isn't Result a prelude auto-import?

Also, there seems to be at least some precedent in existing arrow packages, to just define the arrow result manually if desired?
pub type Result = std::result::Result<T, ArrowError>;
(see arrow-array/ffi, arrow-integration-testing, parquet/errors, etc)

Given the amount of churn it causes across multiple files in the variant package, a type alias seems helpful?

Honestly, I've wondered why arrow-schema doesn't just provide that type alias for everyone to use, so we can simplify a lot of code across all of arrow-rs. Basically everyone has to depend on arrow-schema AFAIK.

scovich · 2025-12-16T16:43:13Z

parquet-variant-compute/src/type_conversion.rs

+use arrow_array::types::{ 
    self, ArrowPrimitiveType, ArrowTimestampType, Decimal32Type, Decimal64Type, Decimal128Type,


If you wanted, could import self as datatypes to reduce churn?

(again below)

debugmiller · 2025-12-16T17:49:56Z

@scovich Thanks for the review. Big picture, what are your thoughts on the alternative approach that @alamb proposed above of adding the ability to register custom decoders on the reader and then exposing this an optional decoder that could be registered? This was essentially already discussed in #7442 and at the time the main concern seemed to be around making Tape public.

scovich · 2025-12-16T18:11:02Z

Big picture, what are your thoughts on the alternative approach that @alamb proposed above of adding the ability to register custom decoders on the reader and then exposing this an optional decoder that could be registered? This was essentially already discussed in #7442 and at the time the main concern seemed to be around making Tape public.

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has? There's a bunch of issues and pull requests around customizable JSON parsing, especially when it comes to error handling. But in theory variant shouldn't have that problem because every valid JSON value can be represented as some variant subtype. Some pull requests of my own also got caught up (and died) due to interactions with tape decoder, so my immediate reaction is to get something working here and worry about expanding the overall API as a separate effort. But again, low confidence answer.

alamb · 2025-12-17T20:21:29Z

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant

Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

alamb · 2025-12-17T20:22:48Z

This was essentially already discussed in #7442 and at the time the main concern seemed to be around making Tape public.

The rationale / justification on that ticket is also a bit sparse as it talked in a hypothetical terms of making a spark json decoder without any additional details that I could gather

scovich · 2025-12-17T21:37:46Z

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant

Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

Gotcha, thanks. What might the extension look like? More than making ArrayDecoder public so that the VariantArrayDecoder can move to a better place?

debugmiller · 2025-12-17T22:24:23Z

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant
Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

Gotcha, thanks. What might the extension look like? More than making ArrayDecoder public so that the VariantArrayDecoder can move to a better place?

I was envisioning very minor tweaks to #7442

scovich · 2025-12-17T23:46:53Z

TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has?

The extension would avoid adding a dependency in arrow-json on parquet-variant
Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well)

Gotcha, thanks. What might the extension look like? More than making ArrayDecoder public so that the VariantArrayDecoder can move to a better place?

I was envisioning very minor tweaks to #7442

The use case seems compelling.

My past bad experiences with Java and FactoryProviderRegistryFactory type classes makes it hard to like factory/registry based approaches at first glance... but I have no idea what a better approach might even look like so I should probably just bite my tongue 😛. Especially given that the companion PR (for JSON encoding) already merged.

debugmiller · 2025-12-19T22:39:00Z

I have incorporated the discussion from this PR into #9021

debugmiller · 2025-12-19T22:46:48Z

closing for now in favor of #9021

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Dec 15, 2025

alamb reviewed Dec 15, 2025

View reviewed changes

working

4e1db7f

fixes and feature

debugmiller force-pushed the decoder-factory branch from 4c0ac0f to 4e1db7f Compare December 15, 2025 19:21

scovich reviewed Dec 16, 2025

View reviewed changes

debugmiller mentioned this pull request Dec 19, 2025

Allow extensions to arrow-json decoder and include an extension for variant #9021

Open

debugmiller closed this Dec 19, 2025

+                          }
+                      }
+                  }
+              }

                
                    No newline at end of file

		use arrow_array::types::{
		self, ArrowPrimitiveType, ArrowTimestampType, Decimal32Type, Decimal64Type, Decimal128Type,

[arrow-json] support deserializing JSON to variant #8998

[arrow-json] support deserializing JSON to variant #8998

Uh oh!

Conversation

debugmiller commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debugmiller commented Dec 15, 2025

Uh oh!

debugmiller commented Dec 15, 2025

Uh oh!

debugmiller commented Dec 15, 2025

Uh oh!

paleolimbot commented Dec 15, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debugmiller commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich commented Dec 16, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

scovich commented Dec 17, 2025

Uh oh!

debugmiller commented Dec 17, 2025

Uh oh!

scovich commented Dec 17, 2025

Uh oh!

debugmiller commented Dec 19, 2025

Uh oh!

debugmiller commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

debugmiller commented Dec 15, 2025 •

edited

Loading

debugmiller commented Dec 16, 2025 •

edited

Loading