-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[arrow-json] support deserializing JSON to variant #8998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
FYI @scovich and @harshmotw-db |
|
Thank you for this PR @debugmiller
I think this was a convenience rather than anything deliberate I agree that using the existing json reader to read variant is a great(!!) idea and I suspect it will be much faster than the Another alternative to rejiggering the dependencies, could be to allow users do provide their own decoders for certain fields. This is probably overkill for just variant, it would also make a nice API for other potential extension types (like some of the geospatial types from @kylebarron and @paleolimbot ) Similar to how Would you be willing to consider this approach? |
| arrow-cast = { workspace = true } | ||
| arrow-data = { workspace = true } | ||
| arrow-schema = { workspace = true } | ||
| parquet-variant-compute = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is kind of wacky -- that arrow now depends on some sub part of parquet
| #[derive(Default)] | ||
| pub struct VariantArrayDecoder {} | ||
|
|
||
| impl ArrayDecoder for VariantArrayDecoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so great
That seems like a nice compromise. I can look into pulling the tape decoder into Long(er) term I think its worth considering a more seamless integration for the canonical extension types. As a casual user it would be nice to not have to treat them any differently than standard types. |
4c0ac0f to
4e1db7f
Compare
|
I just discovered #7442 which is exactly what we are discussing |
In looking at that PR more closely, it would have to be tweaked to provide the The main sticking point on that PR seems to have been concern over making Tape public. Ironically that PR seems to have died out with the expectation that variant parsing would fix the issue. |
This is 100% Kyle, but the place where geoarrow-rs does the encoding is here: https://github.com/geoarrow/geoarrow-rs/blob/main/rust/geoarrow-geojson/src/encoder/factory.rs#L19-L23 ...slightly different because the GeoJSON standard keeps geometry at the center of the universe (everything else is "properties"). Given the opportunity I think we'd be able to put pluggable encoders and decoders to good use.
In DataFusion we've been discussing a registry as a centralized place to customize how these types are handled so that built-in components can handle them ergonomically (apache/datafusion#18223). I don't know as much about the expectation of arrow-rs users but most of the things we're discussing there at the moment are just packaging customizations of arrow-rs components. |
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall approach looks good. A few suggestions.
NOTE: I would have strongly preferred to review a pair of pull requests (one that did the reworking of deps, followed by the actual feature). That way, the first noisy but mechanical PR can be reviewed mechanically, while the second meaningful PR can be scrutinized more easily. The sum is more than the parts... is absolutely true but not a good thing when it comes to PR review overhead!
| field: Option<FieldRef>, | ||
| data_type: DataType, | ||
| coerce_primitive: bool, | ||
| strict_mode: bool, | ||
| is_nullable: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very awkward, to take a Field but still require both data_type and is_nullable?
Should we just admit that the decoder officially uses fields now, and rely on Field::is_nullalbe and Field::data_type?
Especially when ReaderBuilder::build_decoder (L280/305 above) is anyway deriving those two value from fields it was already working with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it looks like nullability in particular is not always directly taken from a field (nested struct case below), and the data type is not always from a field either (decoder builder above). So maybe we should consider passing the field's metadata instead? But I don't know if we have extension type support directly on metadata (or if it requires a Field instance to work with)?
| } | ||
| object_builder.finish(); | ||
| } | ||
| TapeElement::EndObject(_u32) => unreachable!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it truly unreachable, even for invalid input JSON? Does the tape decoder have enough invariant checking to prove that this can never arise? What about bugs in the decoding process or tape navigation?
(perhaps it's better to just return an Err here instead of panicking)
| match lexical_core::parse::<i64>(s.as_bytes()) { | ||
| Ok(v) => Ok(Variant::from(v)), | ||
| Err(_) => { | ||
| match lexical_core::parse::<f64>(s.as_bytes()) { | ||
| Ok(v) => Ok(Variant::from(v)), | ||
| Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a few ideas to simplify the code
| match lexical_core::parse::<i64>(s.as_bytes()) { | |
| Ok(v) => Ok(Variant::from(v)), | |
| Err(_) => { | |
| match lexical_core::parse::<f64>(s.as_bytes()) { | |
| Ok(v) => Ok(Variant::from(v)), | |
| Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))), | |
| } | |
| } | |
| } | |
| if let Ok(v) = lexical_core::parse(s.as_bytes()) { | |
| return Ok(Variant::Int64(v)); | |
| } | |
| match lexical_core::parse(s.as_bytes()) { | |
| Ok(v) => Ok(Variant::Double(v)), | |
| Err(_) => Err(ArrowError::JsonError(format!("failed to parse {s} as number"))), | |
| } |
| } | ||
| } | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about missing newline at EOF?
| use arrow_buffer::{NullBuffer, NullBufferBuilder}; | ||
| use arrow_cast::CastOptions; | ||
| use arrow_schema::{ArrowError, DataType, Field, FieldRef, Fields, TimeUnit}; | ||
| use std::result::Result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't Result a prelude auto-import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there seems to be at least some precedent in existing arrow packages, to just define the arrow result manually if desired?
pub type Result = std::result::Result<T, ArrowError>;
(see arrow-array/ffi, arrow-integration-testing, parquet/errors, etc)
Given the amount of churn it causes across multiple files in the variant package, a type alias seems helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I've wondered why arrow-schema doesn't just provide that type alias for everyone to use, so we can simplify a lot of code across all of arrow-rs. Basically everyone has to depend on arrow-schema AFAIK.
| use arrow_array::types::{ | ||
| self, ArrowPrimitiveType, ArrowTimestampType, Decimal32Type, Decimal64Type, Decimal128Type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you wanted, could import self as datatypes to reduce churn?
(again below)
|
@scovich Thanks for the review. Big picture, what are your thoughts on the alternative approach that @alamb proposed above of adding the ability to register custom decoders on the reader and then exposing this an optional decoder that could be registered? This was essentially already discussed in #7442 and at the time the main concern seemed to be around making |
TBH, I don't think I understand it well enough to have a strong opinion. In particular, I didn't follow how the proposed extension stuff would help solve what problem this PR has? There's a bunch of issues and pull requests around customizable JSON parsing, especially when it comes to error handling. But in theory variant shouldn't have that problem because every valid JSON value can be represented as some variant subtype. Some pull requests of my own also got caught up (and died) due to interactions with tape decoder, so my immediate reaction is to get something working here and worry about expanding the overall API as a separate effort. But again, low confidence answer. |
The extension would avoid adding a dependency in arrow-json on parquet-variant Extensions would also allow for other use cases as well, so I was thinking spending the time to work out a proper API might make sense (we have a use case in repo -- variant, and we have external use cases as well) |
The rationale / justification on that ticket is also a bit sparse as it talked in a hypothetical terms of making a spark json decoder without any additional details that I could gather |
Gotcha, thanks. What might the extension look like? More than making |
I was envisioning very minor tweaks to #7442 |
The use case seems compelling. My past bad experiences with Java and |
|
I have incorporated the discussion from this PR into #9021 |
|
closing for now in favor of #9021 |
Which issue does this PR close?
What changes are included in this PR?
With this change you can provide a Variant annotated Field to the arrow-json reader and it will deserialize JSON into that column as a Variant.
For example
This is my first PR to this project, so I wanted to get this up for feedback. Some things to point out:
variant_experimentalwhich matches the feature in other crates. Does this align with expectations for how this feature would be exposed?arrow-jsonmust includeparquet-variant-computeso to avoid a circular dependency I had to modifyparquet-variant-computeto includearrowsub crates (e.g.arrow-schema) instead of including througharrow. That makes up a decent chunk of the changes in this PR and could be pulled out as a standalone change but I was not sure if it was intentional thatparquet-variant-computeincluded fromarrowdirectly.parquet-variant-compute), numbers are deserialized to either i64 or f64.Are these changes tested?
Unit test was added. Open to feedback on whether that test is adequate or more is expected.
Are there any user-facing changes?
Yes, users can now provide Variant extensions fields to arrow-json reader.