-
Notifications
You must be signed in to change notification settings - Fork 221
Added support for deserializing JSON from iterator #989
Conversation
I've fixed the previous tests by adding the correct validity. I also moved the tests to be more consistent with the rest of the package. |
@jorgecarleitao, @ritchie46, what do you think? |
I am off for the weekend but cn review after. I like the feature so thanks for the PR @cjermain. |
Codecov Report
@@ Coverage Diff @@
## main #989 +/- ##
==========================================
+ Coverage 71.37% 71.39% +0.02%
==========================================
Files 357 357
Lines 19781 19795 +14
==========================================
+ Hits 14118 14133 +15
+ Misses 5663 5662 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, really sorry for the delay, have been sick for the past 3 days and I am catching up.
This looks very good. Great work!
3 suggestions:
- The rest of the crate declares external formats under
io/
(json
,csv
, etc.), and keepsarray
for the arrow array declarations. Would it be possible to move this implementation to underio/json/read
like the rest? - Could we move it to outside the namespace of
Utf8Array
?json
is quite specific format, and maybe we could just keep it underio::json::read
? - Reading through both implementations, it seems to me that the only thing that we need is an
I: Iterator<Item=Option<&str>>
. Thus, could it make sense to refactor our implementation ofjson
to
have two new public functions (infer and deserialize ofI
), and have our currently public API depend on that? Users would then use
let array: Utf8Array = ...;
let dt = infer(array.iter())?;
let array = deserialize(array.iter())?;
this way we do not need to commit to a specific implementation of the iterator (Utf8Array.iter()
in this case).
Thanks @jorgecarleitao! Hope you are feeling better. I agree that moving it into |
So, this is what I was able to come up on the deserialization path (we need something similar for the inference path) use std::sync::Arc;
use serde_json::Value;
use crate::array::Array;
use crate::datatypes::DataType;
use crate::error::ArrowError;
use super::super::super::json::read::_deserialize;
/// Deserializes rows into an [`Array`] of [`DataType`].
/// # Implementation
/// This function is CPU-bounded.
/// This function is guaranteed to return an array of length equal to `rows.len()`.
/// # Errors
/// This function errors iff any of the rows is not a valid JSON (i.e. the format is not valid NDJSON).
pub fn deserialize(rows: &[String], data_type: DataType) -> Result<Arc<dyn Array>, ArrowError> {
deserialize_iter(rows.iter(), data_type)
}
/// Deserializes rows into an [`Array`] of [`DataType`].
/// # Implementation
/// This function is CPU-bounded.
/// This function is guaranteed to return an array of length equal to the leng
/// # Errors
/// This function errors iff any of the rows is not a valid JSON (i.e. the format is not valid NDJSON).
pub fn deserialize_iter<A: AsRef<str>>(
rows: impl Iterator<Item = A>,
data_type: DataType,
) -> Result<Arc<dyn Array>, ArrowError> {
// deserialize strings to `Value`s
let rows = rows
.map(|row| serde_json::from_str(row.as_ref()).map_err(ArrowError::from))
.collect::<Result<Vec<Value>, ArrowError>>()?;
// deserialize &[Value] to Array
Ok(_deserialize(&rows, data_type))
}
// ----------------------------
// test that it compiles for concrete Utf8Array ^_^; no need to expose it in the public API imo
// ----------------------------
fn des_array(
array: &crate::array::Utf8Array<i32>,
data_type: DataType,
) -> Result<Arc<dyn Array>, ArrowError> {
deserialize_iter(array.iter().map(|x| x.unwrap_or("null")), data_type)
} in other words, instead of requiring On a separate note: a We can avoid multiple allocations of There are some aux methods to convert |
Thanks, this gives me a clear idea of what to change. The arrow2/src/io/ndjson/read/file.rs Lines 45 to 58 in b9aa8e8
arrow2/src/io/ndjson/read/file.rs Lines 114 to 118 in b9aa8e8
|
I've made the updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this 🙇, it is super.
This PR adds support for deserializing JSON for each row in a
Utf8Array
. It mimics the interface ofarrow2::io::json::read
andarrow2::io::ndjson::read
, allowing you to apply the transformation on the full array. Currentlypolars
supports JsonPath traversal of an underlyingUtf8Array
inDataFrame.str.json_path_match
, but is not able to return a properly typedSeries
-- it only can handle the first element as a string type. This change opens support for getting the correct type. It also raises the question as to whether the reverse operations would be valuable. For now the scope of this PR is to supportUtf8Array => ArrayRef
of the appropriate underlying type based on the inferred or supplied schema.In 7f00de1, I also added validity support for
StructArray
in the JSON deserializer. This is important to keep consistency on the null count.I'm still getting familiar with Rust, so happy to hear any suggestions for improvements.