-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds optional serde support to datafusion-proto #2892
Adds optional serde support to datafusion-proto #2892
Conversation
macro_rules! roundtrip_expr_test { | ||
($initial_struct:ident, $ctx:ident) => { | ||
let proto: protobuf::LogicalExprNode = (&$initial_struct).try_into().unwrap(); | ||
fn roundtrip_expr_test<T, E>(initial_struct: T, ctx: SessionContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by refactor to eliminate an unnecessary macro, as they complicate development, refactoring, debugging, etc...
Codecov Report
@@ Coverage Diff @@
## master #2892 +/- ##
==========================================
- Coverage 85.34% 85.32% -0.03%
==========================================
Files 276 273 -3
Lines 49290 49351 +61
==========================================
+ Hits 42069 42110 +41
- Misses 7221 7241 +20
Continue to review full report at Codecov.
|
E: Debug, | ||
{ | ||
let proto: protobuf::LogicalExprNode = (&initial_struct).try_into().unwrap(); | ||
let round_trip: Expr = parse_expr(&proto, &ctx).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this PR! I'm sure it's a step in the correct direction, but after cloning it, I'm having trouble figuring out how to extend it to include LogicalPlan
s as well. There appears to be no equivalent parse_plan()
?
If I'm reading this correctly, it looks like the protobuf generator generated an entirely separate copy of all the Expr & LogicalPlan enums, and we created some fairly large functions to convert back and forth between them like parse_expr
? If so the next step would be to write parse_plan
?
It does look from other tests like we are able to roundtrip plans to protobuf, so maybe I am missing something. When I tried:
let protobuf =
protobuf::LogicalPlanNode::try_from_logical_plan(&topk_plan, &extension_codec)?;
let string = serde_json::to_string(&protobuf).unwrap();
I get
| let string = serde_json::to_string(&protobuf).unwrap();
| --------------------- ^^^^^^^^^ the trait `serde::ser::Serialize` is not implemented for `LogicalPlanNode`
| |
| required by a bound introduced by this call
Sorry, I'm a bit lost. Any help is appreciated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was running my tests without the "serde" feature 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I got all that sorted, and was able to do what I was trying to use this PR for: tustvold#63
I was hoping (naively) to see something like:
{Projection: { field: "#employee_csv.id",
Filter: {expr: {left: "#employee_csv.state", op: "IN", right: {
Subquery: {
TableScan: {name: "employee_csv", projection=["state"]},
TableScan: {name: "employee_csv" projection=["id", "state"]
}}
Unfortunately (though serializable and deserializable) the result is not as human readable as I hoped, and probably not something I'd want to check into a test assertion: https://gist.github.com/avantgardnerio/35a04950ca768fdfe6579aea08126b74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above being said, I can see other uses for JSON serialization that extend beyond my narrow use-case that could make this PR valuable enough to merge on its own merits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running that raw JSON through this formatter produces some encouraging output:
{
"projection": {
"input": {
"sort": {
"input": {
"projection": {
"input": {
"listingScan": {
"tableName": "alltypes_plain",
"paths": ["file:///home/bgardner/workspace/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet"],
"fileExtension": ".parquet",
"schema": {
"columns": [
{"name": "id", "arrowType": {"INT32": {}}, "nullable": true},
{"name": "bool_col", "arrowType": {"BOOL": {}}, "nullable": true},
{"name": "tinyint_col", "arrowType": {"INT32": {}}, "nullable": true},
{"name": "smallint_col", "arrowType": {"INT32": {}}, "nullable": true},
{"name": "int_col", "arrowType": {"INT32": {}}, "nullable": true},
{"name": "bigint_col", "arrowType": {"INT64": {}}, "nullable": true},
{"name": "float_col", "arrowType": {"FLOAT32": {}}, "nullable": true},
{"name": "double_col", "arrowType": {"FLOAT64": {}}, "nullable": true},
{"name": "date_string_col", "arrowType": {"BINARY": {}}, "nullable": true},
{"name": "string_col", "arrowType": {"BINARY": {}}, "nullable": true},
{"name": "timestamp_col", "arrowType": {"TIMESTAMP": {"timeUnit": "Nanosecond"}}, "nullable": true}
]
},
"collectStat": true,
"targetPartitions": 16,
"parquet": {"enablePruning": true}
}
},
"expr": [
{"column": {"name": "id", "relation": {"relation": "alltypes_plain"}}},
{"column": {"name": "int_col", "relation": {"relation": "alltypes_plain"}}},
{"column": {"name": "double_col", "relation": {"relation": "alltypes_plain"}}}
]
}
},
"expr": [
{"sort": {"expr": {"column": {"name": "int_col", "relation": {"relation": "alltypes_plain"}}}, "asc": true}},
{"sort": {"expr": {"column": {"name": "double_col", "relation": {"relation": "alltypes_plain"}}}, "asc": true}}
]
}
},
"expr": [{"column": {"name": "id", "relation": {"relation": "alltypes_plain"}}}]
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After messing around a bit, I'm convinced this is just what we need. It does look like the DefaultExtensionCodec
is not exposed in other packages? So maybe it makes sense to expose methods like logical_plan_to_json()
for similar reasons as logical_plan_to_bytes()
- making it easier to use from outside the proto
crate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at tustvold#64 before merging into Apache. I think those couple tweaks are important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Misc suggestions * Update datafusion/proto/Cargo.toml Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
datafusion/proto/Cargo.toml
Outdated
serde = { version = "1.0", optional = true } | ||
serde_json = { version = "1.0", optional = true } | ||
pbjson = { version = "0.3", optional = true } | ||
pbjson-types = { version = "0.3", optional = true } | ||
|
||
[dev-dependencies] | ||
doc-comment = "0.3" | ||
tokio = "1.18" | ||
|
||
[build-dependencies] | ||
tonic-build = { version = "0.7" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tonic-build = { version = "0.7" } | |
pbjson-build = { version = "0.3", optional = true } |
|
||
[dev-dependencies] | ||
doc-comment = "0.3" | ||
tokio = "1.18" | ||
|
||
[build-dependencies] | ||
tonic-build = { version = "0.7" } | ||
pbjson-build = { version = "0.3", optional = true } | ||
prost-build = { version = "0.7" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was actually no reason for this to depend on tonic-build, and not just prost-build. This is likely an orphan from when ballista was extracted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cargo fix sugestions
@@ -33,18 +33,24 @@ name = "datafusion_proto" | |||
path = "src/lib.rs" | |||
|
|||
[features] | |||
default = [] | |||
json = ["pbjson", "pbjson-build", "serde", "serde_json"] | |||
|
|||
[dependencies] | |||
arrow = { version = "18.0.0" } | |||
datafusion = { path = "../core", version = "10.0.0" } | |||
datafusion-common = { path = "../common", version = "10.0.0" } | |||
datafusion-expr = { path = "../expr", version = "10.0.0" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion-expr = { path = "../expr", version = "10.0.0" } | |
datafusion-expr = { path = "../expr", version = "10.0.0" } | |
pbjson = { version = "0.3", optional = true } | |
pbjson-types = { version = "0.3", optional = true } |
datafusion/proto/Cargo.toml
Outdated
serde = { version = "1.0", optional = true } | ||
serde_json = { version = "1.0", optional = true } | ||
pbjson = { version = "0.3", optional = true } | ||
pbjson-types = { version = "0.3", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pbjson-types = { version = "0.3", optional = true } |
datafusion/proto/Cargo.toml
Outdated
|
||
serde = { version = "1.0", optional = true } | ||
serde_json = { version = "1.0", optional = true } | ||
pbjson = { version = "0.3", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pbjson = { version = "0.3", optional = true } |
Benchmark runs are scheduled for baseline = c528986 and contender = c67161b. c67161b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2889
Rationale for this change
This provides a way to serialize the datafusion-proto messages to any serde serializer, e.g. JSON, based on the protobuf JSON specification.
What changes are included in this PR?
Uses pbjson to auto-generate the serde implementations for the protobuf messages. Full disclosure: I am the primary author and maintainer of this crate.
Are there any user-facing changes?
No