-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Reduce the number of distinct binaries to reduce CI / compile time (#… #18289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reduce the number of distinct binaries to reduce CI / compile time (#… #18289
Conversation
|
Would you be able to provide a high level overview of the changes made here? The PR seems pretty big and it's hard to review without some description of the changes made |
|
High-Level Overview This PR consolidates multiple standalone example binaries into a single example binary that uses subcommands to run individual examples. The main goal is to reduce the number of distinct binaries and make the examples easier to organize and maintain. Key Changes
Example Usage To run a specific example, use: cargo run --example datafusion -- datafusion |
|
I’ve noticed that the CI cargo audit check is failing due to the following advisories:
My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project. Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)? |
bc2f0a6 to
0a9cbe2
Compare
It was fixed: #18288 |
…e-number-of-distinct-binaries
|
The merge conflict in encrypted.rs has been resolved, and all checks have passed. Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up.
I think rust_example.sh needs to be fixed to run the examples in this new format.
At least from the compilation artifacts it seems a nice win:
# off main
datafusion (main)$ du -s -h ~/.cargo_target_cache/ci/examples/
5.0G /Users/jeffrey/.cargo_target_cache/ci/examples/
# off this PR
datafusion (pr_18289)$ du -s -h ~/.cargo_target_cache/ci/examples/
1.4G /Users/jeffrey/.cargo_target_cache/ci/examples/|
|
||
| //! # Advanced UDF/UDAF/UDWF/Asynchronous UDF Examples | ||
| //! | ||
| //! This example demonstrates advanced user-defined functions in DataFusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| //! This example demonstrates advanced user-defined functions in DataFusion. | |
| //! These examples demonstrates advanced user-defined functions in DataFusion. |
| #[tokio::main] | ||
| async fn main() -> Result<()> { | ||
| let arg = std::env::args().nth(1).ok_or_else(|| { | ||
| eprintln!("Usage: cargo run --example advanced_udf -- [udf|udaf|udwf|async_udf]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's better to construct the options from ExampleKind so if we add examples in future there's one less place to forget to update 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point
| - [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser` | ||
| - [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files) | ||
| - [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries. | ||
| - [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to uplift this doc to be a table instead, so it's easier at a glance to see which command to run for which example, e.g.
| Group | Example |
|---|---|
| advanced_udaf | udf |
| advanced_udaf | udwf |
| query_planning | analyzer_rule |
Or would this cause too much duplication of information and be prone to drift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes a lot of sense to create a table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?
| AsyncUdf, | ||
| } | ||
|
|
||
| impl FromStr for ExampleKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if there's a way to templatize/abstract this whole file so they don't drift in structure from the others 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - indeed, the example entry files follow a similar pattern. It could make sense to refactor this into a shared helper or macro in the future to keep them in sync. For now, I kept this consistent with the existing example structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cj-zhukov -- this is an exciting idea but I think it is going to be very hard to review a PR of this size (it will require a substantial amount of contingiuous review time, which is quite hard to find)
Is there any way to break it up into smaller pieces?
For example, perhaps you could consolidate all the flight examples into a single example binary as a single PR?
Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples
| - [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser` | ||
| - [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files) | ||
| - [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries. | ||
| - [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?
…18142)
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?