Reduce the number of distinct binaries to reduce CI / compile time (#… #18289

cj-zhukov · 2025-10-26T13:22:06Z

…18142)

Which issue does this PR close?

Closes Reduce the number of distinct binaries to reduce CI / compile time #18142.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…pache#18142)

Jefffrey · 2025-10-27T06:22:39Z

Would you be able to provide a high level overview of the changes made here? The PR seems pretty big and it's hard to review without some description of the changes made

cj-zhukov · 2025-10-27T08:19:17Z

High-Level Overview

This PR consolidates multiple standalone example binaries into a single example binary that uses subcommands to run individual examples. The main goal is to reduce the number of distinct binaries and make the examples easier to organize and maintain.

Key Changes

Grouped similar examples under common folders (e.g. sql_ops, datafusion, etc.).
Each example can now be launched via a subcommand.

Example Usage

To run a specific example, use:

cargo run --example datafusion -- datafusion

cj-zhukov · 2025-10-27T12:15:33Z

I’ve noticed that the CI cargo audit check is failing due to the following advisories:

RUSTSEC-2025-0111 (tokio-tar)
RUSTSEC-2024-0014 (generational-arena)
RUSTSEC-2024-0436 (paste)

My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project.

Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)?
I just wanted to make sure I don’t make unnecessary config changes in this PR.

cj-zhukov · 2025-10-28T05:42:34Z

I’ve noticed that the CI cargo audit check is failing due to the following advisories:
* RUSTSEC-2025-0111 (tokio-tar)

* RUSTSEC-2024-0014 (generational-arena)

* RUSTSEC-2024-0436 (paste)
My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project.

Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)? I just wanted to make sure I don’t make unnecessary config changes in this PR.

It was fixed: #18288

…e-number-of-distinct-binaries

…d.rs

cj-zhukov · 2025-10-29T12:10:34Z

The merge conflict in encrypted.rs has been resolved, and all checks have passed. Ready for review.

Jefffrey

Thanks for picking this up.

I think rust_example.sh needs to be fixed to run the examples in this new format.

At least from the compilation artifacts it seems a nice win:

# off main
datafusion (main)$ du -s -h ~/.cargo_target_cache/ci/examples/
5.0G    /Users/jeffrey/.cargo_target_cache/ci/examples/
# off this PR
datafusion (pr_18289)$ du -s -h ~/.cargo_target_cache/ci/examples/
1.4G    /Users/jeffrey/.cargo_target_cache/ci/examples/

Jefffrey · 2025-10-30T07:36:10Z

datafusion-examples/examples/advanced_udf/main.rs

+
+//! # Advanced UDF/UDAF/UDWF/Asynchronous UDF Examples
+//!
+//! This example demonstrates advanced user-defined functions in DataFusion.


Suggested change

//! This example demonstrates advanced user-defined functions in DataFusion.

//! These examples demonstrates advanced user-defined functions in DataFusion.

Jefffrey · 2025-10-30T07:37:11Z

datafusion-examples/examples/advanced_udf/main.rs

+#[tokio::main]
+async fn main() -> Result<()> {
+    let arg = std::env::args().nth(1).ok_or_else(|| {
+        eprintln!("Usage: cargo run --example advanced_udf -- [udf|udaf|udwf|async_udf]");


I wonder if it's better to construct the options from ExampleKind so if we add examples in future there's one less place to forget to update 🤔

It's a good point

Jefffrey · 2025-10-30T07:40:37Z

datafusion-examples/README.md

- [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser`
- [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files)
- [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries.
+- [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)


We might need to uplift this doc to be a table instead, so it's easier at a glance to see which command to run for which example, e.g.

Group Example

advanced_udaf udf

advanced_udaf udwf

query_planning analyzer_rule

Or would this cause too much duplication of information and be prone to drift?

I think it makes a lot of sense to create a table

If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?

Jefffrey · 2025-10-30T07:41:22Z

datafusion-examples/examples/advanced_udf/main.rs

+    AsyncUdf,
+}
+
+impl FromStr for ExampleKind {


I do wonder if there's a way to templatize/abstract this whole file so they don't drift in structure from the others 🤔

Good point - indeed, the example entry files follow a similar pattern. It could make sense to refactor this into a shared helper or macro in the future to keep them in sync. For now, I kept this consistent with the existing example structure.

alamb

Thank you @cj-zhukov -- this is an exciting idea but I think it is going to be very hard to review a PR of this size (it will require a substantial amount of contingiuous review time, which is quite hard to find)

Is there any way to break it up into smaller pieces?

For example, perhaps you could consolidate all the flight examples into a single example binary as a single PR?

Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples

alamb · 2025-10-31T21:04:10Z

datafusion-examples/README.md

- [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser`
- [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files)
- [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries.
+- [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)


If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?

Sergey Zhukov added 3 commits October 26, 2025 16:19

Reduce the number of distinct binaries to reduce CI / compile time (a…

acf6778

…pache#18142)

Fix issues causing GitHub checks to fail

4ffd508

Fix issues causing GitHub checks to fail

33454ea

Fix issues causing GitHub checks to fail

5912178

trigger GitHub Actions rerun (security advisory fix upstream)

0a9cbe2

cj-zhukov force-pushed the cj-zhukov/Reduce-the-number-of-distinct-binaries branch from bc2f0a6 to 0a9cbe2 Compare October 28, 2025 05:39

Sergey Zhukov added 2 commits October 28, 2025 13:52

Merge remote-tracking branch 'upstream/main' into cj-zhukov/Reduce-th…

978ec0d

…e-number-of-distinct-binaries

fix merging conflict in datafusion-examples/examples/parquet/encrypte…

ebe213d

…d.rs

Jefffrey reviewed Oct 30, 2025

View reviewed changes

alamb reviewed Oct 31, 2025

View reviewed changes

	//! This example demonstrates advanced user-defined functions in DataFusion.
	//! These examples demonstrates advanced user-defined functions in DataFusion.

Group	Example
advanced_udaf	udf
advanced_udaf	udwf
query_planning	analyzer_rule

Reduce the number of distinct binaries to reduce CI / compile time (#… #18289

Are you sure you want to change the base?

Reduce the number of distinct binaries to reduce CI / compile time (#… #18289

Uh oh!

Conversation

cj-zhukov commented Oct 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Jefffrey commented Oct 27, 2025

Uh oh!

cj-zhukov commented Oct 27, 2025

Uh oh!

cj-zhukov commented Oct 27, 2025

Uh oh!

cj-zhukov commented Oct 28, 2025

Uh oh!

cj-zhukov commented Oct 29, 2025

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants