Skip to content

Conversation

@cj-zhukov
Copy link
Contributor

…18142)

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Jefffrey
Copy link
Contributor

Would you be able to provide a high level overview of the changes made here? The PR seems pretty big and it's hard to review without some description of the changes made

@cj-zhukov
Copy link
Contributor Author

High-Level Overview

This PR consolidates multiple standalone example binaries into a single example binary that uses subcommands to run individual examples. The main goal is to reduce the number of distinct binaries and make the examples easier to organize and maintain.

Key Changes

  • Grouped similar examples under common folders (e.g. sql_ops, datafusion, etc.).

  • Each example can now be launched via a subcommand.

Example Usage

To run a specific example, use:

cargo run --example datafusion -- datafusion

@cj-zhukov
Copy link
Contributor Author

I’ve noticed that the CI cargo audit check is failing due to the following advisories:

  • RUSTSEC-2025-0111 (tokio-tar)
  • RUSTSEC-2024-0014 (generational-arena)
  • RUSTSEC-2024-0436 (paste)

My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project.

Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)?
I just wanted to make sure I don’t make unnecessary config changes in this PR.

@cj-zhukov cj-zhukov force-pushed the cj-zhukov/Reduce-the-number-of-distinct-binaries branch from bc2f0a6 to 0a9cbe2 Compare October 28, 2025 05:39
@cj-zhukov
Copy link
Contributor Author

I’ve noticed that the CI cargo audit check is failing due to the following advisories:

* RUSTSEC-2025-0111 (tokio-tar)

* RUSTSEC-2024-0014 (generational-arena)

* RUSTSEC-2024-0436 (paste)

My PR doesn’t introduce or modify any of these dependencies - they appear to be transitive dependencies already present in the project.

Could someone from the PMC confirm whether these advisories should be handled separately (e.g., ignored in CI or tracked in a dedicated issue)? I just wanted to make sure I don’t make unnecessary config changes in this PR.

It was fixed: #18288

@cj-zhukov
Copy link
Contributor Author

The merge conflict in encrypted.rs has been resolved, and all checks have passed. Ready for review.

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up.

I think rust_example.sh needs to be fixed to run the examples in this new format.

At least from the compilation artifacts it seems a nice win:

# off main
datafusion (main)$ du -s -h ~/.cargo_target_cache/ci/examples/
5.0G    /Users/jeffrey/.cargo_target_cache/ci/examples/
# off this PR
datafusion (pr_18289)$ du -s -h ~/.cargo_target_cache/ci/examples/
1.4G    /Users/jeffrey/.cargo_target_cache/ci/examples/


//! # Advanced UDF/UDAF/UDWF/Asynchronous UDF Examples
//!
//! This example demonstrates advanced user-defined functions in DataFusion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This example demonstrates advanced user-defined functions in DataFusion.
//! These examples demonstrates advanced user-defined functions in DataFusion.

#[tokio::main]
async fn main() -> Result<()> {
let arg = std::env::args().nth(1).ok_or_else(|| {
eprintln!("Usage: cargo run --example advanced_udf -- [udf|udaf|udwf|async_udf]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to construct the options from ExampleKind so if we add examples in future there's one less place to forget to update 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point

- [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser`
- [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files)
- [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries.
- [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to uplift this doc to be a table instead, so it's easier at a glance to see which command to run for which example, e.g.

Group Example
advanced_udaf udf
advanced_udaf udwf
query_planning analyzer_rule

Or would this cause too much duplication of information and be prone to drift?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes a lot of sense to create a table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?

AsyncUdf,
}

impl FromStr for ExampleKind {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if there's a way to templatize/abstract this whole file so they don't drift in structure from the others 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - indeed, the example entry files follow a similar pattern. It could make sense to refactor this into a shared helper or macro in the future to keep them in sync. For now, I kept this consistent with the existing example structure.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cj-zhukov -- this is an exciting idea but I think it is going to be very hard to review a PR of this size (it will require a substantial amount of contingiuous review time, which is quite hard to find)

Is there any way to break it up into smaller pieces?

For example, perhaps you could consolidate all the flight examples into a single example binary as a single PR?

Then we can make sure we are agreed on the pattern and then we can apply it to the remaining examples

- [`sql_dialect.rs`](examples/sql_dialect.rs): Example of implementing a custom SQL dialect on top of `DFParser`
- [`sql_query.rs`](examples/memtable.rs): Query data using SQL (in memory `RecordBatches`, local Parquet files)
- [`date_time_function.rs`](examples/date_time_function.rs): Examples of date-time related functions and queries.
- [`advanced_udaf.rs`](examples/advanced_udaf/udaf.rs): Define and invoke a more complicated User Defined Aggregate Function (UDAF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to consolidate the examples anyways, I wonder if it makes sense to consolidate them all into a single (or even fewer) binaries?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce the number of distinct binaries to reduce CI / compile time

3 participants