Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add csv loading benchmarks. #13544

Merged
merged 3 commits into from
Dec 5, 2024
Merged

Add csv loading benchmarks. #13544

merged 3 commits into from
Dec 5, 2024

Conversation

dhegberg
Copy link
Contributor

Which issue does this PR close?

Related to #12904

Rationale for this change

Requested in comments for https://github.com/apache/datafusion/pull/13228

A direct testing on loading csv files was identified as a gap in the benchmarking suite.

What changes are included in this PR?

Basic benchmarks related to loading csv files.

Are these changes tested?

Tested via ./bench.sh run csv

Logged output:

Running csv load benchmarks.
Generated test dataset with 10240283 rows
Executing 'CSV Load Speed Test.'
Iteration 0 finished in 7.079167 ms.
Iteration 1 finished in 3.3643750000000003 ms.
Iteration 2 finished in 3.2645 ms.
Iteration 3 finished in 3.311208 ms.
Iteration 4 finished in 3.319 ms.
Done

results file:
csv.json

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Nov 24, 2024
@berkaysynnada
Copy link
Contributor

Hi @dhegberg, thank you for your contribution—it’s well-written and formatted nicely. We have a dedicated path for operator-specific benchmarks: https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/benches. It seems to me that the measured functionality falls under the scope of CsvExec. Do you think it would be better to include these benchmarks there?

@dhegberg
Copy link
Contributor Author

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

@berkaysynnada
Copy link
Contributor

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @dhegberg -- I agree with @berkaysynnada that this PR is nicely coded and well commented 🙏 and that it might be better to add it as a more focused "unit test"

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

My suspicion is that the first run pulls the data from storage (e.g. SSD) into the kernel page cache, and then subsequent runs are all in memory (no I/O).

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

In terms of what is in benchmarks I think it was meant to be "end to end" benchmarks in the style of tpch or clickbench: a known dataset, some queries, and then we can use the benchmarking framework to drive those queries faster and faster (as well as run the queries independently using datafusion-cli or datafusion-python)

I would recommend moving this benchmark to https://github.com/apache/datafusion/tree/main/datafusion/core/benches

perhaps csv.rs or datasource.rs

impl RunOpt {
pub async fn run(self) -> Result<()> {
let test_file = self.data.build()?;
let mut rundata = BenchmarkRun::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I would like to request is that we split the data generation from the query.

Given this setup, rerunning the benchmarks will likely be dominated by the time it takes to regenerate the input which will be quite slow

@alamb alamb marked this pull request as draft November 27, 2024 19:10
@alamb
Copy link
Contributor

alamb commented Nov 27, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@dhegberg
Copy link
Contributor Author

dhegberg commented Dec 3, 2024

Updated benchmarks output:

☁  core [csv_benchmark] cargo bench --jobs 2 --bench csv_load -- --verbose
    Finished `bench` profile [optimized] target(s) in 0.19s
     Running benches/csv_load.rs (/Users/dhegberg/workplace/datafusion/target/release/deps/csv_load-bd58bc3aea1aaed0)
Gnuplot not found, using plotters backend
Generated test dataset with 69642 rows
Benchmarking load csv testing/default csv read options
Benchmarking load csv testing/default csv read options: Warming up for 3.0000 s
Benchmarking load csv testing/default csv read options: Collecting 100 samples in estimated 20.197 s (1100 iterations)
Benchmarking load csv testing/default csv read options: Analyzing
load csv testing/default csv read options
                        time:   [20.094 ms 20.263 ms 20.426 ms]
                        change: [-0.5031% +0.9097% +2.1820%] (p = 0.20 > 0.05)
                        No change in performance detected.
mean   [20.094 ms 20.426 ms] std. dev.      [729.56 µs 941.81 µs]
median [20.314 ms 20.622 ms] med. abs. dev. [440.35 µs 1.0435 ms]

@dhegberg dhegberg marked this pull request as ready for review December 3, 2024 15:56
@dhegberg
Copy link
Contributor Author

dhegberg commented Dec 3, 2024

@berkaysynnada @alamb

I've moved the benchmarks and they're ready for review.

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @dhegberg

use tokio::runtime::Runtime;

fn load_csv(ctx: Arc<Mutex<SessionContext>>, path: &str, options: CsvReadOptions) {
let rt = Runtime::new().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to give this as a parameter and to create it outside of the measurement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is consistent with the other benchmarks in this package, Runtime is initialized within the bench function iterator.

I can change it here though if you think it should be initialized before hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping consistent with the other benchmarks is preferable

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool -- thank you @dhegberg and @berkaysynnada for the review

I ran it locally and got a flamegraph, and it is pretty neat:

flamegraph

(download flamegraph)

use tokio::runtime::Runtime;

fn load_csv(ctx: Arc<Mutex<SessionContext>>, path: &str, options: CsvReadOptions) {
let rt = Runtime::new().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping consistent with the other benchmarks is preferable

@alamb alamb merged commit a4dd1e2 into apache:main Dec 5, 2024
27 checks passed
zhuliquan pushed a commit to zhuliquan/datafusion that referenced this pull request Dec 6, 2024
* Add csv loading benchmarks.

* Fix fmt.

* Fix clippy.
@dhegberg dhegberg deleted the csv_benchmark branch December 14, 2024 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants