Add csv loading benchmarks. #13544

dhegberg · 2024-11-24T03:26:10Z

Which issue does this PR close?

Related to #12904

Rationale for this change

Requested in comments for https://github.com/apache/datafusion/pull/13228

A direct testing on loading csv files was identified as a gap in the benchmarking suite.

What changes are included in this PR?

Basic benchmarks related to loading csv files.

Are these changes tested?

Tested via ./bench.sh run csv

Logged output:

Running csv load benchmarks.
Generated test dataset with 10240283 rows
Executing 'CSV Load Speed Test.'
Iteration 0 finished in 7.079167 ms.
Iteration 1 finished in 3.3643750000000003 ms.
Iteration 2 finished in 3.2645 ms.
Iteration 3 finished in 3.311208 ms.
Iteration 4 finished in 3.319 ms.
Done

results file:
csv.json

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

Are there any user-facing changes?

No

berkaysynnada · 2024-11-25T08:08:53Z

Hi @dhegberg, thank you for your contribution—it’s well-written and formatted nicely. We have a dedicated path for operator-specific benchmarks: https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/benches. It seems to me that the measured functionality falls under the scope of CsvExec. Do you think it would be better to include these benchmarks there?

dhegberg · 2024-11-25T15:18:55Z

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

berkaysynnada · 2024-11-25T17:06:32Z

I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations.

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

alamb

Thank you very much @dhegberg -- I agree with @berkaysynnada that this PR is nicely coded and well commented 🙏 and that it might be better to add it as a more focused "unit test"

Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?

My suspicion is that the first run pulls the data from storage (e.g. SSD) into the kernel page cache, and then subsequent runs are all in memory (no I/O).

For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?

I don't want to misguide you. Perhaps @alamb can direct you better for that.

In terms of what is in benchmarks I think it was meant to be "end to end" benchmarks in the style of tpch or clickbench: a known dataset, some queries, and then we can use the benchmarking framework to drive those queries faster and faster (as well as run the queries independently using datafusion-cli or datafusion-python)

I would recommend moving this benchmark to https://github.com/apache/datafusion/tree/main/datafusion/core/benches

perhaps csv.rs or datasource.rs

alamb · 2024-11-25T21:08:48Z

benchmarks/src/csv/load.rs

+impl RunOpt {
+    pub async fn run(self) -> Result<()> {
+        let test_file = self.data.build()?;
+        let mut rundata = BenchmarkRun::new();


One thing I would like to request is that we split the data generation from the query.

Given this setup, rerunning the benchmarks will likely be dominated by the time it takes to regenerate the input which will be quite slow

alamb · 2024-11-27T19:11:07Z

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

dhegberg · 2024-12-03T05:08:16Z

Updated benchmarks output:

☁  core [csv_benchmark] cargo bench --jobs 2 --bench csv_load -- --verbose
    Finished `bench` profile [optimized] target(s) in 0.19s
     Running benches/csv_load.rs (/Users/dhegberg/workplace/datafusion/target/release/deps/csv_load-bd58bc3aea1aaed0)
Gnuplot not found, using plotters backend
Generated test dataset with 69642 rows
Benchmarking load csv testing/default csv read options
Benchmarking load csv testing/default csv read options: Warming up for 3.0000 s
Benchmarking load csv testing/default csv read options: Collecting 100 samples in estimated 20.197 s (1100 iterations)
Benchmarking load csv testing/default csv read options: Analyzing
load csv testing/default csv read options
                        time:   [20.094 ms 20.263 ms 20.426 ms]
                        change: [-0.5031% +0.9097% +2.1820%] (p = 0.20 > 0.05)
                        No change in performance detected.
mean   [20.094 ms 20.426 ms] std. dev.      [729.56 µs 941.81 µs]
median [20.314 ms 20.622 ms] med. abs. dev. [440.35 µs 1.0435 ms]

dhegberg · 2024-12-03T15:57:04Z

@berkaysynnada @alamb

I've moved the benchmarks and they're ready for review.

berkaysynnada

LGTM, thank you @dhegberg

berkaysynnada · 2024-12-04T07:49:05Z

datafusion/core/benches/csv_load.rs

+use tokio::runtime::Runtime;
+
+fn load_csv(ctx: Arc<Mutex<SessionContext>>, path: &str, options: CsvReadOptions) {
+    let rt = Runtime::new().unwrap();


Is it better to give this as a parameter and to create it outside of the measurement?

It seems that this is consistent with the other benchmarks in this package, Runtime is initialized within the bench function iterator.

I can change it here though if you think it should be initialized before hand.

I think keeping consistent with the other benchmarks is preferable

alamb

This is pretty cool -- thank you @dhegberg and @berkaysynnada for the review

I ran it locally and got a flamegraph, and it is pretty neat:

(download flamegraph)

alamb · 2024-12-05T03:59:26Z

datafusion/core/benches/csv_load.rs

+use tokio::runtime::Runtime;
+
+fn load_csv(ctx: Arc<Mutex<SessionContext>>, path: &str, options: CsvReadOptions) {
+    let rt = Runtime::new().unwrap();


I think keeping consistent with the other benchmarks is preferable

* Add csv loading benchmarks. * Fix fmt. * Fix clippy.

github-actions bot added the core Core DataFusion crate label Nov 24, 2024

alamb reviewed Nov 25, 2024

View reviewed changes

alamb marked this pull request as draft November 27, 2024 19:10

dhegberg force-pushed the csv_benchmark branch from 83dc65c to 2c28de3 Compare December 3, 2024 05:26

Add csv loading benchmarks.

cf8f726

dhegberg force-pushed the csv_benchmark branch from 2c28de3 to cf8f726 Compare December 3, 2024 05:30

dhegberg added 2 commits December 2, 2024 21:43

Fix fmt.

c0499cf

Fix clippy.

823ce0f

dhegberg marked this pull request as ready for review December 3, 2024 15:56

berkaysynnada approved these changes Dec 4, 2024

View reviewed changes

alamb approved these changes Dec 5, 2024

View reviewed changes

alamb merged commit a4dd1e2 into apache:main Dec 5, 2024
27 checks passed

zhuliquan pushed a commit to zhuliquan/datafusion that referenced this pull request Dec 6, 2024

Add csv loading benchmarks. (apache#13544)

660c4b1

* Add csv loading benchmarks. * Fix fmt. * Fix clippy.

alamb mentioned this pull request Dec 13, 2024

Dec 13, 2024: This week(s) in DataFusion #13760

Closed

5 tasks

dhegberg deleted the csv_benchmark branch December 14, 2024 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add csv loading benchmarks. #13544

Add csv loading benchmarks. #13544

dhegberg commented Nov 24, 2024

berkaysynnada commented Nov 25, 2024

dhegberg commented Nov 25, 2024

berkaysynnada commented Nov 25, 2024

alamb left a comment

alamb Nov 25, 2024

alamb commented Nov 27, 2024

dhegberg commented Dec 3, 2024

dhegberg commented Dec 3, 2024

berkaysynnada left a comment

berkaysynnada Dec 4, 2024

dhegberg Dec 4, 2024

alamb Dec 5, 2024

alamb left a comment •

edited

Loading

alamb Dec 5, 2024

Add csv loading benchmarks. #13544

Add csv loading benchmarks. #13544

Conversation

dhegberg commented Nov 24, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

berkaysynnada commented Nov 25, 2024

dhegberg commented Nov 25, 2024

berkaysynnada commented Nov 25, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Nov 25, 2024

Choose a reason for hiding this comment

alamb commented Nov 27, 2024

dhegberg commented Dec 3, 2024

dhegberg commented Dec 3, 2024

berkaysynnada left a comment

Choose a reason for hiding this comment

berkaysynnada Dec 4, 2024

Choose a reason for hiding this comment

dhegberg Dec 4, 2024

Choose a reason for hiding this comment

alamb Dec 5, 2024

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Dec 5, 2024

Choose a reason for hiding this comment

alamb left a comment •

edited

Loading