-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add csv loading benchmarks. #13544
Add csv loading benchmarks. #13544
Conversation
Hi @dhegberg, thank you for your contribution—it’s well-written and formatted nicely. We have a dedicated path for operator-specific benchmarks: https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/benches. It seems to me that the measured functionality falls under the scope of |
I don't have a strong opinion on the location of the benchmarks, so I'm happy to follow recommendations. For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks? |
I don't want to misguide you. Perhaps @alamb can direct you better for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @dhegberg -- I agree with @berkaysynnada that this PR is nicely coded and well commented 🙏 and that it might be better to add it as a more focused "unit test"
Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new SessionContext not sufficient to remove any cache in loading?
My suspicion is that the first run pulls the data from storage (e.g. SSD) into the kernel page cache, and then subsequent runs are all in memory (no I/O).
For my future reference, how do you differentiate this functionality vs. the parquet related testing in the top level benchmarks?
I don't want to misguide you. Perhaps @alamb can direct you better for that.
In terms of what is in benchmarks
I think it was meant to be "end to end" benchmarks in the style of tpch or clickbench: a known dataset, some queries, and then we can use the benchmarking framework to drive those queries faster and faster (as well as run the queries independently using datafusion-cli
or datafusion-python
)
I would recommend moving this benchmark to https://github.com/apache/datafusion/tree/main/datafusion/core/benches
perhaps csv.rs
or datasource.rs
benchmarks/src/csv/load.rs
Outdated
impl RunOpt { | ||
pub async fn run(self) -> Result<()> { | ||
let test_file = self.data.build()?; | ||
let mut rundata = BenchmarkRun::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I would like to request is that we split the data generation from the query.
Given this setup, rerunning the benchmarks will likely be dominated by the time it takes to regenerate the input which will be quite slow
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
Updated benchmarks output:
|
83dc65c
to
2c28de3
Compare
2c28de3
to
cf8f726
Compare
I've moved the benchmarks and they're ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @dhegberg
use tokio::runtime::Runtime; | ||
|
||
fn load_csv(ctx: Arc<Mutex<SessionContext>>, path: &str, options: CsvReadOptions) { | ||
let rt = Runtime::new().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to give this as a parameter and to create it outside of the measurement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is consistent with the other benchmarks in this package, Runtime is initialized within the bench function iterator.
I can change it here though if you think it should be initialized before hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping consistent with the other benchmarks is preferable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty cool -- thank you @dhegberg and @berkaysynnada for the review
I ran it locally and got a flamegraph, and it is pretty neat:
(download flamegraph)
use tokio::runtime::Runtime; | ||
|
||
fn load_csv(ctx: Arc<Mutex<SessionContext>>, path: &str, options: CsvReadOptions) { | ||
let rt = Runtime::new().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping consistent with the other benchmarks is preferable
* Add csv loading benchmarks. * Fix fmt. * Fix clippy.
Which issue does this PR close?
Related to #12904
Rationale for this change
Requested in comments for
https://github.com/apache/datafusion/pull/13228
A direct testing on loading csv files was identified as a gap in the benchmarking suite.
What changes are included in this PR?
Basic benchmarks related to loading csv files.
Are these changes tested?
Tested via
./bench.sh run csv
Logged output:
results file:
csv.json
Curious result is the first iteration is consistently 6-7 ms vs ~3ms on future iterations. Is a new
SessionContext
not sufficient to remove any cache in loading?Are there any user-facing changes?
No