-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store TPCH results in separate table #1506
Conversation
client.register_plugin(TurnOnPandasCOW(), name="enable-cow") | ||
with get_cluster_info(cluster), performance_report, benchmark_time: | ||
with benchmark_all(client): | ||
yield client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmakait this is the place where we are kicking off the timing. In this PR I am replacing this with a benchmark_all
that is similar to what we're using in the rest of the suite
@@ -62,3 +62,59 @@ class TestRun(Base): | |||
# Artifacts | |||
performance_report_url = Column(String, nullable=True) # Not yet collected | |||
cluster_dump_url = Column(String, nullable=True) | |||
|
|||
|
|||
class TPCHRun(Base): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the information that is actually stored lacking for TPCH run comparisons. Particularly
- polars/spark/duckdb versions
- cluster spec information
- query
- scale
- local
possibly more. Instead of squeezing this into the ordinary table I introduced a new one. This collides a little with our CI runs, the dashboard, the combine-db script, etc. but should be easily fixed. Haven't done so, yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I see, this has been taken care of.
This reverts commit 03e87e4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no particular reason for having two migrations other than me not wanting to go back invalidating everything I have stored locally.
yield client | ||
|
||
|
||
@pytest.fixture(scope="module") | ||
def spark_setup(cluster, local): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stuff just moved to test_pyspark since it isn't used anywhere else.
tests/tpch/generate_plot.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this entire file is very subjective. I haven't cleaned it up but I think it can be used later on if one wants to generate otehr plots.
@hendrikmakait in a follow up it may make sense to sync this with the colors, etc. that we ended up using
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @fjetter
This cleans up a bit about our fixture infrastructure. A lot of this is likely subjective.
Most importantly, this stores the TPCH results additionally in a separate table which has a different schema and remembers more information about the test run. Most of this information is unnecessary for the ordinary test run