Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve left/inner join performance by rerange right table by keys #60341

Merged
merged 11 commits into from
Sep 11, 2024

Conversation

KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented Feb 23, 2024

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improve the join performance by rerange the right table by keys while the table keys are dense in left or inner hash join.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@KevinyhZou KevinyhZou marked this pull request as draft February 23, 2024 03:21
@nikitamikhaylov
Copy link
Member

Note about that we may have sparse columns and working with them through IColumn interface is critical.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Feb 23, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-performance Pull request with some performance improvements label Feb 23, 2024
@robot-clickhouse-ci-2
Copy link
Contributor

robot-clickhouse-ci-2 commented Feb 23, 2024

This is an automated comment for commit 597181c with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts❌ failure
Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@KevinyhZou KevinyhZou marked this pull request as ready for review February 26, 2024 12:42
@jkartseva jkartseva self-assigned this Feb 28, 2024
Copy link
Contributor

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

{
auto * batch = pool.alloc<Batch>();
*batch = Batch(this);
batch->insert(std::move(row_ref), pool);
return batch;
}

row_nums[size] = row_ref.row_num;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why row_nums is required? Isn't row_nums[i] essentially the same as row_refs[i].row_num ?

@@ -46,6 +46,7 @@ struct RowRefList : RowRef
SizeT size = 0; /// It's smaller than size_t but keeps align in Arena.
Batch * next;
RowRef row_refs[MAX_SIZE];
UInt64 row_nums[MAX_SIZE];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type should be ColumnIndex

Comment on lines 110 to 132
void ColumnString::insertIndicesFrom(const IColumn & src, const IColumn::ColumnIndex * selector, const size_t & size)
{
for (size_t i = 0; i < size; ++i)
insertFrom(src, *(selector + i));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overload is not different from the base class method.


void nextBatch()
{
batch = batch->next;
Copy link
Contributor

@jkartseva jkartseva Mar 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This invalidates the existing position, using operator ++ and nextBatch() together is not viable.

@@ -55,14 +56,14 @@ struct RowRefList : RowRef

Batch * insert(RowRef && row_ref, Arena & pool)
{
if (full())
if (full() || (size > 0 && row_ref.block != row_refs[0].block))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new condition should be in the form a function with a human-readable name.

@KevinyhZou KevinyhZou force-pushed the improve_join_insert_from branch from 2e2e1d1 to 2cf4dbf Compare March 2, 2024 07:01
@KevinyhZou
Copy link
Contributor Author

KevinyhZou commented Mar 2, 2024

we have optimize inner join use batch insert base on the pr #58278, and on our gluten case , it can performance better. please review it , thanks . @jkartseva

@KevinyhZou KevinyhZou force-pushed the improve_join_insert_from branch from b16f0da to 4a40b4b Compare March 3, 2024 16:34
@KevinyhZou KevinyhZou marked this pull request as draft March 4, 2024 01:50
@KevinyhZou KevinyhZou force-pushed the improve_join_insert_from branch from 4a40b4b to 202b571 Compare March 12, 2024 10:17
@KevinyhZou KevinyhZou marked this pull request as ready for review March 14, 2024 01:59
@KevinyhZou
Copy link
Contributor Author

cc @jkartseva

@KevinyhZou KevinyhZou force-pushed the improve_join_insert_from branch from 8b38c9c to f266807 Compare March 15, 2024 02:20
@jkartseva
Copy link
Contributor

Is the most recent iteration ready for review @KevinyhZou ?

@KevinyhZou
Copy link
Contributor Author

KevinyhZou commented Mar 15, 2024

Is the most recent iteration ready for review @KevinyhZou ?

yes @jkartseva

@vdimir
Copy link
Member

vdimir commented Mar 15, 2024

I assume some CI failures can be related, for example

Stateful tests (tsan) — Tests are not finished, fail: 1, passed: 7, skipped: 1 Details

Contains thread sanitizer report around HashJoin::joinBlock, see ThreadSanitizer: data race in stderr.log file:
https://s3.amazonaws.com/clickhouse-test-reports/60341/f266807e0cc64ad8ddcff741594bcc85508bd6ab/stateful_tests__tsan_/stderr.log

Cloud you please check it?

@KevinyhZou
Copy link
Contributor Author

OK

@KevinyhZou KevinyhZou force-pushed the improve_join_insert_from branch 3 times, most recently from a85bf39 to 9976aeb Compare March 27, 2024 11:21
@KevinyhZou
Copy link
Contributor Author

cc @vdimir

@KevinyhZou KevinyhZou changed the title Improve inner join performance by decrease insertFrom call Improve left/inner join performance by decrease insertFrom call Apr 2, 2024
@KevinyhZou KevinyhZou force-pushed the improve_join_insert_from branch from e475372 to 4200820 Compare April 26, 2024 10:53
Copy link

woolenwolfbot bot commented Jun 25, 2024

Dear @jkartseva, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@KevinyhZou KevinyhZou force-pushed the improve_join_insert_from branch from c483bb0 to add486b Compare August 20, 2024 09:33
@KevinyhZou
Copy link
Contributor Author

any comments about this pr ? @jkartseva

@jkartseva
Copy link
Contributor

I'll take a look by the end of the week @KevinyhZou

Copy link
Contributor

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this feature experimental. I can help with rolling it out to the staging tier in our cloud. If there are no regressions, we may deprecate the experimental flag.

{
for (size_t i = 0; i < block.columns(); ++i)
{
auto & col = *(block.getByPosition(i).column->assumeMutable());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assumeMutableRef()

Comment on lines 925 to 926
M(Int32, join_to_sort_perkey_rows_threshold, 40, "The lower limit of per-key average rows in the right table to determine whether to sort it in hash join.", 0) \
M(Int32, join_to_sort_table_rows_threshold, 10000, "The upper limit of rows in the right table to determine whether to sort it in hash join.", 0) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were the 40 and 10000 thresholds selected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this feature experimental (e.g., allow_experimental_inner_join_right_table_sorting) and provide a functional test with this setting SET to 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of the thresholds is unclear without reading the code. We could consider one of the following options: updating the description to clarify how changing the setting affects user-experience (for example, using a special join method that improves performance for wide tables but increases memory consumption) or, even better, removing the thresholds and choosing the best value automatically.

Copy link
Contributor Author

@KevinyhZou KevinyhZou Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested locally and found that if there are many rows on the right but very few matching rows, sorting will lead to performance degradation. In this scenario, I found through testing that 10000 of join_to_sort_table_rows_threshold is a reasonable value, which means the right table is not very big and will not cause significant performance degradation due to sorting. On the contrary, if there are lots of matching rows to output, the threshold can be increased to allow a larger right table to be sorted, which can still achieve significant performance improvement.

And another threshold was set default value 40, as I test on the table when the table is not dense enough, and the sorting may also cause the performance degradation, when the threshold set up to 40, then no performance slow down. @jkartseva

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have update the description of the threshold settings, take a look at it whether would be ok? @vdimir

@@ -115,6 +115,7 @@ class AddedColumns
}
join_data_avg_perkey_rows = join.getJoinedData()->avgPerKeyRows();
output_by_row_list_threshold = join.getTableJoin().outputByRowListPerkeyRowsThreshold();
join_data_sorted = join.getJoinedData()->sorted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to the initialization list.


void HashJoin::tryRerangeRightTableData()
{
if ((kind != JoinKind::Inner && kind != JoinKind::Left) || strictness != JoinStrictness::All || table_join->getMixedJoinExpression())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isInnerOrLeft(kind)

void HashJoin::tryRerangeRightTableDataImpl(Map & map [[maybe_unused]])
{
constexpr JoinFeatures<KIND, STRICTNESS, Map> join_features;
if constexpr (join_features.is_all_join && (join_features.left || join_features.inner))
Copy link
Contributor

@jkartseva jkartseva Sep 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external function tryRerangeRightTableData already checks these conditions.
Let's throw a LOGICAL_ERROR if they are not satisfied here.

auto it = rows_ref.begin();
if (it.ok())
{
if (blocks.empty() || blocks.back().rows() > DEFAULT_BLOCK_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the condition be blocks.back().rows() >= DEFAULT_BLOCK_SIZE?

kind,
strictness,
data->maps.front(),
false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: /*prefer_use_maps_all*/ false

src/Interpreters/HashJoin/HashJoin.cpp Outdated Show resolved Hide resolved
Comment on lines 1430 to 1435
if (sample_block_with_columns_to_add.columns() == 0)
{
LOG_DEBUG(log, "The joined right table total rows :{}, total keys :{}, columns added:{}",
data->rows_to_join, data->keys_to_join, sample_block_with_columns_to_add.columns());
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please elaborate on this condition.
Also, why log sample_block_with_columns_to_add.columns()?

data->rows_to_join, data->keys_to_join, sample_block_with_columns_to_add.columns());
return;
}
joinDispatch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

[[maybe_unused]] bool result = joinDispatch(...);
chassert(result);

@KevinyhZou
Copy link
Contributor Author

cc @jkartseva

Copy link
Contributor

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, let's adjust the setting names (see comment) and provide cleaner descriptions, and I'll approve & merge.

@@ -922,6 +922,9 @@ class IColumn;
M(Bool, implicit_transaction, false, "If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)", 0) \
M(UInt64, grace_hash_join_initial_buckets, 1, "Initial number of grace hash join buckets", 0) \
M(UInt64, grace_hash_join_max_buckets, 1024, "Limit on the number of grace hash join buckets", 0) \
M(Int32, join_to_sort_perkey_rows_threshold, 40, "Rerange the right table by key in left or inner hash join when the per-key average rows of it exceed this value (means the table keys is dense) and its number of rows is not too many(controlled by `join_to_sort_table_rows_threshold`), to make the join output by the data batch of key, which would improve performance.", 0) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, a description should be more focused on the particular setting it's describing.

I think it should be reworded, e.g.:
The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys...

Also, the setting name should contain "lower" or "min".

@@ -922,6 +922,9 @@ class IColumn;
M(Bool, implicit_transaction, false, "If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)", 0) \
M(UInt64, grace_hash_join_initial_buckets, 1, "Initial number of grace hash join buckets", 0) \
M(UInt64, grace_hash_join_max_buckets, 1024, "Limit on the number of grace hash join buckets", 0) \
M(Int32, join_to_sort_perkey_rows_threshold, 40, "Rerange the right table by key in left or inner hash join when the per-key average rows of it exceed this value (means the table keys is dense) and its number of rows is not too many(controlled by `join_to_sort_table_rows_threshold`), to make the join output by the data batch of key, which would improve performance.", 0) \
M(Int32, join_to_sort_table_rows_threshold, 10000, "Rerange the right table by key in left or inner hash join when its number of rows not exceed this value and the table keys is dense (controlled by `join_to_sort_perkey_rows_threshold`), to make the join performance improve as output by the data batch of key, but not cost too much on the table reranging.", 0) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly:
The upper threshold of the number of rows in the right table to determine whether to rerange the right table by key in left or inner join.

Or:
The maximum number of rows in the right table...

"upper" or "max" should be in the setting name.

{"input_format_try_infer_datetimes_only_datetime64", true, false, "Allow to infer DateTime instead of DateTime64 in data formats"},
{"join_to_sort_perkey_rows_threshold", 0, 40, "Rerange the right table by key in left or inner hash join when the per-key average rows of it exceed this value (means the table keys is dense) and its number of rows is not too many(controlled by `join_to_sort_table_rows_threshold`), to make the join output by the data batch of key, which would improve performance."},
{"join_to_sort_table_rows_threshold", 0, 10000, "Rerange the right table by key in left or inner hash join when its number of rows not exceed this value and the table keys is dense (controlled by `join_to_sort_perkey_rows_threshold`), to make the join performance improve as output by the data batch of key, but not cost too much on the table reranging."},
{"allow_experimental_join_right_table_sorting", false, false, "If it is set to true, and the conditions of `join_to_sort_perkey_rows_threshold` and `join_to_sort_perkey_rows_threshold` are met, then we will try to rerange the right table by key to improve the performance in left or inner hash join."},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove "try" from the description:

"...are met, rerange the right table by key..."

Copy link
Contributor

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you for working on this.

@jkartseva
Copy link
Contributor

Please update the Changelog entry with a more generalized summary. The present description is too verbose and focused on the specific case.

@KevinyhZou
Copy link
Contributor Author

done

@KevinyhZou KevinyhZou changed the title Improve left/inner join performance by decrease insertFrom call Improve left/inner join performance by rerange right table by keys Sep 11, 2024
@jkartseva jkartseva added this pull request to the merge queue Sep 11, 2024
Merged via the queue into ClickHouse:master with commit b5289c1 Sep 11, 2024
215 of 218 checks passed
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Sep 11, 2024
@zvonand
Copy link
Contributor

zvonand commented Sep 12, 2024

Upgrade check is failing after this one in other PRs:

https://s3.amazonaws.com/clickhouse-test-reports/65488/e76e6d56e17148d34c95cbd91b024dd3b042e4e6/upgrade_check__debug_.html

New settings are not reflected in settings changes history (see new_settings.txt) 	FAIL

   ┌─name────────────────────────────────────────┐
1. │ join_to_sort_minimum_perkey_rows            │
2. │ join_to_sort_maximum_table_rows             │
3. │ allow_experimental_join_right_table_sorting │
   └─────────────────────────────────────────────┘

https://s3.amazonaws.com/clickhouse-test-reports/60341/597181c45e2395991cbb032c7eb2dc3542124e6c/upgrade_check__debug_.html

@KevinyhZou
Copy link
Contributor Author

It seems this settings should be added to 24.9, and I will try to fix this. @zvonand

@@ -922,6 +922,9 @@ class IColumn;
M(Bool, implicit_transaction, false, "If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)", 0) \
M(UInt64, grace_hash_join_initial_buckets, 1, "Initial number of grace hash join buckets", 0) \
M(UInt64, grace_hash_join_max_buckets, 1024, "Limit on the number of grace hash join buckets", 0) \
M(Int32, join_to_sort_minimum_perkey_rows, 40, "The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys", 0) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as https://github.com/ClickHouse/ClickHouse/pull/63677/files#r1771844385. Why is this a Int32 it it's treated as unsigned and it shouldn't be negative?

Copy link
Contributor Author

@KevinyhZou KevinyhZou Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the type is wrong. And I have made a pr to change the type to UInt64, #69886

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants