Improve left/inner join performance by rerange right table by keys #60341

KevinyhZou · 2024-02-23T03:21:40Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improve the join performance by rerange the right table by keys while the table keys are dense in left or inner hash join.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

nikitamikhaylov · 2024-02-23T15:48:08Z

Note about that we may have sparse columns and working with them through IColumn interface is critical.

robot-clickhouse-ci-2 · 2024-02-23T18:06:31Z

This is an automated comment for commit 597181c with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	❌ failure

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success

jkartseva

Some initial comments.

jkartseva · 2024-03-02T02:16:14Z

src/Interpreters/RowRefs.h

            {
                auto * batch = pool.alloc<Batch>();
                *batch = Batch(this);
                batch->insert(std::move(row_ref), pool);
                return batch;
            }
-
+            row_nums[size] = row_ref.row_num;


why row_nums is required? Isn't row_nums[i] essentially the same as row_refs[i].row_num ?

jkartseva · 2024-03-02T02:17:21Z

src/Interpreters/RowRefs.h

@@ -46,6 +46,7 @@ struct RowRefList : RowRef
        SizeT size = 0; /// It's smaller than size_t but keeps align in Arena.
        Batch * next;
        RowRef row_refs[MAX_SIZE];
+        UInt64 row_nums[MAX_SIZE];


The type should be ColumnIndex

jkartseva · 2024-03-02T02:51:12Z

src/Columns/ColumnString.cpp

+void ColumnString::insertIndicesFrom(const IColumn & src, const IColumn::ColumnIndex * selector, const size_t & size)
+{
+    for (size_t i = 0; i < size; ++i)
+        insertFrom(src, *(selector + i));
+}


This overload is not different from the base class method.

jkartseva · 2024-03-02T03:24:54Z

src/Interpreters/RowRefs.h

+
+        void nextBatch()
+        {
+            batch = batch->next;


This invalidates the existing position, using operator ++ and nextBatch() together is not viable.

jkartseva · 2024-03-02T03:42:10Z

src/Interpreters/RowRefs.h

@@ -55,14 +56,14 @@ struct RowRefList : RowRef

        Batch * insert(RowRef && row_ref, Arena & pool)
        {
-            if (full())
+            if (full() || (size > 0 && row_ref.block != row_refs[0].block))


The new condition should be in the form a function with a human-readable name.

KevinyhZou · 2024-03-02T07:10:26Z

we have optimize inner join use batch insert base on the pr #58278, and on our gluten case , it can performance better. please review it , thanks . @jkartseva

KevinyhZou · 2024-03-14T01:59:24Z

cc @jkartseva

jkartseva · 2024-03-15T04:16:08Z

Is the most recent iteration ready for review @KevinyhZou ?

KevinyhZou · 2024-03-15T06:08:02Z

Is the most recent iteration ready for review @KevinyhZou ?

yes @jkartseva

vdimir · 2024-03-15T16:22:51Z

I assume some CI failures can be related, for example

Stateful tests (tsan) — Tests are not finished, fail: 1, passed: 7, skipped: 1 Details

Contains thread sanitizer report around HashJoin::joinBlock, see ThreadSanitizer: data race in stderr.log file:
https://s3.amazonaws.com/clickhouse-test-reports/60341/f266807e0cc64ad8ddcff741594bcc85508bd6ab/stateful_tests__tsan_/stderr.log

Cloud you please check it?

KevinyhZou · 2024-03-17T14:06:36Z

OK

KevinyhZou · 2024-03-29T01:42:35Z

cc @vdimir

woolenwolfbot · 2024-06-25T16:52:06Z

Dear @jkartseva, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

KevinyhZou · 2024-08-28T02:16:31Z

any comments about this pr ? @jkartseva

jkartseva · 2024-08-29T05:40:35Z

I'll take a look by the end of the week @KevinyhZou

jkartseva

I think we should make this feature experimental. I can help with rolling it out to the staging tier in our cloud. If there are no regressions, we may deprecate the experimental flag.

jkartseva · 2024-09-01T20:13:05Z

src/Interpreters/HashJoin/HashJoin.cpp

+            {
+                for (size_t i = 0; i < block.columns(); ++i)
+                {
+                    auto & col = *(block.getByPosition(i).column->assumeMutable());


assumeMutableRef()

jkartseva · 2024-09-01T20:25:38Z

src/Core/Settings.h

+    M(Int32, join_to_sort_perkey_rows_threshold, 40, "The lower limit of per-key average rows in the right table to determine whether to sort it in hash join.", 0) \
+    M(Int32, join_to_sort_table_rows_threshold, 10000, "The upper limit of rows in the right table to determine whether to sort it in hash join.", 0) \


How were the 40 and 10000 thresholds selected?

Let's make this feature experimental (e.g., allow_experimental_inner_join_right_table_sorting) and provide a functional test with this setting SET to 1.

The meaning of the thresholds is unclear without reading the code. We could consider one of the following options: updating the description to clarify how changing the setting affects user-experience (for example, using a special join method that improves performance for wide tables but increases memory consumption) or, even better, removing the thresholds and choosing the best value automatically.

I tested locally and found that if there are many rows on the right but very few matching rows, sorting will lead to performance degradation. In this scenario, I found through testing that 10000 of join_to_sort_table_rows_threshold is a reasonable value, which means the right table is not very big and will not cause significant performance degradation due to sorting. On the contrary, if there are lots of matching rows to output, the threshold can be increased to allow a larger right table to be sorted, which can still achieve significant performance improvement.

And another threshold was set default value 40, as I test on the table when the table is not dense enough, and the sorting may also cause the performance degradation, when the threshold set up to 40, then no performance slow down. @jkartseva

I have update the description of the threshold settings, take a look at it whether would be ok? @vdimir

jkartseva · 2024-09-01T20:34:50Z

src/Interpreters/HashJoin/AddedColumns.h

@@ -115,6 +115,7 @@ class AddedColumns
        }
        join_data_avg_perkey_rows = join.getJoinedData()->avgPerKeyRows();
        output_by_row_list_threshold = join.getTableJoin().outputByRowListPerkeyRowsThreshold();
+        join_data_sorted = join.getJoinedData()->sorted;


Let's move this to the initialization list.

jkartseva · 2024-09-01T21:34:07Z

src/Interpreters/HashJoin/HashJoin.cpp

+
+void HashJoin::tryRerangeRightTableData()
+{
+    if ((kind != JoinKind::Inner && kind != JoinKind::Left) || strictness != JoinStrictness::All || table_join->getMixedJoinExpression())


!isInnerOrLeft(kind)

jkartseva · 2024-09-01T21:35:32Z

src/Interpreters/HashJoin/HashJoin.cpp

+void HashJoin::tryRerangeRightTableDataImpl(Map & map [[maybe_unused]])
+{
+    constexpr JoinFeatures<KIND, STRICTNESS, Map> join_features;
+    if constexpr (join_features.is_all_join && (join_features.left || join_features.inner))


The external function tryRerangeRightTableData already checks these conditions.
Let's throw a LOGICAL_ERROR if they are not satisfied here.

jkartseva · 2024-09-01T21:46:05Z

src/Interpreters/HashJoin/HashJoin.cpp

+            auto it = rows_ref.begin();
+            if (it.ok())
+            {
+                if (blocks.empty() || blocks.back().rows() > DEFAULT_BLOCK_SIZE)


Shouldn't the condition be blocks.back().rows() >= DEFAULT_BLOCK_SIZE?

jkartseva · 2024-09-02T03:25:00Z

src/Interpreters/HashJoin/HashJoin.cpp

+        kind,
+        strictness,
+        data->maps.front(),
+        false,


nit: /*prefer_use_maps_all*/ false

src/Interpreters/HashJoin/HashJoin.cpp

jkartseva · 2024-09-02T03:42:50Z

src/Interpreters/HashJoin/HashJoin.cpp

+    if (sample_block_with_columns_to_add.columns() == 0)
+    {
+        LOG_DEBUG(log, "The joined right table total rows :{}, total keys :{}, columns added:{}",
+            data->rows_to_join, data->keys_to_join, sample_block_with_columns_to_add.columns());
+        return;
+    }


Please elaborate on this condition.
Also, why log sample_block_with_columns_to_add.columns()?

jkartseva · 2024-09-02T03:44:14Z

src/Interpreters/HashJoin/HashJoin.cpp

+            data->rows_to_join, data->keys_to_join, sample_block_with_columns_to_add.columns());
+        return;
+    }
+    joinDispatch(


nit:

[[maybe_unused]] bool result = joinDispatch(...); chassert(result);

…ents

KevinyhZou · 2024-09-10T01:34:36Z

cc @jkartseva

jkartseva

I think this looks good, let's adjust the setting names (see comment) and provide cleaner descriptions, and I'll approve & merge.

jkartseva · 2024-09-10T03:39:13Z

src/Core/Settings.h

@@ -922,6 +922,9 @@ class IColumn;
    M(Bool, implicit_transaction, false, "If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)", 0) \
    M(UInt64, grace_hash_join_initial_buckets, 1, "Initial number of grace hash join buckets", 0) \
    M(UInt64, grace_hash_join_max_buckets, 1024, "Limit on the number of grace hash join buckets", 0) \
+    M(Int32, join_to_sort_perkey_rows_threshold, 40, "Rerange the right table by key in left or inner hash join when the per-key average rows of it exceed this value (means the table keys is dense) and its number of rows is not too many(controlled by `join_to_sort_table_rows_threshold`), to make the join output by the data batch of key, which would improve performance.", 0) \


In general, a description should be more focused on the particular setting it's describing.

I think it should be reworded, e.g.:
The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys...

Also, the setting name should contain "lower" or "min".

jkartseva · 2024-09-10T04:28:47Z

src/Core/Settings.h

@@ -922,6 +922,9 @@ class IColumn;
    M(Bool, implicit_transaction, false, "If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)", 0) \
    M(UInt64, grace_hash_join_initial_buckets, 1, "Initial number of grace hash join buckets", 0) \
    M(UInt64, grace_hash_join_max_buckets, 1024, "Limit on the number of grace hash join buckets", 0) \
+    M(Int32, join_to_sort_perkey_rows_threshold, 40, "Rerange the right table by key in left or inner hash join when the per-key average rows of it exceed this value (means the table keys is dense) and its number of rows is not too many(controlled by `join_to_sort_table_rows_threshold`), to make the join output by the data batch of key, which would improve performance.", 0) \
+    M(Int32, join_to_sort_table_rows_threshold, 10000, "Rerange the right table by key in left or inner hash join when its number of rows not exceed this value and the table keys is dense (controlled by `join_to_sort_perkey_rows_threshold`), to make the join performance improve as output by the data batch of key, but not cost too much on the table reranging.", 0) \


Similarly:
The upper threshold of the number of rows in the right table to determine whether to rerange the right table by key in left or inner join.

Or:
The maximum number of rows in the right table...

"upper" or "max" should be in the setting name.

jkartseva · 2024-09-10T04:43:56Z

src/Core/SettingsChangesHistory.cpp

+            {"input_format_try_infer_datetimes_only_datetime64", true, false, "Allow to infer DateTime instead of DateTime64 in data formats"},
+            {"join_to_sort_perkey_rows_threshold", 0, 40, "Rerange the right table by key in left or inner hash join when the per-key average rows of it exceed this value (means the table keys is dense) and its number of rows is not too many(controlled by `join_to_sort_table_rows_threshold`), to make the join output by the data batch of key, which would improve performance."},
+            {"join_to_sort_table_rows_threshold", 0, 10000, "Rerange the right table by key in left or inner hash join when its number of rows not exceed this value and the table keys is dense (controlled by `join_to_sort_perkey_rows_threshold`), to make the join performance improve as output by the data batch of key, but not cost too much on the table reranging."},
+            {"allow_experimental_join_right_table_sorting", false, false, "If it is set to true, and the conditions of `join_to_sort_perkey_rows_threshold` and `join_to_sort_perkey_rows_threshold` are met, then we will try to rerange the right table by key to improve the performance in left or inner hash join."},


Let's remove "try" from the description:

"...are met, rerange the right table by key..."

jkartseva

Looks good, thank you for working on this.

jkartseva · 2024-09-11T00:08:19Z

Please update the Changelog entry with a more generalized summary. The present description is too verbose and focused on the specific case.

KevinyhZou · 2024-09-11T01:56:52Z

done

zvonand · 2024-09-12T11:51:12Z

Upgrade check is failing after this one in other PRs:

https://s3.amazonaws.com/clickhouse-test-reports/65488/e76e6d56e17148d34c95cbd91b024dd3b042e4e6/upgrade_check__debug_.html

New settings are not reflected in settings changes history (see new_settings.txt) 	FAIL

   ┌─name────────────────────────────────────────┐
1. │ join_to_sort_minimum_perkey_rows            │
2. │ join_to_sort_maximum_table_rows             │
3. │ allow_experimental_join_right_table_sorting │
   └─────────────────────────────────────────────┘

https://s3.amazonaws.com/clickhouse-test-reports/60341/597181c45e2395991cbb032c7eb2dc3542124e6c/upgrade_check__debug_.html

KevinyhZou · 2024-09-12T12:04:15Z

It seems this settings should be added to 24.9, and I will try to fix this. @zvonand

Algunenano · 2024-09-23T17:56:43Z

src/Core/Settings.h

@@ -922,6 +922,9 @@ class IColumn;
    M(Bool, implicit_transaction, false, "If enabled and not already inside a transaction, wraps the query inside a full transaction (begin + commit or rollback)", 0) \
    M(UInt64, grace_hash_join_initial_buckets, 1, "Initial number of grace hash join buckets", 0) \
    M(UInt64, grace_hash_join_max_buckets, 1024, "Limit on the number of grace hash join buckets", 0) \
+    M(Int32, join_to_sort_minimum_perkey_rows, 40, "The lower limit of per-key average rows in the right table to determine whether to rerange the right table by key in left or inner join. This setting ensures that the optimization is not applied for sparse table keys", 0) \


Same as https://github.com/ClickHouse/ClickHouse/pull/63677/files#r1771844385. Why is this a Int32 it it's treated as unsigned and it shouldn't be negative?

Yes, the type is wrong. And I have made a pr to change the type to UInt64, #69886

KevinyhZou marked this pull request as draft February 23, 2024 03:21

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Feb 23, 2024

robot-clickhouse-ci-2 added the pr-performance Pull request with some performance improvements label Feb 23, 2024

KevinyhZou marked this pull request as ready for review February 26, 2024 12:42

jkartseva self-assigned this Feb 28, 2024

jkartseva reviewed Mar 2, 2024

View reviewed changes

KevinyhZou force-pushed the improve_join_insert_from branch from 2e2e1d1 to 2cf4dbf Compare March 2, 2024 07:01

KevinyhZou force-pushed the improve_join_insert_from branch from b16f0da to 4a40b4b Compare March 3, 2024 16:34

KevinyhZou marked this pull request as draft March 4, 2024 01:50

KevinyhZou force-pushed the improve_join_insert_from branch from 4a40b4b to 202b571 Compare March 12, 2024 10:17

KevinyhZou marked this pull request as ready for review March 14, 2024 01:59

KevinyhZou force-pushed the improve_join_insert_from branch from 8b38c9c to f266807 Compare March 15, 2024 02:20

KevinyhZou force-pushed the improve_join_insert_from branch 3 times, most recently from a85bf39 to 9976aeb Compare March 27, 2024 11:21

KevinyhZou changed the title ~~Improve inner join performance by decrease insertFrom call~~ Improve left/inner join performance by decrease insertFrom call Apr 2, 2024

KevinyhZou force-pushed the improve_join_insert_from branch from e475372 to 4200820 Compare April 26, 2024 10:53

woolenwolfbot bot unassigned jkartseva Jun 25, 2024

KevinyhZou added 4 commits August 20, 2024 15:56

add setting tests/performance/all_join_opt.xml

29c9419

add allowReadCaseInsentitive func

b8e967f

remove useless code

cfa4ca6

rebase and reslove conflict

add486b

KevinyhZou force-pushed the improve_join_insert_from branch from c483bb0 to add486b Compare August 20, 2024 09:33

jkartseva reviewed Sep 2, 2024

View reviewed changes

KevinyhZou added 5 commits September 4, 2024 16:05

review fix

dbf6e6c

update the description

49548ed

fix tests incompatible and add new test example

d14e03a

rename to allow_experimental_join_right_table_sorting and modify comm…

f8b6025

…ents

modify test

02e129f

jkartseva reviewed Sep 10, 2024

View reviewed changes

review

597181c

jkartseva approved these changes Sep 10, 2024

View reviewed changes

KevinyhZou changed the title ~~Improve left/inner join performance by decrease insertFrom call~~ Improve left/inner join performance by rerange right table by keys Sep 11, 2024

jkartseva added this pull request to the merge queue Sep 11, 2024

Merged via the queue into ClickHouse:master with commit b5289c1 Sep 11, 2024
215 of 218 checks passed

robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Sep 11, 2024

zvonand mentioned this pull request Sep 12, 2024

Read only necessary columns & respect ttl_only_drop_parts on materialize ttl #65488

Merged

19 tasks

KevinyhZou mentioned this pull request Sep 12, 2024

Fix upgrade check failure for join_to_sort settings #69554

Merged

21 tasks

Algunenano reviewed Sep 23, 2024

View reviewed changes

KevinyhZou mentioned this pull request Nov 22, 2024

Improve grace hash join performance by rerange the right join table by keys #72237

Merged

21 tasks

		M(Int32, join_to_sort_perkey_rows_threshold, 40, "The lower limit of per-key average rows in the right table to determine whether to sort it in hash join.", 0) \
		M(Int32, join_to_sort_table_rows_threshold, 10000, "The upper limit of rows in the right table to determine whether to sort it in hash join.", 0) \

Improve left/inner join performance by rerange right table by keys #60341

Improve left/inner join performance by rerange right table by keys #60341

Conversation

KevinyhZou commented Feb 23, 2024 • edited Loading

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

nikitamikhaylov commented Feb 23, 2024

robot-clickhouse-ci-2 commented Feb 23, 2024 • edited by robot-clickhouse Loading

jkartseva left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkartseva Mar 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinyhZou commented Mar 2, 2024 • edited Loading

KevinyhZou commented Mar 14, 2024

jkartseva commented Mar 15, 2024

KevinyhZou commented Mar 15, 2024 • edited Loading

vdimir commented Mar 15, 2024

KevinyhZou commented Mar 17, 2024

KevinyhZou commented Mar 29, 2024

woolenwolfbot bot commented Jun 25, 2024

KevinyhZou commented Aug 28, 2024

jkartseva commented Aug 29, 2024

jkartseva left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinyhZou Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkartseva Sep 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KevinyhZou commented Sep 10, 2024

jkartseva left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkartseva left a comment

Choose a reason for hiding this comment

jkartseva commented Sep 11, 2024

KevinyhZou commented Sep 11, 2024

zvonand commented Sep 12, 2024 • edited Loading

KevinyhZou commented Sep 12, 2024

Choose a reason for hiding this comment

KevinyhZou Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

KevinyhZou commented Feb 23, 2024 •

edited

Loading

robot-clickhouse-ci-2 commented Feb 23, 2024 •

edited by robot-clickhouse

Loading

jkartseva Mar 2, 2024 •

edited

Loading

KevinyhZou commented Mar 2, 2024 •

edited

Loading

KevinyhZou commented Mar 15, 2024 •

edited

Loading

KevinyhZou Sep 4, 2024 •

edited

Loading

jkartseva Sep 1, 2024 •

edited

Loading

zvonand commented Sep 12, 2024 •

edited

Loading

KevinyhZou Sep 24, 2024 •

edited

Loading