Inefficient hash join degree updating #3254

yuhao-su · 2022-06-15T11:08:37Z

We solved #2794 in #3241 by updating all matched rows. This is inefficient.

One quick improvement should be only uploading incremental changes of Update((Row, Row)) from Memtable to S3.
But this can not avoid updating unnecessary datums in a row (since the only expected to change datum is the degree datum). We need other dirty hacks to achieve this.

Another possible solution is to investigate the approach in Materialized, which seems can transform outer joins into inner joins. So that we have chances to remove the degree.
This approach can also simplify the share arrangement implementation for join types other than inner.

The text was updated successfully, but these errors were encountered:

jon-chuang · 2022-06-20T05:17:39Z

Indeed, there is write amplification due to the row degree. It is bad for joins with many matches.

The specific scenario that is bad for us is if one side is very large, and the other side is updated frequently.
I guess this scenario could be quite plausible. We may have a user_id, and joined on a table of friend users. The user_id table may also be joined with interactions. The interactions could be quite frequent, and result in the write amplification for all of the rows representing different friends. For this type of workload with frequent updates, the solution described in #2794 may be advantageous.

Another possible solution is to investigate the approach in Materialized

If I’m not wrong we already have an optimization for reducing outer join to inner join if any of the downstream predicates includes a column from the non-outer side.

We would still need to handle situations where we cannot optimise the outer join away.

Basically, I think as the first step, we can improve the situation by:

Optimise for not writing the degree when not necessary, i.e. do not write matched row for join side where the opposite side is not an outer side. (Meaning, we should change the state schema for inner sides, and modify the logic in the executor not to handle the degree for inner sides). So e.g. JoinRow can be an enum: MatchesInnerRow, MatchesOuterRow, where MatchesInnerRow is just Row, and MatchesOuterRow is the current implementation with degree.
The other optimization is that if the predicate is only an eq condition, we can store a single value for the entire join key state. It would reduce the write amplification at least. I think this is the minimal amount of work we would have to do if we do not want to recompute the state for match with outer update.

Basically, both of the above ideas reduce the generality of the current implementation into many specific cases and complicate the state representation.

My sense is that we should not pursue this optimization unless we have specific user workload we want to optimise and we are seeing a bottleneck due to the write amplification.

However, 1. Seems more likely to yield benefits for us since joins tend to be inner joins, so we can optimise for the majority case at least.

We should probably have a category for such last-mile optimizations.

yuhao-su · 2022-06-20T06:50:42Z

Indeed, there is write amplification due to the row degree. It is bad for joins with many matches.
...
We should probably have a category for such last-mile optimizations.

Both optimizations sound feasible to me!

The opt about incremental changes is a more general way to reduce the write amplification (no need to modify join executor). It might be implemented later in StateTable.

BugenZhao · 2022-08-02T06:24:19Z

Seems ~50% of the time is spent on increasing and decreasing the degree. (nexmark q5, risedev p 3 node)

yuhao-su added the type/feature label Jun 15, 2022

jon-chuang added priority/low type/perf labels Jun 20, 2022

wcy-fdu mentioned this issue Jul 4, 2022

Tracking: Introduce row-based encoding for internal state #3615

Closed

8 tasks

wcy-fdu mentioned this issue Jul 25, 2022

feat(streaming): use row-based encoding for hash join #4100

Merged

3 tasks

BugenZhao mentioned this issue Aug 3, 2022

enhance: streaming HashJoin should maintain degree only for opposite of outer side(s) #4384

Closed

yuhao-su mentioned this issue Aug 26, 2022

refactor(streaming): use a separate degree table for hash join #4934

Closed

3 tasks

yuhao-su mentioned this issue Sep 7, 2022

refactor(streaming): refine degree usage #5162

Merged

3 tasks

yuhao-su closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient hash join degree updating #3254

Inefficient hash join degree updating #3254

yuhao-su commented Jun 15, 2022

jon-chuang commented Jun 20, 2022 •

edited

Loading

yuhao-su commented Jun 20, 2022 •

edited

Loading

BugenZhao commented Aug 2, 2022

Inefficient hash join degree updating #3254

Inefficient hash join degree updating #3254

Comments

yuhao-su commented Jun 15, 2022

jon-chuang commented Jun 20, 2022 • edited Loading

yuhao-su commented Jun 20, 2022 • edited Loading

BugenZhao commented Aug 2, 2022

jon-chuang commented Jun 20, 2022 •

edited

Loading

yuhao-su commented Jun 20, 2022 •

edited

Loading