Add motionhazard to the outer side of parallel aware join.(fix flaky incorrect results of agg) #284

avamingli · 2023-11-03T07:25:34Z

For a parallel aware join, we may partition outer data to batches if there is not enough memory.And if a worker has consumed all the data from subnode, it will arrive and wait for others to begin next phase: PHJ_BUILD_DONE.

PHJ_BUILD_DONE means that we has partitioned every thing from PHJ_BUILD_HASHING_OUTER phase and are ready to probe. It's true for Postgres, becuase that if a worker pulls no data from subnode and it leaves PHJ_BUILD_HASHING_OUTER phase, all subnode's data must be processed(ex: a Parallel Seqscan).
As all participants leaves PHJ_BUILD_HASHING_OUTER, the whole data should be pulled completely by these processes, or there may be other siblings who have never participated in phase PHJ_BUILD_HASHING_OUTER at all.

But it's not true for CBDB, a plan node may have Motions behind. Some parallel workers pull no data from subnode, it doesn't mean that other workers have no data to process at all. We need to wait for all parallel workers together instead of build barrier's participants arrived at the current phase before we move to the next phase.

Add an outer_motion_barrier to ensure all parallel workers finish their work of partition outer batches.

Authored-by: Zhang Mingli avamingli@gmail.com

closes: #ISSUE_Number

Change logs

Describe your change clearly, including what problem is being solved or what feature is being added.

If it has some breaking backward or forward compatibility, please clary.

Why are the changes needed?

Describe why the changes are necessary.

Does this PR introduce any user-facing change?

If yes, please clarify the previous behavior and the change this PR proposes.

How was this patch tested?

Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.

Contributor's Checklist

Here are some reminders and checklists before/when submitting your pull request, please check them:

Make sure your Pull Request has a clear title and commit message. You can take git-commit template as a reference.
Sign the Contributor License Agreement as prompted for your first-time contribution(One-time setup).
Learn the coding contribution guide, including our code conventions, workflow and more.
List your communication in the GitHub Issues or Discussions (if has or needed).
Document changes.
Add tests for the change
Pass make installcheck
Pass make -C src/test installcheck-cbdb-parallel
Feel free to @cloudberrydb/dev team for review and approval when your PR is ready🥳

src/backend/executor/nodeHashjoin.c

avamingli · 2023-11-03T08:10:28Z

Results check of case in #259 , rewrite to sql

select count(i3::numeric), sum(i3::numeric), avg(i3::numeric) from
(SELECT t1.* FROM test_hj_spill AS t1 right JOIN test_hj_spill AS t2 ON
t1.i1=t2.i2) foo;

correct result:

 count |   sum    |         avg
-------+----------+----------------------
 45000 | 22477500 | 499.5000000000000000
(1 row)

without this patch

500 loops, 470 succeed to run, 30 wrong results

gpadmin@i-v3roa8c2:/var/tmp$ cat res.log | grep -i count | wc -l
470
gpadmin@i-v3roa8c2:/var/tmp$ cat res.log | grep -i 45000 | wc -l
440

It's possible that some loops didn't succeed to run a sql file, because I limit the resource and CBDB may get errors like:

psql:p.sql:3: ERROR:  failed to acquire resources on one or more segments
DETAIL:  connection to server at "127.0.1.1", port 9004 failed: FATAL:  reader could not find writer proc entry
DETAIL:  lock [0,1260] AccessShareLock 0. Probably because writer gang is gone somehow. Maybe try rerunning.
 (seg2 127.0.1.1:9004)

It's expected, just bypass them.

With this patch:

500 loops, 474 succeed to run, 0 wrong results

gpadmin@i-v3roa8c2:/var/tmp$ cat res.log | grep -i count | wc -l
474
gpadmin@i-v3roa8c2:/var/tmp$ cat res.log | grep -i 45000 | wc -l
474

comparison

500 loops test

commit	succeed to run sql	wrong results
without this pr	470	30
with this pr	474	0

For a parallel aware join, we may partition outer data to batches if there is not enough memory.And if a worker has consumed all the data from subnode, it will arrive and wait for others to begin next phase: PHJ_BUILD_DONE. PHJ_BUILD_DONE means that we has partitioned every thing from PHJ_BUILD_HASHING_OUTER phase and are ready to probe. It's true for Postgres, becuase that if a worker pulls no data from subnode and it leaves PHJ_BUILD_HASHING_OUTER phase, all subnode's data must be processed(ex: a Parallel Seqscan). As all participants leaves PHJ_BUILD_HASHING_OUTER, the whole data should be pulled completely by these processes, or there may be other siblings who have never participated in phase PHJ_BUILD_HASHING_OUTER at all. But it's not true for CBDB, a plan node may have Motions behind. Some parallel workers pull no data from subnode, it doesn't mean that other workers have no data to process at all. We need to wait for all parallel workers together instead of build barrier's participants arrived at the current phase before we move to the next phase. Add an outer_motion_barrier to ensure all parallel workers finish their work of partition outer batches. Authored-by: Zhang Mingli avamingli@gmail.com

my-ship-it

LGTM

avamingli self-assigned this Nov 3, 2023

avamingli requested a review from my-ship-it November 3, 2023 07:25

avamingli mentioned this pull request Nov 3, 2023

[Bug] flaky incorrect results of hashjoin_spill in parallel mode on x86_64 #259

Closed

2 tasks

yjhjstz reviewed Nov 3, 2023

View reviewed changes

src/backend/executor/nodeHashjoin.c Show resolved Hide resolved

avamingli force-pushed the outer_plan_motionhazard branch from c7865ca to e4b313c Compare November 3, 2023 07:34

avamingli requested a review from Ray-Eldath November 3, 2023 07:38

avamingli force-pushed the outer_plan_motionhazard branch from e4b313c to 2a994b5 Compare November 3, 2023 09:31

avamingli force-pushed the outer_plan_motionhazard branch from 2a994b5 to 12bb939 Compare November 4, 2023 01:20

my-ship-it approved these changes Nov 6, 2023

View reviewed changes

Ray-Eldath approved these changes Nov 6, 2023

View reviewed changes

avamingli merged commit 516c611 into apache:main Nov 6, 2023
7 checks passed

avamingli deleted the outer_plan_motionhazard branch November 6, 2023 04:25

avamingli mentioned this pull request Nov 6, 2023

Fix explain bad indent when showing operatorMem. #288

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add motionhazard to the outer side of parallel aware join.(fix flaky incorrect results of agg) #284

Add motionhazard to the outer side of parallel aware join.(fix flaky incorrect results of agg) #284

avamingli commented Nov 3, 2023 •

edited

Loading

avamingli commented Nov 3, 2023 •

edited

Loading

my-ship-it left a comment

Add motionhazard to the outer side of parallel aware join.(fix flaky incorrect results of agg) #284

Add motionhazard to the outer side of parallel aware join.(fix flaky incorrect results of agg) #284

Conversation

avamingli commented Nov 3, 2023 • edited Loading

Change logs

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Contributor's Checklist

avamingli commented Nov 3, 2023 • edited Loading

without this patch

With this patch:

comparison

my-ship-it left a comment

Choose a reason for hiding this comment

avamingli commented Nov 3, 2023 •

edited

Loading

avamingli commented Nov 3, 2023 •

edited

Loading