Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: refine hash join v2 for spill #55790

Closed
wants to merge 776 commits into from

Conversation

xzhangxian1008
Copy link
Contributor

What problem does this PR solve?

Issue Number: ref #55153

Problem Summary:

What changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

windtalker and others added 30 commits June 13, 2024 18:01
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufei@pingcap.com>
Signed-off-by: xufei <xufei@pingcap.com>
Signed-off-by: xufei <xufei@pingcap.com>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufei@pingcap.com>
Signed-off-by: xufei <xufei@pingcap.com>
Signed-off-by: xufei <xufei@pingcap.com>
Signed-off-by: xufei <xufei@pingcap.com>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 2, 2024
Copy link

ti-chi-bot bot commented Sep 2, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign myonkeminta for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xzhangxian1008 xzhangxian1008 changed the title executor: refine hashv2 join for spill executor: refine hash join v2 for spill Sep 2, 2024
Copy link

tiprow bot commented Sep 2, 2024

Hi @xzhangxian1008. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

codecov bot commented Sep 2, 2024

Codecov Report

Attention: Patch coverage is 43.01733% with 559 lines in your changes missing coverage. Please review.

Project coverage is 73.4534%. Comparing base (3419bde) to head (75127c7).
Report is 62 commits behind head on master.

Current head 75127c7 differs from pull request most recent head ab22f9b

Please upload reports for the commit ab22f9b to get more accurate results.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #55790        +/-   ##
================================================
+ Coverage   73.0085%   73.4534%   +0.4449%     
================================================
  Files          1584       1600        +16     
  Lines        443036     459190     +16154     
================================================
+ Hits         323454     337291     +13837     
- Misses        99870     101902      +2032     
- Partials      19712      19997       +285     
Flag Coverage Δ
integration 44.0584% <43.0173%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9567% <ø> (ø)
parser ∅ <ø> (∅)
br 45.0436% <ø> (-0.4146%) ⬇️

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker @XuHuaiyu

@ti-chi-bot ti-chi-bot bot requested review from windtalker and XuHuaiyu September 3, 2024 01:47
pkg/executor/join/hash_join_base.go Outdated Show resolved Hide resolved
pkg/executor/join/hash_join_base.go Outdated Show resolved Hide resolved
select {
case <-doneCh:
syncerDone(fetcherAndWorkerSyncer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not put syncerDone as defer function?

return nil
} else if joinResult.err != nil || (joinResult.chk != nil && joinResult.chk.NumRows() > 0) {
w.HashJoinCtx.joinResultCh <- joinResult
} else if joinResult.chk != nil && joinResult.chk.NumRows() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this happen?

This logic is from the origin hash join. Maybe this happens when there are no rows be successfully joined.

builder.rowNumberInCurrentRowTableSeg[partitionID] = 0
failpoint.Inject("finalizeCurrentSegPanic", nil)
seg.finalized = true
htc.memoryTracker.Consume(seg.totalUsedBytes())
if needConsume {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will needConsume be false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will needConsume be false?

When spill is triggered, needConsume will be false.

htc.rowTables = nil
return totalSegmentCnt

htc.clearAllSegmentsInRowTable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keep htc.rowTables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keep htc.rowTables

Because htc.rowTables may be used in next restore round.

@@ -301,6 +299,7 @@ func (e *HashJoinV2Exec) Open(ctx context.Context) error {
e.prepared = false
return err
}
e.isMemoryClearedForTest = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set a test variable to true by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set a test variable to true by default?

Each time finishing a restore round, we will check if all memory has been cleared. Only when there is memory not clear then this variable will be set to false. So we need to set it to true at the beginning.

func (e *HashJoinV2Exec) waitJoinWorkersAndCloseResultChan() {
e.workerWg.Wait()
// finaWorker is responsible for scanning the row table after probe done and wake up the build fetcher
func (e *HashJoinV2Exec) finalWorker(syncer chan struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the join don't need to scan hash table after probe, there is no need to start final worker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the join don't need to scan hash table after probe, there is no need to start final worker?

Yes, but final worker will not create more go routines when there is no need to scan hash table.

@@ -588,40 +524,31 @@ func (w *ProbeWorkerV2) getNewJoinResult() (bool, *hashjoinWorkerResult) {
func (e *HashJoinV2Exec) Next(ctx context.Context, req *chunk.Chunk) (err error) {
if !e.prepared {
e.initHashTableContext()
e.initializeForProbe()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move e.initializeForProbe() out from e.fetchAndProbeHashTable()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move e.initializeForProbe() out from e.fetchAndProbeHashTable()?

Because joinResultCh in e.initializeForProbe() function needs to be initialized before starting build fetcher, as joinResultCh is also used for transmitting error.

req.Reset()

result, ok := <-e.joinResultCh
e.recycleChunk(req, result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move this recycle code before error check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why move this recycle code before error check?

If error happens, we will directly return and fail to recycle chunk, this will let workers hang.

e.buildFinished <- err
return false
for ifContinue {
ifContinue, err = e.dispatchBuildTasksImpl(syncer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatchBuildTasksImpl can be called multiple times here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatchBuildTasksImpl can be called multiple times here?

Yes, after spill is added, this function will be called multiple times/

@@ -114,12 +114,13 @@ func (s *parallelSortSpillAction) actionImpl(t *memory.Tracker) bool {
s.spillHelper.cond.Wait()
}

// we may be in `needSpill` status, so it's necessary check if we are in `notSpilled` status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes?

Yes, it's unrelated with hash join refine. But I find that we can refine the comment in sort spill, so I put it in this pr.

j.appendBuildRowToCachedBuildRowsAndConstructBuildRowsIfNeeded(createMatchRowInfo(0, currentRow), joinResult.chk, 0, false)
insertedRows++
}
j.appendBuildRow(meta, joinResult.chk, j.rowIter.getValue(), &insertedRows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this

I think it's more readable to put these codes into a function and reader could know the purpose of these codes from function name.

// We must ensure that all prebuild workers have exited before
// set `finished` flag and close buildFetcherAndDispatcherSyncChan
fetcherAndWorkerSyncer.Wait()
e.finished.Store(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means this function need to wait all the probe finishes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means this function need to wait all the probe finishes?

No, build fetcher could directly exit without waiting all probe finished.

Copy link

ti-chi-bot bot commented Sep 4, 2024

@xzhangxian1008: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-ddl-test ab22f9b link true /test pull-integration-ddl-test
idc-jenkins-ci-tidb/build ab22f9b link true /test build
idc-jenkins-ci-tidb/check_dev ab22f9b link true /test check-dev
pull-mysql-client-test ab22f9b link true /test pull-mysql-client-test
idc-jenkins-ci-tidb/unit-test ab22f9b link true /test unit-test
idc-jenkins-ci-tidb/mysql-test ab22f9b link true /test mysql-test
idc-jenkins-ci-tidb/check_dev_2 ab22f9b link true /test check-dev2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 6, 2024
Copy link

ti-chi-bot bot commented Sep 6, 2024

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@xzhangxian1008 xzhangxian1008 deleted the refine-hash-join branch December 4, 2024 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants