-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
executor: fix hang in hash agg when exceeding memory limit leads to panic #57641
Conversation
Hi @xzhangxian1008. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #57641 +/- ##
================================================
+ Coverage 72.8292% 73.4559% +0.6266%
================================================
Files 1676 1677 +1
Lines 463753 466002 +2249
================================================
+ Hits 337748 342306 +4558
+ Misses 105139 102946 -2193
+ Partials 20866 20750 -116
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/cc @windtalker |
/hold |
/cc @windtalker @XuHuaiyu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: windtalker, XuHuaiyu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
@xzhangxian1008: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/unhold |
/retest |
@xzhangxian1008: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
In response to a cherrypick label: new pull request created to branch |
What problem does this PR solve?
Issue Number: close #57546
Problem Summary:
Partial worker gets input by calling
getChildInput
function. In this function, we callConsume
function to track memory occupied by input chunk. When sql's memory usage is high, there will be possible to lead to panic inConsume
function. According to the synchronization rule, each time we get a chunk fromgetChildInput
we need to callDone
function for variableinflightChunkSync
. However, when panic happensDone
function is not called and the counter ininflightChunkSync
will never be minused to 0 which leads to hang.Solution: Each time waked up in channel in
getChildInput
function, we will set a variable namedneedDone
. When panic happens ingetChildInput
function, we will catch the panic in this function and check if variableneedDone
is set to true then decide if we need to callDone
function forinflightChunkSync
, and rethrow the panic at last.What changed and how does it work?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.