-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-12146][network] Remove unregister task from NetworkEnvironment to simplify the interface of ShuffleService #8133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
@azagrebin I ever thought that unifying the close of partition/gate with canceler process, but it seems a bit different during handling exception, so just make them independent as now. In addition, I am not sure why the previous implementation only catches exception for gate close, and the partition close might also cause exception. So I just refactored the processes and kept the behavior as before to not change anything. |
|
@flinkbot approve all |
8587a06 to
a4e3c7e
Compare
azagrebin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening the PR @zhijiangW ! I have left some comments.
flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/ResultPartition.java
Outdated
Show resolved
Hide resolved
|
|
||
| for (ResultPartition partition : producedPartitions) { | ||
| taskEventDispatcher.unregisterPartition(partition.getPartitionId()); | ||
| partition.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess:
if (isCanceledOrFailed()) {
partition.fail(getFailureCause());
} else {
partition.close();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think it would be ok, if we actually add one more method closeNetworkResources and put there partition/gate closings from TaskCanceler.run and use here in releaseNetworkResources after taskEventDispatcher.unregisterPartition and if () partition.fail loops. We will eliminate code duplication and improve log/exception handling in former unregisterTask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you also think so, we could integrate the network release with the process in task canceler. :)
|
@azagrebin thanks for reviews and the good inline suggestions! I was supposed to submit a separate fixup commit for addressing your comments. But when I ament the first commit message to add more descriptions, the new code modifications has squashed with previous commit automatically. Considering the small overall changes which might not bring trouble for your further review, I keep the current status. :) |
azagrebin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhijiangW ! I left one suggestion.
| * There are two scenarios to release the network resources. One is from {@link TaskCanceler} to early | ||
| * release partitions and gates. Another is from task thread during task exiting. | ||
| */ | ||
| private void closeNetworkResources(boolean isCanceling) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would break this method down:
private void releaseNetworkResources() {
for all partitions taskEventDispatcher.unregisterPartition
Throwable cause = isCanceledOrFailed() ? getFailureCause() : null;
closeOrFailNetworkResources(producedPartitions, inputGates, cause);
}
static closeNetworkResources(producedPartitions, inputGates, cause) {
// closing loops or fail for partition if cause is null
}
TaskCanceler.run() {
closeOrFailNetworkResources(producedPartitions, inputGates, null); // to preserve what we have
}
I would also keep TaskCanceler class static to simplify refactoring in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the option of keeping previous static TaskCanceler class.
But the static closeNetworkResources might need two more parameters. One is for taskNameWithSubtask used for log, and another is the boolean isCanceledOrFailed because in previous behavior the result of isCanceledOrFailed is not always equal to cause != null. In the case of Task#cancelExecution, the cause is null but isCanceledOrFailed would return true. What do you think?
| private static void closeNetworkResources( | ||
| ResultPartition[] producedPartitions, | ||
| InputGate[] inputGates, | ||
| boolean isCanceledOrFailed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename it to just isFailed? Also, method closeOrFailNetworkResources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could actually have Task in TaskCanceler constructor and make this method non-static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we prefer this way.
| closeNetworkResources(producedPartitions, inputGates, isCanceledOrFailed(), getFailureCause(), taskNameWithSubtask); | ||
| } | ||
|
|
||
| private static void closeNetworkResources( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be used only in one place, I would remove it.
|
@azagrebin thanks for review again! I submitted a new commit for addressing above comments. |
azagrebin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zhijiangW ! LGTM 👍
8b43ec3 to
cc1173c
Compare
pnowojski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left one question and one suggestion.
| ResultPartition[] producedPartitions, | ||
| InputGate[] inputGates) { | ||
|
|
||
| TaskCanceler(Logger logger, Task task, AbstractInvokable invokable, Thread executer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you introducing circular dependency here between Task and TaskCanceler? There are various reasons why this is bad, including: is it necessary to expose 27 public methods (including things like startTaskThread() or run()) to the TaskCanceler?
In various different places we are trying to get away from this pattern of passing StreamingTask everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation is for reusing the closeOrFailNetworkResources and avoid static method and pass the arrays of ResultPartition and InputGate explicitly. The previous AbstractInvokable could also be replaced and got from new Task parameter.
I think the previous introduced AbstractInvokable here is also not a good way considering exposing more public methods besides AbstractInvokable#cancel(). Comparing with Task parameter, we might add more public methods to do so. I agree with reverting this change to pass the previous specific three parameters here even though the closeOrFailNetworkResources might seem ugly.
| * @param isFailed true if the task has failed. | ||
| * @param cause the exception that caused the task to fail, or null, if the task has not failed. | ||
| */ | ||
| private void closeOrFailNetworkResources(boolean isFailed, @Nullable Throwable cause) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for de duplicating this logic already ( :) ), but you could also go one step further and maybe extract this logic to something like TaskCloser class and de-duplicate/re-use it in TaskCanceller as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The introduced TaskCloser seems better, agree with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to pass then Runnable closeNetworkResources to TaskCanceler constructor because it is still task concern to manage partitions/gates and we can avoid exploding methods with parameters. TaskCanceler does not really closes the whole task, just interrupts network resources as well.
Also, just thought, that TaskCanceler actually does not need isFailed case.
@zhijiangW maybe, we could simplify closeOrFailNetworkResources and do only closing there. If we remove close from partition.fail, we could move conditional partition.fail to loop in releaseNetworkResources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... ok, sorry by doing quick review I missed one that you have already deduplicated TaskCanceler code 😳 I thought that those are two separate issues that I'm commenting on, but clearly this is just one thing, so responding to both threads here in single comment:
Yes, passing invokable is also not the pretties thing, but it looks like this issue is beside the scope of this PR, right?
I haven't thought it through but passing Runnable closeNetworkResources seems fine to me as well. It has a drawback of being more or less the same thing, but with more vague name/type in an exchange of less overhead code. With simple class TaskCloser we could better specify concurrency contracts (@ThreadSafe) etc, but I think I would be fine both way.
…to simplify the interface of ShuffleService NetworkEnvironment#unregisterTask is used for closing partition/gate and releasing partition from ResultPartitionManager. partition/gate close could be done in task which already maintains the arrays of them. Further we could release partition from ResultPartitionManager inside ResultPartition via introducing ResultPartition#fail(Throwable). To do so, the NetworkEnvironment#unregisterTask could be totally replaced to remove. The benefit is simplifying the method of NetworkEnvironment which would be regarded as default ShuffleService implementation.
|
Thanks for further reviews again @azagrebin @pnowojski |
pnowojski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I will merge it once it's green & @azagrebin will have no further comments.
|
LGTM 👍 |
What is the purpose of the change
NetworkEnvironment#unregisterTaskis used for closing partition/gate and releasing partition fromResultPartitionManager. partition/gate close could be done in task which already maintains the arrays of them. Further we could release partition fromResultPartitionManagerinsideResultPartitionvia introducingResultPartition#fail(Throwable). To do so, theNetworkEnvironment#unregisterTaskcould be totally replaced to remove. The benefit is simplifying the method ofNetworkEnvironmentwhich would be regarded as defaultShuffleServiceimplementation.Brief change log
unregisterTaskfromNetworkEnvironmentclose(Throwable)inResultPartitionfor releasingVerifying this change
This change is already covered by existing tests.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation