Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YARN-11736. Enhance MultiNodeLookupPolicy to allow configuration of extended comparators for better usability. #7121

Open
wants to merge 4 commits into
base: trunk
Choose a base branch
from

Conversation

TaoYang526
Copy link
Contributor

Description of PR

Please refer to JIRA: YARN-11736

How was this patch tested?

UT

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ docker 1m 8s Docker failed to build run-specific yetus/hadoop:tp-12082}.
Subsystem Report/Notes
GITHUB PR #7121
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/1/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@TaoYang526
Copy link
Contributor Author

@yangwwei @sunilgovind @shameersss1 Could you please help to review this PR? Thanks!

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ docker 1m 18s Docker failed to build run-specific yetus/hadoop:tp-9856}.
Subsystem Report/Notes
GITHUB PR #7121
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/2/console
versions git=2.34.1
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@TaoYang526
Copy link
Contributor Author

TaoYang526 commented Oct 17, 2024

Hi, @yangjiandan It seems that you were using and contributing for multi-node mechanism recently, could you please help to review this PR? Thanks.

@shameersss1
Copy link
Contributor

Sure, will review this week.

Copy link
Contributor

@shameersss1 shameersss1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the modification about the configs and new class here : https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html (you can find the changes to be done for this page in the hadoop code base itself) so that the end users know about its existence.

private String policyClassName;
private long sortingInterval;

public MultiNodePolicySpec(String policyClassName, long timeout) {
public MultiNodePolicySpec(String policyName, String policyClassName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why both policyName and policyClassName is required here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MultiNodeSorter#initPolicy, Policy instance will be created based on policyClassName, and policyName will be used to get conf belong to this policy instance in MultiComparatorPolicy#setConf. This is the only way for every policy instance to know which configuration belong to it, another way is to update the policy interface that I prefer not to use. If there are better approaches, feel free to propose them.

}
Configuration policyConf = new Configuration(this.getConfig());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse config object instead of creating new one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MultiNodeSortingManager#createAllPolicies, we can see all the MultiNodeSorter instances owns a shared config, policyName will be set in policyConf, which is a instance-level configuration, so that policyInstance can get the configurations belong to itself.

// conf keys and default values
public static final String COMPARATORS_CONF_KEY = "comparators";
protected static final List<Comparator> DEFAULT_COMPARATORS = Collections
.unmodifiableList(Arrays.asList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this means the default sorting policy is based on resource utilization, the node having less resource utilization will be given priority

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, default comparators will only be used when there's no comparators configured for this policy or configured incorrectly. Because I believe optimizing the workload distribution among nodes is the primary use case.

}
this.conf = conf;
String policyName = conf.get(
CapacitySchedulerConfiguration.MULTI_NODE_SORTING_POLICY_CURRENT_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference between MULTI_NODE_SORTING_POLICY_CURRENT_NAME and ``MULTI_NODE_SORTING_POLICY`

Copy link
Contributor Author

@TaoYang526 TaoYang526 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CurrentlyMULTI_NODE_SORTING_POLICY_NAME is the prefix of all the configurations of multi-node policy: yarn.scheduler.capacity.multi-node-sorting.policy, it's not a proper name but used in many places so I prefer not to update it. MULTI_NODE_SORTING_POLICY_CURRENT_NAME is used to transfer the policyName to policy instance.

@TaoYang526
Copy link
Contributor Author

Please add the modification about the configs and new class here : https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html (you can find the changes to be done for this page in the hadoop code base itself) so that the end users know about its existence.

@shameersss1 There is no introduce about multi-node policy in the document yet, therefore I am unable to update or add any further information.
I would like to add document for multi-node mechanism, but it's not suitable in this PR, we may need to create another JIRA ticket for the documentation. Does that sound good?

@slfan1989
Copy link
Contributor

slfan1989 commented Oct 18, 2024

@TaoYang526 @shameersss1 I am eager to help, even though I’m not very familiar with this part of YARN. I will do my best. If we can confirm together that the modified code is fine, we can proceed with merging it. I've reviewed @shameersss1 code, and the quality is quite good.

@TaoYang526
Copy link
Contributor Author

@slfan1989 Thank you for joining us, it's great to have you help out.

@TaoYang526
Copy link
Contributor Author

@shameersss1 Thanks for the review. I have added javadoc for key fields in the last commit.
Please take another look and let me know if there’s anything else that needs attention.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 11m 56s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 50m 3s trunk passed
-1 ❌ compile 0m 36s /branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in trunk failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
+1 💚 compile 1m 11s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 1m 3s trunk passed
+1 💚 mvnsite 1m 13s trunk passed
+1 💚 javadoc 1m 11s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 1m 3s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 2m 31s trunk passed
-1 ❌ shadedclient 8m 19s branch has errors when building and testing our client artifacts.
_ Patch Compile Tests _
-1 ❌ mvninstall 1m 2s /patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch failed.
-1 ❌ compile 0m 10s /patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
-1 ❌ javac 0m 10s /patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
-1 ❌ compile 0m 17s /patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05.
-1 ❌ javac 0m 17s /patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05.
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 57s /results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 36 new + 139 unchanged - 0 fixed = 175 total (was 139)
-1 ❌ mvnsite 0m 9s /patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch failed.
-1 ❌ javadoc 0m 55s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
-1 ❌ javadoc 0m 51s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05.
-1 ❌ spotbugs 2m 36s /new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
-1 ❌ shadedclient 36m 41s patch has errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 0m 27s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-server-resourcemanager in the patch failed.
+0 🆗 asflicense 0m 29s ASF License check generated no output?
122m 30s
Reason Tests
SpotBugs module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.policy.CompositeComparator implements Comparator but not Serializable At MultiComparatorPolicy.java:Serializable At MultiComparatorPolicy.java:[lines 334-359]
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/3/artifact/out/Dockerfile
GITHUB PR #7121
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux aed8082e3e2d 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 95ef881
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/3/testReport/
Max. process+thread count 91 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 11m 44s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 1s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 3 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 43m 21s trunk passed
-1 ❌ compile 0m 33s /branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in trunk failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
+1 💚 compile 0m 55s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 checkstyle 0m 59s trunk passed
+1 💚 mvnsite 1m 1s trunk passed
+1 💚 javadoc 1m 0s trunk passed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04
+1 💚 javadoc 0m 50s trunk passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 spotbugs 2m 0s trunk passed
+1 💚 shadedclient 34m 36s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 48s the patch passed
-1 ❌ compile 0m 24s /patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
-1 ❌ javac 0m 24s /patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
+1 💚 compile 0m 46s the patch passed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05
+1 💚 javac 0m 46s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 43s /results-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 36 new + 139 unchanged - 0 fixed = 175 total (was 139)
+1 💚 mvnsite 0m 50s the patch passed
-1 ❌ javadoc 0m 44s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04.
-1 ❌ javadoc 0m 43s /patch-javadoc-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_422-8u422-b05-1~20.04-b05.txt hadoop-yarn-server-resourcemanager in the patch failed with JDK Private Build-1.8.0_422-8u422-b05-1~20.04-b05.
-1 ❌ spotbugs 2m 0s /new-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.html hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚 shadedclient 34m 41s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 105m 44s hadoop-yarn-server-resourcemanager in the patch passed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
244m 25s
Reason Tests
SpotBugs module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.policy.CompositeComparator implements Comparator but not Serializable At MultiComparatorPolicy.java:Serializable At MultiComparatorPolicy.java:[lines 333-358]
Subsystem Report/Notes
Docker ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/4/artifact/out/Dockerfile
GITHUB PR #7121
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux dd81502eb1c3 5.15.0-117-generic #127-Ubuntu SMP Fri Jul 5 20:13:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 938e0d6
Default Java Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.24+8-post-Ubuntu-1ubuntu320.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_422-8u422-b05-1~20.04-b05
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/4/testReport/
Max. process+thread count 954 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7121/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@TaoYang526
Copy link
Contributor Author

@slfan1989 Could you please help to review this PR? Thanks.

@slfan1989 slfan1989 self-requested a review November 11, 2024 14:32
@zuston
Copy link
Member

zuston commented Nov 26, 2024

Oh, I will we encoutered the similar problems about multi node placement about different resource spec. And I found some bugs about this features, please refer to: https://zuston.vercel.app/publish/hadoop-yarn/

@zuston
Copy link
Member

zuston commented Nov 26, 2024

BTW, I think the node sorting policy could be extended by ourself, there is no necessary to change the default policy.

@TaoYang526
Copy link
Contributor Author

TaoYang526 commented Nov 29, 2024

@zuston Thanks for the feedback.

Oh, I will we encoutered the similar problems about multi node placement about different resource spec. And I found some bugs about this features, please refer to: https://zuston.vercel.app/publish/hadoop-yarn/

You are right, I have proposed to fix those bugs in YARN-9598, but it was in dispute after some discussions, and part of it were merged into community in YARN-11573 which you mentioned in your article.
FYI, when the scheduler found another node can place the pending request, reserved container for this request can be unreserved before assigning, you can see the details in RegularContainerAllocator#assignContainer.

BTW, I think the node sorting policy could be extended by ourself, there is no necessary to change the default policy.

This PR doesn't change the default policy, just add a new policy can be configured to use.

@zuston
Copy link
Member

zuston commented Nov 29, 2024

FYI, when the scheduler found another node can place the pending request, reserved container for this request can be unreserved before assigning

Thanks for pointing out this, let me take a deep look.

Oh, I found the assoicated code. But it may still exist some bugs that my article described.

@shameersss1
Copy link
Contributor

@TaoYang526 - Isn't this - https://issues.apache.org/jira/browse/YARN-11728 Same problem as well ?

@zuston
Copy link
Member

zuston commented Nov 29, 2024

@TaoYang526 - Isn't this - https://issues.apache.org/jira/browse/YARN-11728 Same problem as well ?

Yes, this issue is created by me.

@zuston
Copy link
Member

zuston commented Nov 29, 2024

@zuston Thanks for the feedback.

Oh, I will we encoutered the similar problems about multi node placement about different resource spec. And I found some bugs about this features, please refer to: https://zuston.vercel.app/publish/hadoop-yarn/

You are right, I have proposed to fix those bugs in YARN-9598, but it was in dispute after some discussions, and part of it were merged into community in YARN-11573 which you mentioned in your article. FYI, when the scheduler found another node can place the pending request, reserved container for this request can be unreserved before assigning, you can see the details in RegularContainerAllocator#assignContainer.

BTW, I think the node sorting policy could be extended by ourself, there is no necessary to change the default policy.

This PR doesn't change the default policy, just add a new policy can be configured to use.

After reading the RegularContainerAllocator#assignContainer, I think some bugs still exist like described in my article or YARN-11728.

In the below refered code, it just will pickup reserved container from another node, but obviously it don't unreserve this in the commit phase, that's the root cause of this bug.

From my sight, this bug also exist in the case of multi node placement disable, which is just covered up by the normal node round-robin strategy.

FYI, it looks the mechansim of reserved container picked up by another node is not implemented will.

if (availableContainers > 0) {
// Allocate...
// We will only do continuous reservation when this is not allocated from
// reserved container
if (rmContainer == null && reservationsContinueLooking) {
// when reservationsContinueLooking is set, we may need to unreserve
// some containers to meet this queue, its parents', or the users'
// resource limits.
if (!shouldAllocOrReserveNewContainer || needToUnreserve) {
if (!needToUnreserve) {
// If we shouldn't allocate/reserve new container then we should
// unreserve one the same size we are asking for since the
// currentResourceLimits.getAmountNeededUnreserve could be zero. If
// the limit was hit then use the amount we need to unreserve to be
// under the limit.
resourceNeedToUnReserve = capability;
}
unreservedContainer = application.findNodeToUnreserve(node,
schedulerKey, resourceNeedToUnReserve);
// When (minimum-unreserved-resource > 0 OR we cannot allocate
// new/reserved
// container (That means we *have to* unreserve some resource to
// continue)). If we failed to unreserve some resource, we can't
// continue.
if (null == unreservedContainer) {
// Skip the locality request
ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
activitiesManager, node, application, schedulerKey,
ActivityDiagnosticConstant.
NODE_CAN_NOT_FIND_CONTAINER_TO_BE_UNRESERVED_WHEN_NEEDED,
ActivityLevel.NODE);
return ContainerAllocation.LOCALITY_SKIPPED;
}
}
}
ContainerAllocation result = new ContainerAllocation(unreservedContainer,
pendingAsk.getPerAllocationResource(), AllocationState.ALLOCATED);
result.containerNodeType = type;
result.setToKillContainers(toKillContainers);
return result;

@TaoYang526
Copy link
Contributor Author

TaoYang526 commented Nov 29, 2024

@zuston This FYI is just a kind reminder because I thought you are trying to find out the logic about how to allocate on another node for reserved container, according to your article. The bug is still there, and I also mentioned that YARN-9598 tried to fix it but only part of it is merged. It's great if you can help to solve that bug.
Could you please take the further discussion in YARN-11728? I'd like to keep this PR focusing on the improvement of the sorting policy. Thanks so much for your understanding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants