Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix volcano podgroup update issue #2079

Merged
merged 5 commits into from
May 30, 2024

Conversation

ckyuto
Copy link
Contributor

@ckyuto ckyuto commented Apr 22, 2024

What this PR does / why we need it:
This is the fix cause by this PR, the minMember may be updated when the number of replica is changed. However, this also accidentally change the queue value. It also sync up the queue value in the podGroup with the value in runPolicy.SchedulingPolicy.Queue, which is not always applicable to all use cases.

In our use cases we'll inject the queue value according to which org this user belongs to. This change will override the value we set in the queue. The queue value should not be updated once the it is set.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this fix @ckyuto!
Please can you rebase it ?

@coveralls
Copy link

coveralls commented Apr 26, 2024

Pull Request Test Coverage Report for Build 9296657539

Details

  • 0 of 3 (0.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.008%) to 35.423%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/common/job.go 0 3 0.0%
Totals Coverage Status
Change from base Build 9295074366: -0.008%
Covered Lines: 4380
Relevant Lines: 12365

💛 - Coveralls

@ckyuto
Copy link
Contributor Author

ckyuto commented May 1, 2024

@andreyvelich Can you help review?

@tenzen-y
Copy link
Member

tenzen-y commented May 2, 2024

@ckyuto Could you eliminate irrelevant commits?

@ckyuto
Copy link
Contributor Author

ckyuto commented May 2, 2024

tenzen-y
Thanks for the comment. Removed.

@ckyuto ckyuto force-pushed the wyen/fix_pg_update branch 2 times, most recently from f3c56ef to 88347fa Compare May 2, 2024 09:40
@google-oss-prow google-oss-prow bot added size/XS and removed size/M labels May 3, 2024
@ckyuto ckyuto force-pushed the wyen/fix_pg_update branch 2 times, most recently from 0fd9120 to b885e4d Compare May 3, 2024 08:44
@ckyuto
Copy link
Contributor Author

ckyuto commented May 6, 2024

@andreyvelich @tenzen-y I think there's a simple way to fix this. Can I get a review again?

Copy link
Member

@Tomcli Tomcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ckyuto
Copy link
Contributor Author

ckyuto commented May 28, 2024

@tenzen-y the failed flow looks like a transient error. Can you help rerun again?

61.59 Get:80 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 libcurl3-gnutls arm64 7.68.0-1ubuntu2.22 [213 kB]
62.04 Get:81 http://ports.ubuntu.com/ubuntu-ports focal/main arm64 liberror-perl all 0.17029-1 [26.5 kB]
62.11 Get:82 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 git-man all 1:2.25.1-1ubuntu3.11 [887 kB]
64.01 Get:83 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 git arm64 1:2.25.1-1ubuntu3.11 [4437 kB]
73.32 Get:84 http://ports.ubuntu.com/ubuntu-ports focal/universe arm64 libomp5-10 arm64 1:10.0.0-4ubuntu1 [233 kB]
73.74 Get:85 http://ports.ubuntu.com/ubuntu-ports focal/universe arm64 libomp-10-dev arm64 1:10.0.0-4ubuntu1 [44.5 kB]
73.79 Get:86 http://ports.ubuntu.com/ubuntu-ports focal/universe arm64 libomp-dev arm64 1:10.0-50~exp1 [2824 B]
73.82 Fetched 68.2 MB in 40s (1713 kB/s)
73.83 E: Failed to fetch http://ports.ubuntu.com/ubuntu-ports/pool/main/i/isl/libisl22_0.22.1-1_arm64.deb  Undetermined Error [IP: 185.125.190.39 80]
73.83 E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
------
WARNING: No output specified with docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
Dockerfile:9
--------------------
   8 |     
   9 | >>> RUN apt-get update -y && \
  10 | >>>     apt-get install -y --no-install-recommends \
  11 | >>>         ca-certificates \
  12 | >>>         cmake \
  13 | >>>         build-essential \
  14 | >>>         gcc \
  15 | >>>         g++ \
  16 | >>>         git \
  17 | >>>         libomp-dev && \
  18 | >>>     rm -rf /var/lib/apt/lists/*
  19 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c apt-get update -y &&     apt-get install -y --no-install-recommends         ca-certificates         cmake         build-essential         gcc         g++         git         libomp-dev &&     rm -rf /var/lib/apt/lists/*" did not complete successfully: exit code: 100

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, lgtm

Could you extend the PyTorchJob integration test to verify the validations?

It("Should get the corresponding resources successfully", func() {

@ckyuto
Copy link
Contributor Author

ckyuto commented May 29, 2024

Generally, lgtm

Could you extend the PyTorchJob integration test to verify the validations?

It("Should get the corresponding resources successfully", func() {

Updated

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the reason why the CI keeps having the running state even if CI succeeded.
So, could you try to rebase this PR?

Comment on lines 205 to 213
updatedJob := &kubeflowv1.PyTorchJob{}
Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob")

updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test"
err := testK8sClient.Update(ctx, updatedJob)

By("Checking that the queue update fails")
Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded")
Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
updatedJob := &kubeflowv1.PyTorchJob{}
Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob")
updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test"
err := testK8sClient.Update(ctx, updatedJob)
By("Checking that the queue update fails")
Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded")
Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message")
Eventually(func(g Gomega) {
updatedJob := &kubeflowv1.PyTorchJob{}
g.Expect(testK8sClient.Get(ctx, client.ObjectKeyFromObject(job), updatedJob)).Should(Succeed(), "Failed to get PyTorchJob")
updatedJob.Spec.RunPolicy.SchedulingPolicy.Queue = "test"
err := testK8sClient.Update(ctx, updatedJob)
By("Checking that the queue update fails")
g.Expect(err).To(HaveOccurred(), "Expected an error when updating the queue, but update succeeded")
g.Expect(err.Error()).To(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message")
}, testutil.Timeout, testutil.Interval).Should(Succeeded())

The update operation often fails due to other reasons. So, could you use the retry mechanism to avoid flaky tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

ckyuto and others added 5 commits May 29, 2024 20:44
Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

Comment on lines +216 to +218
Expect(err).To(MatchError(ContainSubstring("spec.runPolicy.schedulingPolicy.queue is immutable"), "The error message did not contain the expected message"))
return err != nil
}, testutil.Timeout, testutil.Interval).Should(BeTrue())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't prefer this approach since the root cause is possible to be hidden. Ok, let me refine here in another PR.

@google-oss-prow google-oss-prow bot added the lgtm label May 30, 2024
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y, Tomcli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 00f4d52 into kubeflow:master May 30, 2024
39 checks passed
tenzen-y pushed a commit to tenzen-y/training-operator that referenced this pull request Jun 5, 2024
* fix volcano podgroup update issue

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* queue value shouldn't be reset once it has been set

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* make queue immutable

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* add unit test

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* add retry for update operation

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

---------

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
tenzen-y pushed a commit to tenzen-y/training-operator that referenced this pull request Jun 7, 2024
* fix volcano podgroup update issue

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* queue value shouldn't be reset once it has been set

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* make queue immutable

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* add unit test

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

* add retry for update operation

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>

---------

Signed-off-by: Weiyu Yen <ckyuto@gmail.com>
google-oss-prow bot pushed a commit that referenced this pull request Jun 10, 2024
#2130: Refine the integration tests for the immutable PyTorchJob (#2139)

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Co-authored-by: Weiyu Yen <wyen@linkedin.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants