-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(deps): add kubeflow-training to workbench images #826
feat(deps): add kubeflow-training to workbench images #826
Conversation
Skipping CI for Draft Pull Request. |
429e683
to
b2d3631
Compare
], | ||
"markers": "python_version >= '3.8'", | ||
"version": "==2.1.0" | ||
"version": "==1.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this executing package - On my local system:
pipenv install
pipenv graph | less
Original:
ipykernel==6.29.5
├── comm [required: >=0.1.1, installed: 0.2.2]
│ └── traitlets [required: >=4, installed: 5.14.3]
├── debugpy [required: >=1.6.5, installed: 1.8.11]
├── ipython [required: >=7.23.1, installed: 8.31.0]
│ ├── decorator [required: Any, installed: 5.1.1]
│ ├── jedi [required: >=0.16, installed: 0.19.2]
│ │ └── parso [required: >=0.8.4,<0.9.0, installed: 0.8.4]
│ ├── matplotlib-inline [required: Any, installed: 0.1.7]
│ │ └── traitlets [required: Any, installed: 5.14.3]
│ ├── pexpect [required: >4.3, installed: 4.9.0]
│ │ └── ptyprocess [required: >=0.5, installed: 0.7.0]
│ ├── prompt_toolkit [required: >=3.0.41,<3.1.0, installed: 3.0.48]
│ │ └── wcwidth [required: Any, installed: 0.2.13]
│ ├── Pygments [required: >=2.4.0, installed: 2.18.0]
│ ├── stack-data [required: Any, installed: 0.6.3]
│ │ ├── asttokens [required: >=2.1.0, installed: 3.0.0]
│ │ ├── executing [required: >=1.2.0, installed: 2.1.0]
│ │ └── pure_eval [required: Any, installed: 0.2.3]
│ ├── traitlets [required: >=5.13.0, installed: 5.14.3]
│ └── typing_extensions [required: >=4.6, installed: 4.12.2]
Updated:
ipykernel==6.29.5
├── comm [required: >=0.1.1, installed: 0.2.2]
│ └── traitlets [required: >=4, installed: 5.14.3]
├── debugpy [required: >=1.6.5, installed: 1.8.11]
├── ipython [required: >=7.23.1, installed: 8.31.0]
│ ├── decorator [required: Any, installed: 5.1.1]
│ ├── jedi [required: >=0.16, installed: 0.19.2]
│ │ └── parso [required: >=0.8.4,<0.9.0, installed: 0.8.4]
│ ├── matplotlib-inline [required: Any, installed: 0.1.7]
│ │ └── traitlets [required: Any, installed: 5.14.3]
│ ├── pexpect [required: >4.3, installed: 4.9.0]
│ │ └── ptyprocess [required: >=0.5, installed: 0.7.0]
│ ├── prompt_toolkit [required: >=3.0.41,<3.1.0, installed: 3.0.48]
│ │ └── wcwidth [required: Any, installed: 0.2.13]
│ ├── Pygments [required: >=2.4.0, installed: 2.18.0]
│ ├── stack-data [required: Any, installed: 0.6.3]
│ │ ├── asttokens [required: >=2.1.0, installed: 3.0.0]
│ │ ├── executing [required: >=1.2.0, installed: 1.2.0]
│ │ └── pure_eval [required: Any, installed: 0.2.3]
│ ├── traitlets [required: >=5.13.0, installed: 5.14.3]
│ └── typing_extensions [required: >=4.6, installed: 4.12.2]
I don't get it neither. Here are the package changes. Only this change seem to be relevant from the functional POV.
I tried to trigger the generic GHA we have to check what is gonna be resolved there. Let's wait for the result, see here.
Edit: so, GHA put the version of the executing
package back to 2.1.0, see here. Thus this looks like something totally env related and as another proof we should rely on the common way how to update these pipfile lock files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rely on the common way how to update these pipfile lock files
can you articulate what you mean by "the common way" ?
i attempted to try and "control" env-related issues by writing this utility to add/remove dependencies:
i.e. i wasn't just "running this on MacOS"...
not claiming my methodology is correct - just providing insight into how I did the work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential discrepancy... i had seen (from work on create-release
GHA action) ubuntu 22 in use
so when I created notebook-utils
- I based it off the ubuntu:22
image... but I see in other GHAs we are using ubuntu-latest
which equates (as of end Sept) to ubuntu-24
for GHAs
I will update my util tomorrow and see how this executing
package resolution behaves...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you articulate what you mean by "the common way" ?
Yeah, I meant just the GHA as it can be considered as the only standard way/env we have now that is common to all of us.
I based it off the ubuntu:22 image... but I see in other GHAs we are using ubuntu-latest which equates (as of end Sept) to ubuntu-24 for GHAs
Yeah, it can be it. As I wrote above - even in my local run, I see the same results as you do. So I suspect something behaves differently when we run in GHA. It would be great if we could dig down what is causing these differences, so that we don't need to trigger this via GHA. Maybe we could update the GHA to extract necessary logic to some script that can be run locally (both using same container image as GHA does). But I didn't check more and if we're gonna switch from pipenv to poetry or something else, this may be irrelevant work in the end 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Darn it.. dead end.. I forgot I changed the "base image" I am using to add/remote dependencies to base/ubi9-python-3.11
so using that as a reference... I have the following images on my system...
$ podman images
REPOSITORY TAG IMAGE ID CREATED SIZE
...
quay.io/rh-ee-astonebe/workbench-images base-ubi9-python-3.11-rhoaieng_16587_cache 86dff33bd9cc 5 days ago 1.51 GB
docker.io/library/ubuntu 24.10 e40b6e31bd8c 7 weeks ago 82.6 MB
ubuntu:24.10
(simulating - i think/hope - what GHA would be doing)
$ podman run --platform linux/amd64 e40b6e31bd8 uname -a
Linux d8d2206ee6fe 6.11.3-200.fc40.aarch64 #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:53:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
workbench-images:base-ubi9-python-3.11-rhoaieng_16587_cache
$ podman run --platform linux/amd64 86dff33bd9cc uname -a
Linux 7a999269de8e 6.11.3-200.fc40.aarch64 #1 SMP PREEMPT_DYNAMIC Thu Oct 10 22:53:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Based on the same underlying linux version.. so unlikely to be cause.. unless the baser base image of registry.access.redhat.com/ubi9/python-311
is doing something funky 🤔
Certainly I am biased (as I naturally like the tool I wrote) - but it feels more right to run the dependency resolution within a container that is actually used at runtime... Regardless, I need to rebuild stuff today... we can decide tomorrow AM if we wanna subsequently "zap" the Pipfile's with the GHA..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HA! check this out (from me re-running my script)
So, annoyingly, it seems the difference here is one of pipenv install
vs. pipenv lock
On prior runs of my script - as I was introducing net new packages - it was naturally doing a pipenv install
for this "re-run" - where the Pipfile
already had the dependencies added - I added new functionality to notebook-utils
to make it smart enough to do a pipenv lock
if no new dependencies were being introduced.. and now we see executing == 2.1.0
pulled back in.
The GHA of course only does pipenv lock
- so the discrepancy here appears (somewhat shockingly) to be related to the pipenv
command ran?!?!
But, hey.. the outcome on this PR now doesn't require discussion (although its been fun) - and we all agree we wanna ditch pipenv
ASAP - so we probably just shrug and move on 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly I am biased (as I naturally like the tool I wrote) - but it feels more right to run the dependency resolution within a container that is actually used at runtime...
+1, gha should do it the way you do, imo, pipenv lock
in ubi container
/test all |
/test notebook-jupyter-pytorch-ubi9-python-3-11-pr-image-mirror |
/test notebook-rocm-jupyter-pyt-ubi9-python-3-11-pr-image-mirror |
/test notebook-cuda-jupyter-ds-ubi9-python-3-11-pr-image-mirror |
/test notebooks-ubi9-e2e-tests |
/test codeserver-notebook-e2e-tests |
b2d3631
to
70c151d
Compare
/test codeserver-notebook-e2e-tests |
70c151d
to
f9958e4
Compare
b8b16e3
to
8fd3a37
Compare
This commit adds `kubeflow-training[huggingface]` to the following workbench images: - `./jupyter/datascience/ubi9-python-3.11` - `./jupyter/pytorch/ubi9-python-3.11` - `./jupyter/rocm/pytorch/ubi9-python-3.11` - `./jupyter/trustyai/ubi9-python-3.11` - `./codeserver/ubi9-python-3.11` This outcome comes with a slew of caveats and disclaimers: - Due to a dependency conflict, `codeflare-sdk==0.24.3` was **also** pulled into the following workbench images. - `./jupyter/datascience/ubi9-python-3.11` - `./jupyter/pytorch/ubi9-python-3.11` - `./jupyter/rocm/pytorch/ubi9-python-3.11` - `./jupyter/trustyai/ubi9-python-3.11` -⚠️ In what may be a "controversial" decision, `codeflare-sdk` was **NOT** updated on other workbench images. Since `0.24.3` was a "one-off" release to unblock the `kubeflow-training` inclusion - the thought process here is that normal "sync" procedures on the next official release will standardize the `codeflare-sdk` dependency across all workbench images. This allows us to restrict the testing effort of this commit. - However, as of opendatahub-io@2bd35f7, `codeflare-sdk` `0.24.3` was being pulled into our `Pipfile.lock` file - which is why you won't see `Pipfile.lock` addition of `codeflare-sdk` on **this PR** - `jupyter/minmal/ubi9-python-3.11` was deliberately excluded from receiving `kubeflow-training` per discussions with team. - Due to dependency conflicts discovered `tensorflow`-based workbench images,`kubeflow-training` has not been added to those workbench images at this time. This decision was agreed to by affect stakeholders. Core blocking issue can be seen here: - See #2328 in https://github.com/onnx/tensorflow-onnx/issues - Due to a dependency conflict, `transformers = "==4.38.0"` was **also** added to the`./jupyter/trustyai/ubi9-python-3.11` workbench image after discussion with the developer that last worked on the `trustyai` image. While it certainly must be tested, there was no strict requirement that necessitated pinning the `transformers` dependency to `4.36.2` - and the `huggingface` `extras` now introduces a `4.38.0` constraint for `transformers`. Related-to: https://issues.redhat.com/browse/RHOAIENG-12822
8fd3a37
to
b84b6b5
Compare
Looks good to me. I've checked the images generated from this PR: https://quay.io/repository/rh-ee-astonebe/workbench-images?tab=tags |
/test notebook-rocm-jupyter-pyt-ubi9-python-3-11-pr-image-mirror |
/lgtm |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andyatmiami The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/override ci/prow/notebook-rocm-jupyter-tf-ubi9-python-3-11-pr-image-mirror ci/prow/codeserver-notebook-e2e-tests |
@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/codeserver-notebook-e2e-tests, ci/prow/notebook-rocm-jupyter-tf-ubi9-python-3-11-pr-image-mirror In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override ci/prow/rocm-notebooks-e2e-tests |
@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/rocm-notebooks-e2e-tests In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override ci/prow/notebooks-ubi9-e2e-tests |
@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/notebooks-ubi9-e2e-tests In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override ci/prow/images |
@andyatmiami: Overrode contexts on behalf of andyatmiami: ci/prow/images In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override "build (rocm-jupyter-tensorflow-ubi9-python-3.11) / build" |
@andyatmiami: Overrode contexts on behalf of andyatmiami: build (rocm-jupyter-tensorflow-ubi9-python-3.11) / build In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override "ci/prow/build (rocm-jupyter-tensorflow-ubi9-python-3.11) / build" |
@andyatmiami: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override "build (cuda-jupyter-tensorflow-ubi9-python-3.11) / build" |
@andyatmiami: Overrode contexts on behalf of andyatmiami: build (cuda-jupyter-tensorflow-ubi9-python-3.11) / build In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override "ci/prow/force failure to debug" |
@andyatmiami: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
d8917bf
into
opendatahub-io:main
Description
This commit adds
kubeflow-training[huggingface]
to the following workbench images:./jupyter/datascience/ubi9-python-3.11
./jupyter/pytorch/ubi9-python-3.11
./jupyter/rocm/pytorch/ubi9-python-3.11
./jupyter/trustyai/ubi9-python-3.11
./codeserver/ubi9-python-3.11
This outcome comes with a slew of caveats and disclaimers:
codeflare-sdk~=0.24.3
was also added to the following workbench images../jupyter/datascience/ubi9-python-3.11
./jupyter/pytorch/ubi9-python-3.11
./jupyter/rocm/pytorch/ubi9-python-3.11
./jupyter/trustyai/ubi9-python-3.11
codeflare-sdk
was NOT updated on other workbench images. Since0.24.3
was a "one-off" release to unblock thekubeflow-training
inclusion - the thought process here is that normal "sync" procedures on the next official release will standardize thecodeflare-sdk
dependency across all workbench images. This allows us to restrict the testing effort of this commit.codeflare-sdk
0.24.3
was being pulled into ourPipfile.lock
file - which is why you won't seePipfile.lock
addition ofcodeflare-sdk
on this PRjupyter/minmal/ubi9-python-3.11
was deliberately excluded from receivingkubeflow-training
per discussions with team.tensorflow
-based workbench images,kubeflow-training
has not been added to those workbench images at this time. This decision was agreed to by affect stakeholders. Core blocking issue can be seen here:transformers = "==4.38.0"
was also added to the./jupyter/trustyai/ubi9-python-3.11
workbench image after discussion with the developer that last worked on thetrustyai
image. While it certainly must be tested, there was no strict requirement that necessitated pinning thetransformers
dependency to4.36.2
- and thehuggingface
extras
now introduces a4.38.0
constraint fortransformers
.Related-to: https://issues.redhat.com/browse/RHOAIENG-12822
How Has This Been Tested?
test-
targets of theMakefile
locally to ensure successful completion. This is handled automatically by the PR checks - but wanted the experience/knowledge of how to do this from my local machineephemeral-storage
availability on my OSIA cluster - I was unable to verify therocm
workbench image locallyMerge criteria: