Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump juju 3.1 -> 3.5 #859

Closed
DnPlas opened this issue Apr 2, 2024 · 10 comments
Closed

Bump juju 3.1 -> 3.5 #859

DnPlas opened this issue Apr 2, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@DnPlas
Copy link
Contributor

DnPlas commented Apr 2, 2024

Context

According to the Juju roadmap&releases page, juju 3.1 support stops at the end of April 2024. The next supported version is 3.4, for which bug fixes support ends in April 2024 and the security fix support ends on July 2024.
Because of this and to provide better support of features in CKF, charms have to be tested with this version.

NOTE: while juju 3.5 release is close, there are some features and user stories that need this bump. For instance canonical/istio-operators#398. After 3.5 is released, the team has to go through this process again.

What needs to get done

  1. Test the CKF 1.8/stable bundle works well with juju 3.4 using the UATs - CKF 1.7 only supports 2.9, it doesn't have to be tested.
  2. Bump the juju version in the CI (both controller and client)
  3. Bump charm and testing framework dependencies (ops, python-libjuju, etc.)
  4. Provide an upgrade path from 2.9 (supported in CKF 1.7) to 3.4
  5. (potential) Update any test that needs to be updated because of this change

Merge all of:

Definition of Done

  • The bundle is tested with juju 3.4
  • The CI uses juju 3.4
@DnPlas DnPlas added the enhancement New feature or request label Apr 2, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5503.

This message was autogenerated

@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 2, 2024

Initial tests show that:

  • CKF 1.8/stable deployed using a juju 3.4 controller and client goes into active and idle without any major incident. I tested in this environment:
$ juju controllers
Use --refresh option with this command to see the latest information.

Controller  Model     User   Access     Cloud/Region        Models  Nodes  HA  Version
uk8s*       kubeflow  admin  superuser  microk8s/localhost       2      1   -  3.4.1  

$ juju --version
3.4.1-genericlinux-amd64
  • Running the UATs was not an issue, though they do not provide a lot of information about the compatibility with the controller, almost all ran successfully, except for one, which ended with the following message:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kubeflow-katib 0.15.0 requires grpcio==1.41.1, but you have grpcio 1.51.3 which is incompatible.
kubeflow-katib 0.15.0 requires protobuf==3.19.5, but you have protobuf 3.20.3 which is incompatible.
kfp 2.4.0 requires kubernetes<27,>=8.0.0, but you have kubernetes 28.1.0 which is incompatible.
------------------------------ Captured log call -------------------------------
INFO     test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
=========================== short test summary info ============================
FAILED test_notebooks.py::test_notebook[e2e-wine-kfp-mlflow-seldon] - Failed:...
FAILED test_notebooks.py::test_notebook[katib-integration] - Failed: AssertionError: Katib Experiment was not successful.
FAILED test_notebooks.py::test_notebook[mlflow-integration] - Failed: Noteboo...
FAILED test_notebooks.py::test_notebook[mlflow-kserve] - Failed: Notebook exe...
FAILED test_notebooks.py::test_notebook[mlflow-minio-integration] - Failed: N...
FAILED test_notebooks.py::test_notebook[training-integration] - Failed: Noteb...
...
File "/home/ubuntu/shared/charmed-kubeflow-uats/driver/test_kubeflow_workloads.py", line 130, in test_kubeflow_workloads
    pytest.fail(
  File "/home/ubuntu/shared/charmed-kubeflow-uats/.tox/uats-remote/lib/python3.10/site-packages/_pytest/outcomes.py", line 198, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: Something went wrong while running Job test-kubeflow/test-kubeflow. Please inspect the attached logs for more info...
...
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0
E           RuntimeError: Failed to read logs for pod test-kubeflow/paddle-simple-cpu-worker-0

The message does not seem to be related to the controller, and though we should look into it, we can discard this error as a blocker for bumping the juju version. I will create an issue on canonical/charmed-kubeflow-uats to follow up.

DnPlas added a commit to canonical/istio-operators that referenced this issue Apr 2, 2024
Bumping juju and ops packages to use them in newer versions of the charms,
plus testing them in a CI with a more recent juju version.

This commit also skips some test cases that will be removed in a follow
up commit introduced by #401.

Part of canonical/bundle-kubeflow#859
Part of #398

Signed-off-by: Daniela Plascencia <daniela.plascencia@canonical.com>
@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 3, 2024

One of the limitations that I have found is that juju 3.4 seems to not handle correctly pod spec charms' unit statuses. While most of CKF charms follow the sidecar pattern, some of them are still in podspec (like oidc-gatekeeper and kubeflow-volumes). The behaviour I am observing is shown here (with kubeflow-volumes):

$ juju status
Model       Controller  Cloud/Region        Version  SLA          Timestamp
test-istio  uk8s        microk8s/localhost  3.4.1    unsupported  12:26:36Z

App                     Version                Status   Scale  Charm                   Channel      Rev  Address         Exposed  Message
istio-ingressgateway                           active       1  istio-gateway                          0  10.152.183.62   no
istio-pilot                                    active       1  istio-pilot                            0  10.152.183.246  no
kubeflow-volumes        res:oci-image@2261827  waiting      1  kubeflow-volumes        1.8/stable   260                  no       waiting for container
tensorboard-controller                         active       1  tensorboard-controller  latest/edge  266  10.152.183.108  no

Unit                       Workload  Agent  Address      Ports  Message
istio-ingressgateway/0*    active    idle   10.1.60.139
istio-pilot/0*             active    idle   10.1.60.138
kubeflow-volumes/0*        waiting   idle                       waiting for container # <--- waiting for container, but container is running
tensorboard-controller/0*  active    idle   10.1.60.143

$ kubectl get pods -A | grep volumes
test-istio        kubeflow-volumes-operator-0                      1/1     Running   0          23m

Interestingly enough, this is not the case for oidc-gatekeeper when deploying using juju deploy but it is the case when deploying with model.deploy (e.g. from a test case).

This affects integration tests as they timeout waiting for all units to go to Active status.

DnPlas added a commit to canonical/oidc-gatekeeper-operator that referenced this issue Apr 4, 2024
Bumping juju and ops packages to use them in newer versions of the charms,
plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
DnPlas added a commit to canonical/kubeflow-volumes-operator that referenced this issue Apr 4, 2024
Bumping juju and ops packages to use them in newer versions of the charms,
plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
DnPlas added a commit to canonical/kubeflow-volumes-operator that referenced this issue Apr 4, 2024
Bumping juju and ops packages to use them in newer versions of the charms,
plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 4, 2024

I did a couple more tests and here are my findings:

2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "ingress-auth-relation-changed" hook (via hook dispatching script: dispatch)
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:38:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:11 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:42:12 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:20 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:47:21 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:51:43 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-04-04 09:56:52 INFO juju.worker.caasoperator.uniter.oidc-gatekeeper/0.actions resolver.go:61 actions are blocked=true; outdated remote charm=true - have pending actions: []

I am currently talking to the juju team about it.

DnPlas added a commit to canonical/oidc-gatekeeper-operator that referenced this issue Apr 4, 2024
Bumping juju and ops packages to use them in newer versions of the charms, plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 4, 2024

Update about oidc-gatekeeper and kubeflow-volumes:

When bumping each charm's CIs in both main and track/<version>, all the tests pass and succeed, which now makes it look like the istio-operators integration tests is the actual cause of the issue. To unblock that CI, I have tried swapping kubeflow-volumes with tensorboards-web-app as the ingress requirer application, just to see if there is a difference. In the long run, and to avoid having to deploy charms that change so often, we should have a generic ingress requirer charm that assists in checking the ingress relation, but doesn't actually perform anything.

@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 4, 2024

After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the istio-operators CI were really outdated and had an ops version < 2.x, causing some collisions with juju 3.4. canonical/istio-operators#405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.

Related to: #857

ca-scribner pushed a commit to canonical/oidc-gatekeeper-operator that referenced this issue Apr 4, 2024
Bumping juju and ops packages to use them in newer versions of the charms, plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
DnPlas added a commit to canonical/oidc-gatekeeper-operator that referenced this issue Apr 4, 2024
Bumping juju and ops packages to use them in newer versions of the charms, plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
DnPlas added a commit to canonical/kubeflow-volumes-operator that referenced this issue Apr 4, 2024
Bumping juju and ops packages to use them in newer versions of the charms,
plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859

Co-authored-by: Andrew Scribner <ca.scribner+1@gmail.com>
@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 5, 2024

I think this effort is big enough to be split in smaller tasks and we should definitely involve more people from the team as at the moment there are some changes that have to happen manually. The way I think this task can be completed is by doing the following:

  1. Bump all versions - Do a cannon run to bump all the instances of juju-channel in all .github/workflows/integrate.yaml across repositories. At the same time, bump the versions of ops, pytest-operator, and python-libjuju.

  2. Pin charm dependencies (kind of optional) - All the charms that get deployed as dependencies in integration tests must be pinned to their corresponding 1.8 stable channels in the gh:track/ branches. For instance, istio-operators deploy kubeflow-volumes in their integration tests, we must ensure that the last supported stable version of kubeflow-volumes get deployed alongside the last supported stable istio-operators.

  3. Promote to stable - once all of the necessary changes are merged, all of our 30+ charms have to be promoted from /edge to /stable.

    • This is a manual process: go to repo → go to actions → run the promote action for each charm in the repo. If we opt to go with the manual process because of customer times, we may need to assign ~5 charms to each eng of the team and work on it.

    • We could add a workflow dispatch that promotes all charms, but this will add more work to the task. I suggest we do, as we will need this in the future.

  4. Manual testing - to ensure every charm can be deployed individually and as a bundle with juju 3.4.

@DnPlas
Copy link
Contributor Author

DnPlas commented Apr 5, 2024

After closer inspection to each of the failing CIs, it looks like the charms that were deployed in the istio-operators CI were really outdated and had an ops version < 2.x, causing some collisions with juju 3.4. canonical/istio-operators#405 should fix the issues in that repo's CI. For the rest of the repositories, I don't seem to be finding issues, but I'll keep an eye to catch the places where the charm version is outdated.

Related to: #857

While this change fixed the problem for some charms in the istio-operators CI, it did not solve the problem entirely. At first glance it looks like podspec charms deployed from Charmhub are having some trouble, it is the case for kubeflow-volumes.
There is an ongoing conversation with the juju team here.

DnPlas added a commit to canonical/kubeflow-volumes-operator that referenced this issue Apr 11, 2024
…ts files

Bumping juju and ops packages to use them in newer versions of the charms,
plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
Part of canonical/bundle-kubeflow#862
DnPlas added a commit to canonical/kubeflow-volumes-operator that referenced this issue Apr 11, 2024
Bumping juju and ops packages to use them in newer versions of the charms,
plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
Part of canonical/bundle-kubeflow#862
DnPlas added a commit to canonical/oidc-gatekeeper-operator that referenced this issue Apr 11, 2024
…arms, plus testing them in a CI with a more recent juju version.

Part of canonical/bundle-kubeflow#859
Part of canonical/bundle-kubeflow#862
ca-scribner added a commit to canonical/oidc-gatekeeper-operator that referenced this issue May 8, 2024
ca-scribner added a commit to canonical/oidc-gatekeeper-operator that referenced this issue May 8, 2024
NohaIhab pushed a commit to canonical/oidc-gatekeeper-operator that referenced this issue May 9, 2024
ca-scribner added a commit to canonical/oidc-gatekeeper-operator that referenced this issue May 9, 2024
NohaIhab pushed a commit to canonical/kfp-operators that referenced this issue May 9, 2024
@DnPlas DnPlas changed the title Bump juju 3.1 -> 3.4 Bump juju 3.1 -> 3.5 May 21, 2024
@DnPlas
Copy link
Contributor Author

DnPlas commented May 21, 2024

Because juju 3.5 was available sooner than 3.4, the team has decided to go with that version instead. The work for this is not affected.

@DnPlas
Copy link
Contributor Author

DnPlas commented Jun 5, 2024

Since all of the github CIs are now running juju 3.5, we can close this issue.

@DnPlas DnPlas closed this as completed Jun 5, 2024
DnPlas added a commit to canonical/mlflow-operator that referenced this issue Jun 18, 2024
DnPlas added a commit to canonical/mlflow-operator that referenced this issue Jun 18, 2024
DnPlas added a commit to canonical/mlflow-operator that referenced this issue Jun 26, 2024
NohaIhab pushed a commit to canonical/kserve-operators that referenced this issue Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant