IBM Cloud deployment improvements #447

Tansito · 2023-04-18T12:14:56Z

Summary

Testing our current configuration in IBM Cloud. Fix #423

Details and comments

Due to the differences between the deployment in local vs IBM Cloud I made more dynamic our Ingress configuration with the purpose to be able to support AWS too without the need to change the template if not only the values (as a normal user would do).

I removed nginxconfig in favor of adding those annotations in ingress.
Renamed ray-ingress to just ingress now that we are going to expose different applications
ingress now is more dynamic so you can add the annotations and host configuration through the values file.
How only the gateway + repository are going to be exposed I changed the serviceType of almost all the services to ClusterIP.
Fix in terraform values that now are required by the project: machine_type, workers_per_zone...
Removed helm integration in terraform due to now helm release requires values from the cluster after its creation (investigating if we can obtain them).
ray-cluster changed from a fixed worker to a dynamic set of workers

Resources

infrastructure/helm/quantumserverless/charts/gateway/templates/deployment.yaml

infrastructure/helm/quantumserverless/charts/repository/templates/deployment.yaml

infrastructure/helm/quantumserverless/values-ibm.yaml

infrastructure/helm/quantumserverless/values.yaml

Tansito · 2023-04-18T12:23:35Z

@psschwei I don't know if you can help me debugging a problem that I found. Installing the helm I'm not being able to find kuberay/operator:

psschwei · 2023-04-18T15:12:08Z

@psschwei I don't know if you can help me debugging a problem that I found. Installing the helm I'm not being able to find kuberay/operator:

Looks like the tag should be v0.5.0

Tansito · 2023-04-19T08:48:53Z

Totally right @psschwei , hours fighting yesterday and there was a v 🤦 , thank you!

psschwei · 2023-04-28T14:46:34Z

Is this PR the same config we used to set up the cluster for the dev forum?

Tansito · 2023-04-28T14:49:36Z

Yes but with some on-real-time changes 😂 . Like for example we improved the raycluster configuration, we needed to modify the realm configuration due to some kind of random error and a problem with the docker entrypoints that we need to take a look before open it to review. But yeah is almost with these changes.

Tansito · 2023-04-28T14:50:34Z

I'm working on those points right now btw.

This reverts commit c26aae1.

pacomf · 2023-05-08T17:05:47Z

@Tansito tests are failing because: #509 (comment)

Tansito · 2023-05-08T17:06:49Z

Yes, I was commenting w/ @psschwei and @IceKhan13 in the issue, thank you @pacomf !

IceKhan13 · 2023-05-08T17:27:03Z

Any of #508 and #512 will fix failing test :)

IceKhan13 · 2023-05-08T17:35:54Z

@Tansito can you merge main here? tests are fixed now :)

IceKhan13 · 2023-05-08T17:36:37Z

infrastructure/helm/quantumserverless/charts/gateway/templates/deployment.yaml

@@ -33,19 +33,19 @@ spec:
            {{- toYaml .Values.securityContext | nindent 12 }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
-          command: [ "/usr/src/app/entrypoint.sh", "gunicorn", "main.wsgi:application", "--bind", "0.0.0.0:8000", "--workers=3" ]
+          args: [ "gunicorn", "main.wsgi:application", "--bind", "0.0.0.0:{{ .Values.service.port }}", "--workers=4" ]


ah, that was a problem of not having migrations :)

mostly, it seems that command overrides the entrypoint that you can have in your container (you always learn something new).

IceKhan13 · 2023-05-08T17:41:09Z

infrastructure/helm/quantumserverless/charts/gateway/templates/deployment.yaml

-            httpGet:
-              path: /metrics
-              port: http
+#          livenessProbe:


My only comment is this. Maybe we should uncomment this? :)
or /metrics is not available? or do /api/v1/

You are totally right. My two cents here from previous experiences is that the security team will ask us for two different end-points for liveness and readiness for a simple HTTP 200 and a DB access. For the DB access I think /api/v1 can work (I would need to confirm it) but for the HTTP 200 we will need a specific end-point. That's why I commented them. If you are agree @IceKhan13 I can create an issue for this.

Makes sense :) Yeah, let's add something like /health or something like this in following PRs

psschwei · 2023-05-08T20:34:30Z

this failed to install for me:

$ helm -n quantum-serverless install quantum-serverless --create-namespace -f values-ibm.yaml .
coalesce.go:223: warning: destination for loki.gateway.affinity is a table. Ignoring non-table value (podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          {{- include "loki.gatewaySelectorLabels" . | nindent 10 }}
      topologyKey: kubernetes.io/hostname
)
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "loki" namespace: "" from "": no matches for kind "GrafanaAgent" in version "monitoring.grafana.com/v1alpha1"
ensure CRDs are installed first, resource mapping not found for name: "loki" namespace: "" from "": no matches for kind "LogsInstance" in version "monitoring.grafana.com/v1alpha1"
ensure CRDs are installed first, resource mapping not found for name: "loki" namespace: "" from "": no matches for kind "PodLogs" in version "monitoring.grafana.com/v1alpha1"
ensure CRDs are installed first]

(looks like I broke something in ce747be -- seems like we need the grafana-agent CRDs or maybe even the entire operator...)

not for this PR, but we may want to look into something like kustomize for some of the provider-specific overlays: https://jfrog.com/blog/power-up-helm-charts-using-kustomize-to-manage-kubernetes-deployments/

also, one minor nit: can we update the readme for IBM-specific install instructions?

Tansito · 2023-05-09T09:13:52Z

(looks like I broke something in ce747be -- seems like we need the grafana-agent CRDs or maybe even the entire operator...)

@psschwei I'm going to update this PR with master now but my last deployment in IBM Cloud was with your changes and I think it was working. I can confirm you in some minutes when I try to deploy again.

we may want to look into something like kustomize for some of the provider-specific overlays

I'm always open for improvements. Before introduce more things in the infrastructure I would like to work too in the tests: #117 and #118

also, one minor nit: can we update the readme for IBM-specific install instructions?

You are totally right. I will review it now.

Tansito · 2023-05-09T10:26:00Z

@psschwei I can confirm you. Regardless the warning in:

coalesce.go:223: warning: destination for loki.gateway.affinity is a table. Ignoring non-table value (podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          {{- include "loki.gatewaySelectorLabels" . | nindent 10 }}
      topologyKey: kubernetes.io/hostname
)

I could deploy it in IBM Cloud without a problem.

psschwei · 2023-05-09T11:19:29Z

Yeah, the warning has been there for a while (I think I opened an issue for it sometime back)... the errors were new for me (and also weird, because I didn't get them before...).

On kustomize I'm thinking two things:

minimizing duplicates in the config files
making it easy for users to set their values (secrets, etc, basically all those things in the readme we say for them to edit)

not any hurry on that, more of a day 2 / nice to have at this point. Agree testing would be better for now 😄

# Conflicts: # infrastructure/helm/quantumserverless/README.md # infrastructure/helm/quantumserverless/values.yaml

psschwei · 2023-05-11T12:31:43Z

kustomize

Someone this week recommended helmfile as a potentially better fit, so we have options 😄

IceKhan13

monumental work of setting this all up 👏

pacomf

Here we go! 🚀

Configure providers in staging and prod

Tansito added 6 commits April 14, 2023 11:32

Terraform improvements

3f0148c

Helm improvements

6c55b12

Public service endpoint enable

f301936

Fix template error in repository

f1afb6c

Improved deployment configuration

aa93d40

Updated values for the new architecture

617b362

Tansito marked this pull request as draft April 18, 2023 12:15