Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IBM Cloud deployment improvements #447

Merged
merged 36 commits into from
May 11, 2023
Merged

IBM Cloud deployment improvements #447

merged 36 commits into from
May 11, 2023

Conversation

Tansito
Copy link
Member

@Tansito Tansito commented Apr 18, 2023

Summary

Testing our current configuration in IBM Cloud. Fix #423

Details and comments

Due to the differences between the deployment in local vs IBM Cloud I made more dynamic our Ingress configuration with the purpose to be able to support AWS too without the need to change the template if not only the values (as a normal user would do).

  • I removed nginxconfig in favor of adding those annotations in ingress.
  • Renamed ray-ingress to just ingress now that we are going to expose different applications
  • ingress now is more dynamic so you can add the annotations and host configuration through the values file.
  • How only the gateway + repository are going to be exposed I changed the serviceType of almost all the services to ClusterIP.
  • Fix in terraform values that now are required by the project: machine_type, workers_per_zone...
  • Removed helm integration in terraform due to now helm release requires values from the cluster after its creation (investigating if we can obtain them).
  • ray-cluster changed from a fixed worker to a dynamic set of workers

Resources

@Tansito Tansito marked this pull request as draft April 18, 2023 12:15
@Tansito
Copy link
Member Author

Tansito commented Apr 18, 2023

@psschwei I don't know if you can help me debugging a problem that I found. Installing the helm I'm not being able to find kuberay/operator:

Screenshot 2023-04-18 at 12 12 39

@psschwei
Copy link
Collaborator

@psschwei I don't know if you can help me debugging a problem that I found. Installing the helm I'm not being able to find kuberay/operator:

Looks like the tag should be v0.5.0

@Tansito
Copy link
Member Author

Tansito commented Apr 19, 2023

Totally right @psschwei , hours fighting yesterday and there was a v 🤦 , thank you!

@psschwei
Copy link
Collaborator

Is this PR the same config we used to set up the cluster for the dev forum?

@Tansito
Copy link
Member Author

Tansito commented Apr 28, 2023

Yes but with some on-real-time changes 😂 . Like for example we improved the raycluster configuration, we needed to modify the realm configuration due to some kind of random error and a problem with the docker entrypoints that we need to take a look before open it to review. But yeah is almost with these changes.

@Tansito
Copy link
Member Author

Tansito commented Apr 28, 2023

I'm working on those points right now btw.

@pacomf
Copy link
Member

pacomf commented May 8, 2023

@Tansito tests are failing because: #509 (comment)

@Tansito
Copy link
Member Author

Tansito commented May 8, 2023

Yes, I was commenting w/ @psschwei and @IceKhan13 in the issue, thank you @pacomf !

@IceKhan13
Copy link
Member

Any of #508 and #512 will fix failing test :)

@IceKhan13
Copy link
Member

@Tansito can you merge main here? tests are fixed now :)

@@ -33,19 +33,19 @@ spec:
{{- toYaml .Values.securityContext | nindent 12 }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
command: [ "/usr/src/app/entrypoint.sh", "gunicorn", "main.wsgi:application", "--bind", "0.0.0.0:8000", "--workers=3" ]
args: [ "gunicorn", "main.wsgi:application", "--bind", "0.0.0.0:{{ .Values.service.port }}", "--workers=4" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that was a problem of not having migrations :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly, it seems that command overrides the entrypoint that you can have in your container (you always learn something new).

httpGet:
path: /metrics
port: http
# livenessProbe:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only comment is this. Maybe we should uncomment this? :)
or /metrics is not available? or do /api/v1/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are totally right. My two cents here from previous experiences is that the security team will ask us for two different end-points for liveness and readiness for a simple HTTP 200 and a DB access. For the DB access I think /api/v1 can work (I would need to confirm it) but for the HTTP 200 we will need a specific end-point. That's why I commented them. If you are agree @IceKhan13 I can create an issue for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense :) Yeah, let's add something like /health or something like this in following PRs

@psschwei
Copy link
Collaborator

psschwei commented May 8, 2023

this failed to install for me:

$ helm -n quantum-serverless install quantum-serverless --create-namespace -f values-ibm.yaml .
coalesce.go:223: warning: destination for loki.gateway.affinity is a table. Ignoring non-table value (podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          {{- include "loki.gatewaySelectorLabels" . | nindent 10 }}
      topologyKey: kubernetes.io/hostname
)
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [resource mapping not found for name: "loki" namespace: "" from "": no matches for kind "GrafanaAgent" in version "monitoring.grafana.com/v1alpha1"
ensure CRDs are installed first, resource mapping not found for name: "loki" namespace: "" from "": no matches for kind "LogsInstance" in version "monitoring.grafana.com/v1alpha1"
ensure CRDs are installed first, resource mapping not found for name: "loki" namespace: "" from "": no matches for kind "PodLogs" in version "monitoring.grafana.com/v1alpha1"
ensure CRDs are installed first]

(looks like I broke something in ce747be -- seems like we need the grafana-agent CRDs or maybe even the entire operator...)

not for this PR, but we may want to look into something like kustomize for some of the provider-specific overlays: https://jfrog.com/blog/power-up-helm-charts-using-kustomize-to-manage-kubernetes-deployments/

also, one minor nit: can we update the readme for IBM-specific install instructions?

@Tansito
Copy link
Member Author

Tansito commented May 9, 2023

(looks like I broke something in ce747be -- seems like we need the grafana-agent CRDs or maybe even the entire operator...)

@psschwei I'm going to update this PR with master now but my last deployment in IBM Cloud was with your changes and I think it was working. I can confirm you in some minutes when I try to deploy again.

we may want to look into something like kustomize for some of the provider-specific overlays

I'm always open for improvements. Before introduce more things in the infrastructure I would like to work too in the tests: #117 and #118

also, one minor nit: can we update the readme for IBM-specific install instructions?

You are totally right. I will review it now.

@Tansito
Copy link
Member Author

Tansito commented May 9, 2023

@psschwei I can confirm you. Regardless the warning in:

coalesce.go:223: warning: destination for loki.gateway.affinity is a table. Ignoring non-table value (podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          {{- include "loki.gatewaySelectorLabels" . | nindent 10 }}
      topologyKey: kubernetes.io/hostname
)

I could deploy it in IBM Cloud without a problem.

@psschwei
Copy link
Collaborator

psschwei commented May 9, 2023

Yeah, the warning has been there for a while (I think I opened an issue for it sometime back)... the errors were new for me (and also weird, because I didn't get them before...).

On kustomize I'm thinking two things:

  • minimizing duplicates in the config files
  • making it easy for users to set their values (secrets, etc, basically all those things in the readme we say for them to edit)

not any hurry on that, more of a day 2 / nice to have at this point. Agree testing would be better for now 😄

@Tansito Tansito requested a review from IceKhan13 May 11, 2023 10:40
# Conflicts:
#	infrastructure/helm/quantumserverless/README.md
#	infrastructure/helm/quantumserverless/values.yaml
@psschwei
Copy link
Collaborator

kustomize

Someone this week recommended helmfile as a potentially better fit, so we have options 😄

Copy link
Member

@IceKhan13 IceKhan13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monumental work of setting this all up 👏

pacomf
pacomf approved these changes May 11, 2023
Copy link
Member

@pacomf pacomf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we go! 🚀

@Tansito Tansito merged commit 924e353 into main May 11, 2023
@Tansito Tansito deleted the ibm-cloud-effort branch May 11, 2023 13:19
akihikokuroda pushed a commit that referenced this pull request Aug 22, 2024
Configure providers in staging and prod
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deployment: update infrastructure to be deployed in IBM cloud
5 participants