Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify service account for kaniko jobs #352

Merged
merged 19 commits into from
Feb 24, 2023
Merged

Specify service account for kaniko jobs #352

merged 19 commits into from
Feb 24, 2023

Conversation

shydefoo
Copy link
Contributor

@shydefoo shydefoo commented Feb 21, 2023

What this PR does / why we need it:
This PR adds the option to set the service account for kaniko jobs using the IMG_BUILDER_KANIKO_SERVICE_ACCOUNT environment variable. If this environment variable is empty, the existing behaviour will be used (ie mounting of K8s secret containing Google service account key into the pod).

The helm chart has been modified to create the Kaniko job K8s SA if needed. Else, the K8s SA can be created externally and specified in the helm chart values for merlin to use.

Previously, all kaniko jobs will have:

  1. K8s secret mounted into pod
  2. GOOGLE_APPLICATION_CREDENTIALS env var set
  3. Dockerfile will run gcloud auth activate-service-account command

When this environment variable is set, the K8s kaniko image builder jobs (prediction and batch) will use the specified K8s service account, and not mount the K8s secret as part of the job's PodSpec.

The dockerfile used by batch-prediction and pyfunc-server is modified to run gcloud auth activate-service-account command only if GOOGLE_APPLICATION_CREDENTIALS argument is set.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

NONE


Checklist

  • Added unit test, integration, and/or e2e tests
  • Tested locally
  • Updated documentation
  • Update Swagger spec if the PR introduce API changes
  • Regenerated Golang and Python client if the PR introduce API changes

@shydefoo shydefoo marked this pull request as ready for review February 21, 2023 03:30
* upstream/main:
  Refactor deployed model labels (#346)
  [Pyfunc] Add ability to push metrics to Prometheus push gateway (#311)
Copy link
Member

@pradithya pradithya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @shydefoo . Can you test it in dev? The e2e test in github doesn't cover pyfunc image build.

apiVersion: v1
kind: ServiceAccount
metadata:
name: {{ .Values.merlin.imageBuilder.serviceAccount.name }}-{{ .Release.Name }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there any reason not to use full name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on, i think this is wrong, will change to just use

{{ .Values.merlin.imageBuilder.serviceAccount.name }}

without the {{ .Release.Name }}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But without the release name, there could be conflict of service account when there is multiple merlin release in same namespace?

Copy link
Contributor Author

@shydefoo shydefoo Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that was my initial thought, but then when specifying service account specified externally it should respect the name provided. Pushing a fix for this.

@@ -152,6 +152,10 @@ spec:
- name: IMG_BUILDER_CONTEXT_SUB_PATH
value: "{{ .Values.merlin.imageBuilder.contextSubPath }}"
{{- end }}
{{- if .Values.merlin.imageBuilder.serviceAccount.name }}
- name: IMG_BUILDER_KANIKO_SERVICE_ACCOUNT
value: "{{ .Values.merlin.imageBuilder.serviceAccount.name }}-{{ .Release.Name }}"
Copy link
Member

@pradithya pradithya Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Similar question to above

@shydefoo
Copy link
Contributor Author

shydefoo commented Feb 22, 2023

Can you test it in dev?

Yep i have tested it in dev environment manually, image building is successful for both model service and batch predictor.

@pradithya
Copy link
Member

Can you test it in dev?
Yep i have tested it in dev environment manually, image building is successful for both model service and batch predictor.

Perfect!

@@ -348,7 +348,7 @@ jobs:
LOCAL_REGISTRY: "dev.localhost"
INGRESS_HOST: "127.0.0.1.nip.io"
MLP_CHART_VERSION: 0.3.4
E2E_PYTHON_VERSION: "3.10"
E2E_PYTHON_VERSION: "3.10.6"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version is specified to as leaving it as 3.10 gives this error: see this run https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. @ariefrahmansyah I think it's got to do with how the CI use cache key https://github.com/gojek/merlin/blob/36c63d3775b5b08d518dde87d35385c5c8cb00e0/.github/workflows/merlin.yml#L365 , I am not sure why we use ${{ matrix.python-version }}, this task doesn't have matrix right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on why specifying E2E_PYTHON_VERSION to 3.10.6 solve the issue?

-> I am not sure why we use ${{ matrix.python-version }}, this task doesn't have matrix right?
Yeah, we should change the cache key to not using matrix.python-version, but still have to be different from other python jobs cache key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure on this. This is the error that happens consistently when 3.10 is used:

Notice:  A new release of pip is available: 23.0 -> 23.0.1
Notice:  To update, run: pip install --upgrade pip
Installing dependencies from Pipfile...
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.10/x64/bin/pipenv", line 8, in <module>
    sys.exit(cli())
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/cli/options.py", line 56, in main
    return super().main(*args, **kwargs, windows_expand_args=False)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/decorators.py", line [84](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:85), in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/vendor/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/cli/command.py", line 233, in install
    do_install(
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 2064, in do_install
    do_init(
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 1333, in do_init
    do_install_dependencies(
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line [86](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:87)1, in do_install_dependencies
    batch_install(
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 732, in batch_install
    deps_to_install = [
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/core.py", line 733, in <listcomp>
    dep for dep in deps_to_install if not project.environment.is_satisfied(dep)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/project.py", line 304, in environment
    self._environment = self.get_environment(allow_global=allow_global)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/project.py", line 2[90](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:91), in get_environment
    environment = Environment(
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/environment.py", line 79, in __init__
    self._base_paths = self.get_paths()
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/environment.py", line 390, in get_paths
    c = subprocess_run(command)
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/site-packages/pipenv/utils/processes.py", line 75, in subprocess_run
    return subprocess.run(
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/subprocess.py", line [97](https://github.com/gojek/merlin/actions/runs/4249958378/jobs/7390728519#step:14:98)1, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/hostedtoolcache/Python/3.10.10/x64/lib/python3.10/subprocess.py", line 1847, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/.local/share/virtualenvs/sdk-ywGzPJ0F/bin/python'

There was a similar issue raised here, pypa/pipenv#5075, but the solution is not clear to me. Since the python version here is 3.10.10, i thought of specifying a fixed version and it seems to be able to install the dependencies.

@@ -33,4 +33,4 @@ kubectl create namespace ${E2E_PROJECT_NAME} --dry-run=client -o yaml | kubectl
cd ../../python/sdk
pip install pipenv==2022.8.19
pipenv install --dev --skip-lock --python ${PYTHON_VERSION}
pipenv run pytest -n=8 -W=ignore --cov=merlin -m "not (feast or batch or pyfunc or local_server_test or cli or customtransformer)"
pipenv run pytest -n=6 -W=ignore --cov=merlin -m "not (feast or batch or pyfunc or local_server_test or cli or customtransformer)"
Copy link
Contributor Author

@shydefoo shydefoo Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduced number of parallel processes to 6 so less inference services are created at the same time, so less chances of deployment create timing out?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have timeout issue now 😱
@ariefrahmansyah

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Our e2e test in GitHub is only deploying standard models though. Need more analysis on the root cause. I'll put it on our e2e test improvement doc.

Copy link
Contributor Author

@shydefoo shydefoo Feb 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe timeout is the wrong word..I'm not 100% if it's a timeout issue, but i keep seeing this similar error in the debug log whenever the test fails:
merlin-log

{"level":"error","ts":1677142137.5520687,"caller":"cluster/controller.go:167","msg":"unable to create inference service logger-s-1 Internal error occurred: failed calling webhook \"inferenceservice.kserve-webhook-server.validator\": Post \"[https://kserve-webhook-server-service.kserve.svc:443/validate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s\](https://kserve-webhook-server-service.kserve.svc/validate-serving-kserve-io-v1beta1-inferenceservice?timeout=10s\)": EOF","stacktrace":"github.com/gojek/merlin/cluster.(*controller).Deploy\n\t/src/api/cluster/controller.go:167\ngithub.com/gojek/merlin/queue/work.(*ModelServiceDeployment).Deploy\n\t/src/api/queue/work/model_service_deployment.go:115\ngithub.com/gojek/merlin/queue.(*worker).processJob\n\t/src/api/queue/worker.go:81"}
  {"level":"error","ts":1677142137.5602424,"caller":"work/model_service_deployment.go:117","msg":"unable to deploy version endpoint for model: logger-s, version: 1, reason: error creating inference service","stacktrace":"github.com/gojek/merlin/queue/work.(*ModelServiceDeployment).Deploy\n\t/src/api/queue/work/model_service_deployment.go:117\ngithub.com/gojek/merlin/queue.(*worker).processJob\n\t/src/api/queue/worker.go:81"}
  {"level":"error","ts":1677142137.5852926,"caller":"queue/worker.go:82","msg":"Job 1 failed with error: error creating inference service","stacktrace":"github.com/gojek/merlin/queue.(*worker).processJob\n\t/src/api/queue/worker.go:82"}

model-log

  Error from server (BadRequest): container "kserve-container" in pod "multi-env-1-predictor-default-00001-deployment-64dc8f997c-s4hdl" is waiting to start: PodInitializing

@shydefoo
Copy link
Contributor Author

e2e-tests finally passing, @pradithya will merge this PR if that's ok

@pradithya
Copy link
Member

e2e-tests finally passing, @pradithya will merge this PR if that's ok

Please do. Thanks @shydefoo !

Copy link
Contributor

@ariefrahmansyah ariefrahmansyah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @shydefoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants