Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Kuberenetes templates and deployment guide #1962

Merged
merged 8 commits into from
Mar 18, 2021

Conversation

Langhalsdino
Copy link
Contributor

@Langhalsdino Langhalsdino commented Jul 29, 2020

Motivation and context

The topic was raised a couple of times in issues like #1087 . Since kubernetes is widely use easy deployment into the kubernetes environment would provide great value to the community and help to get cvat to a wider audience.

Special due to changes like #1641 its now way easier to deploy cvat in a k8s environment.

How has this been tested?

I deployed this in a couple of namespaces with in our cluster (with and without nvida gpu).
Furthermore i did not do any changes to the code, therefore the only real issue was networking. Since i was following the docker-compose.yml closely there where no real challges

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT

@coveralls
Copy link

coveralls commented Jul 29, 2020

Pull Request Test Coverage Report for Build 7348

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1092 unchanged lines in 43 files lost coverage.
  • Overall coverage increased (+1.7%) to 70.78%

Files with Coverage Reduction New Missed Lines %
cvat/apps/authentication/urls.py 1 82.35%
datumaro/datumaro/components/launcher.py 1 80.36%
datumaro/datumaro/plugins/accuracy_checker_plugin/details/ac.py 1 4.17%
datumaro/datumaro/plugins/accuracy_checker_plugin/launcher.py 1 25.0%
datumaro/datumaro/plugins/voc_format/format.py 1 97.71%
datumaro/datumaro/util/init.py 1 79.59%
cvat/apps/authentication/auth_basic.py 3 75.0%
src/frames.js 3 23.31%
datumaro/datumaro/plugins/openvino_launcher.py 4 3.95%
src/labels.js 4 82.86%
Totals Coverage Status
Change from base Build 6672: 1.7%
Covered Lines: 12498
Relevant Lines: 17262

💛 - Coveralls

@ActiveChooN
Copy link
Contributor

Hi @Langhalsdino, thanks a lot for your contribution. In our team we think a lot about adding kubernetes or docker swarm support for deployment. I have a few questions and suggestions.

  1. We recently merged PR (DL models as serverless functions #1767) with DL models serving as a function with Nuclio. Do you want to add this in your templates?
  2. There is a nice way to organize the kubernetes structure with Helm charts and helmfile. This can help to separate configuration values from templates itself and provides easy way to deploy and manage application. What about to use this tools in your PR?

@Langhalsdino
Copy link
Contributor Author

Langhalsdino commented Aug 3, 2020

There is a nice way to organize the kubernetes structure with Helm charts and helmfile. This can help to separate configuration values from templates itself and provides easy way to deploy and manage application. What about to use this tools in your PR?

Building helm files is in general a good idea and would help to set all the variable such as docker image registry, domains, ...
I will focus on this next weekend and try to add it with some new commits.

We recently merged PR (#1767) with DL models serving as a function with Nuclio. Do you want to add this in your templates?

I could add it to the template. I could not find any examples on how to use nuclio, do you have example models, ... such that i could test the deployment.

Lastly do you know the current status of #7 , since referencing official images in the deployment templates makes it way clearer and could prepare the ground for publishing cvat as an official helm chart on helm.sh

In general i am not a big fan of having growing pull requests. Therefore i would prefere if we could agree on a certain scope of this issue or maybe split it into multiple once, such that we will see them merged :)

@nmanovic
Copy link
Contributor

nmanovic commented Aug 3, 2020

@Langhalsdino , a couple of examples with nuclio you can find here: https://github.com/opencv/cvat/blob/develop/serverless/deploy.sh

What is the reasonable scope for the PR from your point of view?

@Langhalsdino
Copy link
Contributor Author

Langhalsdino commented Aug 3, 2020

What is the reasonable scope for the PR from your point of view?

I am tempted to split this into two issues and PR.
The first one is adding a helm chart for the basic setup of CVAT and second one would be extending it with a nuclio container (incl. GPU support).

Basic helm chart would include a statefull set of:

  • PostgresDB
  • Redis
  • CVAT
  • CVAT-UI
  • Service

The other issue would extend the functionality by adding nuclio support and GPU support for CVAT and Nuclio.

What do you think about it @nmanovic

@rushtehrani
Copy link
Contributor

rushtehrani commented Aug 3, 2020

If the plan is to use Helm, note that Nuclio supports and recommends the Helm chart deployment for production.

GPU support is dependent on the provider, for example GKE automatically installs the NVIDIA device plugin daemonset but AKS and EKS require you to install it manually.

Also, now that inference is in Nuclio, do we still need to have GPU support in the main CVAT container?

@Langhalsdino
Copy link
Contributor Author

Langhalsdino commented Aug 3, 2020

GPU support is dependent on the provider, for example GKE automatically installs the NVIDIA device plugin daemonset but AKS and EKS require you to install it manually.

I know about the plugin deamonset and i would mention it in a Readme, such that every EKS and AKS user will read it after there pods are not being scheduled :D

If the plan is to use Helm, note that Nuclio supports and recommends the Helm chart deployment for production.

Thats looks awesome. Then we will not need to split this into two issues and i will just link to their helm repo and add configuration options for connecting to Nuclio.

Also, now that inference is in Nuclio, do we still need to have GPU support in the main CVAT container?

This will probably make it a lot easier. If the roadmap of CVAT is pursuing a deeper integration of nuclio, i am happy to skipp the GPU part. Since it is different for every platform 🙈

@rushtehrani
Copy link
Contributor

That all sounds great! At some point we may also want to support a Kustomize deployment too.

@Langhalsdino
Copy link
Contributor Author

Langhalsdino commented Aug 3, 2020

That all sounds great! At some point we may also want to support a Kustomize deployment too.

Why do you want to go for Kustomize, if there is already a helm chart? I can not see the benefits of having both.
If kustomize ist the way to go, i could take a look into it and build it instead of helm. Eventhough helm is to be very popular and i use it on a daily bases.

@rushtehrani
Copy link
Contributor

Why do you want to go for Kustomize, if there is already a helm chart? I can not see the benefits of having both.
If kustomize ist the way to go, i could take a look into it and build it instead of helm. Eventhough helm is to be very popular and i use it on a daily bases.

To clarify, I think going with Helm for this phase makes perfect sense.

Ideally we'd want to support both. Kustomize is built into kubectl and projects like ours and Kubeflow use it for mainly the reasons outlined in this overview.

We can contribute the Kustomize version at a later time.

@nmanovic
Copy link
Contributor

nmanovic commented Aug 4, 2020

What is the reasonable scope for the PR from your point of view?

I am tempted to split this into two issues and PR.
The first one is adding a helm chart for the basic setup of CVAT and second one would be extending it with a nuclio container (incl. GPU support).

Basic helm chart would include a statefull set of:

  • PostgresDB
  • Redis
  • CVAT
  • CVAT-UI
  • Service

The other issue would extend the functionality by adding nuclio support and GPU support for CVAT and Nuclio.

What do you think about it @nmanovic

Looks great! Thanks for the contribution.

@nmanovic
Copy link
Contributor

@Langhalsdino , should we review the PR again or it is in progress? Could you please fix codacy issues?

@Langhalsdino
Copy link
Contributor Author

Sorry i got lost in over engineering a helm chart and interrupted by upcoming exams and my thesis. I will fix codacy issues this weekend and open another PR with the helm chart later on.

@Langhalsdino
Copy link
Contributor Author

@nmanovic I fixed the the issues mentioned in codacy. Regarding the CHANGE.log file. Should i add the changes in the following section or is this done by the repo maintainers?

## [1.1.0-alpha] - 2020-06-30
### Added

@nmanovic
Copy link
Contributor

nmanovic commented Oct 3, 2020

@azhavoro , could you please look at the PR? Do you have any comments?

@nmanovic
Copy link
Contributor

nmanovic commented Oct 3, 2020

@Langhalsdino , sorry for the delay with our review and thanks for the contribution!

@nmanovic nmanovic requested a review from azhavoro October 3, 2020 04:01
@jizg
Copy link

jizg commented Oct 12, 2020

Hi, guys. I have tried deploying nuclio on my Minikube, and tested with CVAT deployed by @Langhalsdino 's YAML templates. The automatic annotation worked well in this way. I made some changes to the YAML templates, just created a service for nuclio in cvat namespace and pointed to nuclio-dashboard service in nuclio namespace, and added no_proxy env for nuclio service to bypass cvat_proxy in cvat_backend_deployment.yml.

@Langhalsdino
Copy link
Contributor Author

Hi, guys. I have tried deploying nuclio on my Minikube, and tested with CVAT deployed by @Langhalsdino 's YAML templates. The automatic annotation worked well in this way. I made some changes to the YAML templates, just created a service for nuclio in cvat namespace and pointed to nuclio-dashboard service in nuclio namespace, and added no_proxy env for nuclio service to bypass cvat_proxy in cvat_backend_deployment.yml.

Could you share you changes, since i would like to include them in my template :)

@jizg
Copy link

jizg commented Oct 13, 2020

Sure. Please check this PR apic-ai#1
nuclio can be deployed based on the offical guide here (https://github.com/nuclio/nuclio/tree/development/docs/setup).

@nmanovic
Copy link
Contributor

@azhavoro , could you please work together with @Langhalsdino to merge the PR? I see that CI is failed. But the functionality is really important for our internal infrastructure. Take care about it.

@Langhalsdino
Copy link
Contributor Author

Sure. Please check this PR apic-ai#1
nuclio can be deployed based on the offical guide here (https://github.com/nuclio/nuclio/tree/development/docs/setup).

@jizg thank you for the PR i merged it :)

@fg91
Copy link

fg91 commented Nov 11, 2020

Hey @Langhalsdino ,

thanks for the kubernetes templates, looking really forward to this functionality!

When trying, I find that the cvat backend cannot find redis.

Backend logs:

2020-11-11 15:45:14,309 DEBG 'rqscheduler' stderr output:
15:45:14 Registering birth

2020-11-11 15:45:14,339 DEBG 'rqscheduler' stderr output:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 584, in _connect
    for res in socket.getaddrinfo(self.host, self.port, self.socket_type,
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/rqscheduler", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/rq_scheduler/scripts/rqscheduler.py", line 61, in main
    scheduler.run(burst=args.burst)
  File "/usr/local/lib/python3.8/dist-packages/rq_scheduler/scheduler.py", line 435, in run
    self.register_birth()
  File "/usr/local/lib/python3.8/dist-packages/rq_scheduler/scheduler.py", line 60, in register_birth
    if self.connection.exists(self.key) and \
  File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 1581, in exists
    return self.execute_command('EXISTS', *names)
  File "/usr/local/lib/python3.8/dist-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/usr/local/lib/python3.8/dist-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error -2 connecting to redis:6379. Name or service not known.

2020-11-11 15:45:14,355 DEBG fd 10 closed, stopped monitoring <POutputDispatcher at 139846789984064 for <Subprocess at 139846790140640 with name rqscheduler in state STARTING> (stdout)>
2020-11-11 15:45:14,355 DEBG fd 16 closed, stopped monitoring <POutputDispatcher at 139846789737680 for <Subprocess at 139846790140640 with name rqscheduler in state STARTING> (stderr)>
2020-11-11 15:45:14,355 INFO exited: rqscheduler (exit status 1; not expected)
2020-11-11 15:45:14,355 DEBG received SIGCHLD indicating a child quit
2020-11-11 15:45:15,357 INFO gave up: rqscheduler entered FATAL state, too many start retries too quickly

Do you know why this could be happening?
Thanks!

Update:
The issue might be the following:

Here, --host redis is hardcoded.
https://github.com/openvinotoolkit/cvat/blob/b565721d81cd2357b82dd7cb9273e8eaf3b97222/supervisord.conf#L46

And in the docker-compose.yml a corresponding alias is specified:

  cvat_redis:
    container_name: cvat_redis
    image: redis:4.0-alpine
    networks:
      default:
        aliases:
          - redis

In kubernetes, the service is, however, not reachable under redis.

@Langhalsdino
Copy link
Contributor Author

@nmanovic I am unable to view the snyk security output and therefore am not sure what failed. Could you share the output with me?

@Langhalsdino
Copy link
Contributor Author

Langhalsdino commented Nov 13, 2020

Hi @fg91 ,

since you send me an email regarding a different image i am assuming that you resolved the redis issue.
If so, could you share what resolved the issue. If not i am happy to take a look into it.

In your email you raised the issue that hosting cvat on a specific path of a domain likemy.cool.domain/cvat is not working. I have not tried this yet, but i would assume, that is currently not possible with cvat. Maybe a CVAT core developer can confirm or denial this.

If it is possible by cvat, i would investigate it further since it should be easily supported by adjusting the templates at a few places.

@Langhalsdino
Copy link
Contributor Author

I am really sorry didn't responding to your request. Based on your previous messages i changed a couple of things:

  • Merged the current develop branch into the PR.
  • Fixed the kubernetes API, such that the latest kubernetes version v1.19.3 is supported and tested it for v1.19.3 and 1.9
  • Added changelog entry.

This should cover the above mentioned discussions.

@coveralls
Copy link

coveralls commented Feb 28, 2021

Coverage Status

Coverage remained the same at 72.22% when pulling c773419 on apic-ai:release-1.1.0 into ca97507 on openvinotoolkit:develop.

@Langhalsdino
Copy link
Contributor Author

@azhavoro and @nmanovic could you do a new code review of this PR? I am really sorry I kept you waiting.

@nmanovic
Copy link
Contributor

nmanovic commented Mar 1, 2021

@azhavoro , need to comment or merge ASAP.

@Langhalsdino
Copy link
Contributor Author

@azhavoro If you need some assistance to setup a local dev environment or other questions, i am happy to assist.

@azhavoro
Copy link
Contributor

azhavoro commented Mar 3, 2021

@Langhalsdino Great job! I tested the PR and it works for me with minor changes (my comments above).
Thank you for the contribution!

@turowicz
Copy link

turowicz commented Mar 3, 2021

Does nuclio work with Kubernetes >= 1.19.x?

@Langhalsdino recent commit mentioned it should work with 1.19.x.

@Langhalsdino
Copy link
Contributor Author

Langhalsdino commented Mar 3, 2021

Does nuclio work with Kubernetes >= 1.19.x?

@Langhalsdino recent commit mentioned it should work with 1.19.x.

I tested the template with 1.9 and 1.19 and therefore deduce that everything between these versions should work. I will update the README.md to include the feedback by @azhavoro this evening.

Since nuclio works for v1.7 or later according to its documentation, there should not be any compatibility issues.

@turowicz
Copy link

turowicz commented Mar 3, 2021

@Langhalsdino I wasn't able to get nuclio to work with Kubernetes >1.19 because it does not support Containerd. How did you host model containers without nuclio trying to build them?

@Langhalsdino
Copy link
Contributor Author

I wasn't able to get nuclio to work with Kubernetes >1.19 because it does not support Containerd. How did you host model containers without nuclio trying to build them?

Sorry, i did not phrase this correctly, therefore this misunderstanding is my fault.

I tested the deployment (part of this PR) with Kubernetes 1.19 and 1.9. Since CVAT works without nuclio this was fine for kubernetes 1.19. Since kubernetes change of using containered was around 1.18, i was suggesting to downgrade your cluster to 1.17, if you would like to use nuclio or wait for nuclio to have fixed this issue.

@azhavoro
Copy link
Contributor

azhavoro commented Mar 9, 2021

LGTM
@ActiveChooN could you please review as well?

@tamademicheli
Copy link

tamademicheli commented Mar 11, 2021 via email

@ActiveChooN
Copy link
Contributor

@azhavoro, LGTM

Copy link
Contributor

@nmanovic nmanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Langhalsdino , thanks for the great contribution!

@nmanovic nmanovic merged commit 54ee8a1 into cvat-ai:develop Mar 18, 2021
@Langhalsdino Langhalsdino deleted the release-1.1.0 branch March 18, 2021 17:38
@ActiveChooN
Copy link
Contributor

@Langhalsdino, hi, we've just merged #3102 helm charts, which can easily replace raw k8s templates. Is there any special purpose for k8s templates or its can be removed?

@ActiveChooN ActiveChooN mentioned this pull request May 7, 2021
5 tasks
@jeffliao888
Copy link

jeffliao888 commented Jul 13, 2022

I following the document and deploy image: openvino/cvat_server:v1.7.0 in my Kubernetes v1.23.8 , and it seems work fine, but when I changed the image: openvino/cvat_server:v2.x.0 , then it showing "Could not check authorization on the server" in the brower.

Can someone help me if you can success to upgrade the CVAT server image from v1.x.0 to v2.x.0 and work fine in K8s ?
Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.