Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major image/chunks transfer slowdown on CVAT (>2.4.5), maybe related to uvicorn/nginx config? #6477

Closed
2 tasks done
titarch opened this issue Jul 13, 2023 · 10 comments
Closed
2 tasks done

Comments

@titarch
Copy link

titarch commented Jul 13, 2023

My actions before raising this issue

I am maintaining a fork of CVAT (deployed with helm on a kubernetes cluster) and I regularly sync with upstream, I recently rebased from version release-2.4.0 to hotfix-2.4.7 and noticed some significant slowdowns for all endpoints retrieving data from the data volume (although it may not be limited to them but it is maybe more noticeable given these endpoints are resulting in the majority of the data transfers).

The most problematic slowdowns are happening when fetching chunks of frames which are 1.5x~2x slower than before, but I also noticed a significant slowdown on the /preview endpoint (up to 10x, around 3s instead of 300ms) when loading a task list view:
image

I saw that there were some major changes regarding the way the data is served, changing from mod_wsgi to uvicorn asgi with some nginx config. This was to me the most probable culprit which is why I reverted and rebased on version release-2.4.5 and the issue went away. This does not confirm for sure that this change alone is responsible as it could be anything between version 2.4.5 and 2.4.7, but I still have strong suspicions it is the new socket config. Do you think this is plausible or something else entirely?

I also noted that performance does not seem to much affected when running locally with docker-compose instead of helm, since the later uses a data volume linked to a SMB fileshare on Azure, throughput is more limited, so if the average network throughput required to run CVAT increased then it could be a reason for the slowdown.

For now I will stick to version 2.4.5 but any help or idea would be very welcome, as I am quite interested by some upcoming features on 2.5.x.
Thanks a lot in advance.

Steps to Reproduce (for bugs)

  1. Deploy CVAT >= 2.4.7 on a cluster and 2.4.5 on another identical cluster
  2. Create some tasks
  3. Measure the difference in performance between the preview or data endpoints

Expected Behaviour

Similar or better performance

Current Behaviour

Significantly degraded performance, slow data transfers, slower global throughput

Context

Running CVAT 2.4.7 on a kubernetes cluster deployed using a helm chart and with a custom data volume using an Azure SMB file share.

Your Environment

  • Rebased for on top of hotfix-2.4.7 branch
  • Docker version 24.0.2
  • Are you using Docker Swarm or Kubernetes? Kubernetes
  • Operating System and version (e.g. Linux, Windows, MacOS): Linux
@Zanz2
Copy link

Zanz2 commented Jul 13, 2023

I had simillar issues too, but it could be that it is completely unrelated to yours. Out of curiosity though, how many import pods are you running, and if you exec into the backend server pod and run python manage.py rqstats how many import workers does it list? In my case increasing the chunk size from 2MB to 10 MB helped and for some reason my import pod was starting with only 1 worker, so i increased that to 3. Not sure if the server change was the cause, but the workers were not set correctly and migrate.py was not called automatically after the new change. Just my two cents.

@titarch
Copy link
Author

titarch commented Jul 17, 2023

Thanks for the suggestion, I do seem to have a normal import worker config does not appear to do anything, and increasing the chunk size sounds like a hassle given that I would have to re-run the frame extraction on hundreds of tasks.
I would really like to get back the performance of version 2.4.5 on later versions.

@Zanz2
Copy link

Zanz2 commented Jul 17, 2023

Ahh, i understand, its about the preview and annotation chunks not the file upload chunks. I was thinking of the TUS resumable file upload protocol chunk size, but that wouldnt have helped anyway. Im on the same version but had problems with file upload, not preview transfers.

Have you just changed the cvat server image version to upgrade? So it is probably not redis related? I wish i could help, maybe its worth trying If persistence is enabled on the redis deployment (which it is by default) maybe try deleting the redis PVC's and then recreate them?

@titarch
Copy link
Author

titarch commented Jul 17, 2023

Yes, exactly. I also have issues downloading the frame chunks for the player which is the biggest problems.
I am using a fork, and depending on which version of CVAT I rebase I get this issue, so I don't think the issues comes from redis. PVCs are also deleted and recreated when I change the CVAT base version.

@titarch
Copy link
Author

titarch commented Jul 18, 2023

I did a git bisect between v2.4.5 and v2.4.6, I reduced the changed of my fork to only the deployment config and stashed on every step. The result confirmed my intuition as it returned the following commit:

87dd7fff928a38a5275ebe4de3793016392e8c0e is the first bad commit
commit 87dd7fff928a38a5275ebe4de3793016392e8c0e
Author: Andrey Zhavoronkov <andrey@cvat.ai>
Date:   Tue May 30 15:24:58 2023 +0300

    Switch to uvicorn (#6195)

I will do more experimenting on #6195

@titarch titarch mentioned this issue Jul 18, 2023
@titarch
Copy link
Author

titarch commented Jul 18, 2023

I have identified the core problem, this is due to the NUMPROC environment variable.
The value seems to be 1 by default and is not overridden to anything higher in the helm config.
If I manually set the numproc to something higher, e.g. 4 in supervisord/server.conf:

[fcgi-program:uvicorn]
socket=tcp://localhost:8000
command=%(ENV_HOME)s/wait-for-it.sh %(ENV_CVAT_POSTGRES_HOST)s:5432 -t 0 -- python3 -m uvicorn
    --fd 0 --forwarded-allow-ips='*' cvat.asgi:application
environment=SSH_AUTH_SOCK="/tmp/ssh-agent.sock"
-numprocs=%(ENV_NUMPROCS)s
+numprocs=4
process_name=%(program_name)s-%(process_num)s
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0

this seems to give performances closer to what was there before.

Is there anything I am missing that would have made the value of NUMPROCS higher?

@Zanz2
Copy link

Zanz2 commented Jul 18, 2023

Nice, yeah thats what i meant in my 1st post. the numprocs is the number of workers that i mentioned, it seems like they are at 1 now by default. You can even set it via the helm chart (additionalEnv: NUMPROCS: 4). you can test that they are changed using python manage.py rqstats

@titarch
Copy link
Author

titarch commented Jul 18, 2023

Right so actually same reason for two different problems, that would be a good way to set it too, I think this should be documented if it is going to remain like that, or a sane default should be added in the values.yaml.

@Zanz2
Copy link

Zanz2 commented Jul 18, 2023

Yeah, aggree it took me a lot of looking before i found that that was the cause, it is indeed not mentioned anywhere unless you go browsing trough the supervisord .conf files for all the workers

@bsekachev
Copy link
Member

@azhavoro Are we using only one process now? Do you think we need to document it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants