Major image/chunks transfer slowdown on CVAT (>2.4.5), maybe related to uvicorn/nginx config? #6477

titarch · 2023-07-13T14:51:36Z

My actions before raising this issue

Read/searched the docs
Searched past issues

I am maintaining a fork of CVAT (deployed with helm on a kubernetes cluster) and I regularly sync with upstream, I recently rebased from version release-2.4.0 to hotfix-2.4.7 and noticed some significant slowdowns for all endpoints retrieving data from the data volume (although it may not be limited to them but it is maybe more noticeable given these endpoints are resulting in the majority of the data transfers).

The most problematic slowdowns are happening when fetching chunks of frames which are 1.5x~2x slower than before, but I also noticed a significant slowdown on the /preview endpoint (up to 10x, around 3s instead of 300ms) when loading a task list view:

I saw that there were some major changes regarding the way the data is served, changing from mod_wsgi to uvicorn asgi with some nginx config. This was to me the most probable culprit which is why I reverted and rebased on version release-2.4.5 and the issue went away. This does not confirm for sure that this change alone is responsible as it could be anything between version 2.4.5 and 2.4.7, but I still have strong suspicions it is the new socket config. Do you think this is plausible or something else entirely?

I also noted that performance does not seem to much affected when running locally with docker-compose instead of helm, since the later uses a data volume linked to a SMB fileshare on Azure, throughput is more limited, so if the average network throughput required to run CVAT increased then it could be a reason for the slowdown.

For now I will stick to version 2.4.5 but any help or idea would be very welcome, as I am quite interested by some upcoming features on 2.5.x.
Thanks a lot in advance.

Steps to Reproduce (for bugs)

Deploy CVAT >= 2.4.7 on a cluster and 2.4.5 on another identical cluster
Create some tasks
Measure the difference in performance between the preview or data endpoints

Expected Behaviour

Similar or better performance

Current Behaviour

Significantly degraded performance, slow data transfers, slower global throughput

Context

Running CVAT 2.4.7 on a kubernetes cluster deployed using a helm chart and with a custom data volume using an Azure SMB file share.

Your Environment

Rebased for on top of hotfix-2.4.7 branch
Docker version 24.0.2
Are you using Docker Swarm or Kubernetes? Kubernetes
Operating System and version (e.g. Linux, Windows, MacOS): Linux

Zanz2 · 2023-07-13T18:29:02Z

I had simillar issues too, but it could be that it is completely unrelated to yours. Out of curiosity though, how many import pods are you running, and if you exec into the backend server pod and run python manage.py rqstats how many import workers does it list? In my case increasing the chunk size from 2MB to 10 MB helped and for some reason my import pod was starting with only 1 worker, so i increased that to 3. Not sure if the server change was the cause, but the workers were not set correctly and migrate.py was not called automatically after the new change. Just my two cents.

titarch · 2023-07-17T08:07:04Z

Thanks for the suggestion, I do seem to have a normal import worker config does not appear to do anything, and increasing the chunk size sounds like a hassle given that I would have to re-run the frame extraction on hundreds of tasks.
I would really like to get back the performance of version 2.4.5 on later versions.

Zanz2 · 2023-07-17T08:51:06Z

Ahh, i understand, its about the preview and annotation chunks not the file upload chunks. I was thinking of the TUS resumable file upload protocol chunk size, but that wouldnt have helped anyway. Im on the same version but had problems with file upload, not preview transfers.

Have you just changed the cvat server image version to upgrade? So it is probably not redis related? I wish i could help, maybe its worth trying If persistence is enabled on the redis deployment (which it is by default) maybe try deleting the redis PVC's and then recreate them?

titarch · 2023-07-17T16:12:02Z

Yes, exactly. I also have issues downloading the frame chunks for the player which is the biggest problems.
I am using a fork, and depending on which version of CVAT I rebase I get this issue, so I don't think the issues comes from redis. PVCs are also deleted and recreated when I change the CVAT base version.

titarch · 2023-07-18T13:27:42Z

I did a git bisect between v2.4.5 and v2.4.6, I reduced the changed of my fork to only the deployment config and stashed on every step. The result confirmed my intuition as it returned the following commit:

87dd7fff928a38a5275ebe4de3793016392e8c0e is the first bad commit
commit 87dd7fff928a38a5275ebe4de3793016392e8c0e
Author: Andrey Zhavoronkov <andrey@cvat.ai>
Date:   Tue May 30 15:24:58 2023 +0300

    Switch to uvicorn (#6195)

I will do more experimenting on #6195

titarch · 2023-07-18T15:18:32Z

I have identified the core problem, this is due to the NUMPROC environment variable.
The value seems to be 1 by default and is not overridden to anything higher in the helm config.
If I manually set the numproc to something higher, e.g. 4 in supervisord/server.conf:

[fcgi-program:uvicorn]
socket=tcp://localhost:8000
command=%(ENV_HOME)s/wait-for-it.sh %(ENV_CVAT_POSTGRES_HOST)s:5432 -t 0 -- python3 -m uvicorn
    --fd 0 --forwarded-allow-ips='*' cvat.asgi:application
environment=SSH_AUTH_SOCK="/tmp/ssh-agent.sock"
-numprocs=%(ENV_NUMPROCS)s
+numprocs=4
process_name=%(program_name)s-%(process_num)s
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0

this seems to give performances closer to what was there before.

Is there anything I am missing that would have made the value of NUMPROCS higher?

Zanz2 · 2023-07-18T15:34:27Z

Nice, yeah thats what i meant in my 1st post. the numprocs is the number of workers that i mentioned, it seems like they are at 1 now by default. You can even set it via the helm chart (additionalEnv: NUMPROCS: 4). you can test that they are changed using python manage.py rqstats

titarch · 2023-07-18T15:36:42Z

Right so actually same reason for two different problems, that would be a good way to set it too, I think this should be documented if it is going to remain like that, or a sane default should be added in the values.yaml.

Zanz2 · 2023-07-18T15:37:33Z

Yeah, aggree it took me a lot of looking before i found that that was the cause, it is indeed not mentioned anywhere unless you go browsing trough the supervisord .conf files for all the workers

bsekachev · 2023-07-28T06:19:07Z

@azhavoro Are we using only one process now? Do you think we need to document it?

titarch mentioned this issue Jul 18, 2023

slow to open #6496

Closed

bsekachev added the performance label Jul 28, 2023

stykm mentioned this issue Aug 24, 2023

CVAT Backend slowing down as number of tasks in project increases #6747

Open

2 tasks

bsekachev closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major image/chunks transfer slowdown on CVAT (>2.4.5), maybe related to uvicorn/nginx config? #6477

Major image/chunks transfer slowdown on CVAT (>2.4.5), maybe related to uvicorn/nginx config? #6477

titarch commented Jul 13, 2023

Zanz2 commented Jul 13, 2023 •

edited

Loading

titarch commented Jul 17, 2023

Zanz2 commented Jul 17, 2023 •

edited

Loading

titarch commented Jul 17, 2023 •

edited

Loading

titarch commented Jul 18, 2023 •

edited

Loading

titarch commented Jul 18, 2023 •

edited

Loading

Zanz2 commented Jul 18, 2023

titarch commented Jul 18, 2023

Zanz2 commented Jul 18, 2023

bsekachev commented Jul 28, 2023

Major image/chunks transfer slowdown on CVAT (>2.4.5), maybe related to uvicorn/nginx config? #6477

Major image/chunks transfer slowdown on CVAT (>2.4.5), maybe related to uvicorn/nginx config? #6477

Comments

titarch commented Jul 13, 2023

My actions before raising this issue

Steps to Reproduce (for bugs)

Expected Behaviour

Current Behaviour

Context

Your Environment

Zanz2 commented Jul 13, 2023 • edited Loading

titarch commented Jul 17, 2023

Zanz2 commented Jul 17, 2023 • edited Loading

titarch commented Jul 17, 2023 • edited Loading

titarch commented Jul 18, 2023 • edited Loading

titarch commented Jul 18, 2023 • edited Loading

Zanz2 commented Jul 18, 2023

titarch commented Jul 18, 2023

Zanz2 commented Jul 18, 2023

bsekachev commented Jul 28, 2023

Zanz2 commented Jul 13, 2023 •

edited

Loading

Zanz2 commented Jul 17, 2023 •

edited

Loading

titarch commented Jul 17, 2023 •

edited

Loading

titarch commented Jul 18, 2023 •

edited

Loading

titarch commented Jul 18, 2023 •

edited

Loading