-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major image/chunks transfer slowdown on CVAT (>2.4.5), maybe related to uvicorn/nginx config? #6477
Comments
I had simillar issues too, but it could be that it is completely unrelated to yours. Out of curiosity though, how many import pods are you running, and if you exec into the backend server pod and run |
Thanks for the suggestion, I do seem to have a normal import worker config does not appear to do anything, and increasing the chunk size sounds like a hassle given that I would have to re-run the frame extraction on hundreds of tasks. |
Ahh, i understand, its about the preview and annotation chunks not the file upload chunks. I was thinking of the TUS resumable file upload protocol chunk size, but that wouldnt have helped anyway. Im on the same version but had problems with file upload, not preview transfers. Have you just changed the cvat server image version to upgrade? So it is probably not redis related? I wish i could help, maybe its worth trying If persistence is enabled on the redis deployment (which it is by default) maybe try deleting the redis PVC's and then recreate them? |
Yes, exactly. I also have issues downloading the frame chunks for the player which is the biggest problems. |
I did a git bisect between v2.4.5 and v2.4.6, I reduced the changed of my fork to only the deployment config and stashed on every step. The result confirmed my intuition as it returned the following commit:
I will do more experimenting on #6195 |
I have identified the core problem, this is due to the [fcgi-program:uvicorn]
socket=tcp://localhost:8000
command=%(ENV_HOME)s/wait-for-it.sh %(ENV_CVAT_POSTGRES_HOST)s:5432 -t 0 -- python3 -m uvicorn
--fd 0 --forwarded-allow-ips='*' cvat.asgi:application
environment=SSH_AUTH_SOCK="/tmp/ssh-agent.sock"
-numprocs=%(ENV_NUMPROCS)s
+numprocs=4
process_name=%(program_name)s-%(process_num)s
stdout_logfile=/dev/stdout
stdout_logfile_maxbytes=0 this seems to give performances closer to what was there before. Is there anything I am missing that would have made the value of NUMPROCS higher? |
Nice, yeah thats what i meant in my 1st post. the numprocs is the number of workers that i mentioned, it seems like they are at 1 now by default. You can even set it via the helm chart (additionalEnv: NUMPROCS: 4). you can test that they are changed using python manage.py rqstats |
Right so actually same reason for two different problems, that would be a good way to set it too, I think this should be documented if it is going to remain like that, or a sane default should be added in the values.yaml. |
Yeah, aggree it took me a lot of looking before i found that that was the cause, it is indeed not mentioned anywhere unless you go browsing trough the supervisord .conf files for all the workers |
@azhavoro Are we using only one process now? Do you think we need to document it? |
My actions before raising this issue
I am maintaining a fork of CVAT (deployed with helm on a kubernetes cluster) and I regularly sync with upstream, I recently rebased from version release-2.4.0 to hotfix-2.4.7 and noticed some significant slowdowns for all endpoints retrieving data from the data volume (although it may not be limited to them but it is maybe more noticeable given these endpoints are resulting in the majority of the data transfers).
The most problematic slowdowns are happening when fetching chunks of frames which are 1.5x~2x slower than before, but I also noticed a significant slowdown on the /preview endpoint (up to 10x, around 3s instead of 300ms) when loading a task list view:
![image](https://private-user-images.githubusercontent.com/23527994/253310041-bbecd408-b288-4d27-a638-7e46fd498a59.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2MDM5MzgsIm5iZiI6MTczOTYwMzYzOCwicGF0aCI6Ii8yMzUyNzk5NC8yNTMzMTAwNDEtYmJlY2Q0MDgtYjI4OC00ZDI3LWE2MzgtN2U0NmZkNDk4YTU5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE1VDA3MTM1OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZmMmJiNDA3MGRjZTA5MTA0NjAxOGYyYWE3NDE5Y2I5NTgyNGJlYTU1YWQzYjUyY2QwYmEwYWJkODNkNGFhYjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.GGCbcU5pqpM3PeqxPDvWAe6smXJ_V9cA-N9PelqpggU)
I saw that there were some major changes regarding the way the data is served, changing from mod_wsgi to uvicorn asgi with some nginx config. This was to me the most probable culprit which is why I reverted and rebased on version release-2.4.5 and the issue went away. This does not confirm for sure that this change alone is responsible as it could be anything between version 2.4.5 and 2.4.7, but I still have strong suspicions it is the new socket config. Do you think this is plausible or something else entirely?
I also noted that performance does not seem to much affected when running locally with docker-compose instead of helm, since the later uses a data volume linked to a SMB fileshare on Azure, throughput is more limited, so if the average network throughput required to run CVAT increased then it could be a reason for the slowdown.
For now I will stick to version 2.4.5 but any help or idea would be very welcome, as I am quite interested by some upcoming features on 2.5.x.
Thanks a lot in advance.
Steps to Reproduce (for bugs)
Expected Behaviour
Similar or better performance
Current Behaviour
Significantly degraded performance, slow data transfers, slower global throughput
Context
Running CVAT 2.4.7 on a kubernetes cluster deployed using a helm chart and with a custom data volume using an Azure SMB file share.
Your Environment
The text was updated successfully, but these errors were encountered: