-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stress test 5000 concurrent users #46
Comments
So far, we've:
|
A big problem we ran into was that we were constantly hit by redirect loops. Me and @minrk dug through this (yay teamwork? I made some comments and went to sleep and he had solved it when I woke up!), and jupyterhub/jupyterhub#1221 seems to get rid of them all!!! When I ran a 2000 user test, about 150 failed to start. Some of them were because of this curl error:
This could patially be fixed by tuning pycurl. However, it failed a 20s timeout, which is pretty generous. I think a k8s ingress based Proxy implementation would help here, and might be the next bottleneck to solve. The spawn times increased linearly with time, eventually going over the 300s timeout we have for image starts. We should take a profiler to the hub process (not sure how to profile tornado code tho haha) and figure out what's the thing that takes linear amounts of time. |
Ran another 5000, and am now convinced that next step is a proxy implementation that does not need to make a network request for each get_all_routes() |
I suspect that slowness of When I run get_all_routes on its own when things are otherwise idle, it takes 50ms to fetch 20k routes, so it's unlikely that the fetch is actually what's taking that time. My guess is that it's something like:
Under load, presumably CHP would take longer to respond to the request, but 20s seems like a lot compared to 50ms. I think it's important to separate what we're testing as concurrent:
In particular, JupyterHub is based on the idea that login/spawn is a relatively infrequent event compared to actually talking to their server, so while stress testing all users concurrently spawning is useful as a worst-case scenario, it doesn't give an answer to the question "how many users can I support" because it is not a realistic load of concurrent users. I'd love a load test that separated the number of people trying to spawn from the number of people concurrently trying to use their running notebook servers (opening & saving, running code, etc .). e.g. being able to ask the questions:
|
+1 on load testing a matrix of factors! We should also identify what 'performance' means. I think it's:
Anything else, @minrk? If we nail these down as things we wanna measure, then we can form the test matrix properly... |
This is awesome @yuvipanda |
I think ideal way for us to stress test is that we can define a function that'll draw a 'current active users' graph - spikes will get more useres logging in, slumps will have users dropping out, and steady states will do nothing. This will let us generate varying kinds of load by just tweaking the shape of this graph (which you can do interactively). |
That looks like a good starting point. Some 'performance' metrics: Hub:
Singleuser (spawned and ready to go):
|
So I did more stuff around this in the last couple of days.
|
I also filed jupyterhub/jupyterhub#1253 - that's the most CPU intensive hotspot we've found. Plus there are lots of well tested libraries for this we should use. |
Filed kubernetes-client/python-base#23 for issues with kubernetes api-client. Update: It's most likely a PyPy bug: https://bitbucket.org/pypy/pypy/issues/2615/this-looks-like-a-thread-safety-bug |
We can offload TLS to a |
Me and @minrk tracked down the SQLAlchemy issues to jupyterhub/jupyterhub#1257 - and it's improved performance considerably! hashlib is next step! |
More notes is:
|
This is probably the hardest one, because we have to make sure that we bundle db operations within a single synchronous function that we pass to a thread, whereas right now we only have to ensure that we commit in between yields. |
Additional point - make sure that each tornado tic isn't doing too much work. We want it to be as evenly spread out as possible - so we don't get tocs that are too long followed by tocs that are too small. Am implementing some jitter into the exponential backoff algorithm for this now... |
We fixed a ton of stuff doing this! YAY! |
Sorry but how do we run the stress test. I don't think I see any tutorial yet. Correct me if I'm wrong. |
I want to be able to simulate 5000 concurrent users, doing some amount of notebook activity (cell execution) and getting results back in some reasonable time frame.
This will require that we test & tune:
Ideally, we want Kubernetes itself to be our bottleneck, and our components introduce no additional delays.
This issue will document and track efforts to achieve this, as well as to define what 'this' is.
The text was updated successfully, but these errors were encountered: