Stress test 5000 concurrent users #46

yuvipanda · 2017-07-09T06:40:16Z

I want to be able to simulate 5000 concurrent users, doing some amount of notebook activity (cell execution) and getting results back in some reasonable time frame.

This will require that we test & tune:

JupyterHub
Proxy implementation
Spawner implementation
Kubernetes itself

Ideally, we want Kubernetes itself to be our bottleneck, and our components introduce no additional delays.

This issue will document and track efforts to achieve this, as well as to define what 'this' is.

yuvipanda · 2017-07-09T07:00:23Z

So far, we've:

Switched to using pycurl inside the hub, to make communication with the proxy service easier
Make polls orders of magnitude faster in kubespawner with Use reflector pattern to speed up polls kubespawner#63
Start the process of making connection pool's maxsize configurable in the official kubernetes client: Allow setting maxsize for PoolManager kubernetes-client/python-base#18

yuvipanda · 2017-07-14T20:31:26Z

A big problem we ran into was that we were constantly hit by redirect loops. Me and @minrk dug through this (yay teamwork? I made some comments and went to sleep and he had solved it when I woke up!), and jupyterhub/jupyterhub#1221 seems to get rid of them all!!!

When I ran a 2000 user test, about 150 failed to start. Some of them were because of this curl error:

[E 2017-07-14 20:09:46.147 JupyterHub ioloop:638] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f8cfbdc1ea0>, <tornado.concurrent.Future object at 0x7f8c30a87048>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 605, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
        return fn(*args, **kwargs)
      File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 626, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/app.py", line 1382, in update_last_activity
        routes = yield self.proxy.get_all_routes()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/proxy.py", line 556, in get_all_routes
        resp = yield self.api_request('', client=client)
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 20553 milliseconds with 163693 bytes received

This could patially be fixed by tuning pycurl. However, it failed a 20s timeout, which is pretty generous. I think a k8s ingress based Proxy implementation would help here, and might be the next bottleneck to solve.

The spawn times increased linearly with time, eventually going over the 300s timeout we have for image starts. We should take a profiler to the hub process (not sure how to profile tornado code tho haha) and figure out what's the thing that takes linear amounts of time.

yuvipanda · 2017-07-15T04:30:20Z

Ran another 5000, and am now convinced that next step is a proxy implementation that does not need to make a network request for each get_all_routes()

minrk · 2017-07-15T11:20:04Z

I suspect that slowness of get_all_routes may be a bit misleading, because lots of actions are using the same IOLoop and even httpclient waiting for Spawners to start, etc. My guess is that it's all tied up in the busy-ness of the Hub IOLoop. Plus, get_all_routes only happens once every five minutes, so it seems a bit odd for it to be a major contributing factor.

When I run get_all_routes on its own when things are otherwise idle, it takes 50ms to fetch 20k routes, so it's unlikely that the fetch is actually what's taking that time. My guess is that it's something like:

start timer
initiate fetch
yield to other tasks
check timer (all those other tasks ate my 20 seconds!)

Under load, presumably CHP would take longer to respond to the request, but 20s seems like a lot compared to 50ms.

I think it's important to separate what we're testing as concurrent:

concurrent spawns
concurrent active users (talking only to their server, not the Hub)

In particular, JupyterHub is based on the idea that login/spawn is a relatively infrequent event compared to actually talking to their server, so while stress testing all users concurrently spawning is useful as a worst-case scenario, it doesn't give an answer to the question "how many users can I support" because it is not a realistic load of concurrent users.

I'd love a load test that separated the number of people trying to spawn from the number of people concurrently trying to use their running notebook servers (opening & saving, running code, etc .). e.g. being able to ask the questions:

What's the performance of 5k users when nobody is spawning new servers
What's the performance of one login when 5k users are active
100 users always logging in / out, 5k active
etc.

yuvipanda · 2017-07-15T19:36:20Z

+1 on load testing a matrix of factors!

We should also identify what 'performance' means. I think it's:

Total time to get from login to notebook on browser (+ failure counts)
Time taken to start a kernel and execute a reference notebook over time (this will happen in a loop)

Anything else, @minrk? If we nail these down as things we wanna measure, then we can form the test matrix properly...

willingc · 2017-07-15T20:13:49Z

This is awesome @yuvipanda

yuvipanda · 2017-07-15T20:20:27Z

I think ideal way for us to stress test is that we can define a function that'll draw a 'current active users' graph - spikes will get more useres logging in, slumps will have users dropping out, and steady states will do nothing. This will let us generate varying kinds of load by just tweaking the shape of this graph (which you can do interactively).

minrk · 2017-07-17T08:03:39Z

That looks like a good starting point.

Some 'performance' metrics:

Hub:

login + spawn time
HTTP roundtrip (GET /hub/home)
API roundtrip (GET /hub/api/users)

Singleuser (spawned and ready to go):

full-go (open notebook, start kernel, run all, save)
websocket roundtrip (single message)
HTTP roundtrip (GET /api/contents)

yuvipanda · 2017-07-23T04:12:42Z

So I did more stuff around this in the last couple of days.

I switched to a kubernetes ingress provider for proxying. This is great, but a bit racy still. Hopefully we'll tackle all of it
I did a performance profile (using pyflame) and it looks like we're spending a good chunk of CPU on auth related methods. There's ways to optimize there.
I'm currently trying out pypy to see if it gives us a boost!

yuvipanda · 2017-07-23T05:27:26Z

I also filed jupyterhub/jupyterhub#1253 - that's the most CPU intensive hotspot we've found. Plus there are lots of well tested libraries for this we should use.

yuvipanda · 2017-07-23T05:33:57Z

Filed kubernetes-client/python-base#23 for issues with kubernetes api-client.

Update: It's most likely a PyPy bug: https://bitbucket.org/pypy/pypy/issues/2615/this-looks-like-a-thread-safety-bug

yuvipanda · 2017-07-23T09:09:02Z

We can offload TLS to a kubectl proxy instance that's running as a sidecar container, providing a http endpoint that's bound to localhost. I've filed kubernetes-client/python#303 on figuring out how to do this.

yuvipanda · 2017-07-24T11:51:23Z

Me and @minrk tracked down the SQLAlchemy issues to jupyterhub/jupyterhub#1257 - and it's improved performance considerably!

hashlib is next step!

yuvipanda · 2017-07-24T12:13:37Z

More notes is:

We have lots of scaling things to do that are dependent on 'number of spawns pending'.
We're using threads that do IO now, and are depending on dicts updated by those in our main thread. Figure out how to profile GIL contention.
Try to reduce the amount of busy looping that we do in the proxy & spawner, and implement exponential backoffs instead of 'gen.sleep(1)'.
Figure out if we really can stop salting our hashes if they're from UUIDs - or at least stop salting the specific hashes that come from UUIDs. The prefix based scheme might be part of our problem.
Move all sqlalchemy operations to a single external thread (not a threadpool probably)

minrk · 2017-07-24T15:34:22Z

Move all sqlalchemy operations to a single external thread (not a threadpool probably)

This is probably the hardest one, because we have to make sure that we bundle db operations within a single synchronous function that we pass to a thread, whereas right now we only have to ensure that we commit in between yields.

yuvipanda · 2017-07-25T16:16:57Z

Additional point - make sure that each tornado tic isn't doing too much work. We want it to be as evenly spread out as possible - so we don't get tocs that are too long followed by tocs that are too small. Am implementing some jitter into the exponential backoff algorithm for this now...

yuvipanda · 2017-09-19T03:11:06Z

We fixed a ton of stuff doing this! YAY!

nnbaokhang · 2020-08-04T15:31:27Z

Sorry but how do we run the stress test. I don't think I see any tutorial yet. Correct me if I'm wrong.

This was referenced Jul 15, 2017

Failure communicating with the proxy at 100 simultaneous logins jupyterhub/jupyterhub#1203

Closed

Creating a 100 node k8s cluster results in all nodes being not ready Azure/acs-engine#998

Closed

yuvipanda closed this as completed Sep 19, 2017

scottyhq mentioned this issue May 9, 2019

Stress test jupyterhub / k8s cluster for 50-100 concurrent users pangeo-data/pangeo-cloud-federation#275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress test 5000 concurrent users #46

Stress test 5000 concurrent users #46

yuvipanda commented Jul 9, 2017

yuvipanda commented Jul 9, 2017

yuvipanda commented Jul 14, 2017

yuvipanda commented Jul 15, 2017

minrk commented Jul 15, 2017

yuvipanda commented Jul 15, 2017

willingc commented Jul 15, 2017

yuvipanda commented Jul 15, 2017

minrk commented Jul 17, 2017

yuvipanda commented Jul 23, 2017

yuvipanda commented Jul 23, 2017

yuvipanda commented Jul 23, 2017 •

edited

Loading

yuvipanda commented Jul 23, 2017

yuvipanda commented Jul 24, 2017

yuvipanda commented Jul 24, 2017

minrk commented Jul 24, 2017

yuvipanda commented Jul 25, 2017

yuvipanda commented Sep 19, 2017

nnbaokhang commented Aug 4, 2020

Stress test 5000 concurrent users #46

Stress test 5000 concurrent users #46

Comments

yuvipanda commented Jul 9, 2017

yuvipanda commented Jul 9, 2017

yuvipanda commented Jul 14, 2017

yuvipanda commented Jul 15, 2017

minrk commented Jul 15, 2017

yuvipanda commented Jul 15, 2017

willingc commented Jul 15, 2017

yuvipanda commented Jul 15, 2017

minrk commented Jul 17, 2017

yuvipanda commented Jul 23, 2017

yuvipanda commented Jul 23, 2017

yuvipanda commented Jul 23, 2017 • edited Loading

yuvipanda commented Jul 23, 2017

yuvipanda commented Jul 24, 2017

yuvipanda commented Jul 24, 2017

minrk commented Jul 24, 2017

yuvipanda commented Jul 25, 2017

yuvipanda commented Sep 19, 2017

nnbaokhang commented Aug 4, 2020

yuvipanda commented Jul 23, 2017 •

edited

Loading