Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stress test 5000 concurrent users #46

Closed
yuvipanda opened this issue Jul 9, 2017 · 18 comments
Closed

Stress test 5000 concurrent users #46

yuvipanda opened this issue Jul 9, 2017 · 18 comments

Comments

@yuvipanda
Copy link
Collaborator

I want to be able to simulate 5000 concurrent users, doing some amount of notebook activity (cell execution) and getting results back in some reasonable time frame.

This will require that we test & tune:

  1. JupyterHub
  2. Proxy implementation
  3. Spawner implementation
  4. Kubernetes itself

Ideally, we want Kubernetes itself to be our bottleneck, and our components introduce no additional delays.

This issue will document and track efforts to achieve this, as well as to define what 'this' is.

@yuvipanda
Copy link
Collaborator Author

So far, we've:

  1. Switched to using pycurl inside the hub, to make communication with the proxy service easier
  2. Make polls orders of magnitude faster in kubespawner with Use reflector pattern to speed up polls kubespawner#63
  3. Start the process of making connection pool's maxsize configurable in the official kubernetes client: Allow setting maxsize for PoolManager kubernetes-client/python-base#18

@yuvipanda
Copy link
Collaborator Author

A big problem we ran into was that we were constantly hit by redirect loops. Me and @minrk dug through this (yay teamwork? I made some comments and went to sleep and he had solved it when I woke up!), and jupyterhub/jupyterhub#1221 seems to get rid of them all!!!

When I ran a 2000 user test, about 150 failed to start. Some of them were because of this curl error:

[E 2017-07-14 20:09:46.147 JupyterHub ioloop:638] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f8cfbdc1ea0>, <tornado.concurrent.Future object at 0x7f8c30a87048>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 605, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.5/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
        return fn(*args, **kwargs)
      File "/usr/local/lib/python3.5/dist-packages/tornado/ioloop.py", line 626, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/app.py", line 1382, in update_last_activity
        routes = yield self.proxy.get_all_routes()
      File "/usr/local/lib/python3.5/dist-packages/jupyterhub/proxy.py", line 556, in get_all_routes
        resp = yield self.api_request('', client=client)
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 20553 milliseconds with 163693 bytes received

This could patially be fixed by tuning pycurl. However, it failed a 20s timeout, which is pretty generous. I think a k8s ingress based Proxy implementation would help here, and might be the next bottleneck to solve.

The spawn times increased linearly with time, eventually going over the 300s timeout we have for image starts. We should take a profiler to the hub process (not sure how to profile tornado code tho haha) and figure out what's the thing that takes linear amounts of time.

@yuvipanda
Copy link
Collaborator Author

Ran another 5000, and am now convinced that next step is a proxy implementation that does not need to make a network request for each get_all_routes()

@minrk
Copy link
Member

minrk commented Jul 15, 2017

I suspect that slowness of get_all_routes may be a bit misleading, because lots of actions are using the same IOLoop and even httpclient waiting for Spawners to start, etc. My guess is that it's all tied up in the busy-ness of the Hub IOLoop. Plus, get_all_routes only happens once every five minutes, so it seems a bit odd for it to be a major contributing factor.

When I run get_all_routes on its own when things are otherwise idle, it takes 50ms to fetch 20k routes, so it's unlikely that the fetch is actually what's taking that time. My guess is that it's something like:

  1. start timer
  2. initiate fetch
  3. yield to other tasks
  4. check timer (all those other tasks ate my 20 seconds!)

Under load, presumably CHP would take longer to respond to the request, but 20s seems like a lot compared to 50ms.

I think it's important to separate what we're testing as concurrent:

  1. concurrent spawns
  2. concurrent active users (talking only to their server, not the Hub)

In particular, JupyterHub is based on the idea that login/spawn is a relatively infrequent event compared to actually talking to their server, so while stress testing all users concurrently spawning is useful as a worst-case scenario, it doesn't give an answer to the question "how many users can I support" because it is not a realistic load of concurrent users.

I'd love a load test that separated the number of people trying to spawn from the number of people concurrently trying to use their running notebook servers (opening & saving, running code, etc .). e.g. being able to ask the questions:

  • What's the performance of 5k users when nobody is spawning new servers
  • What's the performance of one login when 5k users are active
  • 100 users always logging in / out, 5k active
  • etc.

@yuvipanda
Copy link
Collaborator Author

+1 on load testing a matrix of factors!

We should also identify what 'performance' means. I think it's:

  1. Total time to get from login to notebook on browser (+ failure counts)
  2. Time taken to start a kernel and execute a reference notebook over time (this will happen in a loop)

Anything else, @minrk? If we nail these down as things we wanna measure, then we can form the test matrix properly...

@willingc
Copy link
Contributor

This is awesome @yuvipanda

@yuvipanda
Copy link
Collaborator Author

I think ideal way for us to stress test is that we can define a function that'll draw a 'current active users' graph - spikes will get more useres logging in, slumps will have users dropping out, and steady states will do nothing. This will let us generate varying kinds of load by just tweaking the shape of this graph (which you can do interactively).

@minrk
Copy link
Member

minrk commented Jul 17, 2017

That looks like a good starting point.

Some 'performance' metrics:

Hub:

  1. login + spawn time
  2. HTTP roundtrip (GET /hub/home)
  3. API roundtrip (GET /hub/api/users)

Singleuser (spawned and ready to go):

  1. full-go (open notebook, start kernel, run all, save)
  2. websocket roundtrip (single message)
  3. HTTP roundtrip (GET /api/contents)

@yuvipanda
Copy link
Collaborator Author

So I did more stuff around this in the last couple of days.

  1. I switched to a kubernetes ingress provider for proxying. This is great, but a bit racy still. Hopefully we'll tackle all of it
  2. I did a performance profile (using pyflame) and it looks like we're spending a good chunk of CPU on auth related methods. There's ways to optimize there.
  3. I'm currently trying out pypy to see if it gives us a boost!

@yuvipanda
Copy link
Collaborator Author

I also filed jupyterhub/jupyterhub#1253 - that's the most CPU intensive hotspot we've found. Plus there are lots of well tested libraries for this we should use.

@yuvipanda
Copy link
Collaborator Author

yuvipanda commented Jul 23, 2017

Filed kubernetes-client/python-base#23 for issues with kubernetes api-client.

Update: It's most likely a PyPy bug: https://bitbucket.org/pypy/pypy/issues/2615/this-looks-like-a-thread-safety-bug

@yuvipanda
Copy link
Collaborator Author

We can offload TLS to a kubectl proxy instance that's running as a sidecar container, providing a http endpoint that's bound to localhost. I've filed kubernetes-client/python#303 on figuring out how to do this.

@yuvipanda
Copy link
Collaborator Author

Me and @minrk tracked down the SQLAlchemy issues to jupyterhub/jupyterhub#1257 - and it's improved performance considerably!

hashlib is next step!

@yuvipanda
Copy link
Collaborator Author

More notes is:

  1. We have lots of scaling things to do that are dependent on 'number of spawns pending'.
  2. We're using threads that do IO now, and are depending on dicts updated by those in our main thread. Figure out how to profile GIL contention.
  3. Try to reduce the amount of busy looping that we do in the proxy & spawner, and implement exponential backoffs instead of 'gen.sleep(1)'.
  4. Figure out if we really can stop salting our hashes if they're from UUIDs - or at least stop salting the specific hashes that come from UUIDs. The prefix based scheme might be part of our problem.
  5. Move all sqlalchemy operations to a single external thread (not a threadpool probably)

@minrk
Copy link
Member

minrk commented Jul 24, 2017

Move all sqlalchemy operations to a single external thread (not a threadpool probably)

This is probably the hardest one, because we have to make sure that we bundle db operations within a single synchronous function that we pass to a thread, whereas right now we only have to ensure that we commit in between yields.

@yuvipanda
Copy link
Collaborator Author

Additional point - make sure that each tornado tic isn't doing too much work. We want it to be as evenly spread out as possible - so we don't get tocs that are too long followed by tocs that are too small. Am implementing some jitter into the exponential backoff algorithm for this now...

@yuvipanda
Copy link
Collaborator Author

We fixed a ton of stuff doing this! YAY!

@nnbaokhang
Copy link

Sorry but how do we run the stress test. I don't think I see any tutorial yet. Correct me if I'm wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants