Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) #114

Open
sylus opened this issue Aug 11, 2020 · 10 comments
Open
Labels
kind/bug kind - things not working properly

Comments

@sylus
Copy link

sylus commented Aug 11, 2020

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Hello amazing kubeflow community.

Apologies in advance if I struggle to explain this issue properly but here is my attempt.

It seems that the longer the Jupyter Web App Flask application is running and the more frequently it is queried against that this increases the likelihood of subsequent API calls failing. This has resulted in people being unable to submit their workload and have certain fields populated.

Often the API queries first go into pending and then fail quite a few minutes later are:

  • /jupyter/api/namespaces/USERNAME/notebooks
  • /jupyter/api/namespaces/USERNAME/pvcs
  • /jupyter/api/namespaces/USERNAME/poddefaults

We do notice env-info also taking a long time but there is a corresponding and seperate issue for that and this call always seems to return.

In the following picture it shows we are unable to open the configurations drop down which is populated from the PodDefaults and that two API calls are pending that will ultimately fail. (A subsequent log with stack trace once failure happens is also attached).

jupyter-web-app
jupyter-web-app.txt

In the stack trace we get the following messages which come from the kubernetes-client/python python library.

2020-08-11 00:02:39,869 | kubeflow_jupyter.default.app | ERROR | Exception on /api/namespaces/USERNAME/poddefaults [GET]
...
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api/authorization_v1_api.py", line 389, in create_subject_access_review
...
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out. (read timeout=None)

What did you expect to happen:

All of the API calls to succeed quickly and not result in pending leading to timeout.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

We have approximately 125 users in our kubeflow environment and do populate a PodDefault for each user. All of the API calls are fast and work for roughly an hour (sometimes less) whenever we restart the Jupyter Web App.

kubectl rollout restart deploy -n kubeflow jupyter-web-app-deployment

It should also be noted that for the first 2 months of operations of our kubeflow cluster we didn't seem to experience any of these issues which leads us to assume this might be a scaling issue? Though we absolutely don't rule out something we may have done.

As a further debugging step we have made some slight adjustments to the Jupyter Web App (Flask) application such as running it behind gunicorn (performing minor adjustments in our fork) but the problem still seems to arise (though a noticable improvement in latency).

StatCan/kubeflow@d2a2f5f

Finally at present we have started to rewrite all of the Jupyter API requests in a Go backend and see if we get the same problems. It seems at the moment we don't have any issues whatsoever with the Go backend and also set it up so uses a stream from the APi server rather then making API calls all the times as it has a local cache for everything. The only API call this makes on reuqests is checking your authentication.

https://github.com/StatCan/jupyter-apis (Go Backend)

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): 1.0.1
  • kfctl version: (use kfctl version): 1.0.1
  • Kubernetes platform: (e.g. minikube) AKS
  • Kubernetes version: (use kubectl version): 1.15.10
  • OS (e.g. from /etc/os-release): 16.04
@k8s-ci-robot k8s-ci-robot added the kind/bug kind - things not working properly label Aug 11, 2020
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/jupyter 0.89

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@sylus sylus changed the title Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) running with scale Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) Aug 11, 2020
@DanielSCon40
Copy link

DanielSCon40 commented Aug 30, 2020

We experienced this behavior also with a smaller number (<15) of profiles.

@kimwnasptd
Copy link
Member

Thanks for the detailed report @sylus!

We should indeed move to gunicorn for serving the app instead of the Flask server.

What about increasing the number of replicas the jupyter web app has?

@stale
Copy link

stale bot commented Dec 28, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@sylus
Copy link
Author

sylus commented Dec 28, 2020

/remove-lifecycle stale

@yanniszark
Copy link

/lifecycle frozen

@davidspek
Copy link

@kimwnasptd Is it an idea to look into possibly moving the notebooks API to the Go version created over at https://github.com/StatCan/jupyter-apis after release 1.3 when the testing is hopefully all running?

@kimwnasptd
Copy link
Member

Hey everyone, I've done some progress on this and have a PR that implements caching in the backends kubeflow/kubeflow#7080. This should initially help with the load in the backend, since it will not need to perform requests to K8s and will have its own cache.

A next step afterwards will be to extend the frontends to keep on polling but by doing some proper pagination with the backend.

For example now that the backend always has the full list of objects in-memory it can answer page requests like: I want the 3rd page where each page has 20 items.

@andreyvelich
Copy link
Member

/transfer notebooks

@github-project-automation github-project-automation bot moved this to Needs Triage in Kubeflow Notebooks Nov 11, 2024
@google-oss-prow google-oss-prow bot transferred this issue from kubeflow/kubeflow Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind - things not working properly
Projects
Status: Needs Triage
Development

No branches or pull requests

7 participants