Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) #114

sylus · 2020-08-11T01:08:48Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

Hello amazing kubeflow community.

Apologies in advance if I struggle to explain this issue properly but here is my attempt.

It seems that the longer the Jupyter Web App Flask application is running and the more frequently it is queried against that this increases the likelihood of subsequent API calls failing. This has resulted in people being unable to submit their workload and have certain fields populated.

Often the API queries first go into pending and then fail quite a few minutes later are:

/jupyter/api/namespaces/USERNAME/notebooks
/jupyter/api/namespaces/USERNAME/pvcs
/jupyter/api/namespaces/USERNAME/poddefaults

We do notice env-info also taking a long time but there is a corresponding and seperate issue for that and this call always seems to return.

In the following picture it shows we are unable to open the configurations drop down which is populated from the PodDefaults and that two API calls are pending that will ultimately fail. (A subsequent log with stack trace once failure happens is also attached).

jupyter-web-app.txt

In the stack trace we get the following messages which come from the kubernetes-client/python python library.

2020-08-11 00:02:39,869 | kubeflow_jupyter.default.app | ERROR | Exception on /api/namespaces/USERNAME/poddefaults [GET]
...
File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api/authorization_v1_api.py", line 389, in create_subject_access_review
...
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out. (read timeout=None)

What did you expect to happen:

All of the API calls to succeed quickly and not result in pending leading to timeout.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

We have approximately 125 users in our kubeflow environment and do populate a PodDefault for each user. All of the API calls are fast and work for roughly an hour (sometimes less) whenever we restart the Jupyter Web App.

kubectl rollout restart deploy -n kubeflow jupyter-web-app-deployment

It should also be noted that for the first 2 months of operations of our kubeflow cluster we didn't seem to experience any of these issues which leads us to assume this might be a scaling issue? Though we absolutely don't rule out something we may have done.

As a further debugging step we have made some slight adjustments to the Jupyter Web App (Flask) application such as running it behind gunicorn (performing minor adjustments in our fork) but the problem still seems to arise (though a noticable improvement in latency).

StatCan/kubeflow@d2a2f5f

Finally at present we have started to rewrite all of the Jupyter API requests in a Go backend and see if we get the same problems. It seems at the moment we don't have any issues whatsoever with the Go backend and also set it up so uses a stream from the APi server rather then making API calls all the times as it has a local cache for everything. The only API call this makes on reuqests is checking your authentication.

https://github.com/StatCan/jupyter-apis (Go Backend)

Environment:

Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): 1.0.1
kfctl version: (use kfctl version): 1.0.1
Kubernetes platform: (e.g. minikube) AKS
Kubernetes version: (use kubectl version): 1.15.10
OS (e.g. from /etc/os-release): 16.04

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2020-08-11T01:08:55Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/jupyter	0.89

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-08-11T01:08:55Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

DanielSCon40 · 2020-08-30T21:29:35Z

We experienced this behavior also with a smaller number (<15) of profiles.

kimwnasptd · 2020-08-31T07:10:04Z

Thanks for the detailed report @sylus!

We should indeed move to gunicorn for serving the app instead of the Flask server.

What about increasing the number of replicas the jupyter web app has?

stale · 2020-12-28T10:57:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sylus · 2020-12-28T19:11:57Z

/remove-lifecycle stale

yanniszark · 2021-01-08T16:45:52Z

/lifecycle frozen

davidspek · 2021-02-05T14:29:41Z

@kimwnasptd Is it an idea to look into possibly moving the notebooks API to the Go version created over at https://github.com/StatCan/jupyter-apis after release 1.3 when the testing is hopefully all running?

kimwnasptd · 2023-04-03T17:08:10Z

Hey everyone, I've done some progress on this and have a PR that implements caching in the backends kubeflow/kubeflow#7080. This should initially help with the load in the backend, since it will not need to perform requests to K8s and will have its own cache.

A next step afterwards will be to extend the frontends to keep on polling but by doing some proper pagination with the backend.

For example now that the backend always has the full list of objects in-memory it can answer page requests like: I want the 3rd page where each page has 20 items.

andreyvelich · 2024-11-11T18:12:58Z

/transfer notebooks

k8s-ci-robot added the kind/bug kind - things not working properly label Aug 11, 2020

sylus mentioned this issue Aug 11, 2020

Notebook server's add server page intermittently fails to fully load StatCan/aaw#187

Closed

sylus changed the title ~~Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) running with scale~~ Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) Aug 11, 2020

sylus mentioned this issue Aug 28, 2020

API calls on centraldashboard become slow with a large number of profiles kubeflow/kubeflow#5190

Closed

sylus mentioned this issue Sep 10, 2020

jupyter-web-app rewrite kubeflow/kubeflow#5280

Closed

kimwnasptd mentioned this issue Jun 14, 2021

Notebooks WG Roadmap for 1.4 kubeflow/kubeflow#5978

Closed

6 tasks

This was referenced May 1, 2022

Sorting support for the web apps kubeflow/kubeflow#6460

Closed

Filtering support for the web apps kubeflow/kubeflow#6462

Closed

Notebooks WG Roadmap for KF 1.6 kubeflow/kubeflow#6463

Closed

kimwnasptd mentioned this issue Apr 3, 2023

web-apps: Implement watch to avoid constant polling from backend to K8s kubeflow/kubeflow#7080

Open

Jose-Matsuda mentioned this issue Apr 5, 2023

[EPIC] Path to Kubeflow 1.7 StatCan/aaw#1632

Closed

47 tasks

github-project-automation bot added this to Kubeflow Notebooks Nov 11, 2024

github-project-automation bot moved this to Needs Triage in Kubeflow Notebooks Nov 11, 2024

google-oss-prow bot transferred this issue from kubeflow/kubeflow Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) #114

Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) #114

sylus commented Aug 11, 2020 •

edited

Loading

issue-label-bot bot commented Aug 11, 2020

issue-label-bot bot commented Aug 11, 2020

DanielSCon40 commented Aug 30, 2020 •

edited

Loading

kimwnasptd commented Aug 31, 2020

stale bot commented Dec 28, 2020

sylus commented Dec 28, 2020

yanniszark commented Jan 8, 2021

davidspek commented Feb 5, 2021

kimwnasptd commented Apr 3, 2023

andreyvelich commented Nov 11, 2024

Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) #114

Possible multi-threading issue w/Jupyter Web App (kubernetes-client/python) #114

Comments

sylus commented Aug 11, 2020 • edited Loading

issue-label-bot bot commented Aug 11, 2020

issue-label-bot bot commented Aug 11, 2020

DanielSCon40 commented Aug 30, 2020 • edited Loading

kimwnasptd commented Aug 31, 2020

stale bot commented Dec 28, 2020

sylus commented Dec 28, 2020

yanniszark commented Jan 8, 2021

davidspek commented Feb 5, 2021

kimwnasptd commented Apr 3, 2023

andreyvelich commented Nov 11, 2024

sylus commented Aug 11, 2020 •

edited

Loading

DanielSCon40 commented Aug 30, 2020 •

edited

Loading