Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] - Make idle culler settings easily configurable and documented how to change #1283

Closed
costrouc opened this issue May 14, 2022 · 12 comments · Fixed by #1689
Closed

[ENH] - Make idle culler settings easily configurable and documented how to change #1283

costrouc opened this issue May 14, 2022 · 12 comments · Fixed by #1689
Labels
area: user experience 👩🏻‍💻 needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug type: enhancement 💅🏼 New feature or request

Comments

@costrouc
Copy link
Member

Feature description

Currently much of the idle culler is hard coded. @rsignell-usgs brought this up as an issue that he was concerned about. The current timeout is too short in some cases.

Value and/or benefit

The default idle timeout does not work for everyone.

Anything else?

No response

@costrouc costrouc added type: enhancement 💅🏼 New feature or request needs: triage 🚦 Someone needs to have a look at this issue and triage labels May 14, 2022
@viniciusdc
Copy link
Contributor

Hi @costrouc, do they want a config per user structure, or they are happy to get it set in the qhub-config?

@rsignell-usgs
Copy link
Contributor

rsignell-usgs commented May 16, 2022

@viniciusdc and @costrouc , we would be happy to set this in the qhub-config.
One of the worst aspects of having the time out being so short is that any terminal sessions disappear.
Thanks for taking a look!

@rsignell-usgs
Copy link
Contributor

rsignell-usgs commented Jul 27, 2022

Folks, what would it take to enable this?

This is the number top complaint I've heard from ESIP Qhub users.

Even if it wasn't configurable and just made longer by qhub devs, that would be wonderful. Right now it must be 5 minutes, right?

It would be great if dask clusters spun down in 30 min, and notebooks spun down in 90 min or 3 hours.

Just for comparison, AWS SageMaker Studio Lab, the free notebook offering from AWS, times out after 4 hours for a GPU, 12 hours for a CPU.

@iameskild
Copy link
Member

Hi @rsignell-usgs, I will make sure this issue is prioritized for our next sprint (which starts next week). I can't promise it will be configurable from the qhub-config.yaml but I will work with the team to come up with a workable solution asap. Thanks again for the reminder!!

@iameskild iameskild added the needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug label Jul 29, 2022
@iameskild iameskild self-assigned this Jul 29, 2022
@rsignell-usgs
Copy link
Contributor

Okay, thanks @iameskild. The users will definitely appreciate any improvement in the situation, even if not configurable!

@rsignell-usgs
Copy link
Contributor

@iameskild , I remember you showed me how to (temporarily) override the short culler settings by connecting to some pod and editing a config file, right? After the upgrade from 0.4.3 to 0.4.4, the users are screaming again about the too-short timeout for their servers.

@iameskild
Copy link
Member

Hey @rsignell-usgs, for now, you can manually edit the etc-jupyter configmap if you want to make changes to the timeout settings.

Although I still have to circle back to this when I have more time but as a quick update, I was looking into using Terraform's templatefile to make these values more easily configurable.

@viniciusdc
Copy link
Contributor

viniciusdc commented Oct 25, 2022

This can also be achieved using overrides on the jupyterhub configuration to change the idle-culling variable values. Right now, the values that can be changed are those here

jupyterhub:
  overrides:
    cull:
      users: true

Some values come from the idle-culler extension that, as of now, only the above method can be used to update them.

@rsignell-usgs
Copy link
Contributor

To change these, I can use k9s to ssh into the hub-** pod and then just edit them?

@iameskild
Copy link
Member

@rsignell-usgs yep, just edit the file. You may need to kill the hub pod for the changes to take effect.

@rsignell-usgs
Copy link
Contributor

What is the filename once I've ssh'ed into the hub pod?

@rsignell-usgs
Copy link
Contributor

rsignell-usgs commented Oct 25, 2022

Here's the workaround recipe that should modify the cull settings (at least until the next qhub/nebari version is deployed):

  • in k9s, type ":configmap"
  • use arrow keys to highlight the etc-jupyter configmap
  • hit the e key to edit (make the changes below), then "esc"
  • still in k9s, type ":pod"
  • use arrow keys to highlight the pod that starts with hub-xx
  • kill the pod (). (don't worry, it will regenerate in just a few seconds)

Just for the record, I set everything to 30 minutes:


    # The interval (in seconds) on which to check for terminals exceeding the
    # inactive timeout value.
    c.TerminalManager.cull_interval = 30 * 60

    # cull_idle_timeout: timeout (in seconds) after which an idle kernel is
    # considered ready to be culled
    c.MappingKernelManager.cull_idle_timeout = 30 * 60

    # cull_interval: the interval (in seconds) on which to check for idle
    # kernels exceeding the cull timeout value
    c.MappingKernelManager.cull_interval = 30 * 60

    # cull_connected: whether to consider culling kernels which have one
    # or more connections
    c.MappingKernelManager.cull_connected = True

    # cull_busy: whether to consider culling kernels which are currently
    # busy running some code
    c.MappingKernelManager.cull_busy = False

    # Shut down the server after N seconds with no kernels or terminals
    # running and no activity.
    c.NotebookApp.shutdown_no_activity_timeout = 30 * 60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: user experience 👩🏻‍💻 needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug type: enhancement 💅🏼 New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

5 participants