Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial uptime metrics #2609

Merged
merged 26 commits into from
Aug 30, 2024
Merged

Add initial uptime metrics #2609

merged 26 commits into from
Aug 30, 2024

Conversation

dcmcand
Copy link
Contributor

@dcmcand dcmcand commented Aug 5, 2024

Reference Issues or PRs

part of #2557

What does this implement/fix?

Adds a kuberhealthy service which allows in cluster synthetic testing. This inital PR includes the service and basic http checks for conda-store, keycloak, and JupyterHub. These tests are visible in grafana as metrics that can be queried to show uptime or create alerts.

For example, the show the average uptime of conda-store, over the past 30 days you can run
1 - (sum(count_over_time(kuberhealthy_check{check="dev/conda-store-http-check", status="0"}[30d])) OR vector(0))/(sum(count_over_time(kuberhealthy_check{check="dev/conda-store-http-check", status="1"}[30d])) * 100)

Screenshot from 2024-08-27 15-19-53

Kuberhealthy is controlled by a new config setting. It defaults to disabled for the moment. To enable kuberhealthy, add

monitoring:
  healthchecks:
    enabled: true

to your nebari-config.yaml file

Limitations and follow on work

Currently kuberhealthy and all checks deploy to the dev namespace which is the Nebari default. If you are using a namespace other than dev, kuberhealthy should be left disabled.

A follow on PR is planned to address this and take the namespace from the config.

Additionally, a follow on PR to add an uptime monitoring dashboard is planned as well.

These checks are currently set to run every 5 minutes with a 10 minute timeout and a failing percentage of 80%. I intend to make this configurable in a future PR as well.

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

Any other comments?

@dcmcand dcmcand changed the title Begin adding kuberhealthy [DO NOT MERGE] Draft: Begin adding kuberhealthy Aug 5, 2024
@dcmcand dcmcand marked this pull request as draft August 5, 2024 13:23
@dcmcand dcmcand changed the title [DO NOT MERGE] Draft: Begin adding kuberhealthy Begin adding kuberhealthy Aug 5, 2024
@dcmcand dcmcand marked this pull request as ready for review August 26, 2024 17:04
@dcmcand dcmcand changed the title Begin adding kuberhealthy Add initial uptime metrics Aug 27, 2024
@marcelovilla
Copy link
Member

@dcmcand I'm trying this locally (on an M1) and I'm seeing this when deploying:

Downloading https://get.helm.sh/helm-v3.15.3-darwin-arm64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /var/folders/ch/slky97nd0zz1zdw_qk0nqvw00000gn/T/helm/v3.15.3
helm installed into /var/folders/ch/slky97nd0zz1zdw_qk0nqvw00000gn/T/helm/v3.15.3/helm
helm not found. Is /var/folders/ch/slky97nd0zz1zdw_qk0nqvw00000gn/T/helm/v3.15.3 on your $PATH?
Failed to install helm with the arguments provided: -v v3.15.3 --no-sudo
Accepted cli arguments are:
	[--help|-h ] ->> prints this help
	[--version|-v <desired_version>] . When not defined it fetches the latest release from GitHub
	e.g. --version v3.0.0 or -v canary
	[--no-sudo]  ->> install without sudo
	For support, go to https://github.com/helm/helm.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/subcommands/ │
│ deploy.py:79 in deploy                                                       │
│                                                                              │
│   76 │   │   config = read_configuration(config_filename, config_schema=conf │
│   77 │   │                                                                   │
│   78 │   │   if not disable_render:                                          │
│ ❱ 79 │   │   │   render_template(output_directory, config, stages)           │
│   80 │   │                                                                   │
│   81 │   │   if skip_remote_state_provision:                                 │
│   82 │   │   │   for stage in stages:                                        │
│                                                                              │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/render.py:32 │
│ in render_template                                                           │
│                                                                              │
│    29contents = {}                                                      │
│    30for stage in stages:                                               │
│    31 │   │   contents.update(                                               │
│ ❱  32 │   │   │   stage(output_directory=output_directory, config=config).re │
│    33 │   │   )                                                              │
│    34 │                                                                      │
│    35new, untracked, updated, deleted = inspect_files(                  │
│                                                                              │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/stages/base. │
│ py:85 in render                                                              │
│                                                                              │
│    82 │   │   │   │   │   f"{temp_dir}",                                     │
│    83 │   │   │   │   │   "--enable-helm",                                   │
│    84 │   │   │   │   │   "--helm-command",                                  │
│ ❱  85 │   │   │   │   │   f"{helm.download_helm_binary()}",                  │
│    86 │   │   │   │   │   f"{self.template_directory}",                      │
│    87 │   │   │   │   ]                                                      │
│    88 │   │   │   )                                                          │
│                                                                              │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/hel │
│ m.py:38 in download_helm_binary                                              │
│                                                                              │
│   35 │   │   │   stdout=subprocess.PIPE,                                     │
│   36 │   │   │   check=True,                                                 │
│   37 │   │   )                                                               │
│ ❱ 38 │   │   subprocess.run(                                                 │
│   39 │   │   │   [                                                           │
│   40 │   │   │   │   "bash",                                                 │
│   41 │   │   │   │   "-s",                                                   │
│                                                                              │
│ /nix/store/03q8gn91mj95y5bqbcl90hyvmpqpz738-python3-3.11.7/lib/python3.11/su │
│ bprocess.py:571 in run                                                       │
│                                                                              │
│    568 │   │   │   raise                                                     │
│    569 │   │   retcode = process.poll()                                      │
│    570 │   │   if check and retcode:                                         │
│ ❱  571 │   │   │   raise CalledProcessError(retcode, process.args,           │
│    572 │   │   │   │   │   │   │   │   │    output=stdout, stderr=stderr)    │
│    573return CompletedProcess(process.args, retcode, stdout, stderr)    │
│    574                                                                       │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['bash', '-s', '--', '-v', 'v3.15.3', '--no-sudo']'
returned non-zero exit status 1.

This is how my config looks like:

provider: local
namespace: dev
nebari_version: 2024.7.2
project_name: nebari-local
ci_cd:
  type: none
terraform_state:
  type: remote
security:
  keycloak:
    initial_root_password: someverystrongpassword
    overrides:
      image:
        repository: quay.io/aktech/keycloak
        tag: 15.0.2
  authentication:
    type: password
default_images:
  jupyterhub: quay.io/nebari/nebari-jupyterhub:2024.6.1
  jupyterlab: quay.io/nebari/nebari-jupyterlab:2024.6.1
  dask_worker: quay.io/nebari/nebari-dask-worker:2024.6.1
conda_store:
  image: quay.io/aktech/conda-store-server
  image_tag: sha-558beb8
theme:
  jupyterhub:
    hub_title: Nebari - nebari-local
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    hub_subtitle: Your open source data science platform, hosted
local:
  kube_context:
  node_selectors:
    general:
      key: kubernetes.io/os
      value: linux
    user:
      key: kubernetes.io/os
      value: linux
    worker:
      key: kubernetes.io/os
      value: linux
kuberhealthy:
  enabled: true

@dcmcand
Copy link
Contributor Author

dcmcand commented Aug 29, 2024

@marcelovilla can you try again? I just pushed a fix.

@viniciusdc
Copy link
Contributor

For future reference, we will need a follow-up PR to dynamically render the Kustomize patch arguments, such as the namespace value.

@marcelovilla
Copy link
Member

Thanks @dcmcand, I was able to deploy it successfully and run the query you added in the PR description. I also see that two kuberhealthy pods are running.

I tried to re-deploy with this change in my config:

kuberhealthy:
  enabled: false

But I still see the two kuberhealthy pods on the deployment after. No luck removing the block either. Did you experience this as well?

@dcmcand
Copy link
Contributor Author

dcmcand commented Aug 29, 2024

No it should destroy the resources if it isn't enabled. I'll look into it tomorrow

@dcmcand
Copy link
Contributor Author

dcmcand commented Aug 30, 2024

@marcelovilla destroy should work now, and I moved the config like @aktech suggested. Now to enable use

monitoring:
  healthchecks:
    enabled: true

@marcelovilla
Copy link
Member

@dcmcand I'm getting the following error when trying to deploy adding the block you suggested:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/subcommands/ │
│ deploy.py:79 in deploy                                                       │
│                                                                              │
│   76 │   │   config = read_configuration(config_filename, config_schema=conf │
│   77 │   │                                                                   │
│   78 │   │   if not disable_render:                                          │
│ ❱ 79 │   │   │   render_template(output_directory, config, stages)           │
│   80 │   │                                                                   │
│   81 │   │   if skip_remote_state_provision:                                 │
│   82 │   │   │   for stage in stages:                                        │
│                                                                              │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/render.py:32 │
│ in render_template                                                           │
│                                                                              │
│    29contents = {}                                                      │
│    30for stage in stages:                                               │
│    31 │   │   contents.update(                                               │
│ ❱  32 │   │   │   stage(output_directory=output_directory, config=config).re │
│    33 │   │   )                                                              │
│    34 │                                                                      │
│    35new, untracked, updated, deleted = inspect_files(                  │
│                                                                              │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/stages/base. │
│ py:78 in render                                                              │
│                                                                              │
│    75 │   │   │   │   "kustomization.yaml file not found in template directo │
│    76 │   │   │   )                                                          │
│    77 │   │   with tempfile.TemporaryDirectory() as temp_dir:                │
│ ❱  78 │   │   │   kustomize.run_kustomize_subprocess(                        │
│    79 │   │   │   │   [                                                      │
│    80 │   │   │   │   │   "build",                                           │
│    81 │   │   │   │   │   "-o",                                              │
│                                                                              │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/kus │
│ tomize.py:51 in run_kustomize_subprocess                                     │
│                                                                              │
│   48 def run_kustomize_subprocess(processargs, **kwargs) -> None:            │
│   49kustomize_path = download_kustomize_binary()                        │
│   50if run_subprocess_cmd([kustomize_path] + processargs, **kwargs):    │
│ ❱ 51 │   │   raise KustomizeException("Kustomize returned an error")         │
│   52                                                                         │
│   53                                                                         │
│   54 def version() -> str:                                                   │
╰──────────────────────────────────────────────────────────────────────────────╯
KustomizeException: Kustomize returned an error

I also see the unit tests are failing

@dcmcand dcmcand force-pushed the add-uptime-monitoring branch from 679a0d1 to 496416d Compare August 30, 2024 13:21
Copy link
Member

@marcelovilla marcelovilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcmcand 🚀 !

I was able to confirm that deploying with

monitoring:
  healthchecks:
    enabled: true

works as expected and that removing the block or disabling it removes the deployed resources.

@dcmcand dcmcand merged commit 498e569 into develop Aug 30, 2024
26 of 27 checks passed
@dcmcand dcmcand deleted the add-uptime-monitoring branch August 30, 2024 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: k8s ⎈ area: monitoring 🔍 needs: review 👀 This PR is complete and ready for reviewing type: enhancement 💅🏼 New feature or request
Projects
Status: Done 💪🏾
Development

Successfully merging this pull request may close these issues.

4 participants