-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial uptime metrics #2609
Conversation
src/_nebari/stages/kubernetes_kuberhealthy/template/01-kuberhealthy.yaml
Outdated
Show resolved
Hide resolved
@dcmcand I'm trying this locally (on an M1) and I'm seeing this when deploying: Downloading https://get.helm.sh/helm-v3.15.3-darwin-arm64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /var/folders/ch/slky97nd0zz1zdw_qk0nqvw00000gn/T/helm/v3.15.3
helm installed into /var/folders/ch/slky97nd0zz1zdw_qk0nqvw00000gn/T/helm/v3.15.3/helm
helm not found. Is /var/folders/ch/slky97nd0zz1zdw_qk0nqvw00000gn/T/helm/v3.15.3 on your $PATH?
Failed to install helm with the arguments provided: -v v3.15.3 --no-sudo
Accepted cli arguments are:
[--help|-h ] ->> prints this help
[--version|-v <desired_version>] . When not defined it fetches the latest release from GitHub
e.g. --version v3.0.0 or -v canary
[--no-sudo] ->> install without sudo
For support, go to https://github.com/helm/helm.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/subcommands/ │
│ deploy.py:79 in deploy │
│ │
│ 76 │ │ config = read_configuration(config_filename, config_schema=conf │
│ 77 │ │ │
│ 78 │ │ if not disable_render: │
│ ❱ 79 │ │ │ render_template(output_directory, config, stages) │
│ 80 │ │ │
│ 81 │ │ if skip_remote_state_provision: │
│ 82 │ │ │ for stage in stages: │
│ │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/render.py:32 │
│ in render_template │
│ │
│ 29 │ contents = {} │
│ 30 │ for stage in stages: │
│ 31 │ │ contents.update( │
│ ❱ 32 │ │ │ stage(output_directory=output_directory, config=config).re │
│ 33 │ │ ) │
│ 34 │ │
│ 35 │ new, untracked, updated, deleted = inspect_files( │
│ │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/stages/base. │
│ py:85 in render │
│ │
│ 82 │ │ │ │ │ f"{temp_dir}", │
│ 83 │ │ │ │ │ "--enable-helm", │
│ 84 │ │ │ │ │ "--helm-command", │
│ ❱ 85 │ │ │ │ │ f"{helm.download_helm_binary()}", │
│ 86 │ │ │ │ │ f"{self.template_directory}", │
│ 87 │ │ │ │ ] │
│ 88 │ │ │ ) │
│ │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/hel │
│ m.py:38 in download_helm_binary │
│ │
│ 35 │ │ │ stdout=subprocess.PIPE, │
│ 36 │ │ │ check=True, │
│ 37 │ │ ) │
│ ❱ 38 │ │ subprocess.run( │
│ 39 │ │ │ [ │
│ 40 │ │ │ │ "bash", │
│ 41 │ │ │ │ "-s", │
│ │
│ /nix/store/03q8gn91mj95y5bqbcl90hyvmpqpz738-python3-3.11.7/lib/python3.11/su │
│ bprocess.py:571 in run │
│ │
│ 568 │ │ │ raise │
│ 569 │ │ retcode = process.poll() │
│ 570 │ │ if check and retcode: │
│ ❱ 571 │ │ │ raise CalledProcessError(retcode, process.args, │
│ 572 │ │ │ │ │ │ │ │ │ output=stdout, stderr=stderr) │
│ 573 │ return CompletedProcess(process.args, retcode, stdout, stderr) │
│ 574 │
╰──────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['bash', '-s', '--', '-v', 'v3.15.3', '--no-sudo']'
returned non-zero exit status 1. This is how my config looks like: provider: local
namespace: dev
nebari_version: 2024.7.2
project_name: nebari-local
ci_cd:
type: none
terraform_state:
type: remote
security:
keycloak:
initial_root_password: someverystrongpassword
overrides:
image:
repository: quay.io/aktech/keycloak
tag: 15.0.2
authentication:
type: password
default_images:
jupyterhub: quay.io/nebari/nebari-jupyterhub:2024.6.1
jupyterlab: quay.io/nebari/nebari-jupyterlab:2024.6.1
dask_worker: quay.io/nebari/nebari-dask-worker:2024.6.1
conda_store:
image: quay.io/aktech/conda-store-server
image_tag: sha-558beb8
theme:
jupyterhub:
hub_title: Nebari - nebari-local
welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
documentation</a>. If you have any questions or feedback, reach the team on
<a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
forums</a>.
hub_subtitle: Your open source data science platform, hosted
local:
kube_context:
node_selectors:
general:
key: kubernetes.io/os
value: linux
user:
key: kubernetes.io/os
value: linux
worker:
key: kubernetes.io/os
value: linux
kuberhealthy:
enabled: true |
@marcelovilla can you try again? I just pushed a fix. |
For future reference, we will need a follow-up PR to dynamically render the Kustomize patch arguments, such as the namespace value. |
Thanks @dcmcand, I was able to deploy it successfully and run the query you added in the PR description. I also see that two I tried to re-deploy with this change in my config:
But I still see the two |
No it should destroy the resources if it isn't enabled. I'll look into it tomorrow |
@marcelovilla destroy should work now, and I moved the config like @aktech suggested. Now to enable use monitoring:
healthchecks:
enabled: true |
@dcmcand I'm getting the following error when trying to deploy adding the block you suggested: ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/subcommands/ │
│ deploy.py:79 in deploy │
│ │
│ 76 │ │ config = read_configuration(config_filename, config_schema=conf │
│ 77 │ │ │
│ 78 │ │ if not disable_render: │
│ ❱ 79 │ │ │ render_template(output_directory, config, stages) │
│ 80 │ │ │
│ 81 │ │ if skip_remote_state_provision: │
│ 82 │ │ │ for stage in stages: │
│ │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/render.py:32 │
│ in render_template │
│ │
│ 29 │ contents = {} │
│ 30 │ for stage in stages: │
│ 31 │ │ contents.update( │
│ ❱ 32 │ │ │ stage(output_directory=output_directory, config=config).re │
│ 33 │ │ ) │
│ 34 │ │
│ 35 │ new, untracked, updated, deleted = inspect_files( │
│ │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/stages/base. │
│ py:78 in render │
│ │
│ 75 │ │ │ │ "kustomization.yaml file not found in template directo │
│ 76 │ │ │ ) │
│ 77 │ │ with tempfile.TemporaryDirectory() as temp_dir: │
│ ❱ 78 │ │ │ kustomize.run_kustomize_subprocess( │
│ 79 │ │ │ │ [ │
│ 80 │ │ │ │ │ "build", │
│ 81 │ │ │ │ │ "-o", │
│ │
│ /Users/marcelo/projects/quansight/nebari-dev/nebari/src/_nebari/provider/kus │
│ tomize.py:51 in run_kustomize_subprocess │
│ │
│ 48 def run_kustomize_subprocess(processargs, **kwargs) -> None: │
│ 49 │ kustomize_path = download_kustomize_binary() │
│ 50 │ if run_subprocess_cmd([kustomize_path] + processargs, **kwargs): │
│ ❱ 51 │ │ raise KustomizeException("Kustomize returned an error") │
│ 52 │
│ 53 │
│ 54 def version() -> str: │
╰──────────────────────────────────────────────────────────────────────────────╯
KustomizeException: Kustomize returned an error I also see the unit tests are failing |
679a0d1
to
496416d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @dcmcand 🚀 !
I was able to confirm that deploying with
monitoring:
healthchecks:
enabled: true
works as expected and that removing the block or disabling it removes the deployed resources.
Reference Issues or PRs
part of #2557
What does this implement/fix?
Adds a kuberhealthy service which allows in cluster synthetic testing. This inital PR includes the service and basic http checks for conda-store, keycloak, and JupyterHub. These tests are visible in grafana as metrics that can be queried to show uptime or create alerts.
For example, the show the average uptime of conda-store, over the past 30 days you can run
1 - (sum(count_over_time(kuberhealthy_check{check="dev/conda-store-http-check", status="0"}[30d])) OR vector(0))/(sum(count_over_time(kuberhealthy_check{check="dev/conda-store-http-check", status="1"}[30d])) * 100)
Kuberhealthy is controlled by a new config setting. It defaults to disabled for the moment. To enable kuberhealthy, add
to your nebari-config.yaml file
Limitations and follow on work
Currently kuberhealthy and all checks deploy to the
dev
namespace which is the Nebari default. If you are using a namespace other thandev
, kuberhealthy should be left disabled.A follow on PR is planned to address this and take the namespace from the config.
Additionally, a follow on PR to add an uptime monitoring dashboard is planned as well.
These checks are currently set to run every 5 minutes with a 10 minute timeout and a failing percentage of 80%. I intend to make this configurable in a future PR as well.
Put a
x
in the boxes that applyTesting
Any other comments?