Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send anonymous health metrics #1000

Closed
webern opened this issue Jul 22, 2020 · 0 comments · Fixed by #1006
Closed

Send anonymous health metrics #1000

webern opened this issue Jul 22, 2020 · 0 comments · Fixed by #1006
Assignees
Labels
type/enhancement New feature or request

Comments

@webern
Copy link
Contributor

webern commented Jul 22, 2020

What I'd like:

We need a way to know if Bottlerocket instances are healthy. This is particularly important around the first wave of an update, where we need feedback to know if the update has caused any issues. We propose adding a component to send anonymous health pings for this purpose.

Bottlerocket Metrics GA Design - Health Pings

The purpose of this document is to describe the design of Bottlerocket’s metrics-sending functionality as we intend it to exist at GA.

An existing mechanism exists, which sends version, target version, seed value and migration list, via updog. There is a potential wholesale redesign of updog’s contribution, but we will first get healthdog working before addressing changes to updog (see #1001).

We will add a new program to Bottlerocket specifically for the purpose of sending metrics.
healthdog ← 🚲🏠 here.

Settings

Behaviour will be determined by new settings:

defaults.toml

[settings.metrics]
# a GET request with query params will be sent here.
metrics-url = "https://updates.bottlerocket.aws/metrics"
# defaults to `true`, but can be set to `false`
send-metrics = true
# a list of the services that must be running to consider a host 'healthy'.
# in defaults.toml some minimal set will be used, variants will override.
service-health = ["apiserver", "containerd", "host-containerd", "kubelet"]

A config file will be produced from this: /etc/healthdog.toml

metrics_url = "https://metrics.bottlerocket.aws/metrics"
send_metrics = true
region="us-west-2"
seed = 1292
service_health = ["apiserver", "containerd", "host-containerd", "kubelet"]

Sending Metrics

Healthy Boot

The existing mark-successful-boot.service will be extended to also call /bin/healthdog --config /etc/healthdog.toml report-successful-boot. In the event of opt-out or a null URL setting, this will be a no-op, exit 0. Otherwise it will GET https://metrics.bottlerocket.aws/metrics?sender=healthdog&...etc with the following key-value pairs:

  • sender: healthdog
  • event: boot-success
  • version: 0.5.2
  • variant: aws-k8s-1.17
  • arch: x86_64
  • region: us-west-2
  • seed: 1292

Health Pings

healthdog will run on a schedule, once 120 seconds after boot, and every six hours thereafter. It will use systemctl check status to check the status of each service listed in the service-health setting. If all of these are running and report healthy, then the system will be reported as healthy. If one or more of these is not running, or in a bad state, the system will be reported as unhealthy.

The health ping will be set up as a service in systemd, for example
healthdog.timer

[Unit]
Description=Scheduled Healthdog Pings
RefuseManualStart=no        ## Allow manual starts
RefuseManualStop=no         ## Allow manual stops

[Timer]
## Execute job if it missed a run due to machine being off
Persistent=true
## Run 120 seconds after boot for the first time
OnBootSec=120
## Run every 6 hours thereafter
OnUnitActiveHour=6
## File describing job to execute
Unit=healthdog.service

[Install]
WantedBy=timers.target

healthdog.service

[Unit]
Description=Send a Healthdog Ping

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/healthdog --config /etc/healthdog.toml send-health-ping

[Install]
WantedBy=multi-user.target

Healthy Ping

  • sender: healthdog
  • event: health-ping
  • version: 0.5.2
  • variant: aws-ka8s-1.17
  • arch: x86_64
  • region: us-west-2
  • seed: 1292
  • is_healthy: true
  • failed_services: (← empty)

Unhealthy Ping

  • sender: healthdog
  • event: health-ping
  • version: 0.5.2
  • variant: aws-k8s-1.17
  • arch: x86_64
  • region: us-west-2
  • seed: 1292
  • is_healthy: false
  • failed_services: kubelet:1,containerd:127 (← includes failure codes)
@webern webern added the type/enhancement New feature or request label Jul 22, 2020
@webern webern self-assigned this Jul 22, 2020
@webern webern added this to the GA milestone Jul 28, 2020
@tjkirch tjkirch removed this from the GA milestone Aug 18, 2020
@srgothi92 srgothi92 reopened this Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants