Send anonymous health metrics #1000

webern · 2020-07-22T23:25:58Z

What I'd like:

We need a way to know if Bottlerocket instances are healthy. This is particularly important around the first wave of an update, where we need feedback to know if the update has caused any issues. We propose adding a component to send anonymous health pings for this purpose.

Bottlerocket Metrics GA Design - Health Pings

The purpose of this document is to describe the design of Bottlerocket’s metrics-sending functionality as we intend it to exist at GA.

An existing mechanism exists, which sends version, target version, seed value and migration list, via updog. There is a potential wholesale redesign of updog’s contribution, but we will first get healthdog working before addressing changes to updog (see #1001).

We will add a new program to Bottlerocket specifically for the purpose of sending metrics.
healthdog ← 🚲🏠 here.

Settings

Behaviour will be determined by new settings:

defaults.toml

[settings.metrics]
# a GET request with query params will be sent here.
metrics-url = "https://updates.bottlerocket.aws/metrics"
# defaults to `true`, but can be set to `false`
send-metrics = true
# a list of the services that must be running to consider a host 'healthy'.
# in defaults.toml some minimal set will be used, variants will override.
service-health = ["apiserver", "containerd", "host-containerd", "kubelet"]

A config file will be produced from this: /etc/healthdog.toml

metrics_url = "https://metrics.bottlerocket.aws/metrics"
send_metrics = true
region="us-west-2"
seed = 1292
service_health = ["apiserver", "containerd", "host-containerd", "kubelet"]

Sending Metrics

Healthy Boot

The existing mark-successful-boot.service will be extended to also call /bin/healthdog --config /etc/healthdog.toml report-successful-boot. In the event of opt-out or a null URL setting, this will be a no-op, exit 0. Otherwise it will GET https://metrics.bottlerocket.aws/metrics?sender=healthdog&...etc with the following key-value pairs:

sender: healthdog
event: boot-success
version: 0.5.2
variant: aws-k8s-1.17
arch: x86_64
region: us-west-2
seed: 1292

Health Pings

healthdog will run on a schedule, once 120 seconds after boot, and every six hours thereafter. It will use systemctl check status to check the status of each service listed in the service-health setting. If all of these are running and report healthy, then the system will be reported as healthy. If one or more of these is not running, or in a bad state, the system will be reported as unhealthy.

The health ping will be set up as a service in systemd, for example
healthdog.timer

[Unit]
Description=Scheduled Healthdog Pings
RefuseManualStart=no        ## Allow manual starts
RefuseManualStop=no         ## Allow manual stops

[Timer]
## Execute job if it missed a run due to machine being off
Persistent=true
## Run 120 seconds after boot for the first time
OnBootSec=120
## Run every 6 hours thereafter
OnUnitActiveHour=6
## File describing job to execute
Unit=healthdog.service

[Install]
WantedBy=timers.target

healthdog.service

[Unit]
Description=Send a Healthdog Ping

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/bin/healthdog --config /etc/healthdog.toml send-health-ping

[Install]
WantedBy=multi-user.target

Healthy Ping

sender: healthdog
event: health-ping
version: 0.5.2
variant: aws-ka8s-1.17
arch: x86_64
region: us-west-2
seed: 1292
is_healthy: true
failed_services: (← empty)

Unhealthy Ping

sender: healthdog
event: health-ping
version: 0.5.2
variant: aws-k8s-1.17
arch: x86_64
region: us-west-2
seed: 1292
is_healthy: false
failed_services: kubelet:1,containerd:127 (← includes failure codes)

The text was updated successfully, but these errors were encountered:

webern added the type/enhancement New feature or request label Jul 22, 2020

webern self-assigned this Jul 22, 2020

webern mentioned this issue Jul 22, 2020

Improve updog's anonymous metrics #1001

Open

webern added this to the GA milestone Jul 28, 2020

webern mentioned this issue Jul 31, 2020

metricdog: anonymous bottlerocket metrics #1006

Merged

tjkirch removed this from the GA milestone Aug 18, 2020

srgothi92 closed this as completed Sep 1, 2020

srgothi92 reopened this Sep 1, 2020

jhaynes added the priority/p1 label Dec 10, 2020

webern mentioned this issue Feb 2, 2021

metrics: respect proxy settings #1298

Closed

webern closed this as completed in #1006 Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send anonymous health metrics #1000

Send anonymous health metrics #1000

webern commented Jul 22, 2020 •

edited

Loading

Send anonymous health metrics #1000

Send anonymous health metrics #1000

Comments

webern commented Jul 22, 2020 • edited Loading

What I'd like:

Bottlerocket Metrics GA Design - Health Pings

Settings

Sending Metrics

Healthy Boot

Health Pings

webern commented Jul 22, 2020 •

edited

Loading