You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need a way to know if Bottlerocket instances are healthy. This is particularly important around the first wave of an update, where we need feedback to know if the update has caused any issues. We propose adding a component to send anonymous health pings for this purpose.
Bottlerocket Metrics GA Design - Health Pings
The purpose of this document is to describe the design of Bottlerocket’s metrics-sending functionality as we intend it to exist at GA.
An existing mechanism exists, which sends version, target version, seed value and migration list, via updog. There is a potential wholesale redesign of updog’s contribution, but we will first get healthdog working before addressing changes to updog (see #1001).
We will add a new program to Bottlerocket specifically for the purpose of sending metrics. healthdog ← 🚲🏠 here.
Settings
Behaviour will be determined by new settings:
defaults.toml
[settings.metrics]
# a GET request with query params will be sent here.metrics-url = "https://updates.bottlerocket.aws/metrics"# defaults to `true`, but can be set to `false`send-metrics = true# a list of the services that must be running to consider a host 'healthy'.# in defaults.toml some minimal set will be used, variants will override.service-health = ["apiserver", "containerd", "host-containerd", "kubelet"]
A config file will be produced from this: /etc/healthdog.toml
The existing mark-successful-boot.service will be extended to also call /bin/healthdog --config /etc/healthdog.toml report-successful-boot. In the event of opt-out or a null URL setting, this will be a no-op, exit 0. Otherwise it will GET https://metrics.bottlerocket.aws/metrics?sender=healthdog&...etc with the following key-value pairs:
sender: healthdog
event: boot-success
version: 0.5.2
variant: aws-k8s-1.17
arch: x86_64
region: us-west-2
seed: 1292
Health Pings
healthdog will run on a schedule, once 120 seconds after boot, and every six hours thereafter. It will use systemctl check status to check the status of each service listed in the service-health setting. If all of these are running and report healthy, then the system will be reported as healthy. If one or more of these is not running, or in a bad state, the system will be reported as unhealthy.
The health ping will be set up as a service in systemd, for example healthdog.timer
[Unit]
Description=Scheduled Healthdog Pings
RefuseManualStart=no ## Allow manual starts
RefuseManualStop=no ## Allow manual stops
[Timer]
## Execute job if it missed a run due to machine being off
Persistent=true
## Run 120 seconds after boot for the first time
OnBootSec=120
## Run every 6 hours thereafter
OnUnitActiveHour=6
## File describing job to execute
Unit=healthdog.service
[Install]
WantedBy=timers.target
What I'd like:
We need a way to know if Bottlerocket instances are healthy. This is particularly important around the first wave of an update, where we need feedback to know if the update has caused any issues. We propose adding a component to send anonymous health pings for this purpose.
Bottlerocket Metrics GA Design - Health Pings
The purpose of this document is to describe the design of Bottlerocket’s metrics-sending functionality as we intend it to exist at GA.
An existing mechanism exists, which sends version, target version, seed value and migration list, via
updog
. There is a potential wholesale redesign ofupdog
’s contribution, but we will first gethealthdog
working before addressing changes toupdog
(see #1001).We will add a new program to Bottlerocket specifically for the purpose of sending metrics.
healthdog
← 🚲🏠 here.Settings
Behaviour will be determined by new settings:
defaults.toml
A config file will be produced from this:
/etc/healthdog.toml
Sending Metrics
Healthy Boot
The existing
mark-successful-boot.service
will be extended to also call/bin/healthdog --config /etc/healthdog.toml report-successful-boot
. In the event of opt-out or a null URL setting, this will be a no-op, exit 0. Otherwise it willGET https://metrics.bottlerocket.aws/metrics?sender=healthdog&...etc
with the following key-value pairs:Health Pings
healthdog
will run on a schedule, once 120 seconds after boot, and every six hours thereafter. It will usesystemctl check status
to check the status of each service listed in theservice-health
setting. If all of these are running and report healthy, then the system will be reported as healthy. If one or more of these is not running, or in a bad state, the system will be reported as unhealthy.The health ping will be set up as a service in
systemd
, for examplehealthdog.timer
healthdog.service
Healthy Ping
Unhealthy Ping
The text was updated successfully, but these errors were encountered: