RFD 203 - Database Health Checks #52400

GavinFrazar · 2025-02-21T21:34:37Z

Rendered

rfd/0203-database-healthchecks.md

r0mant

Some comments on the UX and configuration.

rfd/0203-database-healthchecks.md

r0mant · 2025-02-27T00:06:56Z

rfd/0203-database-healthchecks.md

+
+Recall that there may be multiple `db_server` heartbeats for the same database, but the web UI only shows a single resource tile for the database.
+If this is the case, then the tooltip will be shown if any of the heartbeats have an unhealthy `target_health.status`.
+The tooltip content may look something like this (or equivalent singular form verbiage if there's only one agent):


The tooltip might be ok to start with but I think eventually this sort of info needs to go on the resource's own "details" page. Similar to how we have status pages for integrations. cc @roraback

Oh like you would click the resource and see an overview of info about it, including health status?

rfd/0203-database-healthchecks.md

r0mant · 2025-02-27T00:22:58Z

rfd/0203-database-healthchecks.md

+      health_check:
+        enabled: true
+        interval: 30s
+        timeout: 5s
+        healthy_threshold: 3
+        unhealthy_threshold: 2


I have a suggestion/idea: instead of adding this to every resource, what do you guys think about making the health check configuration either into a separate configurable resource or extending one of existing cluster config resources to support health checks based on configured matching resources for example?

Just brainstorming:

kind: health_probe version: v1 metadata: name: default spec: match_labels: env: prod interval: xxx timeout: xxx healthy_threshold: xxx

This will make it more flexible and easier extendable to other resource types in future, and give users to configure health checks for different resources and see them all in a centralized fashion. And also separation of concerns FTW.

What do you guys think? @GavinFrazar @greedy52 @rosstimothy

If multiple of these match a resource then we would sort them by name I assume?

I like that it centralizes the config and it's really flexible.
I slightly don't like the added complexity for users to understand how settings are applied.
Overall I like it though.

If we did this then do you see any problems with creating a preset default that enables health checks with *:* match_labels in v18?
That would be nice because most users will probably benefit from it without ever messing with the settings, but for people who need to disable health checks or use specific settings, they have plenty of flexibility to tinker.

The RFD initially proposed a cluster level setting.

The label-based health check rules would be a lot more flexible than the cluster level one. To avoid multiple rules matching the same database target though, we might have to introduce rule priority? And i feel it might be hard for the user to implement the label matching rules without some dry-run tests or visualization tool to see the "end" configurations.

Yep, exactly what I was thinking - out of the box we would create a default config that enables health check for all supported resources and users can always tweak it if necessary. Then it's zero config and follows the principle "make simple things easy and hard things possible".

For conflict resolution, if we go a separate resource route (as opposed to cluster config for example), if multiple match the same resource we would probably have the service that performs health checks just process all matching configurations.

And yeah, regarding user complexity, I actually think this approach will be more familiar and easy to understand for them as we employ the same approach in many different places - database import rules, Okta import rules, login rules, so for those who want to fine-tune things it should be easy to understand.

Agree that specifying a health check per resource is untenable. If this is going to be enabled by default we need to ensure it doesn't have any negative impacts on scalability or backend activity.

I like the dedicated health probe resource but I do think we should think about its shape for a bit. Will a general health probe be a one size fits all solution for database, kube, app, desktop, etc. Are there any protocol specific knobs that may need to exist?

Since we discussed a new config object that can affect every resource at once, should we add some jitter delay before the first check to avoid triggering a lot of connections simultaneously?

Will a general health probe be a one size fits all solution for database, kube, app, desktop, etc. Are there any protocol specific knobs that may need to exist?

I can't think of what else we'd need for a TCP health probe to support all of db, kube, app, desktop, etc. The only special example I can think of is when the resource has multiple endpoints like a MongoDB cluster, but the config we have still generalizes for that case.

Eventually I think it would be nice to expand on this and allow TLS (w/o client cert) and HTTP(S) checks.
TCP, TLS, HTTP probes are all pretty generic already though.

Updated to move all config into a dynamic config resource: 9ea351e

rfd/0203-database-healthchecks.md

GavinFrazar added rfd Request for Discussion no-changelog Indicates that a PR does not require a changelog entry labels Feb 21, 2025

GavinFrazar changed the title ~~Initial draft~~ RFD 203 - Database Health Checks Feb 21, 2025

GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from 4b5a393 to 2372383 Compare February 21, 2025 22:00

GavinFrazar requested review from r0mant, roraback and greedy52 February 21, 2025 22:00

GavinFrazar marked this pull request as ready for review February 21, 2025 22:00

github-actions bot added the size/lg label Feb 21, 2025

github-actions bot requested review from bernardjkim and kimlisa February 21, 2025 22:01

GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from 2372383 to a4cce8f Compare February 21, 2025 22:33

RFD 203 - Database Health Checks

4fd0121

GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from a4cce8f to 4fd0121 Compare February 21, 2025 23:16

rosstimothy reviewed Feb 22, 2025

View reviewed changes

rfd/0203-database-healthchecks.md Outdated Show resolved Hide resolved

rosstimothy reviewed Feb 22, 2025

View reviewed changes

rfd/0203-database-healthchecks.md Outdated Show resolved Hide resolved

rfd/0203-database-healthchecks.md Outdated Show resolved Hide resolved

greedy52 reviewed Feb 24, 2025

View reviewed changes

GavinFrazar removed request for bernardjkim and kimlisa February 24, 2025 19:51

greedy52 reviewed Feb 24, 2025

View reviewed changes

rfd/0203-database-healthchecks.md Outdated Show resolved Hide resolved

GavinFrazar added 7 commits February 24, 2025 12:52

remove global config option

b6d1da6

replace discovery config with dynamic resource config

7483c4d

move ux user stories to the beginning of details

85d8c44

address tooltip feedback

0c56f55

address uri feedback

c3d6d6e

misc

a757ac6

add more restrictions on config

2b24a84

GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from 70f2443 to 2b24a84 Compare February 25, 2025 03:14

GavinFrazar added 2 commits February 24, 2025 19:18

warn on opt-out in discover flow

9e11950

update init behavior

bec2c3f

r0mant reviewed Feb 27, 2025

View reviewed changes

rosstimothy requested review from espadolini and fspmarshall February 27, 2025 19:43

GavinFrazar added 7 commits February 28, 2025 15:50

prioritize user benefit over impl details in intro

2cda811

update motivations

4c65dac

remove enrollment health check toggle

81d39de

move health check config into a dynamic config resource

9ea351e

rename "init" status to "initialized"

a88a6f3

clarify init behavior for pass/fail

b8d7a8c

clarify scope includes all db protocols

a5626ab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 203 - Database Health Checks #52400

RFD 203 - Database Health Checks #52400

GavinFrazar commented Feb 21, 2025 •

edited

Loading

r0mant left a comment

r0mant Feb 27, 2025

GavinFrazar Feb 27, 2025

r0mant Feb 27, 2025

GavinFrazar Feb 27, 2025

greedy52 Feb 27, 2025

r0mant Feb 27, 2025

rosstimothy Feb 27, 2025

GavinFrazar Feb 28, 2025

GavinFrazar Mar 1, 2025

RFD 203 - Database Health Checks #52400

Are you sure you want to change the base?

RFD 203 - Database Health Checks #52400

Conversation

GavinFrazar commented Feb 21, 2025 • edited Loading

r0mant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GavinFrazar commented Feb 21, 2025 •

edited

Loading