Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 203 - Database Health Checks #52400

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

GavinFrazar
Copy link
Contributor

@GavinFrazar GavinFrazar commented Feb 21, 2025

@GavinFrazar GavinFrazar added rfd Request for Discussion no-changelog Indicates that a PR does not require a changelog entry labels Feb 21, 2025
@GavinFrazar GavinFrazar changed the title Initial draft RFD 203 - Database Health Checks Feb 21, 2025
@GavinFrazar GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from 4b5a393 to 2372383 Compare February 21, 2025 22:00
@GavinFrazar GavinFrazar marked this pull request as ready for review February 21, 2025 22:00
@GavinFrazar GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from 2372383 to a4cce8f Compare February 21, 2025 22:33
@GavinFrazar GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from a4cce8f to 4fd0121 Compare February 21, 2025 23:16
@GavinFrazar GavinFrazar force-pushed the rfd/0203-database-healthchecks branch from 70f2443 to 2b24a84 Compare February 25, 2025 03:14
Copy link
Collaborator

@r0mant r0mant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the UX and configuration.


Recall that there may be multiple `db_server` heartbeats for the same database, but the web UI only shows a single resource tile for the database.
If this is the case, then the tooltip will be shown if any of the heartbeats have an unhealthy `target_health.status`.
The tooltip content may look something like this (or equivalent singular form verbiage if there's only one agent):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tooltip might be ok to start with but I think eventually this sort of info needs to go on the resource's own "details" page. Similar to how we have status pages for integrations. cc @roraback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh like you would click the resource and see an overview of info about it, including health status?

Comment on lines 317 to 322
health_check:
enabled: true
interval: 30s
timeout: 5s
healthy_threshold: 3
unhealthy_threshold: 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion/idea: instead of adding this to every resource, what do you guys think about making the health check configuration either into a separate configurable resource or extending one of existing cluster config resources to support health checks based on configured matching resources for example?

Just brainstorming:

kind: health_probe
version: v1
metadata:
  name: default
spec:
  match_labels:
    env: prod
  interval: xxx
  timeout: xxx
  healthy_threshold: xxx

This will make it more flexible and easier extendable to other resource types in future, and give users to configure health checks for different resources and see them all in a centralized fashion. And also separation of concerns FTW.

What do you guys think? @GavinFrazar @greedy52 @rosstimothy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If multiple of these match a resource then we would sort them by name I assume?

I like that it centralizes the config and it's really flexible.
I slightly don't like the added complexity for users to understand how settings are applied.
Overall I like it though.

If we did this then do you see any problems with creating a preset default that enables health checks with *:* match_labels in v18?
That would be nice because most users will probably benefit from it without ever messing with the settings, but for people who need to disable health checks or use specific settings, they have plenty of flexibility to tinker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFD initially proposed a cluster level setting.

The label-based health check rules would be a lot more flexible than the cluster level one. To avoid multiple rules matching the same database target though, we might have to introduce rule priority? And i feel it might be hard for the user to implement the label matching rules without some dry-run tests or visualization tool to see the "end" configurations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly what I was thinking - out of the box we would create a default config that enables health check for all supported resources and users can always tweak it if necessary. Then it's zero config and follows the principle "make simple things easy and hard things possible".

For conflict resolution, if we go a separate resource route (as opposed to cluster config for example), if multiple match the same resource we would probably have the service that performs health checks just process all matching configurations.

And yeah, regarding user complexity, I actually think this approach will be more familiar and easy to understand for them as we employ the same approach in many different places - database import rules, Okta import rules, login rules, so for those who want to fine-tune things it should be easy to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that specifying a health check per resource is untenable. If this is going to be enabled by default we need to ensure it doesn't have any negative impacts on scalability or backend activity.

I like the dedicated health probe resource but I do think we should think about its shape for a bit. Will a general health probe be a one size fits all solution for database, kube, app, desktop, etc. Are there any protocol specific knobs that may need to exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we discussed a new config object that can affect every resource at once, should we add some jitter delay before the first check to avoid triggering a lot of connections simultaneously?

Will a general health probe be a one size fits all solution for database, kube, app, desktop, etc. Are there any protocol specific knobs that may need to exist?

I can't think of what else we'd need for a TCP health probe to support all of db, kube, app, desktop, etc. The only special example I can think of is when the resource has multiple endpoints like a MongoDB cluster, but the config we have still generalizes for that case.

Eventually I think it would be nice to expand on this and allow TLS (w/o client cert) and HTTP(S) checks.
TCP, TLS, HTTP probes are all pretty generic already though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to move all config into a dynamic config resource: 9ea351e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry rfd Request for Discussion size/lg
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants