-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 203 - Database Health Checks #52400
base: master
Are you sure you want to change the base?
Conversation
4b5a393
to
2372383
Compare
2372383
to
a4cce8f
Compare
a4cce8f
to
4fd0121
Compare
70f2443
to
2b24a84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the UX and configuration.
|
||
Recall that there may be multiple `db_server` heartbeats for the same database, but the web UI only shows a single resource tile for the database. | ||
If this is the case, then the tooltip will be shown if any of the heartbeats have an unhealthy `target_health.status`. | ||
The tooltip content may look something like this (or equivalent singular form verbiage if there's only one agent): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tooltip might be ok to start with but I think eventually this sort of info needs to go on the resource's own "details" page. Similar to how we have status pages for integrations. cc @roraback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh like you would click the resource and see an overview of info about it, including health status?
rfd/0203-database-healthchecks.md
Outdated
health_check: | ||
enabled: true | ||
interval: 30s | ||
timeout: 5s | ||
healthy_threshold: 3 | ||
unhealthy_threshold: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suggestion/idea: instead of adding this to every resource, what do you guys think about making the health check configuration either into a separate configurable resource or extending one of existing cluster config resources to support health checks based on configured matching resources for example?
Just brainstorming:
kind: health_probe
version: v1
metadata:
name: default
spec:
match_labels:
env: prod
interval: xxx
timeout: xxx
healthy_threshold: xxx
This will make it more flexible and easier extendable to other resource types in future, and give users to configure health checks for different resources and see them all in a centralized fashion. And also separation of concerns FTW.
What do you guys think? @GavinFrazar @greedy52 @rosstimothy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple of these match a resource then we would sort them by name I assume?
I like that it centralizes the config and it's really flexible.
I slightly don't like the added complexity for users to understand how settings are applied.
Overall I like it though.
If we did this then do you see any problems with creating a preset default that enables health checks with *:*
match_labels in v18?
That would be nice because most users will probably benefit from it without ever messing with the settings, but for people who need to disable health checks or use specific settings, they have plenty of flexibility to tinker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFD initially proposed a cluster level setting.
The label-based health check rules would be a lot more flexible than the cluster level one. To avoid multiple rules matching the same database target though, we might have to introduce rule priority? And i feel it might be hard for the user to implement the label matching rules without some dry-run tests or visualization tool to see the "end" configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly what I was thinking - out of the box we would create a default config that enables health check for all supported resources and users can always tweak it if necessary. Then it's zero config and follows the principle "make simple things easy and hard things possible".
For conflict resolution, if we go a separate resource route (as opposed to cluster config for example), if multiple match the same resource we would probably have the service that performs health checks just process all matching configurations.
And yeah, regarding user complexity, I actually think this approach will be more familiar and easy to understand for them as we employ the same approach in many different places - database import rules, Okta import rules, login rules, so for those who want to fine-tune things it should be easy to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that specifying a health check per resource is untenable. If this is going to be enabled by default we need to ensure it doesn't have any negative impacts on scalability or backend activity.
I like the dedicated health probe resource but I do think we should think about its shape for a bit. Will a general health probe be a one size fits all solution for database, kube, app, desktop, etc. Are there any protocol specific knobs that may need to exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we discussed a new config object that can affect every resource at once, should we add some jitter delay before the first check to avoid triggering a lot of connections simultaneously?
Will a general health probe be a one size fits all solution for database, kube, app, desktop, etc. Are there any protocol specific knobs that may need to exist?
I can't think of what else we'd need for a TCP health probe to support all of db, kube, app, desktop, etc. The only special example I can think of is when the resource has multiple endpoints like a MongoDB cluster, but the config we have still generalizes for that case.
Eventually I think it would be nice to expand on this and allow TLS (w/o client cert) and HTTP(S) checks.
TCP, TLS, HTTP probes are all pretty generic already though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to move all config into a dynamic config resource: 9ea351e
Rendered