[v10.1.x] Alerting: Do not exit if Redis ping fails when using redis-based Alertmanager clustering #74399
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 5c9aeae from #74144
What is this feature?
The normal gossip-based alertmanager clustering is tolerant to network problems. The peers might not be able to reach each other, but Grafana will still run, albeit without deduplication for high-availability alerts.
grafana/pkg/services/ngalert/notifier/multiorg_alertmanager.go
Lines 130 to 133 in 64652a9
(note the error is ignored)
Most of the Redis-based clustering also works this way most of the time - one exception is the ping that happens when the connection is first created. Failure to ping causes the Grafana process to exit.
Given we heavily favor availability for notifications, this PR fixes the Redis implementation to follow the same failure semantics as gossip.
The trade-off is that genuinely bad redis configurations are now harder to discover. It's consistent with other HA configurations, though.
Why do we need this feature?
Prevents a failure mode of the system in favor of allowing duplicate notifications.
Which issue(s) does this PR fix?:
n/a
Special notes for your reviewer:
Please check that: