Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v10.1.x] Alerting: Do not exit if Redis ping fails when using redis-based Alertmanager clustering #74399

Merged
merged 1 commit into from
Sep 5, 2023

Conversation

grafana-delivery-bot[bot]
Copy link
Contributor

Backport 5c9aeae from #74144


What is this feature?

The normal gossip-based alertmanager clustering is tolerant to network problems. The peers might not be able to reach each other, but Grafana will still run, albeit without deduplication for high-availability alerts.

err = peer.Join(cluster.DefaultReconnectInterval, cluster.DefaultReconnectTimeout)
if err != nil {
moa.logger.Error("msg", "Unable to join gossip mesh while initializing cluster for high availability mode", "error", err)
}

(note the error is ignored)

Most of the Redis-based clustering also works this way most of the time - one exception is the ping that happens when the connection is first created. Failure to ping causes the Grafana process to exit.

Given we heavily favor availability for notifications, this PR fixes the Redis implementation to follow the same failure semantics as gossip.

The trade-off is that genuinely bad redis configurations are now harder to discover. It's consistent with other HA configurations, though.

Why do we need this feature?

Prevents a failure mode of the system in favor of allowing duplicate notifications.

Which issue(s) does this PR fix?:

n/a

Special notes for your reviewer:

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

…tmanager clustering (#74144)

Do not fail redis peer construction if ping fails

(cherry picked from commit 5c9aeae)
@grafana-delivery-bot grafana-delivery-bot bot requested review from a team, rwwiv, JacobsonMT, yuri-tceretian and grobinson-grafana and removed request for a team September 5, 2023 15:44
@grafana-delivery-bot grafana-delivery-bot bot added this to the 10.1.x milestone Sep 5, 2023
@alexweav alexweav merged commit 01d039b into v10.1.x Sep 5, 2023
@alexweav alexweav deleted the backport-74144-to-v10.1.x branch September 5, 2023 16:07
@zerok zerok modified the milestones: 10.1.x, 10.1.3 Sep 19, 2023
@zerok zerok modified the milestones: 10.1.3, 10.1.5 Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants