[2025-01-30] Incident Thread #150334
Replies: 6 comments 1 reply
-
UpdateUsers may experience timeouts in various GitHub services. We have identified an issue with our caching infrastructure and are working to mitigate the issue |
Beta Was this translation helpful? Give feedback.
-
UpdateWe are seeing recovery in our caching infrastructure. We are continuing to monitor |
Beta Was this translation helpful? Give feedback.
-
UpdateWe will be failing over one of our primary caching hosts to complete our mitigation of the problem. Users will experience some temporary service disruptions until that event is complete. |
Beta Was this translation helpful? Give feedback.
-
UpdateWe have completed the fail over. Services are operating as normal. |
Beta Was this translation helpful? Give feedback.
-
Incident ResolvedThis incident has been resolved. |
Beta Was this translation helpful? Give feedback.
-
Incident SummaryOn January 30th, 2025 from 14:22 UTC to 14:48 UTC, web requests to GitHub.com experienced failures (at peak the error rate was 44%), with the average successful request taking over 3 seconds to complete. This outage was caused by a hardware failure in the caching layer that supports rate limiting. In addition, the impact was prolonged due to a lack of automated failover for the caching layer. A manual failover of the primary to trusted hardware was performed following recovery to ensure that the issue would not reoccur under similar circumstances. As a result of this incident, we will be moving to a high availability cache configuration and adding resilience to cache failures at this layer to ensure requests are able to be handled should similar circumstances happen in the future. |
Beta Was this translation helpful? Give feedback.
-
❗ An incident has been declared:
Incident with Pull Requests and Issues
Subscribe to this Discussion for updates on this incident. Please upvote or emoji react instead of commenting +1 on the Discussion to avoid overwhelming the thread. Any account guidance specific to this incident will be shared in thread and on the Incident Status Page.
Beta Was this translation helpful? Give feedback.
All reactions