Github Issue | Incident Report | Logs
Include:
- A basic narrative of what happened
- Timing about when the incident occurred, when it was resolved, and any other key moments
- Any other relevant facts about what occurred
Include:
- Messages from affected customers
- Links to these conversations within your support system, if applicable
Include:
- Root causes of the incident
- Lessons learned about your team's processes, communication flow, or code
- Ideas for mitigating or avoiding incidents like this in the future
Describe the problem and how it manifested itself. Include detail on who was impacted by the problem and how they were impacted.
Map out every event that led up to the problem. Map out each inflection point where the problem could have been avoided, had the team done something differently. Map out every point where the problem grew worse.
Ask yourself and your team the following questions. Repeat for each potential root cause you identify.
- What might have caused the problem?
- Why did that arise? (ask this repeatedly, until you get to the deepest underlying cause)
- Is this a root cause (ie the problem could not had occurred, had this not happened?) or a contributing factor?
What evidence do you have to show that each root cause was the culprit?
Can you find evidence that demonstrates the problem could have been avoided, had things been handled differently?
What could be done differently in the future to fix these root causes and avoid problems?
Are there multiple approaches that can help avoid future problems? If so, is there one approach that’s less expensive or less labor-intensive than the others, while still being as effective?
After you’ve implemented changes to avoid the root causes, how effective were the changes? Ie did you avoid the problem from occurring again?
Have you noticed additional root causes you should address?
Discuss, as a team, how well (or poorly) the process worked. Discuss whether there are things you plan to adjust in the process when you next use the RCA approach.