Incident Postmortems: Tips, Templates and the #1 Success Factor

Stuff happens! Our IT systems are incredibly complex. Inevitably, things will break and customers will experience the consequences of these failures. You will feel stressed, angry, frustrated, and pressured to get it fixed ASAP. Once you have the problem fixed and major systems restored, you probably want to forget the whole thing ever happened. Don’t.

When we have experienced a major loss or degradation of our IT services, it is essential that we learn from what happened. A learning approach ensures either that the incident doesn’t happen again, or that we can remedy the situation more expediently than the first time around.

Of course, our learning is no use if we can’t remember what we learned. Thanks to how our brains work, we tend to forget the specific highs and lows of a project, especially when trying to recall them months or years later. And that’s why we must document our lessons learned—in a document often known as a postmortem.

Major incident reviews, or incident postmortems, form an important part of any continual improvement program. These reviews are opportunities to improve both our IT infrastructure and, possibly more importantly, our processes for dealing with these events. A mature organization will see these events as valuable learning opportunities, rather than apportioning blame for errors.

Let’s explore incident postmortems, including the #1 factor for their success. Then, we’ll cover the benefits, rules, and best practices for creating your incident reviews.

(This tutorial is part of our IT Leadership & Best Practices Guide. Use the right-hand menu to navigate.)

What is a postmortem?

Performing a postmortem may sound a bit dark and depressing—it literally translates to “after death”—but it’s actually meant to shed light on a significant problem. A postmortem process comes at the end of a project and helps you both determine and analyze successes, non-successes, and failures. The outcome of this process is a document or report that aims to inform best practices and mitigate risks in the future.

Postmortems, or lessons learned reports, can be performed after anything:

In IT, most postmortems tackle incidents: a severe problem, downtime, or outage that has an immediate impact on users. The postmortem should document detailed information regarding every aspect of the incident: from the root cause to the successful resolution, and all the lessons you might glean from the whole thing.

Postmortems that fail

Perhaps you’ve been involved in an incident postmortem, but decided to scrap it for more “important” work. Maybe you filed the report but, now that it’s hidden away, the recommendations therein haven’t been adopted.

These are the two biggest problems with creating IT postmortems: people dismiss them as non-essential, so the reports aren’t always read, let alone adopted, by the people who can affect change. Because of this, many people immediately see postmortems as an unworthy investment of time and resources.

A few reasons point to why we might dismiss documenting these lessons learned:

For a postmortem to be useful, it must provide specific recommendations for changes, such as policy or processes. If it’s just documenting for documenting sake, it’s a waste of everyone’s time.

#1 factor for incident postmortems to succeed

In my opinion, the most critical success factor for incident reviews is that they are blameless.

To use a popular phrase: do not make your incident postmortem a witch hunt. ‘Blamestorming’ sessions do not benefit anyone. If your company culture seeks out the person who may have caused, through error or omission, a major outage, it is extremely unlikely that you will get truthful answers during the review. (Besides, most incidents are more nuanced than one person failing at their duties.) In this culture, no smart person would be willing to raise their hand and admit a mistake. When that happens, your postmortem has failed before its begun.

Consider a company culture that rewards honesty rather than demonizing mistakes. People will put up their hand willingly to flag an error they may have made. Then, real and useful changes can be made to prevent it being made again in the future.

Benefits of incident reviews

A successful postmortem goes well beyond reviewing how you handled its resolution—the best ones indicate unknown system problems and highlight areas you can improve or automate to reduce risk. A well-run postmortem allows your team to come together in a less stressful environment to achieve several goals:

Of course, incident reviews aren’t just for internal stakeholders. Ultimately, your incident reviews show your customers two important characteristics about your company, which provides invaluable benefits:

How to conduct incident postmortems

Like many things in IT, incident postmortems run much more smoothly (and take significantly less time) if you have a process and some basic rules in place. So, let’s set a few:

  1. Have a template.Create a template that you will work off for each review. This ensures you don’t miss anything. A template also provides the basis for the reporting, that goes to your management team, and the communications that goes out to affected customers and stakeholders.
  2. Define roles and owners. The owner of the review is responsible for managing the meeting and producing the subsequent report. The owner(s) should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact.
  3. Set rules around which incidents need reviews. You must have clear, well defined rules about which incidents will trigger the postmortem process. A good rule of thumb is any incident that has been given a severity one rating. There may be other incidents where a review may be useful. Consider establishing a process whereby service owners can request reviews of incidents that do not meet the severity criteria but that may have severely impacted their services and customers
  4. Act timely. A critical incident will almost always require some downtime for your team; do not delay any longer than necessary. Procrastinating too long means that important details are forgotten. So, when a critical incident occurs, convene within 24-48 hours, and certainly do not delay more than a week.

Create a postmortem template

The responsibility to research, write, and publish a postmortem report lies with the project manager or the person most responsible for a particular outage or data loss. (By responsible for, we mean the person who immediately begins fixing it, not the person who caused it—as many times, these outages occur without human interference.)

An IT postmortem report need not be complicated. In fact, its simplicity encourages us to complete them and others to actually read them. Include specific information that focuses on the key factors of the incident without bogging the reader down with unnecessary details. Here are the core components of a successful post-mortem report:

Summary

First, create a brief summary of the incident. This part of the document should be short, just 1-2 sentences that answers the question “What happened?” This lets readers determine if this report applies to them. Also include details like a relevant, easy to understand title; authors and date; most recent status.

Background

Next, include any supporting information that’s necessary for understanding the incident should be provided immediately after the brief summary. This information offers supplementary (but still concise!) details to help the reader understand the context of the incident.

Incident

Now you’re into the body of the postmortem report. Include a description of the events that’s detailed enough so that someone who wasn’t involved in the incident can understand what occurred. Use timestamps to provide insight into how and when everything unfolded. Use these questions to guide your writing: