I never lose, I either win or learn.
- Nelson Mandela
A lot can be said about that quote, and it has been paraphrased multiple times. It applies to all areas of life, business, and sports. If you are doing a thing, there is a chance that something might go wrong. That something going wrong can mean a multitude of things, and anyone could be at fault there.
Uncovering and then writing down what happened will help you understand what went wrong. If you do an excellent job with that, it will be a learning opportunity for you and your whole company. If your company is large enough, publishing your mistakes is a wonderful way to appear “human” to others. And you can teach more people on how to avoid your mistake.
Software is a fragile beast and tends to break down in most unusual places. It doesn’t matter how much you QA it, or the percentage of code covered with automated tests. If you have users using your software, their actions will break it from time to time. What we can do is to spot those issues (error tracking, performance monitoring) before the users notice them, a philosophy that requires being proactive and having a dedicated person or a team doing that.
A good post-mortem in my book doesn’t have to be long or overly detailed. It should answer a couple of questions:
- What happened?
- What caused it?
- What did we do to fix it?
- How could we have prevented it?
The answer to this question is easy, we need to provide the details on the outage. I.e. We had an unplanned downtime of the service 13th June 2020 at 13:34UTC
What caused it?
This question needs to provide more details of what caused the downtime/P1 bug. I.e. We onboarded a new customer the day before, who had an unexpected dataset. The dataset exposed our backend issue with side-loading records. Side loading records in addition to N+1 queries caused all our backend processes to lock the database and we experienced downtime.
What did we do to fix it?
This should be the entire process of fixing it, even if we made mistakes along the way. Mistakes happen, and since it’s impossible to know all the facts beforehand, we could be misinformed while we are trying to solve it. I.e. Upgraded the database instance to have more memory/cpu on board to handle huge loads. Optimised the code to reduce N+1 queries. Added pagination to the breaking page…
How could we have prevented it?
They say hindsight is always 20/20, because we know what should have been done after the problem was resolved. Ideally, we would get to the root cause of the issue and then extend our procedures to prevent it from happening again. But it’s not that easy since there are always unknown unknowns in the world. The next big customer you onboard might have a weird hierarchy structure that works for them and their previous IT solution supports it, but it never came up in the pre-sales discussions.
The main point is not to stress too much if the issue wasn’t catastrophic, and most issues aren’t. Mistakes happen and we the only way to prevent them from happening is not to release anything ever. Software is a living thing that sometimes works in mysterious ways, and you must accept it as such.