The subtle art of debugging weird issues
As a senior reliability engineer I’m tasked with solving weird issues on a weekly basis. Issues you could sometimes classify as heisenbugs, at other times they are old legacy code leftover from the dark ages (6+ months ago in IT terms). From time to time, you manage to identify colossal fuckups made by the engineering department, yourself included. Those issues range from old data not adhering to the new data scheme, semi-breaking changes to the api, or plain wrong code. Sometimes you make great efforts to implement security measures, which then get reverted back by someone who sees the UI being “broken” for an annoying customer and wants to make them go away.
The base of all this are problem solving skills. The skills that you pick up along the way, as you get more and more experience and get absorbed in the problem. Outside that experience, one of the most important things in problem solving is being able to deconstruct the problem into tiny pieces, and explain it to someone else. The problem arrives as either an error tracking entry, or a customer bug report. In the former case, you should have enough information on the issue itself sent to you by the error tracking service you are using. In the latter case, sometimes you end up with a customer that was born with the ability to understand tech, and sometimes you end up dealing with cave people. In any case, being able to properly replicate and describe the issue is your main goal. If you can replicate it, you can script it in a test.
With complex issues, writing a test scenario (in code or in plain text steps) which fails with the same conditions as the case you are tasked with is crucial. That way not only you can understand and begin solving the issue at hand, but you could reach out to someone with more experienced in the software module you are fixing for help. Whatever your position is in the engineering feed-chain, you never want to have a report that says: “X doesn’t work, can you fix it?” without any additional info. That can’t be said for the first line support, and I appreciate the work that they do more and more as I grow older. Their people skills are one of the best in the business, and the ability to squeeze water from stone ranks high on their skills list.
Once you have the issue described and repeatable, it shouldn’t take long to figure out what’s wrong with it. Following the path in the codebase, you (or a person more experienced with the module) should be able to figure out where the issue is and add a guard clause or fix the code in case. Don’t be afraid to ask for help or pair-program with other engineers on complex issues. By having someone (smarter than you) there looking at the same issue and discussing it from each-other’s perspective, should be enough to light the lightbulb in someone’s head. Et voila, the problem is now solved.
We all have our eye blinders that steer us in one direction or another. As proven once again in the recent pandemic, you should never let one group of experts steer the business in their direction. You should help your teams grow in more than their basic skills. Topics like philosophy, psychology and general systems thinking should be mandatory reading for everyone in the company. And the employees should be encouraged to communicate with other departments as much as possible. A well-rounded team is worth much more than a 10x specialised one (engineer, marketeer, salesman).
Comments