I plan to (let’s see how it pans out) write a series of articles on what I learned (and am still learning) during my stay at intuo. There is so much stuff I want to at least document for myself, and I’m afraid the experience will get lost if I don’t do it. There are some acronyms I will be throwing around in this series:
- RPO (Recovery Point Objective)
- RTO (Recovery Time Objective)
- SLA (Service Level Agreement)
- CVE (Common Vulnerabilities and Exposures)
- DR (Disaster Recovery)
- SRE (System Reliability Engineer) Plus many more that I will remember when I get to them later on.
To preface this, recently I found myself in my new role as a SRE. To be honest, there is nothing new about this role, or the tasks that I’ve been doing for ages now. When the company was acquired last year, the engineering team doubled in size and us seniors got to play with our own things. My thing was always tinkering with hardware and fiddling with different OSs (sysadmin was my first paid job).
My role in the company was do whatever is necessary to push features out while also keeping the servers running while supporting high scale customers and their integrations while hunting heisen-bugs. Now it’s the senior SRE position, where I do the same thing as before, a bit less of feature work, but they do arrive on my plate from time to time. At this moment (in time, my life/career) I found myself wondering about what was I trying to do.
SRE is an acronym for System Reliability Engineer. We have a system in place, I’m somewhat of an engineer (at least by trade) so I’ve got those covered already. Some people find this role stressful, since the whole life of the company depends on the select few keeping the lights on. But this is far from a loner’s job. It’s impossible to be stuck in your own basement, looking at the blinking lights and replacing drives when a red light comes on. Usually you are so detached from hardware that everything is running ‘somewhere’ in the cloud. More than often, Amazon owns that cloud.
As an SRE you have to handle most of the OPS stuff (think DevOps, but with a security cherry on top). Infrastructure architecture, provisioning and deploy strategies fall into the running the product your company is selling. Monitoring the systems and applications for exceptions. Identifying code bottlenecks. Monitoring server outages and dropping everything else to fix the thing that could cause downtime. Every couple of months you get a report from the security officer, and I’ve been lucky to have worked with two excellent security people in my career. That report is the top priority after preventing downtime, because no one wants to leave open holes for someone to break in. That whole soup of tasks includes reviewing pull requests for what could be performance issues, or evident bugs.
But the main thing here is preventing downtime. Because downtime is the true measure of an SRE’s self worth. No one in the company should ever notice that we exist, but everyone starts noticing when things fall apart. This is why most people find this stressful, but you realise that some downtime is okay, if it’s well within the SLA conditions.
It took a bit of pointless ramble to answer the question: What is a reliable system in the end? Having a working system should be the start, so something that doesn’t blow up as soon as some traffic hits it. A system that’s secure enough that a malicious person can’t take it down without a large orchestrated effort. A system that’s scalable enough so you can throw more hardware faster that the traffic rate increases. A system that is fast enough to be able to handle at least 10 times the usual load. This way you have enough time to scale up when sales or marketing does a tremendous job and lands a huge customer. A reliable system is one where no one notices the SRE team exists, because then they are able to do their jobs better. Marketers market the product for sales to sell it so we have enough resources to pay Amazon for more compute power if necessary.
To be continued…