Sunday, June 01, 2014

Why should people get paged at night *ever*?

Someone over at Etsy posted a nice "Sleep Driven Development" article and it brought to mind my personal jihad against pager alerts.

There are a handful of major, well-known tech employers that adhere to "DevOps" or "NoOps" practices and have all engineers on pagers. If you ask exiting engineers at these certain companies what they don't like about working there, it is very often mentioned that being on-call is one of them. An additional trend I've noticed is a strong relationship with being on-call and short tenures at companies.

My first foray into being on pager duty was the email system at Groupon. When I got there, this was using a third party email sender and a really ill-designed process for getting 100% customized email through to our users based on relevance. A lot of that process had come from consulting companies (note to self never have consultants work on systems that ultimately make you responsible on pager duty). Anyway, pretty much every night I got a call from Indianapolis that something broke. Every night. I had to move out of my room so as to not wake my wife. I still have PTSD when I get a call from a 317 area code.

My goal at that job became to build a system that was so reliable that no one would ever get paged, ever. I can still mentally think of the SPOFs in that system and how I wanted to get rid of them (I left for another opportunity before getting the chance).I really hope that today, the people working on that system never get paged and it "just works".

Anyway, therein began a process of my trying to destroy pager duty forever. Once a company gets to a certain size, there's no reason that anyone should even have to carry a pager. The system should be so redundant that failovers are completely automated until pingdom fails. Then someone gets called.

Yes, there are times when a company is small that you can't manage redundancy like this. But once you hit the threshold, spend the money for this. There's no reason not to. Most of the time what we're talking about here is a website. Hypertext over port 80 for god sakes. We're not talking about the primary heat exchanger on a nuclear reactor in a tsunami zone. Spending money on redundancy will not break the bank. Spend as much as you have to on technology to make failovers seamless to the point where someone can come in in the morning and see a list of what failed and needs fixing.

Ultimately if you don't spend the time, effort and money to make your systems redundant, all you're going to do is burn your engineers out until they leave. You asked them to do "devops" or "noops", then don't spend the money or time to make it so they won't need a pager.

Though, another telling type of engineer is the one who designed and built the system that's now paging him all the time and causing him to leave. Therein should be a red flag for hiring, am I right?

No comments: