The Nightmare Before Devops

Techniques, stories, and rants about running web services using DevOps principles



The Old Lion and the Ford

In Aesop’s fable, “The Old Lion and the Fox,” a fox passing by a cave hears a voice calling out. The voice claims to be a sick, old lion too infirm to leave the cave, and asks the fox to come into the cave to help ease his loneliness. The fox, noting that the only tracks to be seen went into the cave but none came out, wisely refuses. The moral of the story is to learn from the failures of others instead of repeating their mistakes.

Perhaps oddly, I see a corollary to this story in one of my father’s jokes. My father was born and raised in Georgia (the state, not the country), and thus had a large collection of redneck jokes which generally made me cringe. The only one I can remember, though, goes like this:

One man gets a ride from another in his pickup. As they’re driving along, they come up to an intersection with a red light. The driver flies through the intersection.

The passenger says, “Hey, you just ran that red light!”

The driver replies, “Naw, it’s ok. My brother does it all the time.”

They come to another intersection, and again, the pickup blows through the red light.

“Hey, man, that light was red, too! That’s really dangerous!”

“Naw, I told you, my brother drives through red lights all the time and he’s fine.”

At the third intersection, they have a green light. The pickup stops.

The passenger: “Hey, man, what’s wrong with you? Why are you stopping for a green light?”

The driver: “My brother might be coming the other way!”

This blog post that I have re-read and shared many times over the years talks about the ubiquitousnous in Silicon Valley of a sociological phenomenon called the “normalization of deviance.” The term was coined by Diane Vaughan to explain the gradual, unconscious acceptance of dangerous engineering practices and beliefs in NASA and its contractors that led to the preventable explosion of the Space Shuttle Challenger.

These delusions usually start to take root the first few times someone takes a shortcut or compromises on quality. If there is no bad outcome, the risk must have been overestimated, right? This behavior quickly becomes acceptable, and then it becomes… normal. “Everyone must do it this way, right?” And when they finally hit those 10:1, 100:1, or 1,000,000:1 odds, *that* must have been a fluke, right? That can’t possibly happen again. There are good reasons why the government appointed commissions independent of NASA to investigate the agency’s major fatal disasters. In most tech startups, though, there are few if any external factors to force an objective evaluation of what went wrong. So nothing changes. Then new people come in, and particularly those who are, maybe, in the first job, or are joining what’s considered a world-class organization, will often start thinking this must be normal, too, because surely it wouldn’t happen here otherwise. Right?

It’s very frustrating to come into an organization with these delusions. Silicon Valley is supposed to be full of smart people, right? Iconoclasts, critical thinkers, all that, right? I tend to agree with the conclusions in Dan Luu’s post, adding that the arrogance and optimism which drives this industry is actually sometimes more like to feed normalization of deviance rather then defeat it. And whether or not (frequently, not) they can learn from someone else’s mistakes, the resulting cultures all too often cannot learn from their own, because, oh, hey, that wasn’t a mistake. Probably some butterfly in China was flapping their wings. (I’m pretty sure there are lots of butterflies in China that will flap their wings, so, uh, if one can take down your entire website, shouldn’t you still try to build in protection against that?)

I also believe normalization of deviance goes a long way to explain why so many attempts to adopt that whole “devops” thing fail. First, there is the very common mistake of thinking devops == Agile and you just need to put your code in git, throw up a Jenkins server, and change someone’s title from sysadmin/syseng/prodeng/whatever to “devops engineer.” Even when an organization is savvy enough to realize that, hey, this devops thing requires some cultural change, it’s one thing to create new, good habits. What’s often neglected is realizing that you also need to break old, bad habits, something that becomes that much more unlikely if no one can actually recognize these habits as toxic anymore.

I don’t have any obvious solutions. But maybe as a starting place, every time someone makes a compromise in the name of start-up necessity, write it on a Post-It note and put it on a Wall of Deviance in the office. And, just to drive the lesson home, as you stick one on the wall, repeat the mantra, “This is not normal. This is not normal.”

Night on Balderdash Mountain

In the Disney animated classic Fantasia, we were meant to be the most terrified by the segment featuring Mussorgsky’s “Night on Bald Mountain,” with its endless stream of dancing spirits risen from their graves. Balderdash. The most terrifying story was earlier in the film. We were fooled because it starred the plucky Mickey Mouse as The Sorcerer’s Apprentice.

What was so terrifying about a bunch of dancing brooms and buckets? What wasn’t terrifying? When we meet him, Apprentice Mickey is stuck with a neverending rotation of menial, manual tasks. Sweep, sweep, sweep. Fetch, fetch, fetch. So Mickey does what any good engineer with a devops inclination woud do: he tries to automate. However, also like any novice engineer without a strong mentor to peer review his pull requests, Mickey creates a fork broom bomb, with his marching brooms flooding the tower. (Furthering the broken culture where he works, his boss, aka The Sorcerer, then gets angry at him. And even then, he probably still doesn’t help mentor Mickey, let alone pay down any of their technical debt.)

One of the hallmarks of a good, strong devops culture is the constant drive towards more and better automation of what were, to varying degrees, manual processes. Performing releases by hand. Manually invoking scripts when cued by an alert. Configuring software or infrastruture via graphical user interfaces. An ever-growing number of tools and services aimed at devops engineers and organizations are driving and driven by the empowerment of automation and the automation of empowerment. And, like all software, and also like many companies saying that they are “now doing DevOps,” some of these offerings deliver on the devops promise better than others.

Maybe you’ve heard the term “code smell,” referring to what may seem like small, cosmetic flaws or questionable decisions in source code, but which are usually indicative of deeper, more serious issues, kind of like traces of digital comorbidity. Similar watermarks can often be found in services that are marketed to devops teams. Some real-world examples:

  • the SaaS is only configurable via a graphical UI
  • the Linux server-side agent is written in JavaScript (you want me to run what on my production servers?)
  • the (rapidly mutating) version numbers of the (again, you’re asking me to run what on my production servers?) agent are hardcoded in the agent’s configuration file
  • the public API makes backward-incompatible changes on a very frequent cadence

Cads Oakley suggests the term “operational whiffs” to cover the analogous markers in devops tools, because it’s a thing. While symptoms like the ones listed above may not always cause issues, they can be indicative of much deeper issues in the producers of the tools: a lack of true understanding of or empathy for devops practitioners. They can create workflows that are not intuitive or native to devops practice; unnecessary pain at upgrade time because of constant backwards-incompatible changes that also require updating any internal automation; and, at worst, a lack of respect for their customer’s time. All of this amounts to creating work and friction for the devops teams the tools are supposed to be helping.

How can software companies trying to sell to devops teams avoid these issues, and create devops-native tooling?

  • Dogfood your tool or service internally. Run it the way a customer would and note any pain points that require human intervention
  • Listen to customer feedback, not just the direct comments, but also the soft signals. “Well, after I edited that file and did this other stuff, it started to work…” And make sure those messages pass from your Customer Success team to the Product and Engineering teams as actionable items, if possible. (For every one of those comments you hear, there are probably a thousand being said under someone’s breath, bookended by expletives.)
  • Hire experienced devops engineers, and more importantly, listen to them

A product can have all the required bullet-point marketing features for its class in the devops tool ecosystem, but if it’s a pain in the ass to configure and maintain, well, then it doesn’t really have all the required features. Eventually, those are the tools most likely to get optimized out in the sweep of iterative improvement of devops momentum.

Don’t Fear the Beeper

In one of Aesop’s fables, a shepherd boy gets his jollies by raising the alarm that a wolf is threatening the herd to trick the villagers to run out to offer protection, only to find they’d been tricked.  After the boy had pulled his prank a few times, a wolf really does come to threaten the sheep, but by now, the villagers no longer believe the boy’s cries of “Wolf!” and the herd, and in some variants, the boy, are consumed by the predator.

Any engineer who has spent any amount of time in an on-call rotation that invariably bombards them with constant alerts, particularly when a sizable number of those are either invalid or not really critical, has probably also been tempted to risk letting a wolf eat the bratty shepherd boy.   This “pager fatigue” gets compounded when the same issues arise time after time and never get fixed at the root cause level, and the engineer feels powerless to improve the situation over time.  When the on-call rotation is 24 hours, with alerts coming any time of day or night, the physical effects of sleep interruption and deprivation can quickly compound the psychological issues.

I’ve had more than my fair share of on-call rotations from hell over the years.  What would invariably make an already awful situation even worse was the knowledge that it was not likely to get any better.  Particularly when the software developers were well-removed from the bugs in their code that were making the service brittle and unreliable, but also when management would not take the time to prioritize those fixes, or, worse, thought a week every month or two of no sleep for an operations engineer was “normal” and part of the job description (not, of course, of the job description they actually posted to hire you), and the situation starts to seem helpless.  MTTA (Mean Time To Acknowledge) starts to rise as people are too demoralized or physically tired to jump on every page.

For those deeply-unempathetic upper management types who still believe that, well, this is just part of the job, they’re missing the business costs of this effect.  Time spent responding over and over to the same issues which never get fixed at the root cause really adds up over time.  Site uptime and service-level agreements also get put at risk, which can sometimes result in loss of income or financial penalties to customers.  And one of the well-known effects of sleep deprivation (which, after all, is used as a form of torture for a reason) is that it greatly increases the risk of accidents.  Do you really want an engineer who can barely keep their eyes open trying to fix a complex, broken system, when there’s a very non-negligible chance they will just make it worse?

I’ve also personally witnessed in more than one organization a form of “learned helplessness,” where engineers, feeling disempowered and just plain tired, stop pushing for fixes and don’t even take care of simple root-cause fixes within their ability.  Even on the off-chance that everything else about the job is great (and it’s probably not when there’s an elephant in the room everyone is ignoring, so no one is cleaning up the elephant-scale dung heaps, either), the never-ending cycle of frustration and stress becomes demoralizing.

While this may be an unwinnable war at many organizations because of the almost industry-wide, deeply-entrenched normalization of deviance around how pagers are just going to be noisy and engineers are just going to lose sleep, on-call duty should not have to be an unavoidably soul-sucking experience.  And since I’ve just started a new job, and after one week noticed the engineer on-call seemed to be ack’ing a lot of pages, I knew I had to nip that in the bud.  Specifically, before I went on-call.

Here’s my plan of action.

  1. Technical or configuration actions
    1. Get rich metrics for paging alerts.  Unfortunately, the canned reports and analytics for the paging service we’re using leave a lot to be desired, so I will probably have to go through the API and generate the per-service, per-error metrics myself.  I was also looking at Etsy’s opsweekly tool, but it doesn’t support our pager service, either, so we’d have to write the plugin.
    2. Identify the alerts that are non-actionable and stop making them page.  If the on-call can’t fix (or temporarily ameliorate) an issue, they shouldn’t get woken up for it.  If a situation does need handling but does not pose an immediate threat, can it wait until morning?  Then wait until morning.  If it’s still important that people be aware of a situation. send it to Slack or something.  If it’s a bug that can only be fixed by code change, which may take time, you may need to mute it temporarily, although that’s risky.  (Do you always remember to re-enable an alert when the underlying condition is fixed?  Really?)
    3. Prioritize fixes for the most frequent and most time-consuming alerts.   If it’s broke, fix it.  Developer time is precious, yes, and there are new features to deliver, but those same developers are the people getting paged in this on-call rotation.  We’re borrowing from Peter to pay Paul.
  2. Engineering culture and attitudes towards on call
    1. Get top-down buy-in.  Engineers generally have a natural inclination to fix what’s broken.  However, they need the time and power to do that.  All levels of management need to be cognizant of the entire on-call experience, including being willing to make short-term trade-offs of priorities when possible, with the understanding that payoffs both in time resources and team effectiveness and morale will pay off in the longer term.  (Fortunately, I have that now.)
    2. Empower team members to own their on-call experience.  Again, as the developers are in this rotation, they are the ones who can fix issues.  They’re also not removed from the incentive of knowing if they fix a certain issue, it won’t wake them up the next time they’re on-call.  (This very separation of power from pain is one of the factors that has made the traditional dev vs ops silos and their associated services so dysfunctional.)  And if it’s not something that can be fixed quickly, make sure it gets tracked and priotized as needed for a permanent fix.
    3. Use those metrics to show improvements.  Being able to see in hard numbers that, over time, yeah, we really aren’t getting woken up as often, or interrupted by alerts that we can’t fix is both the goal and incentive.  A noisy pager isn’t something you fix for once and for all, but requires ongoing vigilance and incentives.

I admit, I’ve been woken up too many times by unimportant, misdirected, or déjà vu millions de fois alerts.  It kills morale, breeds resentment, and has probably shortened my life because of all the sleep deprivation and stress.  I would love to help build an org where the pager is usually so quiet, engineers start to check their phones to make sure the battery didn’t die.  No one’s going to jump for joy when they go on-call, but it’s a win if they aren’t filled with dread.


Driving a Lemon

There’s a popular urban myth known as “The Killer in the Backseat.”  The most common variations of the story involve a lone driver, almost always a woman, on a deserted backroad at night.  Another vehicle, usually some sort of truck, is following her, flashing its lights, trying to pass her, or even trying to ram her. The car’s driver naturally panics and tries to evade the truck, but ultimately either gets pushed off the road or finds a gas station to pull into.  The truck’s driver then runs up to the window to tell her there’s a threatening stranger in the backseat of her car.  She had been mistaken about the true threat all along, instead fearing the person who was trying to help her.`

A more realistic automobile fear may be of buying a lemon, a car, usually used, that just ends up having either a series of non-stop issues or one major issue that never quite gets fixed.  (My mother ended up with a lemon once; a Buick with a sticky gas pedal.  You would be at a stop, then gradually press the gas to start moving, and nothing would happen.  If you kept pressing the gas, it would eventually engage and then lurch forward, like it suddenly applied the amount of gas that it should have been ramping up to as the pedal was pushing down.  Apparently neither my father nor the mechanic could replicate the behavior, which, granted, didn’t happen all the time, but when I came to visit and borrowed the car, it happened to me, too.)

How does all this car stuff relate to devops?  Well, let’s set up the following analogy: the car is a web application or service, the mechanic (and car builders!) is the software engineering team, and the driver is the “traditional” operations engineering team.

What do I mean by “traditional” operations engineer?  Basically, when web companies became a thing, they had a clear separation between the people writing the code and the people who kept it running in production.  Much of this is because the operations side generally evolved from pre-web system administrators, the people who kept the servers running in any given company.  Except those companies, whether they were shrink-wrap software companies or government research labs or visual effects companies, rarely scaled in size and customer base at the rate of the new web businesses.  The traditional silo model didn’t translate to web applications, and in fact, it helped create and perpetuate major issues.

Untitled drawing
The “traditional” silo model of web application development and operations.  Note the one-way arrow.

With the silo model, developers are so isolated from the environment and reality of keeping their code operational and performant in a 24/7 web economy that they don’t get the proper feedback to help them avoid designs and assumptions that inevitably create issues.  Operations engineers, who are the ones who understand what breaks and what scales in a production environment, can, at best, only give advice and request changes after the fact, when the design is long since finished and the code is already in place and many of the developers have been assigned to a new project.  The broken app stays broken, and as traffic scales linearly or exponentially, often the team that supports the application must scale with it.  If that team is already relatively large because the service is a brittle piece of engineering riddled with technical debt, then the company is faced with either trying to hire more people, assuming it can even attract skilled engineers in this economy, or watching its uptime become an industry joke as the overworked ops people get burned out and leave.

So it is a with a lemon.  Maybe the driver can do a few things to mitigate chronic issues, like using a specific kind of higher-grade oil or octane gas, changing belts or filters more frequently, etc., but that can be relatively costly over the life of the car, if it works at all.  Or, as with my mother’s car, she could tell the mechanic the exact behavior, but if the mechanic is not skilled or not sympathetic, they may just ignore her.   But since the mechanic is probably the only one capable of fixing the root cause, that car and its owner are doomed to a lifetime of expensive and frustrating palliative care.

So it goes with web companies.  If the operations team only comes in after the fact to try to manage a poorly-designed. buggy, or non-scalable service, the company is going to throw money at it for the entire life of the service.  Even if, and in my experience (which can’t be isolated), this is not always the case, the development team has the bandwidth and desire or requirement to fix bugs escalated by the operations team, if the major issues lie deep in the application design or its fundamental execution, those fixes won’t be easy.

I still encounter and hear of far too many companies and “old-school” engineering higher-ups who think that an operations team that was not consulted (or didn’t exist) during the design and original coding of a service should still somehow magically be able to make any pile of code run in production.  Well, maybe, but only if the bosses are willing to hire a large enough team.  It would generally be more cost-effective, as in most things, though, to fix the root cause: the code.

Let’s take a trivial example.  Say a software developer has written some incredibly inefficient SQL queries for dealing with the backend database.  Exactly how is an operations team supposed to mitigate that on their own?  Well, maybe they could scale the database infrastructure, but that takes money, money that will almost certainly far exceed, over the lifetime of the application (probably just within a couple of days, actually), the amount of money it would take to get the developer to fix the errant SQL queries.

To sum up, the traditional silos create and perpetuate web services that are brittle and fiscally expensive to run, because the people designing and implementing the services rarely have practical experience of what does and does not work well in production, especially at web scale.  After-the-fact operations teams can only mitigate some of those issues and only at great cost over the life of the application.

This is a simplified overview of the cultural and organizational issues the DevOps movement and its cousin, site reliability engineering, evolved to address and prevent.  I’ll delve into it more later.




Winchester Mystery App

Several years ago, I took the tour of the Winchester Mystery House in San Jose.  The mansion was built by Sarah Winchester, widow of the heir of the Winchester gun fortune.  After suffering through several personal tragedies, she was said to believe that her family and fortune were haunted by the victims of Winchester rifles, and only constant building could appease them.  She spent the last 38 years of her life constantly adding on to the house without ever using an architect and with no real plan, leading to a house with doors that go nowhere or open onto rooms a floor or two down and priceless Tiffany windows set in north-facing walls which receive no direct sunlight.

(My tour group included a little girl named Olivia.  I know her name is Olivia because she kept wandering ahead and touching things she shouldn’t, resulting in her parents’ consistently ineffective pleas to stand still.  Given her likely-conditioned lack of heed, I mentally renamed her “Oblivia.”)

Unfortunately, I’ve seen far too many software projects that bear more than a few similarities to the Winchester Mystery House.  No cohesive architecture, inconsistent conventions and APIs, dead-end methods.  This tends to result in a brittle codebase which is hard to test and even harder to change safely and predictably.  Just as the rooms in Sarah Winchester’s house were connected in unpredictable ways, an application with strange, non-standard dependencies between classes/modules/libraries resists easy, safe changes.  Say you decide you want to change your backing database, but if it’s not accessed via a consistent interface class, it’s going to be a lot more work to find all the special snowflake cases in your code and fix them before they’ve destabilized your production service, because, let’s face it, you probably don’t have very good test coverage, either.

I’m sure a lot of people, especially at start-ups, think this is normal, and maybe it does happen more often than not, but normal should never be automatically conflated with “good.”  So let’s consider scenarios where this kind of application development might arise.

  • Most likely scenario: a small team needs to get an application running for proof-of-concept funding, so they just start writing code.  Developer A is writing the database interface, but dev B is writing the storage engine which depends on it, can’t wait for the interface, so just accesses the database interface class directly.  Hey, it works, they get funding, and they’ll just fix it later, except now they need to add features so they can get paying customers, and there’s never going to be enough time.  At some point the team will agree on conventions and so forth, but the tech debt is still in the codebase, accruing interest which will likely result in a balloon payment (probably production downtime) sooner or later.
  • A lack of team discipline, usually under the mantle of “total freedom to get your job done.”  Maybe the project did actually have an architect or at least some agreements on the architecture, but the developers still ended up doing their own thing when they wanted to or just decided it was more expedient to get a feature out the door.  Usually this involves some excuse like, “I didn’t realize we already had a class for that,” or “I know we already use library X but I like library Y better,” or, my personal favorite, “I wanted to try out cool new tool Z, so what better place than in production?”  And now you’re stuck with it until someone has time to rip it all out, except remember, this is Winchester Mystery Code, so that’s a lot harder than it should be, and there’s never enough time to begin with.
  • The company or at least the intended functionality of the app “pivoted,” and rather than start clean, everyone started building on top of the existing code in a new direction.
  • The architect really had no idea what they were doing.

It could be a combination of factors, too, but the net result is the same.  By “the same,” I mean a completely unique bundle of code, but the pain of maintaining it and services that run off of it remains the same.

I also like to use Katamari Damacy as an analogy for this type of application development.  Katamari Damacy is an old Japanese console video game which translates to something like “clump of souls.”  The backstory has the King of All Cosmos destroying all the stars in a drunken rage, so he sends his son (the player), the Prince of All Cosmos, out with a sticky ball to roll up masses large enough to replace the stars.  As the ball picks up bulky, misshapen objects (think skyscrapers pointy side out), the clump becomes much more difficult to steer, and you’re never going to be able to smoothe it out.  A badly (or not-at-all) designed application becomes a big clump of bugs, and if the piece of code you need to fix or replace is buried under many interlocking accretion layers, imagine how fun it’s going to be to change it.

Some cases are more avoidable than others, but that doesn’t mean it’s impossible to prevent or mitigate large-scale issues.   While it may sometimes seem like there’s no time to waste because you need to start writing code now (now! NOW!), getting clear plans can speed up development in the short and long runs, because developers won’t be duplicating effort or stepping on each other’s toes or organizing code in a painfully haphazard way which is going to make modification difficult.  Some discipline should be required to make sure new code is using established conventions, and friction should be required before pulling in new third-party dependencies to make sure they aren’t redundant or inappropriate.

Once Upon a Time…

There was a kingdom that made most of its money by placing little advertisements everywhere (everywhere!) it could fit them on a webpage. It had to record each of these ads very carefully so it could tax the advertiser and so the kingdom’s tax collectors didn’t come after them for misreporting. The monarch had two regiments to work on making sure the ad serves were processed accurately. One regiment wrote the code to process the ad serves and the other ran the machinery on which the code ran. The two regiments were separated by a big wall. When the first regiment received orders to change their code, once they finished a couple weeks later, they would throw the code over the wall separating them from the second regiment, much like a grenade, because the new code would usually explode in the second regiment’s faces. Sometimes it would even explode when the first regiment hadn’t pulled the pin from the grenade they made. However, the second regiment was virtually powerless to get the first regiment to stop lobbing grenades over the wall, and no one lived happily ever after, but the first regiment was definitely getting more sleep.

The End.

Well, not the end. I have countless horror stories in the same vein, though, and sadly, they’re all true. Even now, that’s how software services are still developed and deployed in too many places. The software developers (the first regiment) implement the features requested by the product managers. In a separate silo, the operations team (the second regiment) deals with deploying the code to the servers and trying to keep the whole thing running in the real world, with varying degrees of recourse if the code is unreliable, non-performant, or just a big pile of crap. That model doesn’t work and requires an ever-growing army of round-the-clock operations people, which costs real money over the lifecycle of a service, to keep the show running in this 24/7 Internet economy.

So, yes, this is yet another devops blog. I’ve worked in traditional “operations” roles for many years (past titles include variants of Unix system administrator, systems engineer, service engineer, site reliability engineer, and devops engineer). I plan to share some more of those horror stories, talk about best practices for designing reliable, scalable services, and whatever else seems relevant or interesting.

Blog at

Up ↑

%d bloggers like this: