Search

The Nightmare Before Devops

Techniques, stories, and rants about running web services using DevOps principles

Burst Water Mains

Scary story: commuting in LA traffic.

That’s it. If you’ve ever lived there, that’s enough.

Now imagine you’re on a major surface street during an evening commute. (“surface street” == not a highway, major means it’s as wide as a highway. LA traffic has its own jargon, and when I moved to the Bay Area, I was surprised by how much of that jargon was local to Socal specifically.) You’re stopped at a major intersection (where two major surface streets meet) and suddenly water starts gushing up into the intersection from the maintenance holes.

Yeah, you’re not getting home any time soon.

In the summer of 2009, when I was still living just outside LA proper, this scenario played out multiple times. The LA Department of Water and Power (DWP) recorded 101 water main breaks, more than double the preceding year. (The city’s water mains tend to run beneath major streets, hence the locations for the flooding.)

Some of the bursts just caused traffic disruptions, but some were large enough to cause real property damage to buildings. One blowout opened up a sinkhole that tried to eat a fire engine.

As the realization hit that the numbers were much higher than in previous years, the DWP and others began floating theories, including:

  • The most obvious reason: the age of the mains. Most of them were iron pipes dating back almost a full century. The DWP had already started planning replacement work.
  • Temperature variations, although the summer of 2009 was about average.
  • Increased seismic activity.
  • Statistical anomaly.

City engineers were puzzled, though, because the breaks were taking place all over the city. (They also seemed to take at many different times of the day and night, but I can’t find a comprehensive list.)

However, 2009 was a bit different from previous years in one critical way. The DWP had instituted new lawn-watering rules for water conservation that went into effect at the beginning of the summer. Automatic sprinkler systems could only run on Mondays and Thursdays and only outside the hours of 9AM-4PM.

The new water rationing was not a major theory in the DWP, though, because many other cities in the area, including Long Beach, had similar rules but without the increased incidence of pipe blowouts.

The city created a panel of scientists and engineers to investigate. In the end, they found that water rationing was the key player here. While the age and condition of the pipes played a major role, the extreme changes in water pressure in the pipes between days with and without residential watering proved to be the tipping point. As a result of the findings, the city changed the water conservation policy to try to maintain more consistent flows.

(While we’re here, let’s note the irony of a policy designed to conserve water that led directly to conditions that sent millions of gallons of it flooding city streets.)

Production service incidents tend to follow similar patterns. There is never just one root cause. Instead, as in LA in the summer of 2009, pre-existing, less-than-ideal conditions often suddenly get pushed past their breaking point by a sometimes small change. I don’t know how much research was conducted before the city decided to institute the original conservation policy, and whether or not water pressure changes and their effects on the mains was even considered. If the DWP was not consulted, that’s another contributing factor. If they were and, for whatever reason, they did not anticipate issues given the known state of their infrastructure (remember, they were already planning on replacing century-old pipes, a process which ongoing to this day), either case would also act as a contributing factor to the incident.

One of my favorite examples of a (very, in this case) complex chain reaction of events colliding with less-than-ideal conditions comes from the write-up of the AWS S3 outage in us-east-1 in February 2017. In addition to the sheer size and length of the outage, it also gave many engineering teams and users a chaos monkey look into which services had hard dependencies on that specific S3 region; one of these impacted services was AWS’s own status board. AWS had to use Twitter to supply updates to customers during the outage.

The media at the time kept writing stories with headlines like, “Typo Takes Down S3,” but that was not only a gross oversimplification, it was… well, maybe not wrong, but arbitary scapegoating. Here are some equally valid and yet invalid, given the lack of scope, headlines they could have used:

  • “Maintenance Script That Did Not Check Input Parameters Takes Down S3”
  • “S3 Suffers Regional Outage Because AWS Stopped Testing for Regional Outages”
  • “S3 Collapses Under the Weight of Its Own Scale”

Honestly, someone should make a Mad Libs based on those headlines.

At any rate, focusing on a chain of discrete or tightly-coupled events in a post-mortem makes less sense than focusing on the contributing conditions, at least if you genuinely want to prevent future issues. These conditions, especially where humans are involved (which is everywhere), are highly contextual and directly related to the organization’s culture. If the engineering teams involved have a culture of strong ownership and collaboration, your set of solutions, both technical and process-related, can and probably should be very different than if the team has a lack of discipline or a lack of footing in reality. And in the latter case, ideally (but probably not realistically), the existing culture should be a target of remediation.

Those century-old cast-iron water mains would have failed sooner or later. In fact, they still do on a regular basis. However, a well-intentioned policy change meant to address a situation unrelated the health of the infrastructure and which may or may not have taken that infrastructure’s decrepit, degraded condition in mind, created some chaotic water main blowouts that summer. If you’re looking at a production incident and your “root cause” is singular, whether it’s either just one event, or one pre-existing condition, or a simple combination, you’re not going to prevent anything in the future, except perhaps effective incident prevention itself.

(P.S. You should read John Allspaw’s more definitive writings on incident analysis.)

Fake It Until Ik Spreek Het

In the story “The Page Who Feigned to Know the Speech of Birds” from 1001 Nights, a servant overhears his rich boss telling his wife that she should spend the following day relaxing in their garden. That night, the servant sneaks into the garden, placing several items.

The next day, as the servant accompanies the lady to the garden. As they walked around the garden, a crow cawed out. The servant thanked the bird and told the lady the bird said they could food under a nearby tree which she should eat. Since the lady was apparently not too bright, she took this to mean that the servant could understand the birds’ language. The next time the crow cawed, she asked the servant to translate. He replied that she could find some wine under another nearby tree. Drinking the wine that was, in fact, nearby, the lady became even more impressed with the servant. The third time the crow cawed, the servant thanked it and told the lady the bird told him there were sweets under yet another tree, by which time she found the servant completely fascinating.

The next time the crow cawed, the servant threw a rock at it. The lady asked him why he would do that, and he replied…

Ok, this story gets a little adult here, so you can go read the ending on your own if so inclined.


I recently gave an online talk on zero trust architectures in Kubernetes for Cloud Native Day. Learning that it was based out of Québec, I was told they didn’t require bilingual slides, but I decided to try my hand at them anyway.

I am by no means fluent in French, although I took French all through high school, and in the past couple years, I’ve been practicing on Duolingo to refresh and update it. My skills are mostly along the lines of « Je peux probablement trouver mon hôtel, commander mon dîner et m’excuser pour mon terrible accent » (“I can probably find my hotel, order dinner, and apologize profusely for my horrible accent.”) (I’m also learning Dutch and Spanish on Duolingo, hence the wonky English/Dutch bilingual post title.)

Embedded slide deck from Trust No 8 / Ne faisez pas confiance à 8

Here are some tips if you ever happen to find yourself in the position of preparing bilingual slides for a technical talk when you are familiar with, but not fluent in, the second language.

  • Puns probably don’t translate. I gave my talk a (very bad) title before I realized I was going to try to make it at least a little bilingual-friendly.
  • Keep the slides simple. Really, most recommendations for technical slide content say you should limit the amount of text on slides. People should be listening, not reading. Tersely-worded slides make even more sense when balancing two languages. Avoid idioms or other non-literal phrasings that are unlikely to translate well.

    My slides feature a mix of my high-quality stick figure illustrations and some small groups of bullet points.
  • Have a native or fluent speaker check your translations. Hopefully you can find someone with a technical background, but if not, even just simple grammar and spelling checks are useful.

But how do you find the accepted translations for technical terms? Often, even in somewhat closely related languages, the accepted translation for a term may not be the literal translation. For example, the Dutch term for “peanut butter” is pindakaas, which literally translates to “peanut cheese.”

Finding accepted translations for uncommon technical jargon can require some digging. I was writing about zero trust networks, Kubernetes, and Istio, so I did a lot of googling and ended up using a mix of the following sites and methods:

  • While the Istio docs only come in English and Chinese, the official Kubernetes documentation comes in many translations, although not all pages are translated for all languages. Check the docs for the technology you’re covering to see if there are translations. You don’t even need to be able to comprehend everything, only to pick out the phrasings used for the concepts you want to cover.
  • If the official docs don’t cover your language, try finding hits on documentation sites of large, multi-national companies which may use or leverage the tech in question. One of my page hits for Istio was on the IBM Cloud site. They have a language pull-down menu in the page footer, so I switched to French and got some useful jargon translations there.
  • Modify your Google search settings to return pages in the language you need. This won’t be immediately useful unless you also disable English, because most page hits will likely be in English. However, once you start using the translated terms you’ve been able to find, you will start getting hits in the second language.
  • Once you start finding the key terms you need, you may want to double-check that they are the most commonly used by googling those and making sure you get a good number of legitimate hits back.
  • Reverso is not a technical site, but they have a huge database of examples in actual texts, so you may be able to find localized translations for some terms you need there.
  • And of course there’s Google Translate, but even for the most popular languages, its translations for all but the simplest phrases still feel unnatural if not plain wrong.

So, that’s it! Alors, c’est tout !

Cassandra, or How Ignorance is Bliss Right Up Until the Greek Armies Pop Out of that Wooden Horse

In Greek mythology, Cassandra is the daughter of King Priam and Queen Hecuba of Troy. Considered a great beauty, she was pursued by the god Apollo, who granted her the gift of prophecy in order to woo her. When she rebuffed his advances, he turned the gift into a curse — although her prophecies were always true, no one would ever believe her warnings.

When her brother Paris brought the married queen Helen to Troy as his lover, Cassandra warned that the pair would doom the city. Everyone called her mad, even when she told them that the Trojan Horse was not a parting gift from the retreating Greek armies, but rather the means to trick the Trojans into opening their gates to the enemy. She was forced to watch helplessly when the Greek soldiers hiding in the wooden horse finally destroyed the city from the inside.


Aside from the incredible and ongoing human suffering and loss of life, watching this pandemic play out from the safety of my couch, I spent awhile trying to break down the complex reactions I was experiencing. Whenever epidemiologists and public health officials try to warn about what may come and, more importantly, how to prevent COVID-19’s spread from turning into a tidal wave, and they are then met with straw man arguments, jingoistic retorts that wouldn’t hold up to elementary critical thinking skills, and sometimes even what seems to be a complete and utter disregard for (mostly poor and/or non-white and/or elderly) human life, I have a borderline physical reaction.

A lot of that reaction is the manifestation of grief at the losses, fear that they may still hit someone close to me, and outrage that too many people are either not getting current, accurate information or they just refuse to listen. But some of that physical tension is almost a form of muscle memory, formed from years of experience trying to prevent predictable problems and outages and all-too-often meeting with limited success. I’ve only managed systems which, if they fail, will not put anyone in real, physical danger. I can only begin to imagine the frustration felt by epidemiologists, virologists, doctors, and other public health officials who are trying to save people’s lives.


In SRE and devops roles, I’ve had more than one Cassandra moment. “Um, if you do [or don’t do, as the case may be] that, there’s a non-zero chance that something 100% bad is going to happen.”

One of the most personally memorable of these actually involved Cassandra, as in the NoSQL database. A Cassandra ring was becoming increasingly wobbly. The JVMs on the C* nodes were spending an increasingly large amount of time performing garbage collection, but the cluster’s transaction rates were not growing at a comparable rate and the size of the individual data rows stayed constant. Basically, the use case and schema design for the cluster did not even begin to map to Cassandra best practices.

Venn Diagram: Ideal Cassandra data model vs this cluster. “And never the twain shall meet.”

In order to extract a small data set, the nodes ended up having to read increasingly large tables into memory. The dev who had inherited the application that used the DB and I warned that, unless we took the cluster down to compact and truncate runaway tables, eventually and in the very near future, the table size would outgrow the heap, trapping the JVM in an infinite GC loop. No data would go in or come out of that cluster once that happened. The Cassandra ring would look more like a black hole at that point.

Except, no, we could not take a planned production downtime. It was not a good time for that.

A week or so later, we took an unplanned production downtime on a weekday morning (peak use time). That Cassandra cluster had gone into, wait for it… infinite GC loop.

For certain production outages in certain industries, it is most definitely not better to beg for forgiveness than to ask for permission.


That was a time where I could not prevent an outage despite relaying credible warnings. I also have numerous examples of issues I prevented without anyone really being aware that there had ever been a danger, because I fixed a problem or added preventive safeguards before it ever spawned an issue that ended up on other people’s radar. If a risk does not produce the worst-case outcome, it’s often not because the worst-case outcome had been exaggerated from the beginning. If I don’t oversleep in the morning because my alarm woke me up on time, the risk that I could have overslept does not retroactively disappear. I did not actually oversleep because I recognized the risk and I acted to prevent it by… setting my alarm. Even if I had gone to bed 12 hours before I needed to wake up in the morning, I would still set the damn alarm.

Granted, many of the preventive measures that we as a society needed to and still need to take in order to fight COVID-19 are not as free of negative consequences compared to setting an alarm clock. But the potential loss of not adopting those measures is also exponentially greater than missing a morning meeting.

So seeing the frustration of people who are motivated by a desire to prevent needless death and suffering as they can only continue to appeal to reason and simple human decency genuinely triggers me. Maybe I’m projecting here, but it seems to me that while maybe some of those individuals may have separate agendas, the vast majority of the scientists, healthcare workers, and other public policymakers

  • have a lot more expertise than naysayers in the Dunning-Kruger peanut gallery
  • have been and continue to warn about possible if not probable catastrophic outcomes, not in order to make themselves seem important but because the risk still exists
  • are not “choosing” to prioritize one important area over another, because the two are actually integrally linked and symbiotic and it would be great if everyone recognized that and acted accordingly
  • would probably really rather not be “right” about their predictions, especially given the difficulty in driving unpopular preventive measures, but wishing doesn’t make it so
  • do not change their predictions as “cover” for having overstated the possible impact but rather because they keep recalculating with new information, including feedback based on current interventions
  • have no “agenda” other than fixing shit that’s broken and preventing dire outcomes
  • are totally waiting for you to shout “why didn’t you warn us?” when the worst actually comes to pass after you ignored or undermined them
  • are basically just doing their job because it’s their job and that’s what they’re supposed to do

Which sounds a hell of a lot like the experience of being an SRE in an engineering org that doesn’t prioritize intelligent and sustainable engineering risk management with a collaborative rather than combative or at least counterproductive dynamic.

Basically, it sucks to know your job, to try to do your job, and not only to be blocked from doing that job effectively but to be demonized in the process. And all I do is try to keep software from falling over. It hurts to think about being in the same basic situation if I was trying to save human lives.

Oh, P.S. Wear a goddamn mask, people.

Shared Irresponsibility Model

In the Grimm brothers’ fairy tale “The Mouse, the Bird, and the Sausage,” the titular characters (including, yes, an anthropomorphic sausage, because why not) all live together happily. Each has their own specific household task: the bird gathers wood in the forest, the mouse brings in water, and cooking falls to the sausage, because why not. Apparently the trio have no utensils, so the sausage rolls around in the pot to stir the cooking food, because why not.

One day, some neighborhood frenemy birds shame the bird into going home and going on strike until they switch the jobs around. The bird ends up in charge of water, the mouse does the cooking, and the sausage has to go collect wood, because why not.

The very next day, the sausage fails to return with firewood, and the bird goes out to the forest to search. The bird meets a dog and asks after the sausage. The dog, who in reality, saw the sausage, said “Sausage!” and then ate the sausage, says the sausage was carrying forged letters, because why not, so the dog was forced to execute the sausage. The bird can’t argue with that, shrugs, and goes home to tell the mouse.

While the bird goes to collect firewood, the mouse tries to cook for the first time, rolling around in pot to stir, and burns to death. The bird returns home, and not knowing what had happened to the mouse, panics and drops the wood. The house catches fire, so the bird runs to the well to get water, falls in, and with no roommates to help, drowns.

The end.

Cloud providers like Amazon Web Services and Google Cloud operate under a shared responsibility model for security. The provider takes responsibility for the security of their infrastructure and software, leaving their customers to take responsibility for running secure applications and using the provider’s controls for locking down their piece of the virtual environment.

Sure, that sounds fair. Except why have leaky S3 buckets in AWS been exposing sensitive data on a regular basis for years now?

My take:

  • S3 has been around forever (since 2006). It was the first cloud service that Amazon launched and it’s still one of the most popular. That, along with AWS’s market share, which is almost equal to all other cloud providers combined, means there are a crapload of S3 buckets out there.
  • An early popular use of S3 buckets was to create static websites. This meant that content had to be publicly available. As a side effect, every bucket automatically gets a public DNS entry. Savvy users will at least use obfuscated names, but most people end up defaulting to human- (and hacker-) readable names like big-corporation-customer-database.
  • The overlapping options for setting bucket access control create confusion. Users can choose between using IAM, the AWS Identity and Authentication Management service, which is the standard method for defining access control for all AWS services. But they can also use bucket policies, which bypass AWS IAM, and per-object ACLs. And all the methods have a learning curve.
  • Until just a few years ago, AWS accounts had a hard limit of 100 S3 buckets. While that may sound like a lot, especially these days when large companies often have 100 different AWS accounts, which is also a recent phenomenon, companies with a large number of diverse teams and applications often hit that limit hard. This situation resulted in “sharing” buckets for multiple applications, meaning the data in these buckets may have ended up needing different access requirements. So, you open up the bucket and then rely on those annoying object ACLs and hope everyone who writes to the bucket puts the correct permissions on the object.
  • AWS, and this is something all major cloud providers are guilty of, at least to some extent, defaults to creating resources for customers with no security controls. You are responsible for locking that shit down, whether it’s your S3 bucket, your VPC network firewall, or whatever else. They want you to use their products, so they make it easy to get started, because (almost) no one wants to figure out how to enable access for a locked-down service before they can start running their insecure application in the cloud.
  • I can’t even tell you how often I’ve seen people, faced with trying to make sure that all the pieces they need to connect their application together, struggle with the principle of least privilege and managing fine-grained access. When they finally hit their frustration level and just want everything to work, all too often they just remove all the protections.
  • A lot of people seem to assume that AWS magically secures their buffer overflowing, cross-site scripting, plain-text authenticating application.
  • And last, but not least, most people just don’t want to Read the Fucking Manual.

Most of this is actually AWS’ fault.

  • Like I mentioned above, most user get resources get created with wide-open access. They don’t want to scare away potential customers by making learning about modifying the security a prerequisite for paying AWS lots of money.
  • As I also mentioned, the controls are very often confusing.
  • It is too damned easy to get everything set up just right, just to have a careless or clueless person with admin privileges blow away all the protections. (In my experience, the people who do that tend very often to be upper management.)
  • AWS pretends like they are providing additional tools and controls to prevent leaky bucket issues, but they come across as disingenuous. The S3 UI now does force users to set some kind of bucket security, or explicitly opt out, during bucket creation. Great, except that’s not how all buckets get created, and again, there’s that thing about blowing away the best protections later. Oh, and you can also scan your infrastructure for open buckets, but people have to pay attention to those alerts.

One control that could genuinely help many customers do a better job of keeping private S3 contents private would be the ability to make a bucket permanently private, limiting access only to authorized IAM users. Yes, some customers will still set everything to public, in case they want to mix private and public data, and some older buckets with little active oversight will still be honeypots for data miners. But a lot of S3 customers would value the peace of mind knowing that they can create buckets without the risk of having the contents exposed to the Internet, no matter what C-level execs that still have full admin on the account do to it.

Maybe it sounds like I’m picking on AWS (which I am), but all the major cloud providers (probably all cloud providers) exhibit the same behaviors, at least to some degree. AWS’ own page on their shared responsibility model for security lays out what they handle and, with a lot more verbiage, what the customer needs to do. But the areas of responsibility which fall to AWS have one incredibly important omission: they do not take responsibility for supplying their customers with all the controls needed to secure their cloud usage.

In some cases, AWS does offer adequate controls. Their move from EC2 instances that just shared the same network space with each other regardless of customer to Virtual Private Clouds for customers added a lot of new protections and tools for users to secure their EC2 instances. Well, it does when the customers actually use them. But particularly in the case of their managed offerings of open-source software, AWS frequently disallows certain standard protections to ease their own management of the service. A few examples:

  • For years, the AWS Elasticache for Redis service disabled the option for password-protection of the Redis endpoint, which was an option for open-source Redis. (AWS only added it this past October.)
  • EFS (Elastic File System, an NFS (Network File System, or my name for it, Nasty Fucking Shit)) implementation, didn’t support root squashing, which maps client users with UID 0, aka the root user, to an unprivileged, anonymous UID. Users can enable it on the client EC2 instance mounting the volume, but that’s not an adequate protection, particularly if the instance gets compromised. AWS did actually add a way to simulate root squashing just last month (January 2020), but it requires a fair amount of IAM configuration, because why not. But for years, users were just out of luck.
  • Amazon’s managed Kubernetes service, EKS… I can’t even begin to enumerate the ways this platform is a security nightmare. Well, the pending content I’ve written for my employer’s website is the main reason I can’t tell you. I will give you a teaser, though: the AWS IAM user or role who creates the cluster has permanent authentication access to the cluster’s Kubernetes API service, and by default has full cluster admin access in the default Kubernetes RBAC configuration. (The official EKS docs somehow omit that “permanent” bit.)

So, there you go. I bashed on AWS a lot, but at least in part, that’s because I just have the most experience with it and its… um, quirks. Google Cloud has its own issues, trust me, not the least of which is the lack of any granularity to their IAM permission sets. And I’ve mostly been able to give Azure a wide berth, but in AKS, the only service I’ve spent any time with, they install the notorious Kubernetes dashboard in every cluster, with zero authentication required within the cluster, and with no way to opt out. I’m more than willing to bet this highly question “feature” represents the tip of a Titanic-sized iceberg.

So what can customers do? Complain. Shop around. Do careful research about services and their security options before using them. Don’t settle.

And please, for the love of all that is holy, if you somehow have any of my personal data, don’t use EKS.

Engineering Quality of the Anti-Road Runner Traps of Wile E. Coyote, Esq.

At one point a number of years ago, when I was working at Yahoo!, my team somehow got into an IRC discussion about why Wile E. Coyote always failed to catch the Road Runner.  Was Coyote just a bad engineer or was ACME Co.’s quality assurance just non-existent?  (Of course, the latter option begs the question, if ACME goods were so shoddy, how smart was Coyote to keep using them?)

At the time, I firmly came down on the side of Coyote being a crappy engineer.  Not long after, I realized that one of the Looney Tunes DVD collections I had contained an entire disc dedicated to golden-era Coyote v. Road-Runner cartoons, so it seemed like the perfect opportunity to do some research.  I watched all those cartoons, taking notes and tallying the root cause of the failure of each of Coyote’s traps.  I posted it on Google+ and then… forgot about it.  Until after Google shut Google+ down, taking the content with it.  (Dammit, Google!)

If you’re familiar with this blog’s format, that story does, in fact, serve as my standard horror story opener for this post.  I think at this point Google’s penchant for taking beloved services out back and putting a bullet in their head (I still remember you, Reader!) has reached legendary status, making it a perfect opener.

That said, I decided I should recreate the original post.  I will be watching the same cartoons on the same disc.  I will spare you the gory details of how I hadn’t used my DVD player in at least a year, so I spent almost half an hour looking in and behind and under the sofa for its remote control.  In the process, I managed to dislodge a lot of detritus and realized I really need to vacuum the couch, although I did make the cats happy with all the toys that came out from under the couch.  (No, I really was sparing you the gory details there.)  Finally, I thought to look for the remote control in the most unlikeliest of places: next to the DVD player.  How the remote got there, I have no earthly idea, but there it was, and after swapping in some fresh batteries, I was good to go.

For reference, all these cartoons were directed by the legendary Chuck Jones.  Also, I’ll only give full descriptions for the first few cartoons and simply show tallies for the rest, both because otherwise a 5-minute cartoon ends up taking half an hour to write up, and because I did the exact same thing last time.

So, without further ado:

  • Beep, Beep (1952)
    • Traps
      • Trap 1: Boxing glove affixed to large, compressed spring, which was fastened to a boulder.  Coyote hides behind the boulder, ready to release the spring and knock out the passing Road Runner
        • Result: When releasing the spring, the glove remained stationary while the boulder was propelled backward, smashing Coyote into a rocky overhang
        • RCA: Cartoon physics.
      • Trap 2: Coyote carries an anvil across a tightwire between two cliffs lining the road, preparing to drop it on Road Runner
        • Result: Instead of being a taught tightwire, the wire stretched down to the ground when Coyote got to the middle with the anvil.  The recoil then shot him up into the air.
        • RCA: Unknown.  We don’t know if the ACME product was mislabeled or defective, or if Coyote did not properly research the material’s properties (aka, failure to do load testing)
      • Trap 3: Coyote rigs a free glass of water to light a barrel of TNT when picked up.
        • Result: Apparently road runners can’t read and don’t drink water.  They can write signs communicating as much, however.
        • RCA: We’ll chalk this one up to unexpected user behavior.
      • Trap 4: Coyote positions himself on catapult consisting of two hinged boards with a spring compressed between them, held ready by a restraining rope.
        • Result: When he cuts the rope with a knife, the catapult slams him into the pavement
        • RCA: Engineering failure.  Seriously.
      • Trap 5: Coyote uses belts to attach himself to a rocket mounted horizontally on four wheels, with bicycle handlebars affixed to the rocket.
        • Result: Upon lighting of its fuse, the rocket still manages to launch itself into the sky, taking the Coyote with it, before exploding into fireworks.
        • RCA: Another tough call.  We’ll put this one under “unknown.”
      • Trap 6: Wearing ACME brand Rocket Powered Roller Skates, Coyote attempts to catch Road Runner at his own speed.
        • Result: It seems Coyote can’t actually roller skate.  Also, the skates don’t appear to have brakes.
        • RCA: I’m calling this an engineering failure.  I see no evidence that Coyote attempted to RTFM.
      • Trap 3 revisited: Once the rocket-powered roller skates run out of fuel, after having knocked Coyote around mightily, he’s left in front of the water-glass-triggered-TNT trap.  And he takes the water glass.  And the TNT goes off.
        • Revised result: Engineering success!  I’ll give Coyote credit, even though it does also have a whiff of technical debt to leave the trap out.
      • Trap 7: Coyote builds a fake railroad crossing with a short length of track, obscuring the ends with bushes.
        • Result: Road Runner races over the track, knocking Coyote onto it, leaving him dazed.  Then a train comes.
        • RCA: Cartoon physics strikes again.
    • RCA Tally
      • Engineering failures: 2
      • ACME failures: 0
      • Unknown: 2
      • Cartoon physics: 2
      • Engineering success: 1
  • Going! Going! Gosh! (1952)
    • Traps
      • Trap 1: Coyote ties a stick of dynamite to an arrow.  As Road Runner approaches, Coyote lights the fuse and draws the arrow.
        • Result: Coyote releases the bow instead of the arrow.  Dynamite explodes.
        • RCA: Engineering failure.  Pretty sure he tested his shooting skills on production data (aka lit dynamite).
      • Trap 2: Coyote sets up an over-sized slingshot in the ground, using himself (with knife and fork and bib) as the projectile.
        • Result: As he walks back to get maximum tension, the slingshot gets pulled from the ground and pins him to a large (pink) cactus.
        • RCA: Engineering failure.
      • Trap 3: Coyote covers a section of road with a layer of quick-drying cement (not explicitly labeled as ACME, btw)
        • Result: When Road Runner plows through the cement at speed, he splashes it onto the Coyote.  Who then gets stuck with the sizable grimace on his face when the cement dries.
        • RCA: We’ll be fair and call this unknown.  While the trap probably would not have succeeded, it only back-fired because of bad luck on timing.
      • Trap 4: Coyote hides under a man-hole cover in the road, armed with a grenade.
        • Result: When he hears the Road Runner “meep meep,” Coyote pulls the pin and prepares to throw.  Except the Road Runner was on a road above his and knocked a large boulder onto the manhole cover, trapping Coyote with the live grenade.
        • RCA: We’ll call this engineering.  Verify that user input.
      • Trap 5: Coyote dresses as a hitchhiking blond with high heels and a short skirt.
        • Result: Road Runner rushes past, knocking Coyote down, then returns with a sign explaining, “I already have a date.”
        • RCA: We’ll call this one blameless.
      • Trap 6: Coyote places a canvas painted with a trompe l’oeil to disguise the dead-end of a road that ends with a cliff.
        • Result: Road Runner continues into the painting’s road.  Coyote, dumb-founded, stands in front, scratching his head.  Then a truck comes out of the painting road, flattening Coyote.  Coyote, furious, decides to run into the painting to catch up with Road Runner.  Instead, he runs through the canvas, falling off the cliff.
        • RCA: Cartoon physics X 2
      • Trap 7: Coyote rolls a large boulder onto a descending mountain pass road.
        • Result: Road Runner, coming up the road, evades the boulder, which then somehow manages to roll up a circular overhang which launches it into the air, where it lands above Coyote’s position, surprising and then immediately flattening him.
        • RCA: Cartoon physics strikes again.  While the plan was never going to work, that boulder had air that would make professional skateboarders envious.
      • Trap 8: Using a variety of products (admittedly, only one, the “Street Cleaner’s Wagon,” was labeled as coming from ACME), Coyote creates a fan-propelled, (not hot-air) balloon vehicle, intending to use it to fly over Road Runner and drop an anvil on him.
        • Result: When the anvil is released, the vehicle shoots up into the air, at which point the string tying the balloon shut comes undone, dropping Coyote into the road.  At which point the anvil lands on his head.
        • RCA: I can’t really chalk this up to cartoon physics (Newton’s third law in action!)  Engineering failure.  The ACME Street Cleaner’s Wagon actually came out remarkably intact, incidentally.
      • Trap 9: Coyote ties one end of a length of rope to log overhanging the road, suspended across two cliffs, tying the other end around his waste.  When he hears the familiar “meep, meep,” he swings off the cliff, holding a spear.
        • Result: He is met with a truck whose horn sounds just like a road runner.  The force of the collision sends him spinning around the suspended log, where he somehow gets tied to the log under the length of rope.
        • RCA: We’ll call this unknown.  We do have some cartoon physics, but… Road Runner was driving the truck.
      • RCA Tally
        • Engineering failures: 4
        • ACME failures: 0
        • Unknown: 2
        • Cartoon physics: 3
        • Engineering success: 0
  • Zipping Along (1953)
    • Traps
      • Trap 1: Coyote + grenade
        • Result: Attempting to remove the pin with his teeth, he instead throws the pin at Road Runner and is left with a live grenade in his mouth.
        • RCA: Engineer(ing) failure
      • Trap 2: Coyote places a couple dozen mouse traps on the road.
        • Result: Road Runner blows past the spot, sending the traps flying into Coyote’s hiding spot.
        • RCA: Engineering failure.  I have no idea how this was supposed to work.
      • Trap 3: Using an ACME Giant Kite Kit and an unlabeled bomb, Coyote attempts to become airborne, hanging onto the kite with one hand and holding the bomb in the other.
        • Result: After getting a few short-lived airborne hops while getting a running start, Coyote reaches the cliff’s edge and plummets to the ground, where the bomb explodes.
        • RCA: Engineering failure.
      • Trap 4: Coyote attempts to chop down a conveniently-located roadside tree to fall on a passing Road Runner.
        • Result: The tree was actually a telephone pole, and when it fell, it pulled its nearest neighbor down onto Coyote.
        • RCA: Engineering failure.  Coyote did not practice reasonable situational awareness.
      • Trap 5: Coyote mixes Ace brand steel shot into a box of ACME bird seed and pours the mixture into the road.  When Road Runner consumes the mixture, Coyote attempts to use a giant magnet to pull the bird to him.
        • Result: Instead of Road Runner, the magnet attracts a barrel of TNT, which promptly explodes.
        • RCA:  … cartoon physics?
      • Trap 6: Coyote reads a book on hypnotism to learn how to make a person jump off cliff.  Coyote practices on a fly, which promptly jumps off a rock.
        • Result: When Coyote attempts to use his newfound skill on Road Runner, the bird whips out a mirror, causing Coyote to hypnotize himself.
        • RCA: Engineering success!
      • Trap 7: Coyote constructs a see-saw with a board using a rock as a fulcrum.  He then stands on one end, holding a large rock, which he throws at the other end to catapult himself into the air.
        • Result: Instead of landing on the other end of the board, the boulder lands on Coyote.
        • RCA: Engineering failure.
      • Trap 8: Coyote mounts a number of rifles to a board positioned on a cliff over the road, tying strings to each trigger.
        • Result: When Coyote pulls the strings, the guns all shoot him.
        • RCA: Engineering failure.  He was standing right next to Road Runner when he pulled the strings.
      • Trap 9: Coyote prepares to cut the end of a rope bridge when Road Runner crosses it.
        • Result: Instead of the rope bridge, the cliff Coyote is standing on plummets to the… well, plummets somewhere
        • RCA: Cartoon physics.
      • Trap 10: Coyote loads himself in a Human Cannonball cannon after lighting the fuse.
        • Result: The cannon is propelled backward, while Coyote hangs, singed, in midair.  At least he was wearing a helmet.
        • RCA: Cartoon physics.
      • Trap 11: Coyote stands one board suspended between two cliffs, holding a giant wrecking ball that is tied by rope to another board some distance down the road.  When Road Runner approaches, Coyote releases the ball.
        • Result: Road Runner stops short of the ball’s reach.  The ball then loops over its anchoring board, arcs through the air without losing any tension in the rope, and smacks Coyote from above.
        • RCA: While we do have some cartoon physics at play, the ball was going to swing back close enough to its starting point either way.  Engineering failure.
      • Trap 12: Coyote constructs a fake storefront across a narrow mountain pass, offering free birdseed “inside.”  Behind it, he has set up a large amount of TNT, all connected to an ACME detonator, placed so the plunger should theoretically trigger when the door is opened.  Coyote then climbs over the facade and hides nearby.
        • Result: A truck comes barreling down the narrow pass, chasing Coyote back to the facade… where he opens the door.
        • RCA: Engineering success!
    • RCA tallies
      • Engineering failures: 7
      • ACME failures: 0
      • Unknown: 0
      • Cartoon physics: 3
      • Engineering success: 2
  • Stop! Look! And Hasten! (1954)
    • Engineering failures: 5
    • ACME failures: 0
    • Unknown: 0
    • Cartoon physics: 1
    • Engineering success: 2 (The Burmese tiger trap did, in fact, catch a Burmese tiger.  And those ACME Triple-Strength Fortified Leg Vitamins worked like a charm!)
  • Ready, Set, Zoom! (1955)
    • Engineering failures: 6
    • ACME failures: 0
    • Unknown: 1
    • Cartoon physics: 1
    • Engineering success:  1
  • Guided Muscle (1955)
    • Engineering failures: 7
    • ACME failures: 0
    • Unknown: 1
    • Cartoon physics: 2
    • Engineering success: 0
  • Gee Whiz-z-z-z-z-z-z (1956)
    • Engineering failures: 4.5
    • ACME failures: .5 (While that ACME Triple-Strength Battleship Steel Armor Plate proved extremely flimsy, Coyote clearly had no idea how momentum works, so I’ll count that as .5.)
    • Unknown: 2 (That exploding bullet did not explode reliably, but we don’t know the brand, in addition to a second unbranded product failure.)
    • Cartoon physics: 1
    • Engineering success: 0
  • There They Go-Go-Go! (1956)
    • Engineering failures: 6 (Granted, one of these started as a cartoon physics issue, but I’ve seen firsthand small failures turn into massive outages because people who apparently didn’t understand the systems as well as they thought they did just started turning knobs.  “I got this bottle of water out of Karen’s freezer.  I’ll just throw it on the fire to put it out.”  Now we have two problems: the clear liquid which, mysteriously, was still liquid at 32°F, has actually seemed to accelerate the fire, and now Karen is out of vodka.)
    • ACME failures: 0
    • Unknown: 1 (Coyote clearly had a defective rocket, but again, it was not branded.)
    • Cartoon physics: 1
    • Engineering success: 0
  • Scrambled Aches (1957)
    • Engineering failures: 6
    • ACME failures: 0
    • Unknown: 0
    • Cartoon physics: 3
    • Engineering success: 0
  • Zoom and Bored (1957)
    • Engineering failures: 6
    • ACME failures: 0
    • Unknown: 1
    • Cartoon physics: 1
    • Engineering success: 0
  • Whoa, Be-Gone! (1958)
    • Engineering failures: 6
    • ACME failures: 0 (Those ACME Instant Tornado Seeds worked like charm! Tech debt: not putting the lid back on.)
    • Unknown: 1 (unbranded product failure)
    • Cartoon physics: 1
    • Engineering success: 0
Eng FailuresACME FailuresUnknown
Cause
Cartoon PhysicsEng
Successes
Total Traps
Totals59.50.51119393
%640.511.820.43.2

In sum, I’d say that not only is Coyote, well, a lousy engineer, but that ACME products, at least during this era, often worked a little too well.

A few notes that may or may not be worth reading:

  • Everything can be cast as a DevOps parable.
  • I worked for a time for Warner Bros Animation.  I may or may not have a film credit on Osmosis Jones.  (I can’t bring myself to watch it again to find out. I only saw a rough cut while it was still in production.)  I definitely had to fix Brad Bird’s email once while he was directing The Iron Giant, though.  Netscape Navigator, RIP.
  • My father started as a mechanical engineer and was, probably not coincidentally, also a Road Runner/Wile E Coyote fan.  I had bought this Looney Tunes collection to watch with him while he was ill, but… I never got the chance to watch it with him.  It sat on my shelf, still shrink-wrapped, for a couple years, until I finally had a reason to watch these cartoons.
  • The spirit (and (lack of) engineering discipline and basic common sense) lives on in, well, most of the Silicon Valley start-ups I’ve worked for, to some degree or another.
  • Seriously, leave a cat toy under the couch for a few months, and when you finally fish it out, it’s brand-new to the cats who knocked it there in the first place.
  • I still have no idea how the DVD player remote ended up next to the DVD player.
Continue reading “Engineering Quality of the Anti-Road Runner Traps of Wile E. Coyote, Esq.”

The Old Lion and the Ford

In Aesop’s fable, “The Old Lion and the Fox,” a fox passing by a cave hears a voice calling out. The voice claims to be a sick, old lion too infirm to leave the cave, and asks the fox to come into the cave to help ease his loneliness. The fox, noting that the only tracks to be seen went into the cave but none came out, wisely refuses. The moral of the story is to learn from the failures of others instead of repeating their mistakes.

Perhaps oddly, I see a corollary to this story in one of my father’s jokes. My father was born and raised in Georgia (the state, not the country), and thus had a large collection of redneck jokes which generally made me cringe. The only one I can remember, though, goes like this:

One man gets a ride from another in his pickup. As they’re driving along, they come up to an intersection with a red light. The driver flies through the intersection.

The passenger says, “Hey, you just ran that red light!”

The driver replies, “Naw, it’s ok. My brother does it all the time.”

They come to another intersection, and again, the pickup blows through the red light.

“Hey, man, that light was red, too! That’s really dangerous!”

“Naw, I told you, my brother drives through red lights all the time and he’s fine.”

At the third intersection, they have a green light. The pickup stops.

The passenger: “Hey, man, what’s wrong with you? Why are you stopping for a green light?”

The driver: “My brother might be coming the other way!”

This blog post that I have re-read and shared many times over the years talks about the ubiquitousnous in Silicon Valley of a sociological phenomenon called the “normalization of deviance.” The term was coined by Diane Vaughan to explain the gradual, unconscious acceptance of dangerous engineering practices and beliefs in NASA and its contractors that led to the preventable explosion of the Space Shuttle Challenger.

These delusions usually start to take root the first few times someone takes a shortcut or compromises on quality. If there is no bad outcome, the risk must have been overestimated, right? This behavior quickly becomes acceptable, and then it becomes… normal. “Everyone must do it this way, right?” And when they finally hit those 10:1, 100:1, or 1,000,000:1 odds, *that* must have been a fluke, right? That can’t possibly happen again. There are good reasons why the government appointed commissions independent of NASA to investigate the agency’s major fatal disasters. In most tech startups, though, there are few if any external factors to force an objective evaluation of what went wrong. So nothing changes. Then new people come in, and particularly those who are, maybe, in the first job, or are joining what’s considered a world-class organization, will often start thinking this must be normal, too, because surely it wouldn’t happen here otherwise. Right?

It’s very frustrating to come into an organization with these delusions. Silicon Valley is supposed to be full of smart people, right? Iconoclasts, critical thinkers, all that, right? I tend to agree with the conclusions in Dan Luu’s post, adding that the arrogance and optimism which drives this industry is actually sometimes more like to feed normalization of deviance rather then defeat it. And whether or not (frequently, not) they can learn from someone else’s mistakes, the resulting cultures all too often cannot learn from their own, because, oh, hey, that wasn’t a mistake. Probably some butterfly in China was flapping their wings. (I’m pretty sure there are lots of butterflies in China that will flap their wings, so, uh, if one can take down your entire website, shouldn’t you still try to build in protection against that?)

I also believe normalization of deviance goes a long way to explain why so many attempts to adopt that whole “devops” thing fail. First, there is the very common mistake of thinking devops == Agile and you just need to put your code in git, throw up a Jenkins server, and change someone’s title from sysadmin/syseng/prodeng/whatever to “devops engineer.” Even when an organization is savvy enough to realize that, hey, this devops thing requires some cultural change, it’s one thing to create new, good habits. What’s often neglected is realizing that you also need to break old, bad habits, something that becomes that much more unlikely if no one can actually recognize these habits as toxic anymore.

I don’t have any obvious solutions. But maybe as a starting place, every time someone makes a compromise in the name of start-up necessity, write it on a Post-It note and put it on a Wall of Deviance in the office. And, just to drive the lesson home, as you stick one on the wall, repeat the mantra, “This is not normal. This is not normal.”

Night on Balderdash Mountain

In the Disney animated classic Fantasia, we were meant to be the most terrified by the segment featuring Mussorgsky’s “Night on Bald Mountain,” with its endless stream of dancing spirits risen from their graves. Balderdash. The most terrifying story was earlier in the film. We were fooled because it starred the plucky Mickey Mouse as The Sorcerer’s Apprentice.

What was so terrifying about a bunch of dancing brooms and buckets? What wasn’t terrifying? When we meet him, Apprentice Mickey is stuck with a neverending rotation of menial, manual tasks. Sweep, sweep, sweep. Fetch, fetch, fetch. So Mickey does what any good engineer with a devops inclination woud do: he tries to automate. However, also like any novice engineer without a strong mentor to peer review his pull requests, Mickey creates a fork broom bomb, with his marching brooms flooding the tower. (Furthering the broken culture where he works, his boss, aka The Sorcerer, then gets angry at him. And even then, he probably still doesn’t help mentor Mickey, let alone pay down any of their technical debt.)

One of the hallmarks of a good, strong devops culture is the constant drive towards more and better automation of what were, to varying degrees, manual processes. Performing releases by hand. Manually invoking scripts when cued by an alert. Configuring software or infrastruture via graphical user interfaces. An ever-growing number of tools and services aimed at devops engineers and organizations are driving and driven by the empowerment of automation and the automation of empowerment. And, like all software, and also like many companies saying that they are “now doing DevOps,” some of these offerings deliver on the devops promise better than others.

Maybe you’ve heard the term “code smell,” referring to what may seem like small, cosmetic flaws or questionable decisions in source code, but which are usually indicative of deeper, more serious issues, kind of like traces of digital comorbidity. Similar watermarks can often be found in services that are marketed to devops teams. Some real-world examples:

  • the SaaS is only configurable via a graphical UI
  • the Linux server-side agent is written in JavaScript (you want me to run what on my production servers?)
  • the (rapidly mutating) version numbers of the (again, you’re asking me to run what on my production servers?) agent are hardcoded in the agent’s configuration file
  • the public API makes backward-incompatible changes on a very frequent cadence

Cads Oakley suggests the term “operational whiffs” to cover the analogous markers in devops tools, because it’s a thing. While symptoms like the ones listed above may not always cause issues, they can be indicative of much deeper issues in the producers of the tools: a lack of true understanding of or empathy for devops practitioners. They can create workflows that are not intuitive or native to devops practice; unnecessary pain at upgrade time because of constant backwards-incompatible changes that also require updating any internal automation; and, at worst, a lack of respect for their customer’s time. All of this amounts to creating work and friction for the devops teams the tools are supposed to be helping.

How can software companies trying to sell to devops teams avoid these issues, and create devops-native tooling?

  • Dogfood your tool or service internally. Run it the way a customer would and note any pain points that require human intervention
  • Listen to customer feedback, not just the direct comments, but also the soft signals. “Well, after I edited that file and did this other stuff, it started to work…” And make sure those messages pass from your Customer Success team to the Product and Engineering teams as actionable items, if possible. (For every one of those comments you hear, there are probably a thousand being said under someone’s breath, bookended by expletives.)
  • Hire experienced devops engineers, and more importantly, listen to them

A product can have all the required bullet-point marketing features for its class in the devops tool ecosystem, but if it’s a pain in the ass to configure and maintain, well, then it doesn’t really have all the required features. Eventually, those are the tools most likely to get optimized out in the sweep of iterative improvement of devops momentum.

Don’t Fear the Beeper

In one of Aesop’s fables, a shepherd boy gets his jollies by raising the alarm that a wolf is threatening the herd to trick the villagers to run out to offer protection, only to find they’d been tricked.  After the boy had pulled his prank a few times, a wolf really does come to threaten the sheep, but by now, the villagers no longer believe the boy’s cries of “Wolf!” and the herd, and in some variants, the boy, are consumed by the predator.

Any engineer who has spent any amount of time in an on-call rotation that invariably bombards them with constant alerts, particularly when a sizable number of those are either invalid or not really critical, has probably also been tempted to risk letting a wolf eat the bratty shepherd boy.   This “pager fatigue” gets compounded when the same issues arise time after time and never get fixed at the root cause level, and the engineer feels powerless to improve the situation over time.  When the on-call rotation is 24 hours, with alerts coming any time of day or night, the physical effects of sleep interruption and deprivation can quickly compound the psychological issues.

I’ve had more than my fair share of on-call rotations from hell over the years.  What would invariably make an already awful situation even worse was the knowledge that it was not likely to get any better.  Particularly when the software developers were well-removed from the bugs in their code that were making the service brittle and unreliable, but also when management would not take the time to prioritize those fixes, or, worse, thought a week every month or two of no sleep for an operations engineer was “normal” and part of the job description (not, of course, of the job description they actually posted to hire you), and the situation starts to seem helpless.  MTTA (Mean Time To Acknowledge) starts to rise as people are too demoralized or physically tired to jump on every page.

For those deeply-unempathetic upper management types who still believe that, well, this is just part of the job, they’re missing the business costs of this effect.  Time spent responding over and over to the same issues which never get fixed at the root cause really adds up over time.  Site uptime and service-level agreements also get put at risk, which can sometimes result in loss of income or financial penalties to customers.  And one of the well-known effects of sleep deprivation (which, after all, is used as a form of torture for a reason) is that it greatly increases the risk of accidents.  Do you really want an engineer who can barely keep their eyes open trying to fix a complex, broken system, when there’s a very non-negligible chance they will just make it worse?

I’ve also personally witnessed in more than one organization a form of “learned helplessness,” where engineers, feeling disempowered and just plain tired, stop pushing for fixes and don’t even take care of simple root-cause fixes within their ability.  Even on the off-chance that everything else about the job is great (and it’s probably not when there’s an elephant in the room everyone is ignoring, so no one is cleaning up the elephant-scale dung heaps, either), the never-ending cycle of frustration and stress becomes demoralizing.

While this may be an unwinnable war at many organizations because of the almost industry-wide, deeply-entrenched normalization of deviance around how pagers are just going to be noisy and engineers are just going to lose sleep, on-call duty should not have to be an unavoidably soul-sucking experience.  And since I’ve just started a new job, and after one week noticed the engineer on-call seemed to be ack’ing a lot of pages, I knew I had to nip that in the bud.  Specifically, before I went on-call.

Here’s my plan of action.

  1. Technical or configuration actions
    1. Get rich metrics for paging alerts.  Unfortunately, the canned reports and analytics for the paging service we’re using leave a lot to be desired, so I will probably have to go through the API and generate the per-service, per-error metrics myself.  I was also looking at Etsy’s opsweekly tool, but it doesn’t support our pager service, either, so we’d have to write the plugin.
    2. Identify the alerts that are non-actionable and stop making them page.  If the on-call can’t fix (or temporarily ameliorate) an issue, they shouldn’t get woken up for it.  If a situation does need handling but does not pose an immediate threat, can it wait until morning?  Then wait until morning.  If it’s still important that people be aware of a situation. send it to Slack or something.  If it’s a bug that can only be fixed by code change, which may take time, you may need to mute it temporarily, although that’s risky.  (Do you always remember to re-enable an alert when the underlying condition is fixed?  Really?)
    3. Prioritize fixes for the most frequent and most time-consuming alerts.   If it’s broke, fix it.  Developer time is precious, yes, and there are new features to deliver, but those same developers are the people getting paged in this on-call rotation.  We’re borrowing from Peter to pay Paul.
  2. Engineering culture and attitudes towards on call
    1. Get top-down buy-in.  Engineers generally have a natural inclination to fix what’s broken.  However, they need the time and power to do that.  All levels of management need to be cognizant of the entire on-call experience, including being willing to make short-term trade-offs of priorities when possible, with the understanding that payoffs both in time resources and team effectiveness and morale will pay off in the longer term.  (Fortunately, I have that now.)
    2. Empower team members to own their on-call experience.  Again, as the developers are in this rotation, they are the ones who can fix issues.  They’re also not removed from the incentive of knowing if they fix a certain issue, it won’t wake them up the next time they’re on-call.  (This very separation of power from pain is one of the factors that has made the traditional dev vs ops silos and their associated services so dysfunctional.)  And if it’s not something that can be fixed quickly, make sure it gets tracked and priotized as needed for a permanent fix.
    3. Use those metrics to show improvements.  Being able to see in hard numbers that, over time, yeah, we really aren’t getting woken up as often, or interrupted by alerts that we can’t fix is both the goal and incentive.  A noisy pager isn’t something you fix for once and for all, but requires ongoing vigilance and incentives.

I admit, I’ve been woken up too many times by unimportant, misdirected, or déjà vu millions de fois alerts.  It kills morale, breeds resentment, and has probably shortened my life because of all the sleep deprivation and stress.  I would love to help build an org where the pager is usually so quiet, engineers start to check their phones to make sure the battery didn’t die.  No one’s going to jump for joy when they go on-call, but it’s a win if they aren’t filled with dread.

 

Driving a Lemon

There’s a popular urban myth known as “The Killer in the Backseat.”  The most common variations of the story involve a lone driver, almost always a woman, on a deserted backroad at night.  Another vehicle, usually some sort of truck, is following her, flashing its lights, trying to pass her, or even trying to ram her. The car’s driver naturally panics and tries to evade the truck, but ultimately either gets pushed off the road or finds a gas station to pull into.  The truck’s driver then runs up to the window to tell her there’s a threatening stranger in the backseat of her car.  She had been mistaken about the true threat all along, instead fearing the person who was trying to help her.`

A more realistic automobile fear may be of buying a lemon, a car, usually used, that just ends up having either a series of non-stop issues or one major issue that never quite gets fixed.  (My mother ended up with a lemon once; a Buick with a sticky gas pedal.  You would be at a stop, then gradually press the gas to start moving, and nothing would happen.  If you kept pressing the gas, it would eventually engage and then lurch forward, like it suddenly applied the amount of gas that it should have been ramping up to as the pedal was pushing down.  Apparently neither my father nor the mechanic could replicate the behavior, which, granted, didn’t happen all the time, but when I came to visit and borrowed the car, it happened to me, too.)

How does all this car stuff relate to devops?  Well, let’s set up the following analogy: the car is a web application or service, the mechanic (and car builders!) is the software engineering team, and the driver is the “traditional” operations engineering team.

What do I mean by “traditional” operations engineer?  Basically, when web companies became a thing, they had a clear separation between the people writing the code and the people who kept it running in production.  Much of this is because the operations side generally evolved from pre-web system administrators, the people who kept the servers running in any given company.  Except those companies, whether they were shrink-wrap software companies or government research labs or visual effects companies, rarely scaled in size and customer base at the rate of the new web businesses.  The traditional silo model didn’t translate to web applications, and in fact, it helped create and perpetuate major issues.

Untitled drawing
The “traditional” silo model of web application development and operations.  Note the one-way arrow.

With the silo model, developers are so isolated from the environment and reality of keeping their code operational and performant in a 24/7 web economy that they don’t get the proper feedback to help them avoid designs and assumptions that inevitably create issues.  Operations engineers, who are the ones who understand what breaks and what scales in a production environment, can, at best, only give advice and request changes after the fact, when the design is long since finished and the code is already in place and many of the developers have been assigned to a new project.  The broken app stays broken, and as traffic scales linearly or exponentially, often the team that supports the application must scale with it.  If that team is already relatively large because the service is a brittle piece of engineering riddled with technical debt, then the company is faced with either trying to hire more people, assuming it can even attract skilled engineers in this economy, or watching its uptime become an industry joke as the overworked ops people get burned out and leave.

So it is a with a lemon.  Maybe the driver can do a few things to mitigate chronic issues, like using a specific kind of higher-grade oil or octane gas, changing belts or filters more frequently, etc., but that can be relatively costly over the life of the car, if it works at all.  Or, as with my mother’s car, she could tell the mechanic the exact behavior, but if the mechanic is not skilled or not sympathetic, they may just ignore her.   But since the mechanic is probably the only one capable of fixing the root cause, that car and its owner are doomed to a lifetime of expensive and frustrating palliative care.

So it goes with web companies.  If the operations team only comes in after the fact to try to manage a poorly-designed. buggy, or non-scalable service, the company is going to throw money at it for the entire life of the service.  Even if, and in my experience (which can’t be isolated), this is not always the case, the development team has the bandwidth and desire or requirement to fix bugs escalated by the operations team, if the major issues lie deep in the application design or its fundamental execution, those fixes won’t be easy.

I still encounter and hear of far too many companies and “old-school” engineering higher-ups who think that an operations team that was not consulted (or didn’t exist) during the design and original coding of a service should still somehow magically be able to make any pile of code run in production.  Well, maybe, but only if the bosses are willing to hire a large enough team.  It would generally be more cost-effective, as in most things, though, to fix the root cause: the code.

Let’s take a trivial example.  Say a software developer has written some incredibly inefficient SQL queries for dealing with the backend database.  Exactly how is an operations team supposed to mitigate that on their own?  Well, maybe they could scale the database infrastructure, but that takes money, money that will almost certainly far exceed, over the lifetime of the application (probably just within a couple of days, actually), the amount of money it would take to get the developer to fix the errant SQL queries.

To sum up, the traditional silos create and perpetuate web services that are brittle and fiscally expensive to run, because the people designing and implementing the services rarely have practical experience of what does and does not work well in production, especially at web scale.  After-the-fact operations teams can only mitigate some of those issues and only at great cost over the life of the application.

This is a simplified overview of the cultural and organizational issues the DevOps movement and its cousin, site reliability engineering, evolved to address and prevent.  I’ll delve into it more later.

 

 

 

Blog at WordPress.com.

Up ↑

%d bloggers like this: