The Nightmare Before Devops

Techniques, stories, and rants about running web services using DevOps principles


A number of cultures have the concept of a doppelgänger, a person’s double, presaging death or doom.  One might say the modern American would feel the same terror if they realized another person was walking around not with the same face, but using the same name and Social Security Number.  With the revelation last week that the credit reporting agency Equifax was hacked and just this kind of critical personally identifiable information for 143 million was stolen, the incidence of credit doppelgängers is likely to skyrocket.

Apparently the hackers exploited a known vulnerability in the Apache Struts web application framework.  The bug had been patched in March; Equifax said their site had been hacked starting in May, although they didn’t realize it for two months.  Therefore, if they had patched the version of Apache Struts they were using in a timely manner, the hack never would have happened.  (There’s also no way to tell right now what, if any, intrusion detection and prevention systems they have in place, but it seems to be a no-brainer that they were clearly inadequate if no one noticed that the information of almost half the country’s population was being systematically stolen for two months without detection.)

I will grant that web application and system security is hard.  It’s very hard.  However, when a company is collecting some of the most critically sensitive data of virtually everyone in the country, they should have a certain burden of Doing the Right Thing.  Unfortunately for American consumers, the forces that normally push companies to formulate and follow strict security practices simply don’t exist for the credit reporting agencies.  Those forces are generally either government or industry self-regulation, or the negative incentive that a company’s customers would lose trust after a breach and take their business elsewhere.

While the credit card industry requires those processing payments to follow PCI, the Payment Card Industry security standard, and federal law requires any organization or company handling health records to comply with HIPAA, the Health Industry Protection and Accountability Act, to secure data, the credit reporting industry, dominated by three companies (Equifax, TransUnion, and Experian), has no such security standard.  The industry is only barely legally regulated as it is.  While the breach has generated huge outrage, the government is not in a position to punish, let alone shut down, the company, even if the present administration were amenable.  But wait, it gets worse.

The second motivation for a company to practice rigorous site security, fear of losing customers, also doesn’t apply here.  The average citizen is not their main customer, although they do make money when individuals request to see their own credit score, use the credit monitoring service, or even freeze and unfreeze their credit.  The main customers are all the businesses, particularly banks and credit card companies, that regularly request these reports.  American consumers have no way to remove or prevent the collection of their credit information by Equifax or any credit reporting agency.  Our information is their product, but we have few legal rights to how it is collected or used, and we have few avenues to punish Equifax, other than lawsuits and pushing our lawmakers to respond, neither of which is guaranteed to succeed.

Like I said, security is very hard.  It has real overhead costs for a company, in terms of training every employee, because every employee can pose a risk, even if a given person’s sphere of risk is limited; building and maintaining the infrastructure to maintain and monitor security; software engineer time to understand and follow secure coding practices, and to respond to vulnerabilities as they become known; and specialized personnel to formulate the company’s security practices and make sure they are implemented and followed continuously, updating the requirements as new classes of threats arise.  Even if a company does follow all the best practices stringently, there is always risk, whether it’s because of zero-day exploit in third-party software or, when it comes down to it, that people, even the most well-trained and skilled, will make mistakes.  The best security policies and practices will take measures to defend against the mistake of a single person creating great risk, but it can still happen.

Web application security has to be designed in from the beginning; you cannot bolt it on later and expect it to be adequate, or think it’s irrelevant to make your code secure if you have a web application firewall or intrusion prevention system.  And it’s not any one person’s job in an engineering team: architects need to consider what type of data may be collected or transferred and design controls on handling it; software developers need to understand how to write safe code and need to follow the security controls laid out in the design, and if they see an issue, raise it accordingly; whoever is responsible for builds needs to make sure third-party libraries and frameworks are constantly updated so hackers can’t exploit a bug that was patched months ago; infrastructure engineers need to harden systems and implement controls.  The list goes on.  Security has to be part of everyone’s job.

Yes, it’s a huge amount of work.  And too many developers and executives overestimate their security and underestimate their risk.  Cybersecurity has to be seen as one of the standard costs of doing business on the Internet, though.  While, sadly, Equifax will survive this security breach, a serious hack at a start-up can kill the company, particularly when the company’s ability to do business rests on securing critical customer data.

Winchester Mystery App

Several years ago, I took the tour of the Winchester Mystery House in San Jose.  The mansion was built by Sarah Winchester, widow of the heir of the Winchester gun fortune.  After suffering through several personal tragedies, she was said to believe that her family and fortune were haunted by the victims of Winchester rifles, and only constant building could appease them.  She spent the last 38 years of her life constantly adding on to the house without ever using an architect and with no real plan, leading to a house with doors that go nowhere or open onto rooms a floor or two down and priceless Tiffany windows set in north-facing walls which receive no direct sunlight.

(My tour group included a little girl named Olivia.  I know her name is Olivia because she kept wandering ahead and touching things she shouldn’t, resulting in her parents’ consistently ineffective pleas to stand still.  Given her likely-conditioned lack of heed, I mentally renamed her “Oblivia.”)

Unfortunately, I’ve seen far too many software projects that bear more than a few similarities to the Winchester Mystery House.  No cohesive architecture, inconsistent conventions and APIs, dead-end methods.  This tends to result in a brittle codebase which is hard to test and even harder to change safely and predictably.  Just as the rooms in Sarah Winchester’s house were connected in unpredictable ways, an application with strange, non-standard dependencies between classes/modules/libraries resists easy, safe changes.  Say you decide you want to change your backing database, but if it’s not accessed via a consistent interface class, it’s going to be a lot more work to find all the special snowflake cases in your code and fix them before they’ve destabilized your production service, because, let’s face it, you probably don’t have very good test coverage, either.

I’m sure a lot of people, especially at start-ups, think this is normal, and maybe it does happen more often than not, but normal should never be automatically conflated with “good.”  So let’s consider scenarios where this kind of application development might arise.

  • Most likely scenario: a small team needs to get an application running for proof-of-concept funding, so they just start writing code.  Developer A is writing the database interface, but dev B is writing the storage engine which depends on it, can’t wait for the interface, so just accesses the database interface class directly.  Hey, it works, they get funding, and they’ll just fix it later, except now they need to add features so they can get paying customers, and there’s never going to be enough time.  At some point the team will agree on conventions and so forth, but the tech debt is still in the codebase, accruing interest which will likely result in a balloon payment (probably production downtime) sooner or later.
  • A lack of team discipline, usually under the mantle of “total freedom to get your job done.”  Maybe the project did actually have an architect or at least some agreements on the architecture, but the developers still ended up doing their own thing when they wanted to or just decided it was more expedient to get a feature out the door.  Usually this involves some excuse like, “I didn’t realize we already had a class for that,” or “I know we already use library X but I like library Y better,” or, my personal favorite, “I wanted to try out cool new tool Z, so what better place than in production?”  And now you’re stuck with it until someone has time to rip it all out, except remember, this is Winchester Mystery Code, so that’s a lot harder than it should be, and there’s never enough time to begin with.
  • The company or at least the intended functionality of the app “pivoted,” and rather than start clean, everyone started building on top of the existing code in a new direction.
  • The architect really had no idea what they were doing.

It could be a combination of factors, too, but the net result is the same.  By “the same,” I mean a completely unique bundle of code, but the pain of maintaining it and services that run off of it remains the same.

I also like to use Katamari Damacy as an analogy for this type of application development.  Katamari Damacy is an old Japanese console video game which translates to something like “clump of souls.”  The backstory has the King of All Cosmos destroying all the stars in a drunken rage, so he sends his son (the player), the Prince of All Cosmos, out with a sticky ball to roll up masses large enough to replace the stars.  As the ball picks up bulky, misshapen objects (think skyscrapers pointy side out), the clump becomes much more difficult to steer, and you’re never going to be able to smoothe it out.  A badly (or not-at-all) designed application becomes a big clump of bugs, and if the piece of code you need to fix or replace is buried under many interlocking accretion layers, imagine how fun it’s going to be to change it.

Some cases are more avoidable than others, but that doesn’t mean it’s impossible to prevent or mitigate large-scale issues.   While it may sometimes seem like there’s no time to waste because you need to start writing code now (now! NOW!), getting clear plans can speed up development in the short and long runs, because developers won’t be duplicating effort or stepping on each other’s toes or organizing code in a painfully haphazard way which is going to make modification difficult.  Some discipline should be required to make sure new code is using established conventions, and friction should be required before pulling in new third-party dependencies to make sure they aren’t redundant or inappropriate.

Once Upon a Time…

There was a kingdom that made most of its money by placing little advertisements everywhere (everywhere!) it could fit them on a webpage. It had to record each of these ads very carefully so it could tax the advertiser and so the kingdom’s tax collectors didn’t come after them for misreporting. The monarch had two regiments to work on making sure the ad serves were processed accurately. One regiment wrote the code to process the ad serves and the other ran the machinery on which the code ran. The two regiments were separated by a big wall. When the first regiment received orders to change their code, once they finished a couple weeks later, they would throw the code over the wall separating them from the second regiment, much like a grenade, because the new code would usually explode in the second regiment’s faces. Sometimes it would even explode when the first regiment hadn’t pulled the pin from the grenade they made. However, the second regiment was virtually powerless to get the first regiment to stop lobbing grenades over the wall, and no one lived happily ever after, but the first regiment was definitely getting more sleep.

The End.

Well, not the end. I have countless horror stories in the same vein, though, and sadly, they’re all true. Even now, that’s how software services are still developed and deployed in too many places. The software developers (the first regiment) implement the features requested by the product managers. In a separate silo, the operations team (the second regiment) deals with deploying the code to the servers and trying to keep the whole thing running in the real world, with varying degrees of recourse if the code is unreliable, non-performant, or just a big pile of crap. That model doesn’t work and requires an ever-growing army of round-the-clock operations people, which costs real money over the lifecycle of a service, to keep the show running in this 24/7 Internet economy.

So, yes, this is yet another devops blog. I’ve worked in traditional “operations” roles for many years (past titles include variants of Unix system administrator, systems engineer, service engineer, site reliability engineer, and devops engineer). I plan to share some more of those horror stories, talk about best practices for designing reliable, scalable services, and whatever else seems relevant or interesting.

Blog at

Up ↑

%d bloggers like this: