“Bugs are 100x more expensive to fix in production” study might not exist (2021)

  • I work in HW industry, but doing SW.

    If I deploy a bug despite "unit" tests, then it will probably be caught by 2nd tier tests 1 day later, which will already be more expensive since I'll need to read report, context switch from other thing, fix it, redeploy, wait for CI, wait for 2nd tier tests, etc.

    If it will not be caught by 2nd tier, then it probably will be caught by validation engineers, which not only costs their time, but they also have to contact me / fill some bug, I need to get familiar with it, context switch again, fix it, redeploy, wait for tests #1 and #2 and wait for their confirmation

    If it will be released to actual customers then there's reputational dmg (can be huge depending on impact), possibly some lost sales / customers, probably a few weeks before it goes back to me thru customer->???->validation->me

    So, while 100x more expensive is some extreme case, then usually the cost is very significantly higher the later you find the bug.

    But what about HW bugs?

    I think those can be really, really expensive.

    Imagine catching hardware bug which affects some computation that MUST be fixed at hardware level.

    Catching it before release and catching it after selling 100m copies is tens of bilions of dollars difference for top GPUs and CPUs.

    Why do you think that those companies are willing to build same product twice so they test eachother? It significantly increases the cost, but reduces the risk.

    Think about CrowdStrike's incident: catching that bug at dev level would cost them shitton times less than what actually happened :P

    Their stock dropped by like 45%

  • I have some thoughts on this (in the context of modern SaaS companies).

    The most expensive parts of fixing a bug are discovering/diagnosing/triaging the bug, cleaning up corrupted records, and customer communication. If you discover a bug in development or even better while you are coding the function or during a code review you get to bypass triaging, customer calls, escalations, RCAs, etc. At a SaaS company with enterprise customers each of those steps involves multiple meetings with your Support, Account Manager, Senior Engineer, Product Manager, Engineering Manager, Department Manager, sometimes Legal or a Security Engineer and then finally the actual coder. So of course if you can resolve an issue (at a modern SaaS company) during development it can be 10-100x less expensive just because of how much bureaucracy is involved in running a large scale enterprise SaaS company.

    It also brings up the interesting side effect of companies adopting non-deterministic coding (AI Code) in that now bugs that could have been discovered during design/development by a human engineer while writing the code can now leak all the way into prod.

  • If you ship firmware to devices, it could be far more expensive. [1]

    [1] https://www.bleepingcomputer.com/news/hardware/botched-firmw...

  • Forget the study, let's just do a simple thought experiment. Your developer gets paid $140k/yr (let's round up to ~$70/hr). Let's say a given bug found in testing takes 1 hour to fix; that's $70 (not counting the costs of ci/cd etc). If they miss it in test, and it hits production, would it cost $7,000 to fix? Depends what you mean by "bug", what it affects, and what you mean by "fix in production".

    - Did you screw up the font size on some text you just published? Ok, you can fix that in about 5 seconds, and it affects pretty much nothing. Doesn't cost 100x.

    - Did your sql migration just delete all records in the production database? Ok, that's going to take longer than 5 seconds to fix. People's data is gone, apps stop working, the lack of or bad data fed to other systems causes larger downstream issues, there's the reputational harm, the money you'll have to pay back to advertisers for their ads / your content being down, and all of that multiplied by however long it takes you to restore the database from backup (um... you do test restoring your backups... right?). That's closer to 100x more expensive to fix in production.

    - Did you release a car, airplane, satellite, etc with a bug? We're looking at potentially millions in losses. Way more than 1000x.

    And those are just the easy ones. What about a bug you release, that then is adopted (and depended on) by downstream api consumers, and that you then spend decades to patch over and engineer around? How about when production bugs cause your product team to lose confidence in deployments, so they spend weeks and weeks to "get ready" for a single deploy, afraid of it failing and not being able to respond quickly? That fear will dramatically slow down the pace of development/shipping.

    The "long tail" of fixing bugs in production involves a lot more complexity than in non-production; that's where the extra cost comes from. These costs could end up costing 10,000x over the long term, when all is said and done. Security bugs, reliability bugs, performance bugs, user interface bugs, etc. There's a universe of bugs which are much harder/costlier to fix in production.

    But you know what is certain? It always costs more to fix in production. 1.2x, 10x, 1000x, that's not the point; the point is, fix your bugs before it goes to production. ("Shift Left" is how we refer to this in the DevOps space, but it applies to everything in the world that has to do with quality. Improve quality before it gets shipped to customers, and you save money in the long run.)

  • HN discussion about the Register article from 2021: https://news.ycombinator.com/item?id=27917595

    HN discussion about the original blog post from 2021: https://news.ycombinator.com/item?id=27892615

  • Subtitle:

    > It's probably still true, though, says formal methods expert

    Seems like click bait. The thesis is predicated on the idea that people claim this is the result of some study. I’ve never once heard it presented that way. It’s a rule of thumb.

  • There was another 1981 paper that went into this by Boehm "Software Engineering Economics", but I can't find the details right now.

    This NASA publication [1] cites a number of studies and the cost increase factors they estimate for various stages of development, including the Boehm study.

    [1]: https://ntrs.nasa.gov/api/citations/20100036670/downloads/20...

  • Spelling mistake in the "about us" section of your continuously deployed website? Production or pre-production matters little if it's a developer catching the bug. Maybe if it's caught by a customer, you have some overhead as it's triaged through QA, product leads, routed to someone with commit permissions, etc.

    Spelling mistake in the "about us" section of your program baked into ROMs of internationally sold hardware? 100x is a vast underestimate of the cost multiplier to "fix in production", which would likely involve recalls, if not trashing product outright, and ROMs were a lot more common in the era this 100x figure supposedly came from. You might fix it for the next batch, or include the fix if you had a more critical bug that might make the recall worth it, but otherwise that bug lives in production for the life of the hardware.

    Spelling mistake in the HTTP 1.x referrer field? You update the world's dictionaries, because that's significantly cheaper than mutating the existing bits of the protocol. Any half measures that would require maintaining backwards compatability would cause more problems than fixing the spelling "fixes", and any full measure that would fix all the old software for everyone might require a war or three after bankrupting a few billionares. That bug isn't just in software now, it's burrowed into books and minds.

    "Same" bug, but different contexts lead to wildly different multipliers. If you want useful numbers for your context, you probably don't need or want a generic study and statistics - I'd bet even a dataless wild guess would be more accurate. Alternatively, you can run your own study in your own context...

  • I don’t get the hairsplitting here—it seems obvious to me that if you build the wrong feature, you have to replace it with something else which needs building as well as something akin to demolition of the first feature.

    Repeat this cycle more than once for the same feature and it clearly accrues to real impact…

    the 100x may be exaggerated but that’s beside the point to me — I think even 2x or 3x on a feature is regrettable and oftentimes avoidable

  • > Laurent Bossavit, an Agile methodology expert

    Congratulations, you got me to stop reading just at the start of the article.

    On topic, I don't think any good engineer ever claimed the title of the article. The "more expensive" part stems from having to rush and maybe do a sloppy job, introducing regressions, higher hosting costs or other maladies.

    So the "higher cost" might just be a compounding value borne out of panicky measures. Sometimes you really do have to get your sleeves rolled up and timebox any fix you have in mind and just progress and/or actually kill the problem. Often though, you just deflect the problem to somewhere else temporarily where the "bleeding" will not be as significant. Which buys you the time to do a better job.

    Titles like those of the articles are highly dramatized. I am surprised any serious working person ever took them seriously.

  • The 100x also is kind of meaningless. 100x compared to what? I can introduce a bug by being careless for 1 minute, discover it in my own tests and spend the next 2 days to figure it out. That's already 960x compared to the time it took me to introduce it.

  • Waterfall development never existed, either.

    Business management and self help publishing long predates research, and nothing has changed. For some reason, software development has been extra susceptible to their nonsense.

  • Fixing it in production means it could have effected your production users who in turn couldn't do whatever it is that they do to actually make or give you money with an unknown but potentially significant effect on your bottom line.

    It also involves more and oft more senior people who are paid more as it must be triaged, assigned, and managed.

    Whilst it is unlikely that this falls exactly neatly on different orders of magnitude eg exactly 10 and 100x more if its taken to mean that its substantially and very substantially more expensive this seems fine.

  • Some production systems are unique or expensive enough that emulating/virtualizing a setup in development is a large enough effort that it's cheaper to punt the observation and correction of environment-specific bugs to production.

  • Every major vendor does testing now on the client machines. Release early and often; record bugs; fix some. Rely on community forums for support.

    This sounds like it's cheaper to fix in production, by orders of magnitude?

  • Bug found by you, the developer vs. user in production? Easily 100x.

  • Speaking of how you present information in a misleading manner, did anybody actually get interviewed for this article or is this just a reheating of some blog articles?

  • Claiming that a supposedly well known study doesn't exist when that study is in fact not at all well known or even cited (although the premise is) is like pure rage-bait/clickbait gold.

    edit reply because fuck you hn mods:

    What? How?? Im not claiming it's from a scientific study I'm claiming it's "conventional wisdom" firmly in the zeitgeist.

  • Actual title:

    Everyone cites that 'bugs are 100x more expensive to fix in production' research, but the study might not even exist

  • 2021