Testing on production

  • An interesting perspective I once heard from an information security expert is that there's a difference between risks and 'things that can go wrong'. Something is only an actual risk if it hurts the bottom-line. In particular quite a few things that can go wrong don't carry that much risk, and conversely something that is hard but not impossible to go wrong may carry huge amounts of risk.

    The trick with this perspective is that after identifying the real risks you can then link the risks and possible mitigations by looking at all 'things' and identifying the ways in which they might fail (and how this may be prevented from happening). This way you can easily identify which mitigations are helping prevent risks and which risks are not sufficiently mitigated. It's a fair bit of work, but it's not complicated and often gives useful insights.

    What this article basically does is note that you should first asses what risks a failed deployment has, and correctly states that in quite a few cases this risk is low and therefore the mitigations (of which there can be many) may not be necessary and may in fact be doing harm without actually sufficiently preventing any risk.

  • I have a dumb question as a non-SWE who is curious about software engineering.

    I've heard "feature flags" are popular these days, and I understand that that's where you commit code for a new way of doing things but hide it behind a flag so you don't have to turn it on right away.

    Now, if I want to test in prod, couldn't I just make the flag for my new feature turn on if I log in on a special developer test account? And if everything goes well, I change the condition to apply to everyone?

  • I enjoyed the entire article except this part:

    > Unfortunately there is no easy way to distinguish between people who are good and need a paycheck from people who just need a paycheck. But you sure as hell don’t want the latter in your team.

    If you can't tell them apart, then the distinction is unimportant. So if among the group of people who need paychecks, good is indistinguishable from non-good, the comment serves no purpose other than needless elitism.

  • this made me chuckle

    > If GitHub makes a mistake it can affect thousands of businesses but they’ll likely shrug and their DevOps team will just post “GitHub is down, nothing we can do” on some Slack channel.

    Gonna try and read the rest of this on the lunch break as was surprisingly meaty for a clickbait title ;)

  • Good article, but it's a bit binary on the notion of incident. For the same company, it can be very serious to have a global 1h outage, but not so serious to have the internal admin interface down for 1h. This allows for more fine-grained assessment of the validation required to push to prod: the "checks" only have to test the critical part of the application. Dev exp start deteriorating when the non-critical parts are over-tested.

  • Love this article. So many great points that I deeply agree with but have never really put into words, and all written in such an engaging style.

  • Just keep the enironments separate, but similar. What works in the test environment, should work in production.

    Of course, there are always exceptions to this rule. Adapt and modify the code as needed.

    We keep three environments at work: Dev, Test and Prod. However, dev environments are sometimes neglected and some features land in Test only.

    So, use Dev as a development playground. Use Test to test the changes made in Dev. If the change is approved in Test, it will go in Prod environment.

  • Everybody has got a test environment. Some also have a production environment.

  • >> If Tesla makes a mistake in their autopilot software, people might die.

    In this case, a good "Testing on Production" rule would be to not let customers test your software, period.

    There's plenty of land and resources to construct towns and cities that simulate real-life commute very accurately.

    In the case of self-driving (or even autopilot), you're not really testing a feature, you're researching a new product, they difference is vast.

  • > Shipping confidence We can define “shipping confidence” as the feeling a mentally sane developer has when they know their code is about to be deployed to production (whether it can be updated over the air or not).

    A bug which must be fixed in production is much more expensive than a bug fixed during development.

    People here complain when you bash Microsoft, but their phylosophy was (and still is) let the users test the product.

  • Double negation is hard... :) (yes and no should be switched)

    > Ask yourself a question: do you have any reason to think that your engineers will not do a good job? If the answer is no: why are they still there? If the answer is yes: let them do their damn job.

  • We all test in production but some people are in denial and refuse to accept it.

  • > The TL;DR is that some (“best”) practices are contextual and understanding when to use them is ultimately what gives us the title of “engineers”.

    So well put, just today I implemented a feature and kept asking myself if i should be extending the component (leaning more towards OOP) or just add an additional argument to said component. The latter would have stuck more with the current style but I also realized there's no obvious better way, extending made sense and I realized the importance of understanding the nuance and standing up for those design decisions is what I am here to do :)

    thank for putting that in less words

  • "Everybody has a testing environment. Some people are lucky enough enough to have a totally separate environment to run production in"

  • I don’t understand why people redefine words just make their point. It can be confusing at best and at worst change the me meaning of words when it becomes viral. “Smart” people means smart people. It shouldn’t be used to mean junior dev who are trying to hard to prove themselves and over engineer or choose the wrong approach. So many words have changed their original meaning because someone decides to write a viral post and redefine words to make a point

  • Welp! For some reason someone at HN decided to change the title and bump this down to the 11th position (atm). Not sure what I did wrong here but it feels pretty crappy...

    @dang any chance you could help here? :(