What Data Can’t Do

  • I am increasingly worried with people applying ML in everything without any rigour.

    Statical inference generally only works well in very specific conditions:

    1 - You know the distribution of the phenomenon under study (or make an explicit assumption and assume the risk of being wrong)

    2 - Using (1), you calculate how much data you need so you get an estimation error below x%

    Even though most ML models are essentially statistics and have all the same limitations (issues with convergence, fat tailed distributions, etc...) it seems the industry standard is to pretend none of that exists and hope for the best.

    IMO the best moneymaking opportunities in the decade will involve exploiting unsecured IOT devices and naive ML models, we will have plenty of those.

  • This author has published a couple of articles like this at the New Yorker They all have this in common: the author works through some interesting and in some ways unusual cases where data or statistics have been improperly or naively applied, with some social costs. I really enjoy the articles themselves.

    Then the New Yorker packages it up with a cartoon and a headline and subheadline like "Big Data: When will it eat our children?" or "Numbers: Do they even have souls?", and serves it up to their technophobic audience in a palatable way.

    https://www.newyorker.com/contributors/hannah-fry

  • The article seems to stop pretty early, as if something is missing.

    It’s an anecdote about a government incentive to have doctors see patients within 48 hours causing doctors to refuse scheduling patients later than 48 hours in order to get the incentive bonus.

    This is not an example of limits of data, but an example of perverse incentives.

  • Data is not a substitute for good judgment, for empathy, for proper incentives.

    The article focuses on governments and bureaucracies but there's no better example than "data-driven" tech companies, as we A/B test our key engagement metrics all the way to soulless products (with, of course, a little machine learning thrown in to juice the metrics).

    I wrote about this before: https://somehowmanage.com/2020/08/23/data-is-not-a-substitut...

  • Kind of love the initial story in the article about 48-hour wait times.

    I had a stint writing conferencing software for quite some time, and every once in a while we'd come across a customer requirement that had capabilities which were obvious to us developers "would be misused". As a result, we did the "Thinking, Fast and Slow" pre-mortem to help surface other ways that the system could be attacked (along with what we would do to prevent it and how it impacted the original feature).

    If you create something, and open it to the public, and there's any way for someone to misuse it for financial incentive (especially if they can do so without consequence), it will be misused. In fact, depending on the incentive, you may find that the misuse becomes the only way that the service is used.

  • > doctors would be given a financial incentive to see patients within forty-eight hours.

    Not measuring that from the first contact that the patient made is simply dishonest.

    "Call back in three days to make the appointment, so I can claim you were seen within 48 hours, and therefore collect a bonus" amounts to fraud because the transaction for obtaining that appointment has already been initiated.

    I mean, they could as well just give the person the appointment in a secret, private appointment registry, and then copy the appointments from that registry into the public one in such a way that it appears most of the appointments are being made within the 48 hour window. Nothing changes, other than that bonuses are being fraudulently collected, but at least the doctor's office isn't being a dick to the patients.

  • Data always needs to be paired with empathy. ML/AI simply doesn't have empathy so it will always be missing a piece of the overall pie.

    Let AI crunch the numbers, but combine it with a human who can understand the "why" of things and you can really kick butt.

  • Confusing performance metrics and strategical objective is not a data problem, it is a human problem. It happens to a lot of people outside the usual Blair-WhiteNationalist-IQ crowd. I do not think that advanced technical knowledge in ML or stats is required to avoid this mistake ; it is the ability to perform valid counterfactuals statements.

    A good example of what I mean can be found on wikipedia :

    His instinctive preference for offensive movement was typified by an answer Patton gave to war correspondents in a 1944 press conference. In response to a question on whether the Third Army's rapid offensive across France should be slowed to reduce the number of U.S. casualties, Patton replied, "Whenever you slow anything down, you waste human lives."[103]

    https://en.wikipedia.org/wiki/George_S._Patton

    Here, US general Patton is not confounding a performance metric (number of casualities) with strategic goal (winning the war). His counterfactual statement could be that ''if we slow things down, you are simply delaying future battles and increase the total number of casualties in order to achieve victory''.

    I'm not suprised at Blair decision. When we choose leaders, do we favor long term strategic thinkers, or opportunistic pretty faces?

  • From the ungated archive [1]:

    > Whenever you try to force the real world to do something that can be counted, unintended consequences abound. That’s the subject of two new books about data and statistics: “Counting: How We Use Numbers to Decide What Matters”, by Deborah Stone, which warns of the risks of relying too heavily on numbers, and “The Data Detective”, by Tim Harford, which shows ways of avoiding the pitfalls of a world driven by data.

    Data is a powerful feedback mechanism that can enable system gamification; it can also expose it. The evil is extracting unearned value from a system through gamification not the tools employed to do so. I’m looking forward to reading both books.

    [1] https://archive.is/ynOm2

  • Data is very limited, indeed. We can't predict outside the distribution, or unrelated events (without a causal link), or random events in the future. We should be humble about the limits of data.

  • Does the use of statistics actually amplify misunderstanding, or merely reveal misunderstandings that were already there? In any of these examples given - predicting rearrests, infant mortality, or so on - it's hard to imagine that someone not using numbers would have reached a conclusion that was any closer to the truth.

    Data has its limits, but the solution is usually - maybe even always - more data, not less.

  • This article is totally gibberish. It's a terrible mixture of many unrelated things. Just because those things all have something to do with data (anything can be presented in numeric form), it does not make their issues are about data.

    First, the Tony Blair example is not about data. It is a failure of government planning. It's wrong politics and wrong economy.

    The G.D.P. example is laughable. G.D.P. is never intended to be used to compare individual cases. What kind of nonsense is this?

    And the IQ example. The results are backed by decades of extensive studies. The author thinks picking a few critics can invalidate the whole field. And look! The white supremacist who gave Asians the highest IQ, what a disgrace to his own ideology.

    Many more. I feel it's kind of tactic to produce this kind of article. Just glue a bunch of stuff, throw together with somethings seem to be related, bam, you got an article.

  • An absolutely fantastic article that captures my concerns as a user, purveyor, and automater of systems that help with numbers. I'm always very cautious regarding the jump from numbers informing to numbers deciding.

  • This article is not so much about the data, as it is about rules and thresholds used to divide that data into groups.

  • "once a useful number becomes a measure of success, it ceases to be a useful number"

    Two other unintended consequences of incentives I learned in economics:

    1. Increasing fuel efficiency does not reduce gas consumption. People just use their car more often.

    2. Asking people to pay-per-bag for garbage pickup resulted in people dumping trash on the outskirts of town.

    Edit: Did more research after downvote. Definitely double check things you learn in college

    1. The jury is still out: https://en.wikipedia.org/wiki/Jevons_paradox

    2. Seems false https://en.wikipedia.org/wiki/Pay_as_you_throw#Diversion_eff...

  • They might as well have included the granddaddy example (as the age of computing goes): The vietnam war, mcnamara & body counts.

  • https://archive.is/ynOm2

  • All observation is theory-laden; data cannot speak for itself.

  • around paywall : https://archive.vn/ynOm2

  • solve covid, not a vaccine but a treatment, those are forbidden.