The Missing 11th of the Month (2015)

  • Interesting! Be sure to follow the link to the second post about what happened to the 2nd, 3rd, 22nd, and 23rd. It's simpler but still worth the read:

    https://drhagen.com/blog/the-missing-23rd-of-the-month/

  • This is why one of my principles is to be skeptical of outliers. Often they are not real and therefore misrepresent the true data.

    It's one reason median is preferred over mean, at the outset, as well as throwing out outliers just to see what things look like.

  • You can tell how much they cared about data quality because they never took the time to look at context-dependent glyph equivalencies. And some context-sensitive algorithms might not make the same mistakes as a naive “guess what characters are here” algorithm that just uses glyph shapes. You run into this a LOT with ALPR systems because some of the presses excluded some characters. O and 0 are the most common character equivalency. But only in certain places.

    OCR is actually complicated if you’re trying to rely on the data for something.

  • The last time this hit HN, my hosting provider complained about the traffic, but I've since migrated the blog to GitHub Pages, so I guess that won't be an issue this time.

  • I love stuff like this.

    However, shouldn't every date with a "1" be less common if that is the case? Why 22 and 23?

    I think 11 might be somewhat explained by scanner errors if we assume e.g. l2 is corrected to 12 but ll not to 11.

    But I guess maybe 2,3,11,22,23 are less common due to people overcomensating for wanting to not pick dates that look not randomly sampled?

  • Well, that was surprisingly fun to read!

  • tl,dr: It's an OCR error

  • Naming an event after its date will have a limited run.