Ask HN: Identifying duplicate data from a large dataset?

  • This is easier than deduplicating the many different URLs which have the same content. A harder problem awaits you!

    ML & basic stats