I'm surprised to see that ML-based semantic search is barely touched on in this article. There's a strong focus on entity matching, but an arguably more powerful way to conduct similarity search is to leverage embedding vectors from trained models.
A great upside to this approach is that it works for a variety of different types of unstructured data (images, video, molecular structures, geospatial data, etc), not just text. The rise of multimodal models such as CLIP (https://openai.com/blog/clip) makes this even more relevant today. Combine it with a vector database such as Milvus (https://milvus.io) and you'll be able to do this at scale with very minimal effort.
I would like to know if any of these techniques could be used for identifying articles that are either copies of each other, or near-copies, or different articles on the same story.
I have been working on an entity matching solution for two years now, and I have decided to write down some of the learning I picked up along the way. Turns out there are too many relevant details to cover in a single post, so I will cover the topic in multiple parts.
This first part is the high-level introduction, useful for project planning and architecture decisions that need to be made early in the development process. Any feedback is welcome, along with wishes for the follow-up parts if you have something specific that you would like to be covered.