Hacker News

A Distributed Real-Time Data Store with Flexible Deduplication

by prosperoon 1/20/2017, 7:20:12 PM with 3 comments

by alexatkeplaron 1/20/2017, 9:22:24 PM
There are other reasons for duplicates in event streams - not just the dupes introduced by at-least once processing in Kinesis or Kafka workers. We've done a lot of thinking about this (all open-source) at Snowplow, this is a good starting point:
http://snowplowanalytics.com/blog/2015/08/19/dealing-with-du...
Our last release started to tackle dupes caused by bots, spiders and dodgy UUID algos:
http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-pe...
by csearson 1/20/2017, 8:08:57 PM
I would be curious to know if they evaluated any cloud-based data stores or streaming services from AWS or GCP before deciding to building this from scratch. It seems like a common set of requirements for event analytics pipelines.
by julienmarieon 1/21/2017, 2:33:58 AM
It reminds me exactly of the common architecture pattern of KDB/Q. Still at this point, it's a marvel of tech.