I would be curious to know if they evaluated any cloud-based data stores or streaming services from AWS or GCP before deciding to building this from scratch. It seems like a common set of requirements for event analytics pipelines.
It reminds me exactly of the common architecture pattern of KDB/Q. Still at this point, it's a marvel of tech.
There are other reasons for duplicates in event streams - not just the dupes introduced by at-least once processing in Kinesis or Kafka workers. We've done a lot of thinking about this (all open-source) at Snowplow, this is a good starting point:
http://snowplowanalytics.com/blog/2015/08/19/dealing-with-du...
Our last release started to tackle dupes caused by bots, spiders and dodgy UUID algos:
http://snowplowanalytics.com/blog/2016/12/20/snowplow-r86-pe...