Hacker News

The Distributed Data Mesh (2019)

by 0asaon 3/18/2021, 10:24:59 AM with 8 comments

by chatmastaon 3/19/2021, 1:22:28 PM
We're building a data mesh at Splitgraph [0]. We provide a unified interface to query and discover data products. You can query the data using the Postgres wire protocol at a single endpoint, with any of your existing tools. And you can discover it in the catalog, using a familiar GitHub-like interface. You can try this right now on the public website, where we federate access to 40k open datasets. Every dataset is addressable with a `namespace/repository:tag` format. The `tag` can refer to either the live data, in which case we forward the query upstream, or to a versioned snapshot of data (a "data image") that you build with declarative, Docker-like tooling. [1]
On the enterprise side, integrating the access and discovery layers gives a lot of advantages, especially around data governance. On the web, we give users tools to connect data sources, document them, and share/audit access to them. When a query comes through the endpoint, since we're implemented as a Postgres proxy, we can rewrite/filter/drop it in accordance with rules, or we can forward it along to the upstream data source(s) and/or join across them. If you use Splitfiles to generate versioned data, we can also provide data lineage/provenance and full reproducibility.
We've been working on this for ~3 years but are still pretty early. If anyone wants to help, we just raised a seed round and are hiring a remote team -- check my comment history for links.
[0] https://www.splitgraph.com
[1] https://www.splitgraph.com/docs/working-with-data/using-spli...
by joshuanapolion 3/19/2021, 12:56:27 PM
I think that the main point of the article is that a company’s data strategy should result in discrete data products aligned with business domains. A domain-oriented team should be responsible for each data product. Data infrastructure should cover universal data-processing concerns, but should not include business logic. These characteristics contrast with a centralized data lake, where a single organization is responsible for both the infrastructure and content of the data resource.
by eternalbanon 3/19/2021, 12:49:29 PM
I can't distinguish between what is described and service oriented (SOA) approach:
```
    Discoverable
    Addressable
    Trustworthy and truthful
    Self-describing semantics and syntax
    Inter-operable and governed by global standards
    Secure and governed by a global access control
```
A reminder that Thoughtworks was highly influential in pushing Microservices. This may be an elaborate mea-culpa ("oops, SOA was actually more sensible") without admiting 'culpa', rehashing SOA with a set of features (above) that look awfully like those highly elaborate SOA proposals with XML and all sorts of meta-data to 'couple' these "data products' (previously called Services).
by Sladikon 3/19/2021, 10:38:21 AM
On the following link you can learn how Zalando implemented this concept in the real world: https://databricks.com/session_na20/data-mesh-in-practice-ho...
by datameshlearnon 3/19/2021, 5:29:28 PM
If any folks want to learn more about data mesh, we have a (vendor-independent) Slack to share ideas and insights. I teamed with Zhamak, the author, to launch it. It's still in early days but at 1K+ in a month so hopefully can really help people get the content they need to learn about it all. [0] https://launchpass.com/data-mesh-learning I also compiled a list of public user stories [1] https://www.reddit.com/r/datamesh/comments/m6ecuz/data_mesh_...
by bsenftneron 3/19/2021, 11:52:39 AM
Just wanna say "data lakes"? Is this a real term? The buzz words are so thick, it's hard to see past the gush propaganda.
by brnNbrd2rpNshrdon 3/19/2021, 7:17:28 PM
deEeEeEeeecent
by timdaubon 3/19/2021, 10:55:48 AM
TL;DR?