The Distributed Data Mesh (2019)

  • We're building a data mesh at Splitgraph [0]. We provide a unified interface to query and discover data products. You can query the data using the Postgres wire protocol at a single endpoint, with any of your existing tools. And you can discover it in the catalog, using a familiar GitHub-like interface. You can try this right now on the public website, where we federate access to 40k open datasets. Every dataset is addressable with a `namespace/repository:tag` format. The `tag` can refer to either the live data, in which case we forward the query upstream, or to a versioned snapshot of data (a "data image") that you build with declarative, Docker-like tooling. [1]

    On the enterprise side, integrating the access and discovery layers gives a lot of advantages, especially around data governance. On the web, we give users tools to connect data sources, document them, and share/audit access to them. When a query comes through the endpoint, since we're implemented as a Postgres proxy, we can rewrite/filter/drop it in accordance with rules, or we can forward it along to the upstream data source(s) and/or join across them. If you use Splitfiles to generate versioned data, we can also provide data lineage/provenance and full reproducibility.

    We've been working on this for ~3 years but are still pretty early. If anyone wants to help, we just raised a seed round and are hiring a remote team -- check my comment history for links.

    [0] https://www.splitgraph.com

    [1] https://www.splitgraph.com/docs/working-with-data/using-spli...

  • I think that the main point of the article is that a company’s data strategy should result in discrete data products aligned with business domains. A domain-oriented team should be responsible for each data product. Data infrastructure should cover universal data-processing concerns, but should not include business logic. These characteristics contrast with a centralized data lake, where a single organization is responsible for both the infrastructure and content of the data resource.

  • I can't distinguish between what is described and service oriented (SOA) approach:

        Discoverable
        Addressable
        Trustworthy and truthful
        Self-describing semantics and syntax
        Inter-operable and governed by global standards
        Secure and governed by a global access control
    
    A reminder that Thoughtworks was highly influential in pushing Microservices. This may be an elaborate mea-culpa ("oops, SOA was actually more sensible") without admiting 'culpa', rehashing SOA with a set of features (above) that look awfully like those highly elaborate SOA proposals with XML and all sorts of meta-data to 'couple' these "data products' (previously called Services).

  • On the following link you can learn how Zalando implemented this concept in the real world: https://databricks.com/session_na20/data-mesh-in-practice-ho...

  • If any folks want to learn more about data mesh, we have a (vendor-independent) Slack to share ideas and insights. I teamed with Zhamak, the author, to launch it. It's still in early days but at 1K+ in a month so hopefully can really help people get the content they need to learn about it all. [0] https://launchpass.com/data-mesh-learning I also compiled a list of public user stories [1] https://www.reddit.com/r/datamesh/comments/m6ecuz/data_mesh_...

  • Just wanna say "data lakes"? Is this a real term? The buzz words are so thick, it's hard to see past the gush propaganda.

  • deEeEeEeeecent

  • TL;DR?