Launch HN: Datafold (YC S20) – Diff Tool for SQL Databases

  • I was curious about pricing, but I see it's "call me pricing" with buttons to schedule a demo, so at least I can see this is squarely aimed at the enterprise. If I'm being honest, I don't like seeing "call me pricing" on HN; there are no rules against it, but it just doesn't feel right on HN.

    Are you able to say anything about pricing here?

  • I see a lot of these things and I don't understand them. I've done too much ETL so I'm not naive. Now either 1) people are making a mountain out of a molehill (not saying that's happening here, but in other cases I think so) 2) there's something my experience of ETL hasn't taught me or 3) these tools are specialised for niches. This one talks about 'large datasets' but I don't know how large that is.

    Some questions then

    > Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless.

    select is fine for diffing. You just do an either-way except , something like

      (
      select f1, f2, f3 ... f100
      from t1
      except
      select f1, f2, f3 ... f100
      from t2
      )
      union 
      (
      select f1, f2, f3 ... f100
      from t2
      except
      select f1, f2, f3 ... f100
      from t1
      )
    
    used this and it's fine on many rows (millions is fine but I do recommend and index and a DB with a halfway decent optimiser).

    > (2)

    Interesting. OK.

    > (3) Data transfer validation: moving large volumes of data between databases is error-prone

    Really? I never had a problem. What is 'large'? what problems have you seen? There are easy solutions with checksums, error correction (comes free with networks) or round-tripping, is that a problem?

    Edit, just done that with mssql tables, 8 cols, 38 bytes per row, ~776,000 rows (identical but for one row), diff as above takes 2 seconds without an index (with PK it takes 5 seconds. Sigh. Well done MS). The single row discrepancy shows up fine. Totally trivial to extend it to 100 columns (did that too in previous job).

  • Hey! Looks great! Is there an example of the Github integration - how does it looks like?

    I'm one of the developers and maintainer of the DVC project and we recently released CML.dev- which integrates with Github and can be used to run some checks on data as well. But in our case it's about analyzing files more or less. I'm curious how does that integration look like in your case.

  • It is always good to see new approaches to testing but I don't see how this one is going to work. I've worked at multiple database companies. Diff'ing data is one of the weakest and most cumbersome ways to verify correctness.

    Diffs are relatively slow, when they fail you get a blizzard of errors, and the oracles (i.e. the "good" output) have to be updated constantly as the product changes. Plus I don't see how this helps with schema migration or performance issues, which are major problems in data management. And don't get me started on handling things like dates, which change constantly, hence break diffs.

    If you really care about correctness it's better to use approaches like having focused test cases that check specific predicates on data. They can run blindingly fast and give you actionable data about regressions. They're also a pain to code but are most productive in the long run.

  • At the moment I use https://github.com/djrobstep/migra to make PostgreSQL diff.

  • 1) Will it allow me to diff AWS RDS Aurora/MySQL serverA database schemaA against AWS RDS Aurora/MySQL serverB database schemaB ? 2) Are there APIs to initiate and view/parse these diffs that you generate or is it all through UI?

  • To whome it may concern, we have written a paper on the subject (I'm not affiliated with datafold): https://www.researchgate.net/publication/323563341_A_Time-co...

    The paper describes the original algorithm with examples.

  • Much needed! Analysts struggle with this all the time - trying to explain why an analysis is off and having to manually debug every column in a new database

  • In case you're unaware, your logo looks unfortunately a bit uncannily similar to that of https://www.sigmacomputing.com/ given you are both in a similar broader product category! I actually thought your logo throughout the site was actually a reference to integration with sigma at first.

  • I got the chance to play with Datafold and I would have loved to have had it when I was working for Facebook on data pipelines.

  • I've recently discovered, and highly recommend Daff [0]. It's an open source tool that can diff CSVs and SQLlite database tables. The tabular diff format is fantastic.

    [0]: https://github.com/paulfitz/daff

  • Hey Gleb, congrats on the launch. This is an interesting tool.

    Myself being a founder of a product in the space of tools for data-driven companies, I wanted to ask -

    Is you product aimed entirely at data engineers? The description seemed very technical and a problem that mostly very large companies would have. Did I understand correctly?

  • Can this be used locally for datasets with non-transferable PII? Thinking about this for non-profit work.

  • Great tool! I am only interested in running such a tool locally (on-prem). This avoids security/privacy issues and data transfer time/cost issues.

    A good model for me would be 30 day free license to get it integrated into our worklows.

  • For Cassandra: https://github.com/apache/cassandra-diff

    (Designed for correctness testing at petabyte scale)

  • Can you import/work with .bcp files? (Microsoft SQL server bulk export files). For example, diffing 2 bcp files, even if you need to import and set them up as databases again.

  • How does it work behind the scene? Is it simply sample a portion of the data then do the diff? What if I need 100% accuracy?

  • Congrats on the launch.

    How does it compare to enterprise ETL tools like Informatica, Talend - is it not possible to do these within them ?

  • Will there be integration with ORACLE Dbs?

  • Very cool, nice work! SQL Server / Azure SQL support available or on the roadmap?

  • It is interesting to see Ant Design used other than my personal projects.

  • Very cool! Does it work with MS SQL in our own DC?

  • love it, fucked up and made a startup with a solution so you'll never make that mistake again.