Replibyte – Seed your database with real data

  • Trying to think how to anonymise datetimes hurts my head. You might want to randomise the date of an event. But you also need this random date to be consistent with respect to both the current time and the order of other related rows in the database.

  • How does it keep personal data safe? I had a look at “how it works” and “faqs” but they don’t answer how you keep stuff safe? It also gets uploaded to S3?

    I might have missed it, but I need to know exactly where our PII is stored (so not on a dev laptop), how do you know what to replace and what do you do with any info you do replace?

    Edit: To answer my own question, via transformers. But that seems to suggest each dev has to keep it up to date with any schema changes etc

    (Also some links are broken on GitHub)

  • One feature I’d love to see is a transformer that instead of providing a random value provides a cryptographic one way hash of the data (ie sha2) - that way key uniqueness stays the same (to avoid unique constraints on columns) and also the same value used in one place will match another value in another table after transformation which more accurately reflects the “shape” of the data.

  • I recommend checking out clickhouse-obfuscator. It's a more sophisticated tool for dataset obfuscation.

    Installation (single binary Linux/Mac/FreeBSD):

    curl https://clickhouse.com/ | sh

    ./clickhouse obfuscator --help

    Docs: https://clickhouse.com/docs/en/operations/utilities/clickhou...

  • The default seems to be to store the sanitized dump on S3.

    It’s not always available in a professional context. Or might be considered extraction.

    Keeping everything local and detailing exactly what goes where and how would be helpful.

  • I think the description in the man entry is better than the one in the README. Other than that, cool tool!