Using Parquet's Bloom Filters

  • One thing I have wondered: would it make sense to reduce file size? Generally advice I’ve seen is to keep files to around 250mb-1gb, but if you’re leaning heavily on bloom filters it feels like it could make sense to reduce the number of files to reduce the amount that would trigger the per-file filter.

  • With large datasets, wouldn't partitioning the data on low cardinality columns give the same benefit without the space overhead?