Hacker News

Pg_lakehouse: Query Any Data Lake from Postgres

by landingunlesson 5/13/2024, 1:29:17 PM with 19 comments

by nathanwallaceon 5/13/2024, 8:57:19 PM
Readers may also enjoy Steampipe [1], an open source tool to live query 140+ services with SQL (e.g. AWS, GitHub, CSV, Kubernetes, etc). It uses Postgres Foreign Data Wrappers under the hood and supports joins etc with other tables. (Disclaimer - I'm a lead on the project.)
1 - https://github.com/turbot/steampipe
by whalesaladon 5/14/2024, 2:53:43 AM
How many folks here struggle to adopt tooling like this because it isn’t possible to add psql extensions to places like RDS?
by arduanikaon 5/13/2024, 9:21:45 PM
The name seems to be an allusion to the author P.G. Wodehouse, creator of the character Jeeves.
https://en.wikipedia.org/wiki/P._G._Wodehouse
Very clever naming!
by ahacheteon 5/14/2024, 9:20:13 AM
The (internal) use of DataFusion to create new, powerful extensions for Postgres is a very clever idea. Very good work for the ParadeDB team.
I like this one very much. Very simple way to avoid having to use different set of tools and query languages (or more limited query languages) to query lakes.
by kiwicoppleon 5/13/2024, 7:38:06 PM
Neat that you plan to support both Delta Lake and Apache Iceberg
I'm curious about HN's position between these two formats? I'm having a hard time deciphering which might be the industry winner (or perhaps they both have a place, no "winner" necessary)
by tehlikeon 5/13/2024, 7:29:23 PM
Paradedb is doing a lot of good work with postgres. Pg_analytics, and now pg_lakehouse...
by jeadieon 5/13/2024, 9:45:45 PM
This looks functionally similar as using http://github.com/spiceai/spiceai with a postgreSQL data accelerator.
by yrashkon 5/13/2024, 7:36:21 PM
As somebody who writes a lot of Postgres extensions, I can say this is quite interesting!
I think I can see some parallels to Supabase's wrappers project.
Keep up the good work!
by mcdonjeon 5/13/2024, 8:15:13 PM
Looks like pg as a replacement for databricks sql, which is already a query engine for datalakes. It's not a lakehouse, but it calls itself one. Seems like a cool and useful project, but the name is problematic.
by nikitaon 5/14/2024, 1:16:57 AM
I have another question. So far on the clickbench leaderboard it's 15x slower than baseline. The number 1 place is 1.67 slower the baseline.
I assume that's DataFusion speed. What's the plan to improve upon it?
by nikitaon 5/13/2024, 9:01:52 PM
This is great work! Could you please comment on the choice of your license. Lost Postgres extension that achieve wide adoption use Postgres, MIT or Apache license.
by mustafabalon 5/14/2024, 2:03:03 AM
Very nice addition! Do you plan to support Snowflake as an object store in the near future? It's not currently in pg_lakehouse's README.
by tarasglekon 5/14/2024, 6:01:02 AM
I am not up to date in various lakes. Is this read-only? Are you able to init a lake from scratch?
What's the model to feed such a lake from some queue?
by epsilonicon 5/13/2024, 11:13:31 PM
How does this compare to Hydra? https://www.hydra.so/
by sdairson 5/13/2024, 8:04:38 PM
Very cool!
Could you share the key difference between this and the previous pg_analytics, and motivation of making it a separate plugin?
by samberon 5/13/2024, 8:22:25 PM
It seems very promising!
2 questions:
- do you distribute query processing over multiple pg nodes ?
- do you store the metadata in PG, instead of a traditional metastore?
by brunoqcon 5/13/2024, 7:46:16 PM
Nice. I wish timescaledb open-sourced their s3 storage thing.
by hardwaresoftonon 5/14/2024, 3:02:01 AM
Yet another amazing postgres plugin made possible by pgrx (https://github.com/pgcentralfoundation/pgrx)
It's really crazy how some projects just instantly enable a whole generation of new possibilities.
If you are impressed like this and want to build something like it -- check out pgrx, it's a pretty great experience.
by q9tE6uHb7yKqon 5/14/2024, 1:34:20 AM
looks interesting!