Hacker News

How to Distribute Postgres Geographically

by mebcittoon 6/1/2024, 4:42:37 AM with 3 comments

by jitlon 6/2/2024, 12:23:50 PM
(I work at Notion, one of the larger Notion clones)
We experimented with using partitioned tables and doing the fancy postgres_fdw thing when we sharded our tenant-clustered tables in 2020. Even with all the Postgres instances in the same region we found the approach unwieldy. Routing queries in the application gives you a lot more control versus needing to teach Postgres the exact logic to route a query, plus do a (dangerous? one-way? global?) schema migration whenever you want to change the routing logic.
We touched on this briefly in our sharding blog post here: https://www.notion.so/blog/sharding-postgres-at-notion
Another reason to avoid complex routing in Postgres is risk. If something goes wrong at 3am in a load-bearing cross-Postgres-instance query, how easily will it be to mitigate that if the routing is happening in Postgres (with whatever advanced clustering system Citus, pgzx, postgres_fdw) versus if the routing is happening in the application? For example if there’s a network partition between the “global” postgres instance and the us-east-2 cluster? Maybe you’re Postgres wizards and know how to handle this with a quick schema change or something in the psql CLI but I’d bet more on my team’s ability to write fault tolerance in the application versus in Postgres internal logic.
by hazaskullon 6/2/2024, 6:38:36 AM
While it is fun to see how to creatively solve such issues, it does raise the question of managability. When sharding data into loosely (fdw) coupled silo's it would become tricky to make consistent backups, ensure locking mechanisms work when sharded data might sometimes be directly related, handle zone/region failures gracefully, prevent hot spots, perform multi-region schema-changes reliably, etc. I suppose this pattern principally only works when the data is in fact not strongly related and the silo's are quite independent. I wouldn't call that a distributed system at all, really. This may be a matter of opinion of course.
It does give a "When all you have is a hammer..." vibe to me and begs the question: why not use a system that's designed for use-cases like this and do it reliably and securely ? i.e.: https://www.cockroachlabs.com/docs/stable/multiregion-overvi... (yes, I know full data domiciling requires something even more strict but I currently don't know of any system that can transparently span the globe and stay performant while not sharing any metadata or caching between regions)
by nurpleon 6/2/2024, 2:03:26 PM
Ever since the war of stored procedures, I've been very reticent to put any more logic than absolutely necessary into the database. The simple truth of the matter is that the DB server codebases are awful, complicated, places to develop this kind of functionality. I strongly believe these features should be handled at higher layers.
At my last job, in fintech, we used application-level shard routing with each aggregate root served by a particular RPC service (users, accounts, transactions, etc). When one of these aggregate services were asked to pull data, they would query a central routing service for the server and shard the requested data resided in.
Between them and the DB servers we had a cluster of pgbouncer instances to handle connection oversubscription and the routing of queries around DB instances in the face of failures or maintenance.
While this was pretty effective, the work to manage sharded Postgres, DDL updates, caching, locating, and balancing data was still very complicated and error prone, and was the root of many production outages.
I didn't design this system, but after a couple years leading a platform migration, which necessitated getting into all this at a deep level, I would do it differently. Instead of query routing and DB sharding, I would shard the aggregate root services themselves. Each "sharded" set of these instances would have a simple PG instance (with replica chain) behind it that knows nothing about any other set.
At this point, instead of routing being done by the aggregate root services locating and querying data from DB shards on huge vertically-scaled servers, each set only pulls data from their dedicated DB. A routing facade is placed in front of the set of sets that sends requests from consumers to the one holding the desired data.
With this architecture, the complexity of sharding and vertically scaling at the DB layer, and handling connection oversubscription with a query router like PGbouncer, just falls away.
I would keep these sets, and their DBs, rather small to also reap the benefits of small datasets. One of the biggest issues we had with huge DBs is the time it takes to do anything; as an example, restarting replication in a failed chain could take many hours because of the huge amount of data, and any hiccups in connectivity would quickly overrun our capability to "catch" replication up and we'd have to fall back to zfs-send to resnap.
A larger number of smaller DBs would not improve the total time needed to do something like backups or DDL mutations, but it would significantly reduce the time for any particular instance which reduces overall risk and blast radius of a DB failure.
Another thing I think small DBs can help with is data locality, DB automation, and possibly making the instances effectively ephemeral. When your dataset is small, bringing an instance up from a hot backup can take on the order of a few seconds, and could allow you to schedule the root services and their DB on the same host.
For geographical distribution, the routing facade can also send writes to the set serving that shard in some other region.