A major Postgres upgrade with zero downtime

  • > But we had less than a terabyte of data

    I really wonder how an in-place `pg_upgrade` of such small amounts of data would take 30+ minutes.

    My experience from a less mission-critical situation where 5 minutes of maintenance are absolutely acceptable is that an in-place `pg_upgrade` with `--link` of a 8 TB database takes less than a minute and will not accidentally lose data or fail to properly configure schema search paths or whatever other mess the article was talking about.

    I understand that 30 minutes of downtime are not acceptable. But if it's 5 minutes or less, I would seriously consider an offline upgrade using `pg_upgrade`

    And if it takes 30 minutes to hard-link less than 1 TB of data files, you should seriously consider changing hosts because that's absolutely unacceptable performance.

  • This is impressive! I know others are questioning the "no downtime" bit, but that is why service level objectives exist -- because it really depends on the customer experience.

    If you managed to have a cutover with no noticeable dip in business metrics (aka the users didn't notice) then I'd call that a no-downtime upgrade!

    Very clever on the improvement over Lyft's methods. Thanks for the writeup. Now maybe someone can get it down from 3 seconds of pausing. :)

  • Logical replication is a great tool, and we are using it for our next DB upgrade coming up in the next few months. It just has a few limitations I wish they would address, especially since logical replication is quickly becoming one of the more popular ways to upgrade databases with minimal downtime.

    I understand that logical replication handles partial replication, and the data warehouse use case, and I understand WHY many of the choices they made for it are there.

    I just wish that there was a flag you could set in a subscription that would enable it to be a complete 1:1 copy of the database, including DDL, sequences, etc, without having to do all the workarounds that are necessary now.

    Currently if a dev adds a column to a table on the publisher, and forgets to do it on the subscriber, there is no error, no indication, until it actually tries to replicate data for that table.

    Then you have to be monitoring for that error, and go figure out what other new tables, columns, etc, might have been added to the publisher that are missed on the subscriber. Its a huge opportunity for problems.

  • Pause all writes > let 16 to catch up > resume writes on 16

    Isn’t that….. downtime? Unless you mean downtime to be only when reads are also not available.

  • Zero-downtime Postgres upgrades have been kind of normalized, at least in the environments I have been exposed to, with pgcat

      https://github.com/postgresml/pgcat

  • They say the "stop the world" approach that causes more downtime is

      Turn off all writes.
      Wait for 16 to catch up
      Enable writes again — this time they all go to 16
    
    and instead they used a better algorithm:

      Pause all writes.
      Wait for 16 to catch up. 
      Resume writes on 16.
    
    These seem pretty similar.

    1. What is the difference in the algorithm? Is it just that in the "stop the world" approach the client sees their txns fail until "wait for 16 to catch up" is done? Whereas in the latter approach the client never sees their txns fail, they just have a bit more latency?

    2. Why does the second approach result in less downtime?

  • Awesome!

    The best we’ve achieved is 0 downtime for read operations and less than 1 minute downtime for write ops [1]

    Achieving 0 downtime for write ops is super hard!

    [1] https://www.listennotes.com/blog/a-practical-way-to-upgrade-...

  • I can't believe they took the time to make such an amazing write-up. With formatting and everything. Normally I just grab whatever broken crayon is closest and scribble on the back of an overdue bill "don't fsck up next time"

  • The title is pretty misleading. They're not even running Postgres, but AWS Aurora, which is Postgres compatible, but is not Postgres.

    Also, pausing queries does count as downtime. The system was unavailable for that period of time.

  • I have to wonder – are they using a connection pooler? I'm leaning towards no, since what they did in code can be natively done with PgBouncer, PgCat, et al. That would also explain the last footnote:

    > The big bottleneck is all the active connections

    For anyone who is unaware, Postgres (and Aurora-compatible Postgres, which sucks but has a great marketing team) uses a process per connection, unlike MySQL (and others, I think) which use a thread per connection. This is inevitably the bottleneck at scale, long before anything else.

    I did feel for them here:

    > We couldn’t create a blue-green deployment when the master DB had active replication slots. The AWS docs did not mention this. [emphasis mine]

    The docs also used to explicitly say that you could run limited DDL, like creating or dropping indices, on the Green DB. I found this to be untrue in practice, notified them, and I see they've since updated their docs. A painful problem to discover though, especially when it's a huge DB that took a long time to create the B/G in the first place.

  • I wonder why they didn't use synchronous_commit option. That would eliminate replication lag and allow for real zero-downtime.

  • > The next few hours was frustrating: we would change a setting, start again, wait 30 minutes, and invariably end up with the same error.

    Sounds about right for cloud services.

  • I haven't it tried it, but in another post recently someone mentioned enabling synchronous replication after the clone is up to date

    https://www.postgresql.org/docs/current/warm-standby.html#SY...

  • When I see instantdb in the domain on here I always know it's going to be a good read :)

    Nicely done!

  • > all Instant databases are hosted under one Aurora Postgres instance

    Doesn't the increase the chances of one bad tenant taking the database down for all other tenants?