Query planners unexpectedly “helping” can lead to very real availability and performance problems. For large scale workloads, I really want to know what my database is doing. Having it suddenly flip an index for reasons no one can predict or explain because some ML algorithm decided it was the right thing sounds incredibly dangerous.
One of the authors is famous for his love of the Wu-Tang Clan.
Cool to see bits of this love in his code. https://github.com/cmu-db/noisepage/blob/ed86192c8ea5c7a0754...
ML-assisted query optimization is super interesting to me (I've had to fight the uphill battle against anomalies in Postgres stats and estimations[1]), but I'd also love to see more optimization and auto-tuning across the "stack":
- Use column-oriented storage for columns that are frequently scanned in analytical queries
- Automated index management, based on user-provided boundaries for write throughput, disk footprint, query latency
- Calibrated optimism for heap updates. AFAIK, current DBs are either optimistic (assume transactions are more likely to commit than rollback) and update the heap in-place, and write the old value elsewhere in case of rollback, or pessimistically write all updates in new, versioned tuples and let MVCC (and garbage collection) handle the rest. Would be interesting to see the performance improvement that could come from modeling commit / rollback outcomes and optimizing accordingly.
- Using variable-sized pages to reduce overhead for older pages, as a sort of light cold storage
Anyone know of any DBs that automatically tune these aspects?
I love this project. This is part of my long term goal for my main product [1] which is all about making databases easier to work with for everyone while abstracting away SQL.
The use of ML/RL is exactly the method I have in mind for long term. As I am not from that background, so my immediate goals are more like a set of best practices and migration management with a Python layer (SQLAlchemy, sqlalchemy-migrate and automation on top).
Somewhat related is Pavlo's musings on Naming a Database Management System [1]:
"In my opinion, the best DBMS names from the last thirty years are Postgres[4] and Clickhouse.
These two names did not mean anything before, but now they only have one connotation. There is no ambiguity. There is no overlap with other DBMS names or non-database entities. They are easy to spell correctly[5]. Everyone refers to them by their full name. Nobody mistakingly calls them "PostgresDB" or "ClickhouseSQL."
After reflecting on what makes them so good, I realized what the secret was to them. They are a two-syllable name that is derived from combining two unrelated one-syllable words together (e.g., Post + Gres, Click + House). Each individual word has its own meaning. It is only when you put them together does it mean the database.
Given this, I henceforth contend that the best way to name a DBMS is to combine two one-syllable words."
Noise + Page
[1]: https://www.cs.cmu.edu/~pavlo/blog/2020/03/on-naming-a-datab...
Let me first say sorry. I know it's a bit unrelated.. but it seems like the right people are here. Is there any kind of open source database that is close to Amazon's QLDB? ideally as a postgresql plugin.
Also: is there any "distributed database system" that allows to synchronize for example some tables (or even specific fields) of a server-side database with a third-party client (which might be sqlite vs postgresql in the server side), so that only part of the database is synchronized, in a secure way? this "slave" should not be trusted, think about it as a web user.
Looks like Peloton finally got too popular for Andy Pavlo to keep using their name for the project.
Sort of a shame, I thought the old name was a more apt analogy for the goals of the system.
This is really cool.
Does anybody know how you implement multiversion concurrency control?
I'm trying to implement it in a toy project that simulates bank accounts withdrawing and depositing money in a multithreaded scenario where no money should be lost or created.
I found that the algorithm only worked when I considered only write timestamps. If there is a write timestamp that is greater than this commit's write timestamp, then abort and retry.
This is using several buzzwords some database vendors are already using to promote their fully managed solutions, yet the "autonomy" of this dbms appears to be limited to SQL tuning.
I don't even see references to high availability configuration and how this would withstand transactional load and be resilient enough in comparison with more mature engines.
Would it be able to apply some (most) of these technics to other RDBMS like PostgreSQL and MySQL? Maybe some of (non private/secure) information about the queries/stats could be sent back to a centralised server that would be ingested by the ML model making it even "smarter".
They have to limit the scope somewhere.
It seems like "Self-Driving" to the authors means simply that it has a query optimizer using ML.
I would counter that idea by suggesting that a more foolproof database would also incorporate truly solid auto-scaling, and global replication, so that one can just throw immense amounts of data into tables, while retaining fast CRUD operations worldwide, without any DBA support required. A little bit like the direction CockroachDB and TiDB are headed, perhaps.