Choosing vector database: a side-by-side comparison

  • As others have correctly pointed out, to make a vector search or recommendation application requires a lot more than similarity alone. We have seen the HNSW become commoditised and the real value lies elsewhere. Just because a database has vector functionality doesn’t mean it will actually service anything beyond “hello world” type semantic search applications. IMHO these have questionable value, much like the simple Q and A RAG applications that have proliferated. The elephant in the room with these systems is that if you are relying on machine learning models to produce the vectors you are going to need to invest heavily in the ML components of the system. Domain specific models are a must if you want to be a serious contender to an existing search system and all the usual considerations still apply regarding frequent retraining and monitoring of the models. Currently this is left as an exercise to the reader - and a very large one at that. We (https://github.com/marqo-ai/marqo, I am a co-founder) are investing heavily into making the ML production worthy and continuous learning from feedback of the models as part of the system. Lots of other things to think about in how you represent documents with multiple vectors, multimodality, late interactions, the interplay between embedding quality and HNSW graph quality (i.e. recall) and much more.

  • Everyone I talk to who is building some vector db based thing sooner or later realizes they also care about the features of a full-text search engine.

    They care about filtering, they care to some degree about direct lexical matches, they care about paging, getting groups / facet counts, etc.

    Vectors, IMO, are just one feature that a regular search engine should have. IMO currently Vespa does the best job of this, though lately it seems Lucene (Elasticsearch and Opensearch) are really working hard to compete

  • Let me half hijack to ask a related question:

    I'm building a RAG for my personal use: Say I have a lot of notes on various topics I've compiled over the years. They're scattered over a lot of text files (and org nodes). I want to be able to ask questions in a natural language and have the system query my notes and give me an answer.

    The approach I'm going for is to store those notes in a vector DB. When I ask my query, a search is performed and, say, the top 5 vectors are sent to GPT for parsing (along with my query). GPT will then come back with an answer.

    I can build something like this, but I'm struggling in figuring out metrics for how good my system is. There are many variables (e.g. amount of content in a given vector, amount of overlap amongst vectors, number of vectors to send to GPT, and many more). I'd like to tweak them, but I also want some objective way to compare different setups. Right now all I do is ask a question, look at the answer, and try to subjectively gauge whether I think it did a good job.

    Any tips on how people measure the performance/effectiveness for these types of problems?

  • I'll add txtai to the list: https://github.com/neuml/txtai

    txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows.

    Embeddings databases are a union of vector indexes (sparse and dense), graph networks and relational databases. This enables vector search with SQL, topic modeling and retrieval augmented generation.

    txtai adopts a local-first approach. A production-ready instance can be run locally within a single Python instance. It can also scale out when needed.

    txtai can use Faiss, Hnswlib or Annoy as it's vector index backend. This is relevant in terms of the ANN-Benchmarks scores.

    Disclaimer: I am the author of txtai

  • I really appreciate comparisons like this, although I find myself wanting to know more about why certain things are listed the way they are.

    For example, pgvector is listed as not having role-based access control, but the Postgres manual dedicates an entire chapter to it: https://www.postgresql.org/docs/current/user-manag.html

    Hence why I’d be interested to know more about the supporting details for the different categories. It may help uncover some inadvertent errors in the analysis, but also would just serve as a useful jumping-off point for people doing their own research as well.

  • I made this table to compare vector databases in order to help me choose the best one for a new project. I spent quite a few hours on it, so I wanted to share it here too in hopes it might help others as well. My main criteria when choosing vector DB were the speed, scalability, dx, community and price. You'll find all of the comparison parameters in the article.

  • I'd love to know how vector databases compare in their ability to do hybrid queries, vector similarity filtered by metadata values. For example, find the 100 items with the closest cosine similarity where genre = jazz and publication date between 1990 and 2000.

    Can the vector index operate on a subset of records? Or when searching for 100 closest matches does the database have to find 1000 matches and then apply the metadata filter, and hope that doesn't reduce the result set down to zero and exclude relevant vectors?

    It seems like measuring precision and recall for hybrid queries would be illuminating.

  • Curious about the lack of Vespa, especially given the thoroughness of the article and its long-time reputation. OpenSearch is also missing, but perhaps it can be considered being lumped in with Elasticsearch due to them both being based on Lucene. The products are starting to diverge, so would be nice to see, especially since it is open-source.

    For the performance-based columns, would be also helpful to see which versions were tested. There is so much attention lately for vector databases, that they all are making great strides forward. The Lucene updates are notable.

  • What advantage are vector databases providing above using an index in conjunction with a mature database? I’m not sold on this as a separate technology.

    Vector search is useful, but I don’t understand why I would go out of my way when I could implement FAISS or HNSWlib as an adjunct to postgres or a document store.

  • Strongly disagree with PGVector's DX being worse than Chroma. Installing, configuring, and working with Chroma was infuriating -- it's alpha software and has the bugs and rough edges to prove it. The tools to support and interface with postgres are battle-tested and so much nicer by comparison; getting Chroma working took over a week, ripping it out and replacing with PGVector took a couple hours.

    Also agree with this[0] article that vector search is only one type of search, and even for RAG isn't necessarily the one you want to start with.

    [0]: https://colinharman.substack.com/p/beware-tunnel-vision-in-a...

  • I've been wondering about Redis as vector database [0].

    [0]: https://twitter.com/sh_reya/status/1661136833848438784

  • Nice post! I think this could be a very good page to bookmark.

    There is also this series of articles detailing the options and it includes some that the OP is missing: https://thedataquarry.com/posts/vector-db-1/#key-takeaways

    I'm currently in the market for a self hosted DB for a personal project. The project is an app you can run on your own system and provide QA on your text files. So I'm looking for something light weight, but I'm also looking for the best possible search and ANN retrieval is just a single part of that.

  • Their definition about Hybrid Search is I think wrong.

    Through this terms tend to not be consistently defined at all, so "wrong" is maybe the wrong word.

    Their definition seem to be about filtering results during (approximate) KNN vector search.

    But that is filtering, not hybrid search. Through it might sometimes be implemented as a form of hybrid search, but that's an internal implementation detail and you probably should hope it's not implemented that way.

    Hybrid search is when you do both a vector search and a more classical text based search (e.g. bm25) and combine both results in a reasonable way.

  • This is interesting because it does not mention Vector database powered by Apache Cassandra or the hosted serverless version DataStax Astra. Here is write up we did on 5 hard problems in Vector database and how we solved them. https://thenewstack.io/5-hard-problems-in-vector-search-and-...

    In full transparency: I work for DataStatx and lead engineering for Vector database.

  • I don't think we need specialized databases for vectors. Relational databases can easily be expanded by vector data types and operations. They will eventually catch up by supporting what was once a unique feature of the new system: https://medium.com/@magda7817/two-things-to-keep-in-mind-bef...

  • Postgres vector store has been the most simple, and will be if you are at a lower scale. You can just use it directly with something like spring boot.

  • Quick question regarding the scalability and support of multiple vector databases under a single cloud service. Suppose an enterprise Saas product served multiple customers with each requiring a unique RAG vector knowledge-base for product and company info. Do any of these solutions allow for a large number (dozens or hundreds) of small distinct Knowledge bases? Do any offer easily integrated automated pipelines for documents to be parsed and ingested?

  • Postgres with PGVector is the best database, plus vectors.

    All of the "Vector DBs" suffer horribly when trying to do basic things.

    Want to include any field that matches a field in an array of keys? Easy in SQL. Requires an entire song and dance in Pinecone or Weaviate.

    After implementing Chroma, Weaviate, Pinecone, Sqlite with HNSW indices and Qdrant-- I'm not impressed. Postgres is measurably faster since so much relies on pre-filtering, joins, etc.

  • Strongly disagree about the Pinecone developer experience. Not that they don't have SDKs, but last I checked they didn't have documentation on how to approach local dev environments.

    The implication being that you spin up a separate index for $70/mo, and then you have to upsert any relevant data yourself. Sure that's not difficult, but why do you have to do it at all? Why doesn't Pinecone make it easy to replicate data to another index for use in dev/staging?

  • You might want to add https://turbopuffer.com/ as well now in the benchmarks.

  • You might like the 'Which Search Engine?' panel I ran at Buzzwords earlier this year with some of the leading contenders (Vespa, Qdrant, Elastic, Solr, Weaviate) https://www.youtube.com/watch?v=iI40L4wMtyI - vector search was obviously part of the discussion

  • Pricing for pg should be easy to compute

    20M vectors @768 is about 62GB, for 32bit, not even quantized. AWS RDS will put it at 83USD/m (db.t4g.small, 2vcpu 2GB RAM). But that's not with egress, backups, etc

    Seems acceptable at least for a POC?

    A better option if you already have the data in the same instance, but developer experience being low scares me. Anyone tried it? How did it go?

  • I'm interested to try some of these others next time around, but I've used qdrant self-hosted in two projects and been pleased. Milvus was recommended so I gave that a try but found it over complicated. Pgvector seems like an obvious choice if you are already using postgres and if that performance is ok.

  • Latency from embedding models is still going to be the bottleneck for performance however fast the DB is going to be. Plus adding all the overhead of synthesising answers and summaries from a LLM is going to weigh you down.

  • I'm actually curious on how the new vector DB from cloudflare compares.

  • And soon, on MySQL/Vitess as well: https://planetscale.com/ai

  • Redis is definitely missing in the comparison.

  • 16x difference between pg and milvus?

    I thought for most use cases this would be quite performance sensitive

  • What do people think about MongoDB's search offering and its pivot into vectors?

  • None of these vector dbs seem economical outside of enterprise.

  • also, typesense

  • Somehow I felt that at least part of the articles was generated by a LLM. It’s unfortunate to see that a new bias has started to creep up. Whatever I read now I second guess and I feel it maybe partially or fully generated by LLMs.