Investing in new vector database development vs enhancing existing databases

  • Has anyone seen a convincing argument for why you would want a dedicated vector database in place of a normal database with a good, fast vector index implementation?

    The existing DB + vector index option seems so obvious to me that I'm worried I'm missing something.

  • Quick "ask HN": I'm currently working on a semantic search solution, and one of the challenges is to be able to query billions of embeddings easily (single-digit seconds). I've been testing different approaches with a small dataset (50-100 million embeddings, 512 or 768 dimensions), and all databases I've tried have somewhat severe issues with this volume of data (<100GB of data) on my local machine. I've tried milvus, chroma, clickhouse, pgvector, faiss and probably some others I don't recall right now. Any suggestions on additional databases to try out?

  • Why is this flagged? Is there a problem with the content of the article?

    I don't see this reflected in the comments here

  • ANN-benchmarks [1] compares 30 OSS libraries. pgvector comes up dead last.

    [1] https://ann-benchmarks.com/

  • Pg_vector seems to be getting a lot of mindshare but it's worth noting that pgvecto.rs also exists and has some features that pg_vector doesn't, including multithreaded indexing.

  • pgvector is "only" at v 0.5 but does look interesting to me. Any HNers tried it?

  • Adding txtai to the list https://github.com/neuml/txtai

    txtai is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows. txtai can satisfy most vector database use cases such as being a knowledge source for retrieval augmented generation (RAG).

    txtai is independently developed (not VC-backed) and released under an Apache 2.0 license.

  • Most devs would be better off upgrading their database and using vectors there. They need to change the schema, but it's way easier than introducing a new database.

    A specialized vector database would be needed in just a few relatively rare use cases.

  • Just use PostgreSQL and pgvector

  • One from the Qdrant team here. We genuinely recommend starting with whatever you already have in your stack to prototype or produce applications with vector search. From the beginning, one should probably not start a new project with a complex micro-service architecture. The same as you can start with a full-text search using just Postgres or whatever you use as the main DB. However, as the requirements are getting higher and more complex, you probably should switch to a dedicated search engine to avoid the monolithic persistence anti-pattern. https://learn.microsoft.com/en-us/azure/architecture/antipat...

    Regarding the dedicated solution, the difference is always in the details. That detail might significantly impact if you are looking for precision or working with large data. First of all, filtering the vector similarity search is quite tricky, and it does not work with pre and post-filtering because the graph's connectivity gets lost. That is why we introduced filterable HNSW at the very beginning. https://qdrant.tech/articles/filtrable-hnsw/ Here is another quite good explanation by James Briggs: https://www.pinecone.io/learn/vector-search-filtering/

    Going further, it is not only about being able to search. It is also about being able to scale. Optionally, use disk for cold data as more affordable storage and keep hot data in expensive RAM. Or using one of the built-in compression functionalities. The recently introduced Binary Quantization makes it possible to compress vector embeddings up to 32 times and speed up the search up to 40 times in parallel. This makes billion-scale vector search affordable only for enterprise companies. https://qdrant.tech/articles/binary-quantization/

    Native vector databases build all their features around vectors, and vectors are the first-class citizens in the architecture of the database core engine, not just another type of index support.

    Vector search is not only about Text Search. See: https://qdrant.tech/articles/vector-similarity-beyond-search... It is also not even only about Search. Qdrant offers, for example, a dedicated Recommendation API, where you can submit negative and positive vector examples and get ready-made recommendation results back. https://qdrant.tech/articles/new-recommendation-api/ And the upcoming version of the engine will introduce even more functionalities like a new Discovery API, for example.

    Ultimately, you do not need a vector database if you are looking for a simple vector search functionality. But you need one if you are looking to do more around it, and it is the central functionality of your application. It is just like using a multi-tool to make something quick or using a dedicated instrument highly optimized for the use case. Thank you for your attention.

  • from the article: "Instead of investing in new vector database products, it would be better to focus on existing databases and explore how they can be enhanced by incorporating vector search functionalities to become more powerful."

    Didn't read the rest but this looks like sound advice; same would go for graph DBs.

  • This article is about why you shouldn't enter the vector database field, and it's reasonable.

    But I want to comment on another thing I often hear: "You don't need a vector database - just use Postgres or Numpy, etc". As someone who moved to Pinecone from a Numpy-based solution, I have to disagree.

    Using a hosted vector database is straightforward. Get an API key from Pinecone, send them your vectors, and then query it with new vectors. It's fast, supports metadata filtering, and scales horizontally.

    On the other hand, setting up pgvector is a hassle - especially since none of the Cloud vendors support it natively, and a Numpy-based solution, while great for a POC, quickly becomes a hassle when trying to append to it and scale it horizontally.

    If you need a vector database, use a vector database. You won't regret it.