The primary functionality of the vector store is straightforward: identifying the most similar vectors to a given vector. While the concept is simple, translating it into a practical product poses significant challenges.

A simple and basic approach to searching in a vector database is to perform an exhaustive search by comparing a query vector to every other vector stored in the database one by one. However, this consumes too many resources and results in very high latencies, making it not very practical. To address this problem, Approximate Nearest Neighbor (ANN) algorithms are used. ANN search approximates the true nearest neighbor, which means it might not find the absolute closest point, but it will find one that’s close enough, with a low-latency and by consuming fewer resources. In the literature, the comparison of the results of ANNS with exhaustive search is called the recall rate. The higher the recall rate the better the results.

Several ANNS algorithms, such as HNSW[1], NSG[2], and DiskANN[3], are available for use, each with its distinct characteristics. One of the difficult problems in ANN algorithms is that indexing and querying vectors may require storing the whole data in memory. When the dataset is huge, then memory requirements for indexing may exceed available memory. DiskANN algorithm tries to solve this problem by using disk as the main storage for indexes and for performing queries directly on disk.DiskANN paper acknowledges that, if you try to store your vectors in disk and use HNSW or NSG, you may end up with again very high latencies. DiskANN is focused on serving queries from disk with low-latency and good recall rate. And this helps Upstash Vector to be cost-effective, therefore cheaper compared to alternatives.

Even though DiskANN has its advantages, it also requires more work to be practical. Main problem is that, you can’t insert/update existing index without reindexing all the vectors. For this problem, there is another improved paper FreshDiskANN[4]. FreshDiskANN improves DiskANN via introducing a temporary index for up-to-date data in memory. Queries are served from both the temporary(up-to-date) index and also from the disk. And these temporary indexes are merged to the disk from time-to-time behind the scene.

Upstash Vector is based on DiskANN and FreshDiskANN with more improvements based on our tests and observations.

References

  1. Malkov, Y. A., Yashunin, D. A. (2016). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. CoRR, abs/1603.09320 (2016). [https://arxiv.org/abs/1603.09320]
  2. Fu, C., Xiang, C., Wang, C., Cai, D. (2019). Fast Approximate Nearest Neighbor Search with Navigating Spreading-Out Graphs. Proceedings of the VLDB, 12(5), 461–474. doi: 10.14778/3303753.3303754. [https://www.vldb.org/pvldb/vol12/p461-fu.pdf]
  3. Subramanya, S. J., Devvrit, Kadekodi, R., Krishaswamy, R., Simhadri, H. V. (2019). DiskANN: Fast Accurate Billion-Point Nearest Neighbor Search on a Single Node. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS ‘19), Article No.: 1233, Pages 13766–13776. [https://dl.acm.org/doi/abs/10.5555/3454287.3455520]
  4. Singh, A., Subramanya, S. J., Krishnaswamy, R., Simhadri, H. V. (2021). FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search. CoRR abs/2105.09613 (2021). [https://arxiv.org/abs/2105.09613]