This report provides an overview of vector databases, outlines the existing studies on vector search and addresses perspectives of vector databases in the context of Large Language Models.

Vector search is a method to implement queries for information retrieval by representing data, including texts, images, audio and video, by mathematical vectors.

When objects such as words or entire documents are converted into numbers or, scientifically speaking, encoded as vectors through a process known as vector embedding, these numbers can be added, multiplied or subtracted to create relationships and explore meanings, resulting in more relevant searches.

Currently, there have been multiple approaches to vectorizing data depending on their type, such as text, visual, audio, and such. For example, speaking about the text data, word embedding developed from using vector space models (VSM) based on semantics, like in Word2vec88, to neural word embedding based on neural networks.

What Is Vector Database aka Vector Search Engine?

As implied by its name, a vector database is a storage for multi-dimensional vectors. Unlike traditional databases working with queries based on exact matches or criteria, vector databases offer searches based on the similarity and contextual meaning of data.

By capturing the attributes of various data, vector databases offer a better fit for unstructured datasets or complex data such as video, audio or text. Due to their architecture, vector databases can effectively handle big data required in modern digital environment without bottlenecks or latency issues.

Examples of Vector Search

Currently, there are several algorithms used to search vector databases working with various types of data. These include:

Nearest Neighbor Search (NNS) to find a point in a dataset that is the most similar to the current point. For example, the NNS algorithm can be used to find images similar to a target image based on their style and content.
ANNS (Approximate Nearest Neighbor Search), which allows approximation in the search results. In the ANNS method the lower accuracy of the match is compensated by the speed and space efficiency. An example of an ANNS method could include finding products that are similar to the target one based on features.

Other search algorithms include FAISS (Facebook Similarity Search Algorithm), SPTAG – Space Partition Tree and Graph and HNSW (Hierarchical Navigable Small World) algorithms.

Overviews of Studies on Vector Databases

Representing data digitally by vectors has a long history, with the first attempts to use words as vectors dating back to 1950. Below are just a few of the fundamental studies on the subjects providing a glimpse into the exploration of vector search and vector databases in the context of neural networks and large language models.

Naseem et. al. (2020)

A fundamental work by Naseem et al. (2020), titled A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models, provides one of the most thorough overviews of methods and algorithms to infer vectors from texts, at word, character and document level. The report showcases methods starting from manually selecting features, aka feature engineering, to SOTA representational learning methods, leveraging neural networks for vector embedding.

Han et al. (2023)

In a work titled A Comprehensive Survey on Vector Database: Storage and Retrieval Technique, Han et al. review multiple algorithms for vector search, including hash-, tree-, graph- and quantization-based approaches. The team developed their report as a comprehensive review of vector databases to provide comparison and discuss the challenges of vector search algorithms in various settings. The report also mentions perspectives on using vector databases with LLMs.

Limits of Vector Search

Despite being superior for finding relevance, vector search still has its limitations. For example, for many types of queries, keyword search still provides more relevant results than vector search. It is also important to remember that scaling vector search comes at the cost of additional hardware since vector search is implemented through complex vector calculations requiring significant computing power.

In an attempt to address scalability and costs, some approaches presume using vector search as a supplementary method to keyword queries. Other methods involve caching to decrease the demand for computing power and provide instant results. While these techniques come with issues of their own, several other methods were developed to improve vector search outcomes.

Improving Vector Search Outcomes

Using a Filter on Metadata

By attaching metadata key-value pairs to vectors, it becomes possible to limit vector search based on such metadata. Such approach allows to retrieve the number of nearest-neighbor results matching the filter on metadata while providing for lower search latency compared to unfiltered searches.

Hybrid Search

In the hybrid search method, keyword search is combined with vector search engine into a single API. While running sequential keyword and vector searchers comes as a poor trade-off, combining keyword and vector search into a single query proves for accurate search results with high speed and scalability.

Re-ranking

Re-ranking vector search goes through two stages, including:

vector search to get to the top relevant search results,
re-ranking using a cross-encoder.

The process relies on vector search, which can provide fast results while searching among literally a billion vectors, as demonstrated in the namesake work Searching in One Billion Vectors: Re-rank with Source Coding by Jegou et al. In the second stage, the process uses a cross-encoder to provide only high-quality rankings.

Vector Search with VectorShift

Vector databases and vector search possess tremendous potential for various applications and tasks, enabling searches for complex data, such as images, text and audio and delivering better search experiences. In addition, vector databases expand the opportunities with large language models by offering diverse and higher-quality data for LLM, enabling distributed LLM training, model compression and real-time knowledge search.

Implementing vector databases and vector search can be facilitated by the application of platforms like VectorShift, offering no-code or SDK interfaces for custom applications. For more information on VectorShift capabilities and utilizing LLMs and vector databases, please don't hesitate to get in touch with our team or request a free demo.