I’ve seen a lot written recently on vector databases, so I wanted to get an idea of how they work and in what situations that they could be useful. I worked with the Pinecone Vector Database, which is one of the most popular vector database systems on DB Engines.
What Are Vector Databases:
A vector databases allows us to store vector embeddings. These vector embeddings are multidimensional arrays of numbers that represent an object. This object can be the text in a document, an image, or sound clips, for example. There are different methods available for converting an object to a vector embeddings, like the OpenAI API for converting strings.
The vector embeddings can be searched, to find documents with a match to a given string, or images or audio that are similar to one another. With text, we can perform a semantic search, which means that different terms with similar meanings can be matched as well as any exact matches. SQL Server does have some semantic search capability with full text search for single words and not phrases, but I’m not sure how it compares in accuracy to a vector search.
There are also applications for AI, Chatbots, LLMs and Natural Language Processing.
Pinecone:
Pinecone is a managed vector database, which can scale to multiple nodes, if necessary. There is a free tier that we can use to get familiar with the system. There’s a UI we can use to manage the system, or we can interact with a CLI. I’m going to use Python, I’ll also use Python to generate the vector embeddings.
Pinecone terms:
Record: Each record has a unique ID and a vector of numeric values, plus optional metadata.
Index: Stores records – Similar to a table in a relational database.
Collection: A copy of an Index. Used for backups, experimenting with configuration changes, or transferring data.
Namespace: An Index can be split into separate namespaces.
Project: Contains one or more indexes.
Pod: Different pod sizes determine the amount of compute power and storage.
Replica: Read-only copy of data.
Pinecone Signup:
We can go to the Pinecone site and signup for the Free Tier, which will let us create one project with one index at no cost. They’ll send a code to your email that you’ll use to sign in (Unless you want to sign in with GitHub, Google, or Microsoft). You’ll be asked about your preferred programming language and your project type, but these answers won’t affect the setup.
Create Index And Work With Data:
You can use the Pinecone Console to create indexes and work with data. I’m going to use Python to work with the database, so that I can call a library to give me vector embeddings for some text.
I’ll install the Pinecone client, as well as Gensim and NLTK for the vector embeddings.
pip install pinecone-client pip install gensim # Includes numpy, scipy, smart_open pip install nltk
We also need to download some data for the NLTK tokenizer.
import nltk
nltk.download('punkt')
Creating an index is straightforward. You will need to go into the Pinecone console under ‘API Keys’ and get the API Key and the Environment values for your project. Put those values into the Init call.
You’ll also need to know how many dimensions are used for each array of the vector enbeddings. You’ll need to use that number to set up the index. If the number of dimensions in an array doesn’t match the index, an error will be raised. With the Gensim call, I’ll have 100 dimensions in each array.
import pinecone
pinecone.init(api_key="{apikey}", environment="{environment}")
pinecone.create_index("textindex1", dimension=100)
pinecone.list_indexes()
Next, we’ll insert some values. I’ve created a dictionary with a string key, and some text describing some sports teams in the Southeastern US. We’ll use the NLTK library to tokenize the text values, basically to split a string into the individual words as well as some cleanup for the model. We’ll use the Gensim library to generate a model, and to create the vector embeddings.
There is an option for the insert to include metadata, like adding a category. I didn’t add any metadata in my example, but that data can be used to filter query records.
Initially, I was getting an error with the insert, ‘Unable to prepare type ndarray for serialization’. Luckily, I found a post from someone getting the same error. The WV call returns a NDArray from NumPy, so that needs to be converted to a list in Python for the Pinecone insert.
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Text dictionary
text = {
"ATL_Falcons": "Atlanta Falcons - Football - Atlanta, Georgia",
"Auburn": "Auburn University - Football - Auburn, Alabama",
"NASH_SC": "Nashville SC - Soccer - Nashville, Tennessee",
"ATL_United": "Atlanta United - Soccer - Atlanta, Georgia",
"CHAR_Hornets": "Charlotte Hornets - Basketball - Charlotte, North Carolina"
}
# Tokenization
tokenized = {key: word_tokenize(value.lower()) for key,value in text.items()}
# Train Word2Vec model
model = Word2Vec(list(tokenized.values()), vector_size=100, window=5, min_count=1, sg=0)
# Create reference to index and insert values
i = pinecone.Index("textindex1")
# Should see count for each record inserted
for key, value in tokenized.items(): i.upsert([ (key, (model.wv[value].mean(axis=0)).tolist() ) ])
# Get index info
i.describe_index_stats()
Using the same index reference in the i variable, we can search for matches. The model we created before can be used to convert the search string into a vector, and the Pinecone query function will use that to search. The top_k value will specify how many matches to return. The matches will return the key of the matched record and a ‘score’ on how close the match was. The default is to not return the vector for each match, but there is an option (include_values=true) that can have the query return those.
# Search for match i.query(vector=(model.wv['football']).tolist(), top_k=2) # top_k for the number of results to return i.query(vector=(model.wv['georgia']).tolist(), top_k=2) i.query(vector=(model.wv['soccer']).tolist(), top_k=2)
Conclusion:
This was a very quick introduction to vector databases and to Pinecone. There’s a lot of functionality still left to go through. It would be interesting to work with images and image search, among other things.
Links: