How to Create Vector Embeddings
On this page
Vector embeddings represent your data as points in multi-dimensional space. These embeddings capture meaningful relationships in your data and enable tasks such as semantic search and retrieval. You can store vector embeddings along with your other data in Atlas and use Atlas Vector Search to query your vectorized data.
To perform Atlas Vector Search queries, you must:
Choose a method to create vector embeddings.
Create vector embeddings from your data and store them in Atlas.
Create a vector embedding that represents your search query and run the query.
Atlas Vector Search returns documents whose vector embeddings are closest in distance to the embedding that represents your query. This indicates that they are similar in meaning.
Choose a Method to Create Embeddings
To create vector embeddings, you must use an embedding model. To connect to an embedding model and create embeddings for Atlas Vector Search, use one of the following methods:
Call an embedding service. Most AI providers offer APIs for their proprietary embedding models that you can use to create vector embeddings.
For a sample implementation with OpenAI, see Create and Store Embeddings From Data.
Load an open-source model. If you don't have API keys or credits with an embedding service, you can use an open-source embedding model by loading it locally from your application.
For a sample implementation, see Create and Store Embeddings From Data.
Leverage an integration. You can integrate Atlas Vector Search with open-source frameworks like LangChain and LlamaIndex, services like Amazon Bedrock, and more. These integrations include built-in libraries and tools to help you quickly connect to open-source and proprietary embedding models and generate vector embeddings for Atlas Vector Search.
To get started, see Integrate Vector Search with AI Technologies.
Create and Store Embeddings From Data
The following procedure demonstrates how to create vector embeddings and store them in Atlas by using an open-source or proprietary embedding model and the MongoDB PyMongo Driver.
Prerequisites
To run these examples, you must have the following:
An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later (including RCs). Ensure that your IP address is included in your Atlas project's access list.
An environment to run Python interactive notebooks such as Colab.
Procedure
Complete the following steps to create vector embeddings from a sample dataset, and then store them in a collection in Atlas.
Note
This example covers how to create vector embeddings from a new dataset. If you want to create embeddings for an existing collection, you must add a new field that contains the embedding and update each document in the collection.
Set up the environment.
Create an interactive Python notebook by saving a file
with the .ipynb
extension, and then run the
following command in the notebook to install dependencies:
pip install --quiet datasets pandas nomic sentence-transformers einops pymongo
Note
If you experience warnings about version compatibility, you can ignore them as they do not prevent you from completing this tutorial.
Load and prepare the sample data.
This tutorial uses a sample dataset that contains text from a variety of how-to articles. This dataset is available on the Hugging Face dataset library for easy access to the data from your application.
Paste and run the following code in your notebook. This code does the following:
Loads the dataset from the Hugging Face dataset library.
Keeps only the first 100 entries of the dataset.
Converts the dataset to a pandas DataFrame so you can easily process the data.
Filters the data for non-null entries.
from datasets import load_dataset import pandas as pd # Load the dataset without downloading it fully data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True) data_head = data.take(100) # Create the DataFrame df = pd.DataFrame(data_head) # Only keep entries where the text field is not null df = df[df["text"].notna()] # Preview contents of the data df.head()
Create vector embeddings from your data.
Paste and run the following code in your notebook to create vector embeddings by using an open-source embedding model from Nomic AI. This code does the following:
Loads the nomic-embed-text-v1 embedding model.
Creates a function named
get_embedding
that uses the model to generate an embedding for a given text input.Calls the function to generate embeddings from the
text
field in your DataFrame and stores these embeddings in a newtext_embedding
field.
from nomic import embed from sentence_transformers import SentenceTransformer # Load the embedding model (https://huggingface.co/nomic-ai/nomic-embed-text-v1") model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True) def get_embedding(text): """Generates vector embeddings for the given text.""" embedding = model.encode(text) return embedding.tolist() # Creates embeddings and stores them as a new field df["text_embedding"] = df["text"].apply(get_embedding) df.head()
Store your data in Atlas.
Paste and run the following code in your notebook to connect to your
Atlas cluster and store your data in the sample_db.articles
collection. Replace the placeholder value with your Atlas cluster's SRV
connection string.
Note
Your connection string should use the following format:
mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") # Ingest data into Atlas db = mongo_client["sample_db"] collection = db["articles"] documents = df.to_dict("records") collection.insert_many(documents)
After running the sample code, you can
view your vector embeddings in the Atlas UI
by navigating to the sample_db.articles
collection in your cluster.
Set up the environment.
Create an interactive Python notebook by saving a file
with the .ipynb
extension, and then run the
following command in the notebook to install dependencies:
pip install --quiet datasets pandas openai pymongo
Note
If you experience warnings about version compatibility, you can ignore them as they do not prevent you from completing this tutorial.
Load and prepare the sample data.
This tutorial uses a sample dataset that contains text from a variety of how-to articles. This dataset is available on the Hugging Face dataset library for easy access to the data from your application.
Paste and run the following code in your notebook. This code does the following:
Loads the dataset from the Hugging Face dataset library.
Keeps only the first 100 entries of the dataset.
Converts the dataset to a pandas DataFrame so you can easily process the data.
Filters the data for non-null entries.
from datasets import load_dataset import pandas as pd # Load the dataset without downloading it fully data = load_dataset("MongoDB/cosmopedia-wikihow-chunked", split="train", streaming=True) data_head = data.take(100) # Create the DataFrame df = pd.DataFrame(data_head) # Only keep entries where the text field is not null df = df[df["text"].notna()] # Preview contents of the data df.head()
Create vector embeddings from your data.
Paste and run the following code in your notebook to create vector embeddings by using a proprietary embedding model from OpenAI. Replace the placeholder value with your OpenAI API key. This code does the following:
Specifies the
text-embedding-3-small
embedding model.Creates a function named
get_embedding
that calls on the model's API to generate an embedding for a given text input.Calls the function to generate embeddings from the
text
field in your DataFrame and stores these embeddings in a newtext_embedding
field.
import os from openai import OpenAI # Specify your OpenAI API key and embedding model os.environ["OPENAI_API_KEY"] = "<api-key>" model = "text-embedding-3-small" openai_client = OpenAI() def get_embedding(text): """Generates vector embeddings for the given text.""" embeddings = openai_client.embeddings.create(input = [text], model=model).data[0].embedding return embeddings # Creates embeddings and stores them as a new field df["text_embedding"] = df["text"].apply(get_embedding) df.head()
Tip
See also:
For API details and a list of available models, refer to the OpenAI documentation.
Store your data in Atlas.
Paste and run the following code in your notebook to connect to your
Atlas cluster and store your data in the sample_db.articles
collection. Replace the placeholder value with your Atlas cluster's SRV
connection string.
Note
Your connection string should use the following format:
mongodb+srv://<username>:<password>@<clusterName>.<hostname>.mongodb.net
import pymongo # Connect to your Atlas cluster mongo_client = pymongo.MongoClient("<connection-string>") # Ingest data into Atlas db = mongo_client["sample_db"] collection = db["articles"] documents = df.to_dict("records") collection.insert_many(documents)
After running the sample code, you can
view your vector embeddings in the Atlas UI
by navigating to the sample_db.articles
collection in your cluster.
Create Embeddings for Queries
The following procedure demonstrates how to create an embedding for Atlas Vector Search queries by using an open-source or proprietary embedding model and the MongoDB PyMongo Driver.
Procedure
Once you've created embeddings from the sample data, complete the following steps to create an Atlas Vector Search index on your data and create an embedding that you can use for vector search queries.
Create the Atlas Vector Search index.
To enable vector search queries on your data,
create an Atlas Vector Search index on the sample_db.articles
collection.
The following index definition specifies the
text_embedding
field as the vector type, 768
vector
dimensions, and the similarity measure as euclidean
.
The method you can use to create the index depends
on your cluster tier:
For free and shared clusters, follow the steps to create an index through the Atlas UI. Name the index
vector_index
and use the following index definition:{ "fields": [ { "type": "vector", "path": "text_embedding", "numDimensions": 768, "similarity": "euclidean" } ] } For dedicated clusters, you can also create the index by using a supported MongoDB driver. Paste and run the following code in your notebook to create the index by using the PyMongo driver helper method:
from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "text_embedding", "numDimensions": 768, "similarity": "euclidean" } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model)
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
To generate the query vector for your vector search queries, you can use the same method that you used to create embeddings from your data.
For example, paste and run the following code to do the following:
Create an embedding for the string home improvement by calling the embedding function that you defined in the previous example.
Pass the embedding into the
queryVector
field in your aggregation pipeline.Run a sample vector search query and return the output.
# Generate embedding for the search query query_embedding = get_embedding("home improvement") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "text_embedding", "numCandidates": 100, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{'text': "**Step 9: Analyze Findings**\nReview collected information meticulously. Identify maximum deviations, average variances, patterns, etc. Decide whether remedial actions are needed based on severity and implications of revealed disparities. Common solutions include shimming low spots, grinding high ones, repairing damaged sections, or even consulting experts about potential structural concerns.\n\nBy diligently adhering to this procedure, you'll successfully check your floor's level condition, thereby facilitating informed decisions concerning maintenance, renovation, or construction projects!", 'score': 0.4972769618034363} {'text': '**Step 5: Deep Clean Surfaces**\nNow that the room is free of excess clutter, focus on deep cleaning surfaces. Start high and work your way down to avoid recontaminating cleaned areas. Dust light fixtures, ceiling fans, windowsills, shelves, and furniture. Vacuum or sweep floors thoroughly. Mop hard floor surfaces using a suitable cleanser. Pay attention to often neglected spots like baseboards and door frames.\n\nKey Tips:\n- Always start with the highest points to prevent falling dust from settling on already cleaned surfaces.\n- Move large pieces of furniture away from walls to ensure thorough cleaning beneath them.\n- Allow ample drying time before replacing stored items to prevent moisture damage.', 'score': 0.48243528604507446} {'text': "Remember to include support columns if needed, especially if designing multi-story structures.\n\n**Step 5: Designing Interiors**\nNow comes the fun part - decorating! Add lighting with torches, lanterns, or glowstone. Install staircases leading upstairs or downstairs. Create cozy seating areas with chairs and tables. Adorn walls with paintings, banners, or vines. And don't forget about adding bathroom facilities!\n\nBe creative but consistent with your theme. If going for a luxury feel, opt for gold accents and fine furniture pieces. Alternatively, go minimalist with clean lines and neutral colors.\n\n**Step 6: Creating Upper Levels & Roofs**\nRepeat steps four and five for additional floors, ensuring structural integrity throughout. When reaching the topmost level, cap off the building with a roof. Common roof shapes include gable, hip, mansard, and skillion. Whichever style you choose, ensure symmetry and proper alignment.", 'score': 0.4739491045475006} {'text': '**Step 7: Landscaping Exteriors**\nFinally, beautify your surroundings. Plant trees, flowers, and grass. Dig ponds or rivers nearby. Pathway bricks or gravel paths towards entrances. Build outdoor sitting areas, gardens, or even swimming pools!\n\nAnd there you have it - a grand hotel standing tall amidst the virtual landscape! With careful planning, patient collection of materials, thoughtful interior design, meticulous upper levels, and picturesque landscaping, you now possess both a functional space and impressive architectural feat. Happy building!', 'score': 0.4724790155887604} {'text': 'Title: How to Create and Maintain a Compost Pile\n\nIntroduction:\nComposting is an easy and environmentally friendly way to recycle organic materials and create nutrient-rich soil for your garden or plants. By following these steps, you can learn how to build and maintain a successful compost pile that will help reduce waste and improve the health of your plants.\n\n**Step 1: Choose a Location **\nSelect a well-draining spot in your backyard, away from your house or other structures, as compost piles can produce odors. Ideally, locate the pile in partial shade or a location with morning sun only. This allows the pile to retain moisture while avoiding overheating during peak sunlight hours.\n\n_Key tip:_ Aim for a minimum area of 3 x 3 feet (0.9m x 0.9m) for proper decomposition; smaller piles may not generate enough heat for optimal breakdown of materials.', 'score': 0.471458375453949}
Create the Atlas Vector Search index.
To enable vector search queries on your data,
create an Atlas Vector Search index on the sample_db.articles
collection.
The following index definition specifies the
text_embedding
field as the vector type, 1536
vector
dimensions, and the similarity measure as euclidean
.
The method you can use to create the index depends
on your cluster tier:
For free and shared clusters, follow the steps to create an index through the Atlas UI. Name the index
vector_index
and use the following index definition:{ "fields": [ { "type": "vector", "path": "text_embedding", "numDimensions": 1536, "similarity": "euclidean" } ] } For dedicated clusters, you can also create the index by using a supported MongoDB driver. Paste and run the following code in your notebook to create the index by using the PyMongo driver helper method:
from pymongo.operations import SearchIndexModel # Create your index model, then create the search index search_index_model = SearchIndexModel( definition = { "fields": [ { "type": "vector", "path": "text_embedding", "numDimensions": 768, "similarity": "euclidean" } ] }, name="vector_index", type="vectorSearch", ) collection.create_search_index(model=search_index_model)
To learn more, see Create an Atlas Vector Search Index.
Create embeddings for vector search queries and run a query.
To generate the query vector for your vector search queries, you can use the same method that you used to create embeddings from your data.
For example, paste and run the following code to do the following:
Create an embedding for the string home improvement by calling the embedding function that you defined in the previous example.
Pass the embedding into the
queryVector
field in your aggregation pipeline.Run a sample vector search query and return the output.
# Generate embedding for the search query query_embedding = get_embedding("home improvement") # Sample vector search pipeline pipeline = [ { "$vectorSearch": { "index": "vector_index", "queryVector": query_embedding, "path": "text_embedding", "numCandidates": 100, "limit": 5 } }, { "$project": { "_id": 0, "text": 1, "score": { "$meta": "vectorSearchScore" } } } ] # Execute the search results = collection.aggregate(pipeline) # Print results for i in results: print(i)
{'text': '**Step 6: Regular Maintenance**\nAfter investing effort into cleaning and organizing a crowded room, maintaining its orderliness is crucial. Establish habits that promote ongoing tidiness, such as regularly putting things back where they belong, scheduling weekly cleanup sessions, and addressing new clutter promptly rather than letting it accumulate over time.\n\nBy consistently applying these steps, you can successfully clean and maintain a very crowded room, creating a peaceful and enjoyable living space.', 'score': 0.42446020245552063} {'text': "**Step 9: Analyze Findings**\nReview collected information meticulously. Identify maximum deviations, average variances, patterns, etc. Decide whether remedial actions are needed based on severity and implications of revealed disparities. Common solutions include shimming low spots, grinding high ones, repairing damaged sections, or even consulting experts about potential structural concerns.\n\nBy diligently adhering to this procedure, you'll successfully check your floor's level condition, thereby facilitating informed decisions concerning maintenance, renovation, or construction projects!", 'score': 0.421939879655838} {'text': 'Check If a Floor Is Level: A Comprehensive Step-by-Step Guide\n==========================================================\n\nA level floor is crucial for various reasons such as safety, aesthetics, and proper functioning of appliances or furniture that require stability. This tutorial will guide you through checking whether your floor is level with accuracy and precision using tools available at most hardware stores. By following these steps, you can identify any irregularities, enabling necessary corrections before installing new floors, fixtures, or equipment.\n\n**Duration:** Approximately 30 minutes (excluding correction time)', 'score': 0.4213894307613373} {'text': '**Step 7: Landscaping Exteriors**\nFinally, beautify your surroundings. Plant trees, flowers, and grass. Dig ponds or rivers nearby. Pathway bricks or gravel paths towards entrances. Build outdoor sitting areas, gardens, or even swimming pools!\n\nAnd there you have it - a grand hotel standing tall amidst the virtual landscape! With careful planning, patient collection of materials, thoughtful interior design, meticulous upper levels, and picturesque landscaping, you now possess both a functional space and impressive architectural feat. Happy building!', 'score': 0.41135403513908386} {'text': "**Step 2: Gather Necessary Materials**\nTo efficiently clean a crowded room, gather all necessary materials beforehand. Some essential items include:\n\n* Trash bags\n* Recycling bins or bags\n* Boxes or storage containers\n* Cleaning supplies (e.g., broom, vacuum cleaner, dustpan, mop, all-purpose cleaner)\n* Gloves\n* Label maker or markers\n\nHaving everything at hand ensures smooth progress without wasting time searching for tools during the cleaning process.\n\n**Step 3: Declutter Systematically**\nStart by removing unnecessary items from the room. Divide objects into categories such as trash, recyclables, donations, and items to keep. Be ruthless when deciding which belongings are truly valuable or needed. If you haven't used something within the past year, consider whether it's worth keeping. Donating unused items not only frees up space but also benefits those in need.", 'score': 0.407828688621521}
Tip
See also:
You can also create embeddings by calling the API endpoint directly. To learn more, see OpenAI API Reference.
To learn more about running vector search queries, see Run Vector Search Queries.