This guide explains how to perform semantic queries on documents in CapyDB. Semantic queries retrieve documents by matching the meaning of the provided query text with EmbJSONs in the database.
The query operation returns a list of matched chunks from EmbJSONs in the collection. Only EmbJSONs with the same emb_model as the query text are included in the semantic search.EmbJSONs with differing emb_model
are excluded from the semantic search.
The simplest way to use the query
operation is to just provide the query text. This offers an easy and intuitive way to search your data semantically without worrying about additional parameters.
# Simple query example
query_text = "Software engineer with expertise in AI"
response = collection.query(query_text)
A successful query operation will return a JSON response containing the matching documents. By default, the response includes the matched text chunks, their location in the document, similarity scores, and basic document metadata:
{
"matches": [
{
"chunk": "John is a software engineer with expertise in AI.",
"path": "bio",
"chunk_n": 0,
"score": 0.95,
"document": {
"_id": ObjectId("64d2f8f01234abcd5678ef90")
// All document fields are returned here (name, bio, skills, etc.)
}
},
{
"chunk": "Alice is a data scientist with a background in machine learning.",
"path": "bio",
"chunk_n": 1,
"score": 0.89,
"document": {
"_id": ObjectId("64d2f8f01234abcd5678ef91")
// Complete document data is returned by default
}
}
]
}
By default, the system will:
For more control over your semantic searches, you can customize the query operation with additional parameters. These parameters allow you to fine-tune the search behavior, filter results, and specify what data to include in the response.
# Advanced query with optional parameters
query_text = "Software engineer with expertise in AI"
emb_model = "text-embedding-3-small" # Optional
top_k = 3 # Optional
include_values = True # Optional
projection = {
"mode": "include",
"fields": ["name", "bio"]
} # Optional
response = collection.query(
query_text,
filter={"status": "active"},
projection=projection,
emb_model=emb_model,
top_k=top_k,
include_values=include_values
)
When you customize the query with additional parameters like include_values
or projection
, the response can include more detailed information:
{
"matches": [
{
"path": "bio",
"chunk": "John is a software engineer with expertise in AI.",
"chunk_n": 0,
"score": 0.95,
"values": [
0.123, 0.456, 0.789, ...
],
"document": {
"_id": ObjectId("64d2f8f01234abcd5678ef90"),
"name": "John Doe",
"bio": EmbText("John is a software engineer with expertise in AI.")
}
},
{
"path": "bio",
"chunk": "Alice is a data scientist with a background in machine learning.",
"chunk_n": 1,
"score": 0.89,
"values": [
0.234, 0.567, 0.890, ...
],
"document": {
"_id": ObjectId("64d2f8f01234abcd5678ef91"),
"name": "Alice Smith",
"bio": EmbText("Alice is a data scientist with a background in machine learning.")
}
}
]
}
Parameter | Description |
---|---|
query | The text to be embedded and matched against stored EmbJSON fields. This parameter is required. |
filter (optional) | MongoDB-style query filter to apply to documents before semantic search. This helps narrow down the document set before performing the semantic search. |
projection (optional) | Specifies which fields to include or exclude in the returned documents. Format: {"mode": "include", "fields": ["field1", "field2"]} or{"mode": "exclude", "fields": ["field3"]} . |
emb_model (optional) | The embedding model used for the query. Defaults to OpenAI's text-embedding-3-small. Users can select from supported embedding models. If the specified model does not match those used in the stored EmbJSON, only matching fields will be targeted. |
top_k (optional) | The maximum number of matches to return. Defaults to 10. Increase this value to get more results, decrease it to improve performance and reduce response size. |
include_values (optional) | Whether to include the embedding vector values in the response. Defaults to false. Set to true if you need the raw vector data for further processing. |
When you need to quickly search for documents related to a concept:
results = collection.query("climate change impact")
When you need to search within a specific category or subset of documents:
results = collection.query(
"renewable energy solutions",
filter={"category": "science", "published": True}
)
When you only need the top few most relevant matches:
results = collection.query("machine learning techniques", top_k=3)
When you need to include specific fields in the response:
projection = {"mode": "include", "fields": ["title", "abstract", "author"]}
results = collection.query("quantum computing", projection=projection)
Your feedback helps us improve our documentation. Let us know what you think!