Implementing Gemini Text Embeddings for Production Applications
Note: This guide is based on Google Generative AI API documentation, Gemini embedding model specifications (text-embedding-004 released March 2025), and documented RAG (Retrieval-Augmented Generation) patterns. All code examples use the official google-generativeai Python SDK and follow Google Cloud best practices.
Text embeddings transform text into dense vector representations that capture semantic meaning, enabling applications like semantic search, document clustering, and Retrieval-Augmented Generation (RAG). Google’s Gemini embedding models, particularly text-embedding-004 released in March 2025, provide state-of-the-art performance with configurable output dimensions and task-specific optimization.
This guide demonstrates practical implementation of Gemini embeddings for production applications, from API integration through vector database deployment and RAG systems.
Prerequisites
Required Knowledge:
- Python 3.9+ and async programming concepts
- Basic understanding of vector similarity (cosine similarity, dot product)
- Familiarity with REST APIs and API keys
- Optional: Experience with vector databases or semantic search
Required Tools:
# Install Google Generative AI SDK
pip install google-generativeai==0.4.0
# Vector database clients
pip install chromadb==0.4.22 # Local/embedded vector DB
pip install pinecone-client==3.0.0 # Managed vector DB
# Data processing and utilities
pip install numpy==1.24.0
pip install pandas==2.1.0
pip install scikit-learn==1.3.0
# Optional: For visualization
pip install matplotlib==3.8.0 umap-learn==0.5.5
API Access:
- Google AI API key (free tier available: https://makersuite.google.com/app/apikey)
- Optional: Google Cloud account for Vertex AI deployment (enterprise)
Gemini Embedding Models Overview
Model Specifications
Google’s Gemini embedding models as of March 2025:
| Model | Dimensions | Max Input | Task Types | Pricing (per 1M tokens) |
|---|---|---|---|---|
| text-embedding-004 | 768 (default), 256/512 configurable | 2048 tokens | Retrieval, similarity, classification, clustering | Free tier: 1500 requests/day |
| text-multilingual-embedding-002 | 768 | 2048 tokens | Multilingual (100+ languages) | Free tier: 1500 requests/day |
Key Features:
- Task-specific optimization: Optimize for retrieval, similarity, classification, or clustering
- Configurable dimensions: Trade accuracy for storage/speed with dimension reduction
- Batch processing: Process up to 100 texts per API call
- Multilingual support: Native support for 100+ languages
Comparison with Other Embedding Models
| Model | Dimensions | Strengths | Best For |
|---|---|---|---|
| Gemini text-embedding-004 | 768 | Task-specific optimization, free tier | General-purpose, RAG, semantic search |
| OpenAI text-embedding-3-small | 1536 | High performance | Production applications |
| OpenAI text-embedding-3-large | 3072 | Highest quality | Critical applications |
| Cohere embed-english-v3.0 | 1024 | Strong English performance | English-only applications |
API Setup and Authentication
Basic Configuration
# config.py - Google Generative AI setup
import google.generativeai as genai
import os
# Configure API key
API_KEY = os.environ.get("GOOGLE_API_KEY")
if not API_KEY:
raise ValueError("GOOGLE_API_KEY environment variable not set")
genai.configure(api_key=API_KEY)
# List available embedding models
def list_embedding_models():
"""Display available embedding models and their details"""
for model in genai.list_models():
if 'embedContent' in model.supported_generation_methods:
print(f"Model: {model.name}")
print(f" Display Name: {model.display_name}")
print(f" Description: {model.description}")
print(f" Input Token Limit: {model.input_token_limit}")
print()
list_embedding_models()
Generate Embeddings
# embeddings.py - Core embedding generation
from typing import List, Dict, Union
import numpy as np
def generate_embedding(
text: str,
model: str = "models/text-embedding-004",
task_type: str = "RETRIEVAL_DOCUMENT",
output_dimensionality: int = 768
) -> np.ndarray:
"""
Generate embedding for a single text.
Args:
text: Input text to embed
model: Gemini embedding model name
task_type: Optimization task type:
- RETRIEVAL_QUERY: Optimize for search queries
- RETRIEVAL_DOCUMENT: Optimize for indexed documents
- SEMANTIC_SIMILARITY: General similarity comparison
- CLASSIFICATION: Optimize for text classification
- CLUSTERING: Optimize for clustering tasks
output_dimensionality: Output vector dimensions (256, 512, or 768)
Returns:
NumPy array of embedding vector
"""
result = genai.embed_content(
model=model,
content=text,
task_type=task_type,
output_dimensionality=output_dimensionality
)
return np.array(result['embedding'])
# Example usage
text = "Gemini embedding models provide state-of-the-art semantic understanding."
embedding = generate_embedding(text)
print(f"Text: {text}")
print(f"Embedding shape: {embedding.shape}")
print(f"First 5 dimensions: {embedding[:5]}")
Batch Embedding Generation
Process multiple texts efficiently:
def generate_embeddings_batch(
texts: List[str],
model: str = "models/text-embedding-004",
task_type: str = "RETRIEVAL_DOCUMENT",
output_dimensionality: int = 768
) -> np.ndarray:
"""
Generate embeddings for multiple texts in a single API call.
Args:
texts: List of texts to embed (max 100 per batch)
model: Gemini embedding model name
task_type: Optimization task type
output_dimensionality: Output vector dimensions
Returns:
NumPy array of shape (len(texts), output_dimensionality)
"""
# API supports max 100 texts per request
if len(texts) > 100:
# Process in chunks
embeddings = []
for i in range(0, len(texts), 100):
batch = texts[i:i+100]
batch_embeddings = generate_embeddings_batch(batch, model, task_type, output_dimensionality)
embeddings.append(batch_embeddings)
return np.vstack(embeddings)
result = genai.embed_content(
model=model,
content=texts,
task_type=task_type,
output_dimensionality=output_dimensionality
)
return np.array([embed for embed in result['embedding']])
# Example: Batch processing
documents = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with multiple layers.",
"Natural language processing enables computers to understand human language.",
"Computer vision allows machines to interpret visual information."
]
embeddings = generate_embeddings_batch(documents)
print(f"Generated {len(embeddings)} embeddings with shape {embeddings[0].shape}")
Semantic Search Implementation
Build Document Index
# semantic_search.py - Complete semantic search system
from typing import List, Tuple
import numpy as np
from dataclasses import dataclass
@dataclass
class Document:
"""Document with metadata"""
id: str
text: str
embedding: np.ndarray
metadata: Dict = None
class SemanticSearchIndex:
"""In-memory semantic search index using cosine similarity"""
def __init__(self, model: str = "models/text-embedding-004"):
self.model = model
self.documents: List[Document] = []
def add_documents(self, texts: List[str], metadata: List[Dict] = None):
"""
Add documents to the index with embeddings.
Args:
texts: List of document texts
metadata: Optional metadata for each document
"""
# Generate embeddings with RETRIEVAL_DOCUMENT task type
embeddings = generate_embeddings_batch(
texts,
model=self.model,
task_type="RETRIEVAL_DOCUMENT"
)
# Store documents
for i, (text, embedding) in enumerate(zip(texts, embeddings)):
doc_metadata = metadata[i] if metadata else {}
doc = Document(
id=f"doc_{len(self.documents)}",
text=text,
embedding=embedding,
metadata=doc_metadata
)
self.documents.append(doc)
print(f"Added {len(texts)} documents. Total: {len(self.documents)}")
def search(self, query: str, top_k: int = 5) -> List[Tuple[Document, float]]:
"""
Search for documents similar to query.
Args:
query: Search query text
top_k: Number of top results to return
Returns:
List of (Document, similarity_score) tuples sorted by relevance
"""
# Generate query embedding with RETRIEVAL_QUERY task type
query_embedding = generate_embedding(
query,
model=self.model,
task_type="RETRIEVAL_QUERY"
)
# Calculate cosine similarity with all documents
similarities = []
for doc in self.documents:
similarity = self._cosine_similarity(query_embedding, doc.embedding)
similarities.append((doc, similarity))
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
@staticmethod
def _cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
"""Calculate cosine similarity between two vectors"""
dot_product = np.dot(vec1, vec2)
norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
return dot_product / norm_product
# Example usage
index = SemanticSearchIndex()
# Add technical documentation
documents = [
"Python is a high-level, interpreted programming language known for simplicity.",
"JavaScript is the primary language for web development and browser scripting.",
"Rust provides memory safety without garbage collection through ownership.",
"Go (Golang) is designed for concurrent programming and microservices.",
"TypeScript adds static typing to JavaScript for better tooling.",
"Kubernetes orchestrates containerized applications across clusters.",
"Docker containers package applications with their dependencies.",
"Terraform enables infrastructure as code for cloud resources."
]
index.add_documents(documents)
# Search
query = "What language is best for web browsers?"
results = index.search(query, top_k=3)
print(f"\nQuery: {query}\n")
for i, (doc, score) in enumerate(results, 1):
print(f"{i}. [Score: {score:.4f}] {doc.text}")
Vector Database Integration
ChromaDB (Local/Embedded)
ChromaDB provides a simple embedded vector database:
# chromadb_integration.py - ChromaDB vector database
import chromadb
from chromadb.config import Settings
class GeminiChromaDB:
"""Semantic search using Gemini embeddings + ChromaDB"""
def __init__(self, collection_name: str = "documents"):
# Initialize ChromaDB client (persistent storage)
self.client = chromadb.PersistentClient(path="./chroma_db")
# Create or get collection
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"description": "Gemini embeddings collection"}
)
def add_documents(
self,
texts: List[str],
ids: List[str] = None,
metadatas: List[Dict] = None
):
"""Add documents with Gemini embeddings to ChromaDB"""
# Generate embeddings
embeddings = generate_embeddings_batch(
texts,
task_type="RETRIEVAL_DOCUMENT"
)
# Generate IDs if not provided
if ids is None:
ids = [f"doc_{i}" for i in range(len(texts))]
# Add to ChromaDB
self.collection.add(
embeddings=embeddings.tolist(),
documents=texts,
ids=ids,
metadatas=metadatas
)
print(f"Added {len(texts)} documents to ChromaDB")
def search(
self,
query: str,
n_results: int = 5,
where: Dict = None
) -> Dict:
"""
Search for similar documents.
Args:
query: Search query text
n_results: Number of results to return
where: Metadata filters (e.g., {"category": "tech"})
Returns:
Dictionary with documents, distances, and metadata
"""
# Generate query embedding
query_embedding = generate_embedding(
query,
task_type="RETRIEVAL_QUERY"
)
# Query ChromaDB
results = self.collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=n_results,
where=where
)
return results
# Example usage
chroma_db = GeminiChromaDB(collection_name="tech_docs")
# Add documents with metadata
documents = [
"Kubernetes pods are the smallest deployable units.",
"Docker containers provide isolation and portability.",
"Microservices architecture decomposes applications into services.",
"RESTful APIs use HTTP methods for CRUD operations.",
"GraphQL provides a query language for APIs."
]
metadata = [
{"category": "orchestration", "year": 2014},
{"category": "containerization", "year": 2013},
{"category": "architecture", "year": 2011},
{"category": "api", "year": 2000},
{"category": "api", "year": 2015}
]
chroma_db.add_documents(documents, metadatas=metadata)
# Search with metadata filter
results = chroma_db.search(
query="How to deploy containers?",
n_results=3,
where={"category": "orchestration"}
)
print("\nSearch Results:")
for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
print(f"{i}. [Distance: {distance:.4f}] {doc}")
Pinecone (Managed Vector Database)
For production deployments with scale:
# pinecone_integration.py - Pinecone managed vector database
import pinecone
from pinecone import Pinecone, ServerlessSpec
class GeminiPinecone:
"""Semantic search using Gemini embeddings + Pinecone"""
def __init__(
self,
api_key: str,
index_name: str,
dimension: int = 768,
metric: str = "cosine"
):
# Initialize Pinecone
pc = Pinecone(api_key=api_key)
# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=dimension,
metric=metric,
spec=ServerlessSpec(
cloud='aws',
region='us-east-1'
)
)
print(f"Created Pinecone index: {index_name}")
self.index = pc.Index(index_name)
def upsert_documents(
self,
texts: List[str],
ids: List[str] = None,
metadata: List[Dict] = None
):
"""Upsert documents with Gemini embeddings to Pinecone"""
# Generate embeddings
embeddings = generate_embeddings_batch(
texts,
task_type="RETRIEVAL_DOCUMENT"
)
# Prepare vectors for Pinecone
vectors = []
for i, (text, embedding) in enumerate(zip(texts, embeddings)):
vector_id = ids[i] if ids else f"doc_{i}"
vector_metadata = metadata[i] if metadata else {}
vector_metadata['text'] = text # Store original text
vectors.append({
"id": vector_id,
"values": embedding.tolist(),
"metadata": vector_metadata
})
# Upsert to Pinecone (batch size 100)
for i in range(0, len(vectors), 100):
batch = vectors[i:i+100]
self.index.upsert(vectors=batch)
print(f"Upserted {len(texts)} documents to Pinecone")
def search(
self,
query: str,
top_k: int = 5,
filter: Dict = None
) -> Dict:
"""
Search for similar documents.
Args:
query: Search query text
top_k: Number of results
filter: Metadata filters (e.g., {"category": {"$eq": "tech"}})
Returns:
Dictionary with matches, scores, and metadata
"""
# Generate query embedding
query_embedding = generate_embedding(
query,
task_type="RETRIEVAL_QUERY"
)
# Query Pinecone
results = self.index.query(
vector=query_embedding.tolist(),
top_k=top_k,
filter=filter,
include_metadata=True
)
return results
# Example usage
pinecone_db = GeminiPinecone(
api_key=os.environ["PINECONE_API_KEY"],
index_name="gemini-embeddings"
)
# Add documents
pinecone_db.upsert_documents(
texts=documents,
metadata=metadata
)
# Search
results = pinecone_db.search("container orchestration", top_k=3)
for match in results['matches']:
print(f"Score: {match['score']:.4f} - {match['metadata']['text']}")
Retrieval-Augmented Generation (RAG)
Combine semantic search with generative AI:
# rag_system.py - RAG implementation with Gemini
import google.generativeai as genai
class GeminiRAG:
"""RAG system using Gemini embeddings + Gemini Pro"""
def __init__(
self,
embedding_model: str = "models/text-embedding-004",
generation_model: str = "gemini-pro"
):
self.search_index = SemanticSearchIndex(model=embedding_model)
self.generation_model = genai.GenerativeModel(generation_model)
def add_knowledge_base(self, documents: List[str], metadata: List[Dict] = None):
"""Add documents to the knowledge base"""
self.search_index.add_documents(documents, metadata)
def query(
self,
question: str,
top_k: int = 3,
max_tokens: int = 500
) -> Dict[str, any]:
"""
Answer question using RAG pattern.
Steps:
1. Retrieve relevant documents via semantic search
2. Construct prompt with retrieved context
3. Generate answer using Gemini Pro
Returns:
Dictionary with answer, sources, and relevance scores
"""
# Step 1: Retrieve relevant documents
search_results = self.search_index.search(question, top_k=top_k)
# Extract context from top results
context_docs = []
sources = []
for doc, score in search_results:
context_docs.append(doc.text)
sources.append({
"text": doc.text,
"score": float(score),
"metadata": doc.metadata
})
# Step 2: Construct RAG prompt
context = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(context_docs)])
prompt = f"""You are a helpful assistant. Answer the question based on the provided context. If the context doesn't contain relevant information, say so.
Context:
{context}
Question: {question}
Answer:"""
# Step 3: Generate answer
response = self.generation_model.generate_content(
prompt,
generation_config=genai.GenerationConfig(
max_output_tokens=max_tokens,
temperature=0.2 # Lower temperature for factual answers
)
)
return {
"answer": response.text,
"sources": sources,
"question": question
}
# Example usage
rag_system = GeminiRAG()
# Add knowledge base
knowledge_base = [
"Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF). It was first released in 2014.",
"Docker containers package applications with their dependencies, ensuring consistency across environments. Docker was released in 2013.",
"Microservices architecture breaks applications into small, independently deployable services. Each service handles a specific business capability.",
"RESTful APIs use HTTP methods: GET (retrieve), POST (create), PUT (update), DELETE (remove). REST was defined in Roy Fielding's 2000 dissertation.",
"GraphQL was developed by Facebook in 2012 and open-sourced in 2015. It allows clients to request exactly the data they need."
]
rag_system.add_knowledge_base(knowledge_base)
# Ask question
result = rag_system.query("When was Kubernetes first released and who created it?")
print(f"Question: {result['question']}\n")
print(f"Answer: {result['answer']}\n")
print("Sources:")
for i, source in enumerate(result['sources'], 1):
print(f"{i}. [Relevance: {source['score']:.4f}] {source['text'][:100]}...")
Document Clustering
Group similar documents automatically:
# clustering.py - Document clustering with Gemini embeddings
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def cluster_documents(
texts: List[str],
n_clusters: int = 3,
model: str = "models/text-embedding-004"
) -> Dict[int, List[str]]:
"""
Cluster documents using K-Means on Gemini embeddings.
Returns:
Dictionary mapping cluster_id -> list of documents
"""
# Generate embeddings
embeddings = generate_embeddings_batch(
texts,
task_type="CLUSTERING"
)
# K-Means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(embeddings)
# Calculate silhouette score (measure of cluster quality)
silhouette_avg = silhouette_score(embeddings, cluster_labels)
print(f"Silhouette Score: {silhouette_avg:.4f} (higher is better)")
# Group documents by cluster
clusters = {}
for i, label in enumerate(cluster_labels):
if label not in clusters:
clusters[label] = []
clusters[label].append(texts[i])
return clusters
# Example: Cluster technical documents
documents = [
# Programming languages
"Python is great for data science and machine learning",
"JavaScript runs in web browsers and Node.js servers",
"Go is designed for concurrent systems programming",
# Databases
"PostgreSQL is a powerful relational database",
"MongoDB stores documents in JSON-like format",
"Redis is an in-memory key-value store",
# Cloud platforms
"AWS provides scalable cloud infrastructure",
"Google Cloud Platform offers AI/ML services",
"Azure integrates with Microsoft ecosystem"
]
clusters = cluster_documents(documents, n_clusters=3)
for cluster_id, docs in clusters.items():
print(f"\nCluster {cluster_id}:")
for doc in docs:
print(f" - {doc}")
Cost Optimization and Best Practices
Dimension Reduction for Storage Savings
def compare_dimensions():
"""Compare embedding quality vs storage with different dimensions"""
text = "Sample text for dimension comparison"
dimensions = [256, 512, 768]
for dim in dimensions:
embedding = generate_embedding(
text,
output_dimensionality=dim
)
# Calculate storage size
storage_bytes = embedding.nbytes
storage_mb_per_million = (storage_bytes * 1_000_000) / (1024 * 1024)
print(f"Dimensions: {dim}")
print(f" Storage: {storage_bytes} bytes per embedding")
print(f" Storage for 1M embeddings: {storage_mb_per_million:.2f} MB")
print()
compare_dimensions()
# Output:
# Dimensions: 256 - ~1 GB per 1M embeddings (cheapest)
# Dimensions: 512 - ~2 GB per 1M embeddings
# Dimensions: 768 - ~3 GB per 1M embeddings (highest quality)
Caching for Repeated Queries
from functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def cached_embedding(text_hash: str, task_type: str) -> tuple:
"""Cache embeddings for frequently-used texts"""
# This would actually generate the embedding
# The decorator caches based on text_hash
pass
def get_cached_embedding(text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> np.ndarray:
"""Generate embedding with caching"""
text_hash = hashlib.md5(text.encode()).hexdigest()
try:
return cached_embedding(text_hash, task_type)
except:
# Generate if not cached
embedding = generate_embedding(text, task_type=task_type)
cached_embedding(text_hash, task_type) # Update cache
return embedding
Production Deployment Considerations
Rate Limiting
Google Generative AI API has rate limits:
- Free tier: 1,500 requests per day
- Paid tier: 10,000 requests per minute (RPM)
Implement backoff retry:
import time
from google.api_core import retry
@retry.Retry(predicate=retry.if_exception_type(Exception))
def generate_embedding_with_retry(text: str, **kwargs) -> np.ndarray:
"""Generate embedding with automatic retry on rate limit"""
try:
return generate_embedding(text, **kwargs)
except Exception as e:
if "429" in str(e): # Rate limit error
time.sleep(60) # Wait 1 minute
raise # Retry will catch this
raise
Monitoring and Observability
Track embedding generation metrics:
import logging
from datetime import datetime
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def generate_embedding_monitored(text: str, **kwargs) -> np.ndarray:
"""Generate embedding with monitoring"""
start_time = datetime.now()
try:
embedding = generate_embedding(text, **kwargs)
# Log success
duration_ms = (datetime.now() - start_time).total_seconds() * 1000
logger.info(f"Embedding generated: {len(text)} chars, {duration_ms:.2f}ms")
return embedding
except Exception as e:
logger.error(f"Embedding failed: {str(e)}")
raise
Conclusion and Resources
This guide covered practical implementation of Gemini text embeddings, from API integration through production RAG systems and vector database deployment. Key takeaways:
- Gemini embeddings provide task-specific optimization for retrieval, similarity, and clustering
- Configurable dimensions enable storage/performance tradeoffs
- ChromaDB and Pinecone offer complementary vector database solutions
- RAG systems combine semantic search with generative AI for accurate, grounded responses
- Production deployment requires rate limiting, caching, and monitoring
Further Resources:
- Google Generative AI Docs: https://ai.google.dev/docs/embeddings
- Gemini API Quickstart: https://ai.google.dev/gemini-api/docs/quickstart
- ChromaDB Documentation: https://docs.trychroma.com/
- Pinecone Documentation: https://docs.pinecone.io/
- RAG Patterns: https://cloud.google.com/blog/products/ai-machine-learning/rag-patterns (Google Cloud blog)