Implementing Gemini Text Embeddings for Production Applications

Note: This guide is based on Google Generative AI API documentation, Gemini embedding model specifications (text-embedding-004 released March 2025), and documented RAG (Retrieval-Augmented Generation) patterns. All code examples use the official google-generativeai Python SDK and follow Google Cloud best practices.

Text embeddings transform text into dense vector representations that capture semantic meaning, enabling applications like semantic search, document clustering, and Retrieval-Augmented Generation (RAG). Google’s Gemini embedding models, particularly text-embedding-004 released in March 2025, provide state-of-the-art performance with configurable output dimensions and task-specific optimization.

This guide demonstrates practical implementation of Gemini embeddings for production applications, from API integration through vector database deployment and RAG systems.

Prerequisites

Required Knowledge:

  • Python 3.9+ and async programming concepts
  • Basic understanding of vector similarity (cosine similarity, dot product)
  • Familiarity with REST APIs and API keys
  • Optional: Experience with vector databases or semantic search

Required Tools:

# Install Google Generative AI SDK
pip install google-generativeai==0.4.0

# Vector database clients
pip install chromadb==0.4.22        # Local/embedded vector DB
pip install pinecone-client==3.0.0  # Managed vector DB

# Data processing and utilities
pip install numpy==1.24.0
pip install pandas==2.1.0
pip install scikit-learn==1.3.0

# Optional: For visualization
pip install matplotlib==3.8.0 umap-learn==0.5.5

API Access:

Gemini Embedding Models Overview

Model Specifications

Google’s Gemini embedding models as of March 2025:

Model Dimensions Max Input Task Types Pricing (per 1M tokens)
text-embedding-004 768 (default), 256/512 configurable 2048 tokens Retrieval, similarity, classification, clustering Free tier: 1500 requests/day
text-multilingual-embedding-002 768 2048 tokens Multilingual (100+ languages) Free tier: 1500 requests/day

Key Features:

  • Task-specific optimization: Optimize for retrieval, similarity, classification, or clustering
  • Configurable dimensions: Trade accuracy for storage/speed with dimension reduction
  • Batch processing: Process up to 100 texts per API call
  • Multilingual support: Native support for 100+ languages

Comparison with Other Embedding Models

Model Dimensions Strengths Best For
Gemini text-embedding-004 768 Task-specific optimization, free tier General-purpose, RAG, semantic search
OpenAI text-embedding-3-small 1536 High performance Production applications
OpenAI text-embedding-3-large 3072 Highest quality Critical applications
Cohere embed-english-v3.0 1024 Strong English performance English-only applications

API Setup and Authentication

Basic Configuration

# config.py - Google Generative AI setup
import google.generativeai as genai
import os

# Configure API key
API_KEY = os.environ.get("GOOGLE_API_KEY")
if not API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable not set")

genai.configure(api_key=API_KEY)

# List available embedding models
def list_embedding_models():
    """Display available embedding models and their details"""
    for model in genai.list_models():
        if 'embedContent' in model.supported_generation_methods:
            print(f"Model: {model.name}")
            print(f"  Display Name: {model.display_name}")
            print(f"  Description: {model.description}")
            print(f"  Input Token Limit: {model.input_token_limit}")
            print()

list_embedding_models()

Generate Embeddings

# embeddings.py - Core embedding generation
from typing import List, Dict, Union
import numpy as np

def generate_embedding(
    text: str,
    model: str = "models/text-embedding-004",
    task_type: str = "RETRIEVAL_DOCUMENT",
    output_dimensionality: int = 768
) -> np.ndarray:
    """
    Generate embedding for a single text.

    Args:
        text: Input text to embed
        model: Gemini embedding model name
        task_type: Optimization task type:
            - RETRIEVAL_QUERY: Optimize for search queries
            - RETRIEVAL_DOCUMENT: Optimize for indexed documents
            - SEMANTIC_SIMILARITY: General similarity comparison
            - CLASSIFICATION: Optimize for text classification
            - CLUSTERING: Optimize for clustering tasks
        output_dimensionality: Output vector dimensions (256, 512, or 768)

    Returns:
        NumPy array of embedding vector
    """
    result = genai.embed_content(
        model=model,
        content=text,
        task_type=task_type,
        output_dimensionality=output_dimensionality
    )

    return np.array(result['embedding'])

# Example usage
text = "Gemini embedding models provide state-of-the-art semantic understanding."
embedding = generate_embedding(text)

print(f"Text: {text}")
print(f"Embedding shape: {embedding.shape}")
print(f"First 5 dimensions: {embedding[:5]}")

Batch Embedding Generation

Process multiple texts efficiently:

def generate_embeddings_batch(
    texts: List[str],
    model: str = "models/text-embedding-004",
    task_type: str = "RETRIEVAL_DOCUMENT",
    output_dimensionality: int = 768
) -> np.ndarray:
    """
    Generate embeddings for multiple texts in a single API call.

    Args:
        texts: List of texts to embed (max 100 per batch)
        model: Gemini embedding model name
        task_type: Optimization task type
        output_dimensionality: Output vector dimensions

    Returns:
        NumPy array of shape (len(texts), output_dimensionality)
    """
    # API supports max 100 texts per request
    if len(texts) > 100:
        # Process in chunks
        embeddings = []
        for i in range(0, len(texts), 100):
            batch = texts[i:i+100]
            batch_embeddings = generate_embeddings_batch(batch, model, task_type, output_dimensionality)
            embeddings.append(batch_embeddings)
        return np.vstack(embeddings)

    result = genai.embed_content(
        model=model,
        content=texts,
        task_type=task_type,
        output_dimensionality=output_dimensionality
    )

    return np.array([embed for embed in result['embedding']])

# Example: Batch processing
documents = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with multiple layers.",
    "Natural language processing enables computers to understand human language.",
    "Computer vision allows machines to interpret visual information."
]

embeddings = generate_embeddings_batch(documents)
print(f"Generated {len(embeddings)} embeddings with shape {embeddings[0].shape}")

Semantic Search Implementation

Build Document Index

# semantic_search.py - Complete semantic search system
from typing import List, Tuple
import numpy as np
from dataclasses import dataclass

@dataclass
class Document:
    """Document with metadata"""
    id: str
    text: str
    embedding: np.ndarray
    metadata: Dict = None

class SemanticSearchIndex:
    """In-memory semantic search index using cosine similarity"""

    def __init__(self, model: str = "models/text-embedding-004"):
        self.model = model
        self.documents: List[Document] = []

    def add_documents(self, texts: List[str], metadata: List[Dict] = None):
        """
        Add documents to the index with embeddings.

        Args:
            texts: List of document texts
            metadata: Optional metadata for each document
        """
        # Generate embeddings with RETRIEVAL_DOCUMENT task type
        embeddings = generate_embeddings_batch(
            texts,
            model=self.model,
            task_type="RETRIEVAL_DOCUMENT"
        )

        # Store documents
        for i, (text, embedding) in enumerate(zip(texts, embeddings)):
            doc_metadata = metadata[i] if metadata else {}
            doc = Document(
                id=f"doc_{len(self.documents)}",
                text=text,
                embedding=embedding,
                metadata=doc_metadata
            )
            self.documents.append(doc)

        print(f"Added {len(texts)} documents. Total: {len(self.documents)}")

    def search(self, query: str, top_k: int = 5) -> List[Tuple[Document, float]]:
        """
        Search for documents similar to query.

        Args:
            query: Search query text
            top_k: Number of top results to return

        Returns:
            List of (Document, similarity_score) tuples sorted by relevance
        """
        # Generate query embedding with RETRIEVAL_QUERY task type
        query_embedding = generate_embedding(
            query,
            model=self.model,
            task_type="RETRIEVAL_QUERY"
        )

        # Calculate cosine similarity with all documents
        similarities = []
        for doc in self.documents:
            similarity = self._cosine_similarity(query_embedding, doc.embedding)
            similarities.append((doc, similarity))

        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)

        return similarities[:top_k]

    @staticmethod
    def _cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
        """Calculate cosine similarity between two vectors"""
        dot_product = np.dot(vec1, vec2)
        norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
        return dot_product / norm_product

# Example usage
index = SemanticSearchIndex()

# Add technical documentation
documents = [
    "Python is a high-level, interpreted programming language known for simplicity.",
    "JavaScript is the primary language for web development and browser scripting.",
    "Rust provides memory safety without garbage collection through ownership.",
    "Go (Golang) is designed for concurrent programming and microservices.",
    "TypeScript adds static typing to JavaScript for better tooling.",
    "Kubernetes orchestrates containerized applications across clusters.",
    "Docker containers package applications with their dependencies.",
    "Terraform enables infrastructure as code for cloud resources."
]

index.add_documents(documents)

# Search
query = "What language is best for web browsers?"
results = index.search(query, top_k=3)

print(f"\nQuery: {query}\n")
for i, (doc, score) in enumerate(results, 1):
    print(f"{i}. [Score: {score:.4f}] {doc.text}")

Vector Database Integration

ChromaDB (Local/Embedded)

ChromaDB provides a simple embedded vector database:

# chromadb_integration.py - ChromaDB vector database
import chromadb
from chromadb.config import Settings

class GeminiChromaDB:
    """Semantic search using Gemini embeddings + ChromaDB"""

    def __init__(self, collection_name: str = "documents"):
        # Initialize ChromaDB client (persistent storage)
        self.client = chromadb.PersistentClient(path="./chroma_db")

        # Create or get collection
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"description": "Gemini embeddings collection"}
        )

    def add_documents(
        self,
        texts: List[str],
        ids: List[str] = None,
        metadatas: List[Dict] = None
    ):
        """Add documents with Gemini embeddings to ChromaDB"""

        # Generate embeddings
        embeddings = generate_embeddings_batch(
            texts,
            task_type="RETRIEVAL_DOCUMENT"
        )

        # Generate IDs if not provided
        if ids is None:
            ids = [f"doc_{i}" for i in range(len(texts))]

        # Add to ChromaDB
        self.collection.add(
            embeddings=embeddings.tolist(),
            documents=texts,
            ids=ids,
            metadatas=metadatas
        )

        print(f"Added {len(texts)} documents to ChromaDB")

    def search(
        self,
        query: str,
        n_results: int = 5,
        where: Dict = None
    ) -> Dict:
        """
        Search for similar documents.

        Args:
            query: Search query text
            n_results: Number of results to return
            where: Metadata filters (e.g., {"category": "tech"})

        Returns:
            Dictionary with documents, distances, and metadata
        """
        # Generate query embedding
        query_embedding = generate_embedding(
            query,
            task_type="RETRIEVAL_QUERY"
        )

        # Query ChromaDB
        results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_results,
            where=where
        )

        return results

# Example usage
chroma_db = GeminiChromaDB(collection_name="tech_docs")

# Add documents with metadata
documents = [
    "Kubernetes pods are the smallest deployable units.",
    "Docker containers provide isolation and portability.",
    "Microservices architecture decomposes applications into services.",
    "RESTful APIs use HTTP methods for CRUD operations.",
    "GraphQL provides a query language for APIs."
]

metadata = [
    {"category": "orchestration", "year": 2014},
    {"category": "containerization", "year": 2013},
    {"category": "architecture", "year": 2011},
    {"category": "api", "year": 2000},
    {"category": "api", "year": 2015}
]

chroma_db.add_documents(documents, metadatas=metadata)

# Search with metadata filter
results = chroma_db.search(
    query="How to deploy containers?",
    n_results=3,
    where={"category": "orchestration"}
)

print("\nSearch Results:")
for i, (doc, distance) in enumerate(zip(results['documents'][0], results['distances'][0]), 1):
    print(f"{i}. [Distance: {distance:.4f}] {doc}")

Pinecone (Managed Vector Database)

For production deployments with scale:

# pinecone_integration.py - Pinecone managed vector database
import pinecone
from pinecone import Pinecone, ServerlessSpec

class GeminiPinecone:
    """Semantic search using Gemini embeddings + Pinecone"""

    def __init__(
        self,
        api_key: str,
        index_name: str,
        dimension: int = 768,
        metric: str = "cosine"
    ):
        # Initialize Pinecone
        pc = Pinecone(api_key=api_key)

        # Create index if it doesn't exist
        if index_name not in pc.list_indexes().names():
            pc.create_index(
                name=index_name,
                dimension=dimension,
                metric=metric,
                spec=ServerlessSpec(
                    cloud='aws',
                    region='us-east-1'
                )
            )
            print(f"Created Pinecone index: {index_name}")

        self.index = pc.Index(index_name)

    def upsert_documents(
        self,
        texts: List[str],
        ids: List[str] = None,
        metadata: List[Dict] = None
    ):
        """Upsert documents with Gemini embeddings to Pinecone"""

        # Generate embeddings
        embeddings = generate_embeddings_batch(
            texts,
            task_type="RETRIEVAL_DOCUMENT"
        )

        # Prepare vectors for Pinecone
        vectors = []
        for i, (text, embedding) in enumerate(zip(texts, embeddings)):
            vector_id = ids[i] if ids else f"doc_{i}"
            vector_metadata = metadata[i] if metadata else {}
            vector_metadata['text'] = text  # Store original text

            vectors.append({
                "id": vector_id,
                "values": embedding.tolist(),
                "metadata": vector_metadata
            })

        # Upsert to Pinecone (batch size 100)
        for i in range(0, len(vectors), 100):
            batch = vectors[i:i+100]
            self.index.upsert(vectors=batch)

        print(f"Upserted {len(texts)} documents to Pinecone")

    def search(
        self,
        query: str,
        top_k: int = 5,
        filter: Dict = None
    ) -> Dict:
        """
        Search for similar documents.

        Args:
            query: Search query text
            top_k: Number of results
            filter: Metadata filters (e.g., {"category": {"$eq": "tech"}})

        Returns:
            Dictionary with matches, scores, and metadata
        """
        # Generate query embedding
        query_embedding = generate_embedding(
            query,
            task_type="RETRIEVAL_QUERY"
        )

        # Query Pinecone
        results = self.index.query(
            vector=query_embedding.tolist(),
            top_k=top_k,
            filter=filter,
            include_metadata=True
        )

        return results

# Example usage
pinecone_db = GeminiPinecone(
    api_key=os.environ["PINECONE_API_KEY"],
    index_name="gemini-embeddings"
)

# Add documents
pinecone_db.upsert_documents(
    texts=documents,
    metadata=metadata
)

# Search
results = pinecone_db.search("container orchestration", top_k=3)
for match in results['matches']:
    print(f"Score: {match['score']:.4f} - {match['metadata']['text']}")

Retrieval-Augmented Generation (RAG)

Combine semantic search with generative AI:

# rag_system.py - RAG implementation with Gemini
import google.generativeai as genai

class GeminiRAG:
    """RAG system using Gemini embeddings + Gemini Pro"""

    def __init__(
        self,
        embedding_model: str = "models/text-embedding-004",
        generation_model: str = "gemini-pro"
    ):
        self.search_index = SemanticSearchIndex(model=embedding_model)
        self.generation_model = genai.GenerativeModel(generation_model)

    def add_knowledge_base(self, documents: List[str], metadata: List[Dict] = None):
        """Add documents to the knowledge base"""
        self.search_index.add_documents(documents, metadata)

    def query(
        self,
        question: str,
        top_k: int = 3,
        max_tokens: int = 500
    ) -> Dict[str, any]:
        """
        Answer question using RAG pattern.

        Steps:
        1. Retrieve relevant documents via semantic search
        2. Construct prompt with retrieved context
        3. Generate answer using Gemini Pro

        Returns:
            Dictionary with answer, sources, and relevance scores
        """
        # Step 1: Retrieve relevant documents
        search_results = self.search_index.search(question, top_k=top_k)

        # Extract context from top results
        context_docs = []
        sources = []
        for doc, score in search_results:
            context_docs.append(doc.text)
            sources.append({
                "text": doc.text,
                "score": float(score),
                "metadata": doc.metadata
            })

        # Step 2: Construct RAG prompt
        context = "\n\n".join([f"[{i+1}] {doc}" for i, doc in enumerate(context_docs)])

        prompt = f"""You are a helpful assistant. Answer the question based on the provided context. If the context doesn't contain relevant information, say so.

Context:
{context}

Question: {question}

Answer:"""

        # Step 3: Generate answer
        response = self.generation_model.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                max_output_tokens=max_tokens,
                temperature=0.2  # Lower temperature for factual answers
            )
        )

        return {
            "answer": response.text,
            "sources": sources,
            "question": question
        }

# Example usage
rag_system = GeminiRAG()

# Add knowledge base
knowledge_base = [
    "Kubernetes was originally developed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF). It was first released in 2014.",
    "Docker containers package applications with their dependencies, ensuring consistency across environments. Docker was released in 2013.",
    "Microservices architecture breaks applications into small, independently deployable services. Each service handles a specific business capability.",
    "RESTful APIs use HTTP methods: GET (retrieve), POST (create), PUT (update), DELETE (remove). REST was defined in Roy Fielding's 2000 dissertation.",
    "GraphQL was developed by Facebook in 2012 and open-sourced in 2015. It allows clients to request exactly the data they need."
]

rag_system.add_knowledge_base(knowledge_base)

# Ask question
result = rag_system.query("When was Kubernetes first released and who created it?")

print(f"Question: {result['question']}\n")
print(f"Answer: {result['answer']}\n")
print("Sources:")
for i, source in enumerate(result['sources'], 1):
    print(f"{i}. [Relevance: {source['score']:.4f}] {source['text'][:100]}...")

Document Clustering

Group similar documents automatically:

# clustering.py - Document clustering with Gemini embeddings
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def cluster_documents(
    texts: List[str],
    n_clusters: int = 3,
    model: str = "models/text-embedding-004"
) -> Dict[int, List[str]]:
    """
    Cluster documents using K-Means on Gemini embeddings.

    Returns:
        Dictionary mapping cluster_id -> list of documents
    """
    # Generate embeddings
    embeddings = generate_embeddings_batch(
        texts,
        task_type="CLUSTERING"
    )

    # K-Means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    cluster_labels = kmeans.fit_predict(embeddings)

    # Calculate silhouette score (measure of cluster quality)
    silhouette_avg = silhouette_score(embeddings, cluster_labels)
    print(f"Silhouette Score: {silhouette_avg:.4f} (higher is better)")

    # Group documents by cluster
    clusters = {}
    for i, label in enumerate(cluster_labels):
        if label not in clusters:
            clusters[label] = []
        clusters[label].append(texts[i])

    return clusters

# Example: Cluster technical documents
documents = [
    # Programming languages
    "Python is great for data science and machine learning",
    "JavaScript runs in web browsers and Node.js servers",
    "Go is designed for concurrent systems programming",

    # Databases
    "PostgreSQL is a powerful relational database",
    "MongoDB stores documents in JSON-like format",
    "Redis is an in-memory key-value store",

    # Cloud platforms
    "AWS provides scalable cloud infrastructure",
    "Google Cloud Platform offers AI/ML services",
    "Azure integrates with Microsoft ecosystem"
]

clusters = cluster_documents(documents, n_clusters=3)

for cluster_id, docs in clusters.items():
    print(f"\nCluster {cluster_id}:")
    for doc in docs:
        print(f"  - {doc}")

Cost Optimization and Best Practices

Dimension Reduction for Storage Savings

def compare_dimensions():
    """Compare embedding quality vs storage with different dimensions"""
    text = "Sample text for dimension comparison"

    dimensions = [256, 512, 768]
    for dim in dimensions:
        embedding = generate_embedding(
            text,
            output_dimensionality=dim
        )

        # Calculate storage size
        storage_bytes = embedding.nbytes
        storage_mb_per_million = (storage_bytes * 1_000_000) / (1024 * 1024)

        print(f"Dimensions: {dim}")
        print(f"  Storage: {storage_bytes} bytes per embedding")
        print(f"  Storage for 1M embeddings: {storage_mb_per_million:.2f} MB")
        print()

compare_dimensions()
# Output:
# Dimensions: 256 - ~1 GB per 1M embeddings (cheapest)
# Dimensions: 512 - ~2 GB per 1M embeddings
# Dimensions: 768 - ~3 GB per 1M embeddings (highest quality)

Caching for Repeated Queries

from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_embedding(text_hash: str, task_type: str) -> tuple:
    """Cache embeddings for frequently-used texts"""
    # This would actually generate the embedding
    # The decorator caches based on text_hash
    pass

def get_cached_embedding(text: str, task_type: str = "RETRIEVAL_DOCUMENT") -> np.ndarray:
    """Generate embedding with caching"""
    text_hash = hashlib.md5(text.encode()).hexdigest()

    try:
        return cached_embedding(text_hash, task_type)
    except:
        # Generate if not cached
        embedding = generate_embedding(text, task_type=task_type)
        cached_embedding(text_hash, task_type)  # Update cache
        return embedding

Production Deployment Considerations

Rate Limiting

Google Generative AI API has rate limits:

  • Free tier: 1,500 requests per day
  • Paid tier: 10,000 requests per minute (RPM)

Implement backoff retry:

import time
from google.api_core import retry

@retry.Retry(predicate=retry.if_exception_type(Exception))
def generate_embedding_with_retry(text: str, **kwargs) -> np.ndarray:
    """Generate embedding with automatic retry on rate limit"""
    try:
        return generate_embedding(text, **kwargs)
    except Exception as e:
        if "429" in str(e):  # Rate limit error
            time.sleep(60)  # Wait 1 minute
            raise  # Retry will catch this
        raise

Monitoring and Observability

Track embedding generation metrics:

import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def generate_embedding_monitored(text: str, **kwargs) -> np.ndarray:
    """Generate embedding with monitoring"""
    start_time = datetime.now()

    try:
        embedding = generate_embedding(text, **kwargs)

        # Log success
        duration_ms = (datetime.now() - start_time).total_seconds() * 1000
        logger.info(f"Embedding generated: {len(text)} chars, {duration_ms:.2f}ms")

        return embedding

    except Exception as e:
        logger.error(f"Embedding failed: {str(e)}")
        raise

Conclusion and Resources

This guide covered practical implementation of Gemini text embeddings, from API integration through production RAG systems and vector database deployment. Key takeaways:

  • Gemini embeddings provide task-specific optimization for retrieval, similarity, and clustering
  • Configurable dimensions enable storage/performance tradeoffs
  • ChromaDB and Pinecone offer complementary vector database solutions
  • RAG systems combine semantic search with generative AI for accurate, grounded responses
  • Production deployment requires rate limiting, caching, and monitoring

Further Resources: