Implementing GenAIOps on Azure: A Practical Guide

Note: This guide is based on official Azure documentation, Azure OpenAI Service API specifications, and Azure Machine Learning MLOps patterns. All code examples use current Azure SDK versions (openai 1.0+ for Azure OpenAI, azure-ai-ml 1.12+, azure-identity 1.14+) and follow documented Azure best practices.

GenAIOps (Generative AI Operations) applies MLOps principles to generative AI systems, focusing on deployment, monitoring, versioning, and governance of large language models (LLMs). Azure provides a comprehensive platform for GenAIOps through Azure OpenAI Service, Azure Machine Learning, and supporting infrastructure services.

This guide demonstrates practical implementation of GenAIOps pipelines on Azure, from model deployment through production monitoring and cost optimization.

Prerequisites

Required Knowledge:

  • Azure subscription with access to Azure OpenAI Service (requires application approval)
  • Python 3.8+ and familiarity with async programming
  • Basic understanding of REST APIs and authentication patterns
  • Optional: Experience with Azure Machine Learning or MLOps concepts

Required Tools and SDKs:

# Install Azure SDKs
pip install openai>=1.0.0  # Includes AzureOpenAI class
pip install azure-ai-ml==1.12.0
pip install azure-identity==1.14.0
pip install azure-monitor-opentelemetry-exporter==1.0.0b21
pip install opentelemetry-instrumentation==0.42b0

# Optional: Azure CLI for infrastructure management
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
az login

Azure Resources Required:

  • Azure OpenAI Service resource (requires approval: https://aka.ms/oai/access)
  • Azure Machine Learning workspace (for experiment tracking)
  • Azure Key Vault (for secure credential storage)
  • Azure Monitor + Application Insights (for telemetry)
  • Azure Kubernetes Service (for scalable deployment)

Azure OpenAI Service Fundamentals

Service Architecture

Azure OpenAI Service provides managed access to OpenAI models through Azure infrastructure:

Model Family Use Case Token Limits Pricing Tier
GPT-4 Turbo Complex reasoning, long context 128K context Premium ($0.01/1K input tokens)
GPT-4 Production applications 8K/32K context Standard ($0.03/1K input tokens)
GPT-3.5 Turbo High-throughput apps 16K context Low-cost ($0.0005/1K input tokens)
Embedding models Semantic search, RAG 8K context Very low ($0.0001/1K tokens)

Key Differences from OpenAI API:

  • Regional deployment (data residency guarantees)
  • Enterprise-grade SLAs (99.9% uptime)
  • Managed identity authentication (no API keys in code)
  • Integration with Azure security/compliance tools

Authentication Setup

Use Azure Managed Identity for production (no secrets in code):

# authentication.py - Secure Azure authentication patterns
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI

def get_azure_openai_client(endpoint: str) -> AzureOpenAI:
    """
    Create Azure OpenAI client using DefaultAzureCredential.

    Authentication priority order:
    1. Managed Identity (in Azure)
    2. Azure CLI credentials (local dev)
    3. Environment variables (CI/CD)

    Args:
        endpoint: Azure OpenAI resource endpoint (e.g., https://myopenai.openai.azure.com)

    Returns:
        Authenticated AzureOpenAI client
    """
    # DefaultAzureCredential tries multiple auth methods
    credential = DefaultAzureCredential()

    # Create token provider for Azure OpenAI scope
    token_provider = get_bearer_token_provider(
        credential,
        "https://cognitiveservices.azure.com/.default"
    )

    client = AzureOpenAI(
        azure_endpoint=endpoint,
        azure_ad_token_provider=token_provider,
        api_version="2024-02-15-preview"  # Use latest stable API version
    )

    return client

# Usage example
client = get_azure_openai_client("https://myopenai.openai.azure.com")

Alternative: Key-Based Authentication (Development Only):

from openai import AzureOpenAI
import os

# NEVER hardcode keys - use environment variables or Azure Key Vault
client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-02-15-preview"
)

Deploying and Using Azure OpenAI Models

Model Deployment

Deploy models through Azure Portal or Azure CLI:

# Create Azure OpenAI resource (one-time setup)
az cognitiveservices account create \
  --name myopenai \
  --resource-group myResourceGroup \
  --kind OpenAI \
  --sku S0 \
  --location eastus

# Deploy GPT-4 model
az cognitiveservices account deployment create \
  --name myopenai \
  --resource-group myResourceGroup \
  --deployment-name gpt-4-deployment \
  --model-name gpt-4 \
  --model-version "0613" \
  --model-format OpenAI \
  --sku-capacity 10  # Tokens per minute / 1000

Basic Completion API Usage

# completions.py - Azure OpenAI completion patterns
from openai import AzureOpenAI
from typing import List, Dict

def generate_completion(
    client: AzureOpenAI,
    deployment_name: str,
    messages: List[Dict[str, str]],
    temperature: float = 0.7,
    max_tokens: int = 800
) -> str:
    """
    Generate completion using Azure OpenAI chat API.

    Args:
        client: Authenticated AzureOpenAI client
        deployment_name: Name of deployed model (e.g., "gpt-4-deployment")
        messages: Chat history in OpenAI format
        temperature: Sampling temperature (0.0-2.0)
        max_tokens: Maximum tokens in response

    Returns:
        Generated text completion
    """
    response = client.chat.completions.create(
        model=deployment_name,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0
    )

    return response.choices[0].message.content

# Example usage
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain Azure OpenAI Service in 3 sentences."}
]

completion = generate_completion(
    client=client,
    deployment_name="gpt-4-deployment",
    messages=messages
)

print(f"Response: {completion}")
print(f"Tokens used: {response.usage.total_tokens}")

Streaming Responses

Implement streaming for better user experience:

# streaming.py - Streaming completion for real-time responses
def stream_completion(
    client: AzureOpenAI,
    deployment_name: str,
    messages: List[Dict[str, str]]
) -> None:
    """
    Stream completion chunks as they're generated.

    Useful for:
    - Real-time user interfaces
    - Reducing perceived latency
    - Processing partial responses
    """
    response_stream = client.chat.completions.create(
        model=deployment_name,
        messages=messages,
        stream=True
    )

    print("Streaming response:")
    for chunk in response_stream:
        if chunk.choices:
            delta = chunk.choices[0].delta
            if delta.content:
                print(delta.content, end="", flush=True)

    print("\n")  # New line after stream completes

# Usage
stream_completion(client, "gpt-4-deployment", messages)

Prompt Engineering and Management

Prompt Template System

Create reusable prompt templates with version control:

# prompts.py - Prompt template management system
from typing import Dict, Any
from dataclasses import dataclass
from datetime import datetime

@dataclass
class PromptTemplate:
    """Version-controlled prompt template"""
    name: str
    version: str
    system_prompt: str
    user_template: str
    temperature: float
    max_tokens: int
    created_at: datetime = datetime.now()

    def format(self, **kwargs: Any) -> List[Dict[str, str]]:
        """Format template with provided variables"""
        return [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": self.user_template.format(**kwargs)}
        ]

# Define prompt templates
PROMPTS = {
    "code_review": PromptTemplate(
        name="code_review",
        version="1.2.0",
        system_prompt="""You are an expert code reviewer. Review code for:
        1. Security vulnerabilities
        2. Performance issues
        3. Code quality and maintainability
        Provide specific, actionable feedback.""",
        user_template="Review this {language} code:\n\n```\n{code}\n```",
        temperature=0.3,  # Low temperature for consistent analysis
        max_tokens=1000
    ),

    "documentation": PromptTemplate(
        name="documentation",
        version="1.0.0",
        system_prompt="""You are a technical writer. Create clear, comprehensive
        documentation with examples. Follow the style guide provided.""",
        user_template="Document this function:\n\n{function_signature}\n\nPurpose: {purpose}",
        temperature=0.5,
        max_tokens=1500
    )
}

# Usage
def review_code(client: AzureOpenAI, code: str, language: str = "python") -> str:
    """Review code using versioned prompt template"""
    template = PROMPTS["code_review"]
    messages = template.format(language=language, code=code)

    return generate_completion(
        client=client,
        deployment_name="gpt-4-deployment",
        messages=messages,
        temperature=template.temperature,
        max_tokens=template.max_tokens
    )

# Example
code_sample = '''
def process_user_input(user_input):
    return eval(user_input)  # Security issue!
'''

review = review_code(client, code_sample)
print(review)

Few-Shot Learning Patterns

Implement few-shot learning for better results:

# few_shot.py - Few-shot learning examples
def create_few_shot_prompt(
    examples: List[Dict[str, str]],
    task_description: str,
    user_input: str
) -> List[Dict[str, str]]:
    """
    Create few-shot prompt with examples.

    Pattern: System instruction → Examples → User query
    """
    messages = [
        {"role": "system", "content": task_description}
    ]

    # Add few-shot examples as conversation history
    for example in examples:
        messages.append({"role": "user", "content": example["input"]})
        messages.append({"role": "assistant", "content": example["output"]})

    # Add actual user query
    messages.append({"role": "user", "content": user_input})

    return messages

# Example: Entity extraction
entity_extraction_examples = [
    {
        "input": "Apple Inc. announced new products in Cupertino on June 5, 2024.",
        "output": "Entities: [ORG: Apple Inc.], [LOC: Cupertino], [DATE: June 5, 2024]"
    },
    {
        "input": "The meeting with John Smith is scheduled for next Monday.",
        "output": "Entities: [PERSON: John Smith], [DATE: next Monday]"
    }
]

messages = create_few_shot_prompt(
    examples=entity_extraction_examples,
    task_description="Extract named entities from text. Format: [TYPE: entity]",
    user_input="Microsoft's CEO Satya Nadella spoke at the Seattle conference yesterday."
)

result = generate_completion(client, "gpt-4-deployment", messages)
print(result)  # Output: Entities: [ORG: Microsoft], [PERSON: Satya Nadella], [LOC: Seattle], [DATE: yesterday]

Experiment Tracking with Azure ML

Azure ML Integration

Track experiments, compare models, and manage artifacts:

# mlops.py - Azure ML experiment tracking
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Environment, Job
from azure.identity import DefaultAzureCredential
import mlflow

def setup_azure_ml_tracking(
    subscription_id: str,
    resource_group: str,
    workspace_name: str
) -> MLClient:
    """
    Set up Azure ML tracking for GenAI experiments.

    Returns:
        Authenticated MLClient for experiment tracking
    """
    credential = DefaultAzureCredential()
    ml_client = MLClient(
        credential=credential,
        subscription_id=subscription_id,
        resource_group_name=resource_group,
        workspace_name=workspace_name
    )

    # Configure MLflow to use Azure ML backend
    mlflow.set_tracking_uri(ml_client.workspaces.get(workspace_name).mlflow_tracking_uri)

    return ml_client

# Track prompt experiments
def track_prompt_experiment(
    experiment_name: str,
    prompt_template: PromptTemplate,
    test_inputs: List[str],
    client: AzureOpenAI,
    deployment_name: str
):
    """
    Track prompt engineering experiments with MLflow.

    Logs:
    - Prompt template version
    - Hyperparameters (temperature, max_tokens)
    - Input/output examples
    - Token usage metrics
    """
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run(run_name=f"{prompt_template.name}_v{prompt_template.version}"):
        # Log prompt template
        mlflow.log_param("template_name", prompt_template.name)
        mlflow.log_param("template_version", prompt_template.version)
        mlflow.log_param("temperature", prompt_template.temperature)
        mlflow.log_param("max_tokens", prompt_template.max_tokens)
        mlflow.log_text(prompt_template.system_prompt, "system_prompt.txt")

        # Test prompt on multiple inputs
        total_tokens = 0
        results = []

        for i, test_input in enumerate(test_inputs):
            messages = [
                {"role": "system", "content": prompt_template.system_prompt},
                {"role": "user", "content": test_input}
            ]

            response = client.chat.completions.create(
                model=deployment_name,
                messages=messages,
                temperature=prompt_template.temperature,
                max_tokens=prompt_template.max_tokens
            )

            output = response.choices[0].message.content
            tokens = response.usage.total_tokens

            total_tokens += tokens
            results.append({"input": test_input, "output": output, "tokens": tokens})

            # Log individual test case
            mlflow.log_text(f"Input: {test_input}\nOutput: {output}", f"test_case_{i}.txt")

        # Log aggregate metrics
        mlflow.log_metric("total_tokens", total_tokens)
        mlflow.log_metric("avg_tokens_per_request", total_tokens / len(test_inputs))
        mlflow.log_metric("test_cases", len(test_inputs))

        print(f"Experiment logged. Total tokens: {total_tokens}")
        return results

# Usage
ml_client = setup_azure_ml_tracking(
    subscription_id="your-subscription-id",
    resource_group="your-rg",
    workspace_name="your-workspace"
)

test_cases = [
    "Explain quantum computing",
    "Describe machine learning",
    "What is cloud computing?"
]

track_prompt_experiment(
    experiment_name="prompt_engineering",
    prompt_template=PROMPTS["documentation"],
    test_inputs=test_cases,
    client=client,
    deployment_name="gpt-4-deployment"
)

Production Deployment Patterns

Azure Kubernetes Service (AKS) Deployment

Deploy GenAI application with autoscaling:

1. Create Kubernetes Deployment Manifest:

# k8s/genai-api-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: genai-api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: genai-api
  template:
    metadata:
      labels:
        app: genai-api
    spec:
      serviceAccountName: genai-api-sa  # Managed Identity
      containers:
      - name: api
        image: myacr.azurecr.io/genai-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: AZURE_OPENAI_ENDPOINT
          value: "https://myopenai.openai.azure.com"
        - name: DEPLOYMENT_NAME
          value: "gpt-4-deployment"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: genai-api-service
  namespace: production
spec:
  type: LoadBalancer
  selector:
    app: genai-api
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: genai-api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: genai-api
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

2. FastAPI Application with Monitoring:

# api.py - Production FastAPI application
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
import os
import time

# Initialize OpenTelemetry tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export traces to Azure Monitor
azure_exporter = AzureMonitorTraceExporter(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(azure_exporter))

# Initialize FastAPI
app = FastAPI(title="GenAI API", version="1.0.0")

# Instrument FastAPI with OpenTelemetry
FastAPIInstrumentor.instrument_app(app)

# Initialize Azure OpenAI client (using Managed Identity in AKS)
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(
    credential, "https://cognitiveservices.azure.com/.default"
)

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_ad_token_provider=token_provider,
    api_version="2024-02-15-preview"
)

DEPLOYMENT_NAME = os.environ["DEPLOYMENT_NAME"]

class CompletionRequest(BaseModel):
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 800

class CompletionResponse(BaseModel):
    completion: str
    tokens_used: int
    latency_ms: float

@app.post("/completions", response_model=CompletionResponse)
async def create_completion(request: CompletionRequest):
    """Generate completion with telemetry tracking"""

    with tracer.start_as_current_span("generate_completion") as span:
        span.set_attribute("prompt_length", len(request.prompt))
        span.set_attribute("temperature", request.temperature)
        span.set_attribute("max_tokens", request.max_tokens)

        start_time = time.time()

        try:
            response = client.chat.completions.create(
                model=DEPLOYMENT_NAME,
                messages=[{"role": "user", "content": request.prompt}],
                temperature=request.temperature,
                max_tokens=request.max_tokens
            )

            latency_ms = (time.time() - start_time) * 1000

            span.set_attribute("tokens_used", response.usage.total_tokens)
            span.set_attribute("latency_ms", latency_ms)
            span.set_attribute("status", "success")

            return CompletionResponse(
                completion=response.choices[0].message.content,
                tokens_used=response.usage.total_tokens,
                latency_ms=latency_ms
            )

        except Exception as e:
            span.set_attribute("status", "error")
            span.record_exception(e)
            raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    """Kubernetes liveness probe"""
    return {"status": "healthy"}

@app.get("/ready")
async def readiness_check():
    """Kubernetes readiness probe"""
    try:
        # Verify Azure OpenAI connectivity
        client.chat.completions.create(
            model=DEPLOYMENT_NAME,
            messages=[{"role": "user", "content": "test"}],
            max_tokens=1
        )
        return {"status": "ready"}
    except Exception:
        raise HTTPException(status_code=503, detail="Not ready")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

3. Deploy to AKS:

# Build and push Docker image
docker build -t myacr.azurecr.io/genai-api:latest .
docker push myacr.azurecr.io/genai-api:latest

# Apply Kubernetes manifests
kubectl apply -f k8s/genai-api-deployment.yaml

# Verify deployment
kubectl get pods -n production
kubectl get hpa -n production

Monitoring and Cost Optimization

Azure Monitor Integration

Track token usage and costs:

# monitoring.py - Cost tracking and monitoring
from azure.monitor.query import LogsQueryClient, MetricsQueryClient
from azure.identity import DefaultAzureCredential
from datetime import datetime, timedelta
from typing import Dict, List

def get_token_usage_metrics(
    workspace_id: str,
    days: int = 7
) -> Dict[str, float]:
    """
    Query Azure Monitor for token usage and cost metrics.

    Returns:
        Dictionary with total tokens, requests, estimated cost
    """
    credential = DefaultAzureCredential()
    logs_client = LogsQueryClient(credential)

    # KQL query to aggregate token usage
    query = f"""
    AppTraces
    | where TimeGenerated > ago({days}d)
    | where Properties.tokens_used > 0
    | summarize
        TotalTokens = sum(todouble(Properties.tokens_used)),
        TotalRequests = count(),
        AvgTokensPerRequest = avg(todouble(Properties.tokens_used)),
        P95Latency = percentile(todouble(Properties.latency_ms), 95)
    """

    response = logs_client.query_workspace(
        workspace_id=workspace_id,
        query=query,
        timespan=timedelta(days=days)
    )

    if response.tables:
        row = response.tables[0].rows[0]
        total_tokens = row[0]
        total_requests = row[1]
        avg_tokens = row[2]
        p95_latency = row[3]

        # Estimate cost (GPT-4 pricing: $0.03 per 1K input tokens, $0.06 per 1K output)
        # Assuming 50/50 input/output split
        estimated_cost = (total_tokens / 1000) * 0.045

        return {
            "total_tokens": total_tokens,
            "total_requests": total_requests,
            "avg_tokens_per_request": avg_tokens,
            "p95_latency_ms": p95_latency,
            "estimated_cost_usd": estimated_cost
        }

    return {}

# Usage
metrics = get_token_usage_metrics(workspace_id="your-workspace-id", days=7)
print(f"Weekly Metrics:")
print(f"  Total Tokens: {metrics['total_tokens']:,.0f}")
print(f"  Total Requests: {metrics['total_requests']:,.0f}")
print(f"  Estimated Cost: ${metrics['estimated_cost_usd']:.2f}")

Cost Optimization Strategies

1. Implement Response Caching:

# caching.py - Response caching for identical requests
from functools import lru_cache
import hashlib
import json

@lru_cache(maxsize=1000)
def get_cached_completion(
    prompt_hash: str,
    temperature: float,
    max_tokens: int
) -> str:
    """Cache completions based on prompt hash"""
    # This function will be called only if not in cache
    pass

def generate_with_caching(
    client: AzureOpenAI,
    deployment_name: str,
    prompt: str,
    temperature: float = 0.7,
    max_tokens: int = 800
) -> str:
    """Generate completion with caching for cost savings"""

    # Create hash of prompt + parameters
    cache_key = hashlib.sha256(
        f"{prompt}_{temperature}_{max_tokens}".encode()
    ).hexdigest()

    # Try cache first
    try:
        return get_cached_completion(cache_key, temperature, max_tokens)
    except:
        pass

    # Generate if not cached
    response = client.chat.completions.create(
        model=deployment_name,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens
    )

    result = response.choices[0].message.content

    # Cache result (via function call to update lru_cache)
    get_cached_completion(cache_key, temperature, max_tokens)

    return result

2. Use Provisioned Throughput for High Volume:

# For consistent high-volume workloads, use Provisioned Throughput (PTU)
# instead of pay-per-token pricing

az cognitiveservices account deployment create \
  --name myopenai \
  --resource-group myResourceGroup \
  --deployment-name gpt-4-ptu \
  --model-name gpt-4 \
  --model-version "0613" \
  --sku-name ProvisionedManaged \
  --sku-capacity 100  # 100 PTUs = ~6M tokens/month @ fixed cost

CI/CD Pipeline with GitHub Actions

Complete MLOps pipeline:

# .github/workflows/deploy-genai.yml
name: Deploy GenAI API

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  AZURE_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
  RESOURCE_GROUP: genai-rg
  AKS_CLUSTER: genai-aks
  ACR_NAME: myacr

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest pytest-cov

      - name: Run tests
        env:
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          AZURE_OPENAI_KEY: ${{ secrets.AZURE_OPENAI_KEY }}
        run: |
          pytest tests/ --cov=. --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v3

      - name: Azure Login
        uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Build and push image
        run: |
          az acr build --registry $ACR_NAME \
            --image genai-api:${{ github.sha }} \
            --image genai-api:latest \
            --file Dockerfile .

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v3

      - name: Azure Login
        uses: azure/login@v1
        with:
          creds: ${{ secrets.AZURE_CREDENTIALS }}

      - name: Set AKS context
        run: |
          az aks get-credentials \
            --resource-group $RESOURCE_GROUP \
            --name $AKS_CLUSTER

      - name: Update deployment
        run: |
          kubectl set image deployment/genai-api \
            api=$ACR_NAME.azurecr.io/genai-api:${{ github.sha }} \
            -n production

          kubectl rollout status deployment/genai-api -n production

Best Practices and Limitations

Production Recommendations

1. Rate Limiting and Throttling:

  • Implement token bucket algorithm for API rate limiting
  • Use Azure API Management for centralized throttling
  • Monitor quota usage: 300K TPM for standard GPT-4

2. Error Handling:

from azure.core.exceptions import AzureError
import time

def generate_with_retry(client, deployment_name, messages, max_retries=3):
    """Generate with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model=deployment_name,
                messages=messages
            )
        except AzureError as e:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff
            wait_time = 2 ** attempt
            time.sleep(wait_time)

3. Security:

  • Use Managed Identity (never hardcode keys)
  • Enable Azure Key Vault for sensitive prompts/data
  • Implement content filtering (built into Azure OpenAI)
  • Use Azure Private Link for network isolation

Known Limitations

Limitation Impact Mitigation
Regional availability Not all models in all regions Deploy to supported regions (East US, West Europe)
Token limits 128K max context (GPT-4 Turbo) Implement chunking for long documents
Rate limits 300K TPM default (GPT-4) Request quota increase or use PTU
Cold start latency First request may be slow Implement warm-up requests
Cost unpredictability Token usage varies by task Implement cost alerts and budgets

Conclusion and Resources

This guide covered practical GenAIOps implementation on Azure, from authentication and deployment through production monitoring and cost optimization. Key takeaways:

  • Azure OpenAI Service provides enterprise-grade GenAI with security/compliance
  • Managed Identity eliminates API key management complexity
  • Azure ML enables experiment tracking and model comparison
  • AKS deployment with autoscaling handles production workloads
  • Cost optimization through caching and Provisioned Throughput reduces expenses

Further Resources: