The Sanctuary

Writing about interests; Computer Science, Philosophy, Mathematics and AI.

AI & Infrastructure: The New DevOps Frontier

The boundaries between DevOps and AI are dissolving. What were once separate disciplines are now converging into new hybrid practices that require engineers to understand both worlds. This convergence is not a future prediction—it’s happening now, reshaping how we build, deploy, and operate software systems.

This article explores the intersection of AI and infrastructure: MLOps for machine learning operations, AIOps for AI-powered IT operations, treating LLMs as infrastructure components, building platforms to support AI workloads, and the evolving skillset required of modern platform engineers.

The Convergence

Traditional DevOps brought us CI/CD pipelines, container orchestration, infrastructure as code, and observability practices. These remain foundational. But AI has introduced new layers of complexity that demand new approaches.

On the AI side, we have Large Language Models, machine learning pipelines, training infrastructure, and inference services. These systems don’t fit neatly into traditional DevOps patterns.

DevOps and AI Convergence

The convergence creates three new disciplines:

  • MLOps: Applying DevOps principles to machine learning systems
  • AIOps: Using AI to improve IT operations
  • LLMOps: Operating LLM infrastructure at production scale

Each represents a different direction of integration, but all require engineers who can bridge both worlds.

Section 1: MLOps

MLOps extends DevOps principles to machine learning, but the challenges are fundamentally different. In traditional software, the same code produces the same behavior. In ML systems, the same code with different data produces different behavior. This is the core complexity that MLOps addresses.

DevOps vs MLOps

AspectDevOpsMLOps
VersioningCode (Git)Code + Data + Models
PipelinesCI/CDCT/CD (Continuous Training)
TestingUnit & IntegrationModel & Data validation
MonitoringApplication metricsModel drift detection
RollbackPrevious code versionModel + data dependencies

The shift from deterministic to probabilistic systems changes everything. Testing becomes statistical rather than binary. Monitoring must detect gradual degradation, not just failures. Rollback requires understanding which data produced which model.

The MLOps Pipeline

A mature MLOps pipeline consists of five stages:

Data Stage

# Data versioning with DVC
import dvc.api

# Track data changes alongside code
data_path = dvc.api.get_url(
    path='data/training/dataset.csv',
    repo='[email protected]:org/ml-project.git',
    rev='v1.2.0'
)

# Data quality checks
from great_expectations.core import ExpectationSuite

suite = ExpectationSuite("training_data")
suite.add_expectation(
    expect_column_values_to_not_be_null(column="target")
)
suite.add_expectation(
    expect_column_values_to_be_between(
        column="feature_1", min_value=0, max_value=100
    )
)

Training Stage

# Experiment tracking with MLflow
import mlflow

with mlflow.start_run():
    mlflow.log_params({
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 100
    })

    model = train_model(X_train, y_train)

    mlflow.log_metrics({
        "accuracy": accuracy,
        "f1_score": f1,
        "auc_roc": auc
    })

    mlflow.sklearn.log_model(model, "model")

Validation Stage

# Model validation against baseline
def validate_model(new_model, baseline_model, test_data):
    new_metrics = evaluate(new_model, test_data)
    baseline_metrics = evaluate(baseline_model, test_data)

    # Statistical significance test
    improvement = new_metrics['accuracy'] - baseline_metrics['accuracy']
    p_value = statistical_test(new_model, baseline_model, test_data)

    return {
        'passes': improvement > 0.01 and p_value < 0.05,
        'improvement': improvement,
        'p_value': p_value
    }

Deployment Stage

# Model registry and deployment
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Register model
model_uri = f"runs:/{run_id}/model"
mv = client.create_model_version(
    name="fraud_detection",
    source=model_uri,
    run_id=run_id
)

# Promote to production after validation
client.transition_model_version_stage(
    name="fraud_detection",
    version=mv.version,
    stage="Production"
)

Monitoring Stage

# Drift detection
from evidently.metrics import DataDriftPreset
from evidently.report import Report

def check_drift(reference_data, production_data):
    report = Report(metrics=[DataDriftPreset()])
    report.run(
        reference_data=reference_data,
        current_data=production_data
    )

    drift_detected = report.as_dict()['metrics'][0]['result']['dataset_drift']

    if drift_detected:
        trigger_retraining_pipeline()

    return drift_detected

The MLOps Technology Stack

The tooling landscape for MLOps has matured significantly:

Experiment Tracking & Model Registry

  • MLflow: Open source, comprehensive, good starting point
  • Weights & Biases: Excellent visualization, collaborative features
  • Neptune: Strong experiment management focus

Feature Stores

  • Feast: Open source, Kubernetes-native
  • Tecton: Enterprise features, real-time serving
  • Hopsworks: Integrated with data platforms

Orchestration

  • Kubeflow Pipelines: Kubernetes-native ML workflows
  • Airflow: General purpose, widely adopted
  • Prefect/Dagster: Modern alternatives with better DX

Model Serving

  • KServe: Serverless inference on Kubernetes
  • Seldon Core: Kubernetes-native, feature rich
  • Triton Inference Server: Multi-framework, GPU optimized
  • BentoML: Framework for packaging and serving

MLOps Maturity Model

Most organizations fall into one of four maturity levels:

Level 0 - Manual Process

  • Data scientists work in notebooks
  • Manual model export and deployment
  • No versioning or reproducibility
  • This describes 70-80% of organizations

Level 1 - Pipeline Automation

  • Automated training pipelines
  • Experiment tracking in place
  • Basic model versioning
  • Manual deployment decisions

Level 2 - CI/CD for ML

  • Automated testing for data and models
  • Continuous training triggers
  • Model registry with approval workflows
  • Feature store adoption

Level 3 - Full MLOps

  • Automated retraining based on drift
  • A/B testing and gradual rollouts
  • Complete observability
  • Self-healing pipelines

The goal is incremental improvement, not a leap to Level 3.

Section 2: AIOps

AIOps inverts the relationship—instead of applying DevOps to AI, we apply AI to DevOps. The premise is simple: modern systems generate more data than humans can process. AI can help.

The Scale Problem

A typical enterprise environment generates:

  • Millions of log entries per hour
  • Thousands of metric time series
  • Hundreds of alerts per day

Traditional rule-based monitoring cannot keep pace. Static thresholds generate false positives. Alert fatigue causes teams to miss real issues.

AIOps applies machine learning to:

  • Detect anomalies automatically
  • Correlate related alerts
  • Predict failures before they occur
  • Automate incident response

ML-Based Anomaly Detection

The fundamental improvement over static thresholds:

# Static threshold (traditional)
def static_alert(cpu_usage):
    return cpu_usage > 80  # Always alerts at 80%

# ML-based dynamic baseline
from sklearn.ensemble import IsolationForest
import numpy as np

class DynamicBaseline:
    def __init__(self, window_size=168):  # 1 week of hourly data
        self.model = IsolationForest(contamination=0.01)
        self.history = []
        self.window_size = window_size

    def update(self, value, hour_of_week):
        # Include temporal features
        features = [value, hour_of_week, np.sin(2*np.pi*hour_of_week/168)]
        self.history.append(features)

        if len(self.history) > self.window_size:
            self.history.pop(0)
            self.model.fit(self.history)

    def is_anomaly(self, value, hour_of_week):
        features = [[value, hour_of_week, np.sin(2*np.pi*hour_of_week/168)]]
        return self.model.predict(features)[0] == -1

The key insight: a CPU spike at Monday 9 AM during batch processing is normal. The same spike at Sunday 3 AM is not. Static thresholds cannot distinguish these cases.

Self-Healing Infrastructure

The ultimate goal of AIOps is autonomous remediation:

Self-Healing Infrastructure Cycle

# Example: Kubernetes-based self-healing with custom controller
apiVersion: aiops.io/v1
kind: RemediationPolicy
metadata:
  name: memory-pressure-remediation
spec:
  trigger:
    type: Anomaly
    metric: container_memory_usage_bytes
    model: isolation_forest
    sensitivity: 0.95

  diagnosis:
    correlations:
      - metric: container_restarts
      - metric: http_error_rate
      - logs: "OutOfMemoryError"

  actions:
    - type: ScaleUp
      condition: "correlation.http_error_rate > 0.8"
      target:
        deployment: api-server
        replicas: "+2"

    - type: Restart
      condition: "correlation.container_restarts > 3"
      target:
        pod: "selector:app=api-server"

  feedback:
    successMetric: http_error_rate
    successThreshold: 0.01
    learningEnabled: true

The cycle: Detect (anomaly detection) → Diagnose (correlation analysis) → Remediate (automated runbook) → Learn (feedback loop).

Intelligent Alert Correlation

Converting thousands of alerts into actionable insights:

class AlertCorrelator:
    def __init__(self, topology_graph, time_window=300):
        self.topology = topology_graph
        self.time_window = time_window
        self.alert_buffer = []

    def process_alert(self, alert):
        self.alert_buffer.append(alert)
        self.cleanup_old_alerts()

        # Group alerts by time proximity
        time_groups = self.group_by_time(self.alert_buffer)

        # Correlate by topology
        for group in time_groups:
            correlated = self.correlate_by_topology(group)

            # Find root cause candidates
            root_causes = self.find_root_causes(correlated)

            # Score by business impact
            scored = self.score_impact(root_causes)

            yield Incident(
                alerts=correlated,
                probable_causes=scored[:3],
                suggested_actions=self.get_runbooks(scored[0])
            )

    def correlate_by_topology(self, alerts):
        # Use graph algorithms to find related services
        clusters = []
        for alert in alerts:
            service = alert.source
            neighbors = self.topology.neighbors(service, depth=2)
            related = [a for a in alerts if a.source in neighbors]
            clusters.append(set(related + [alert]))

        # Merge overlapping clusters
        return self.merge_clusters(clusters)

The funnel: 10,000 raw alerts → 1,000 after noise reduction → 50 correlated incidents → 10 prioritized issues → 5 actionable items with suggested remediations.

Section 3: LLMs as Infrastructure

Large Language Models have transitioned from research curiosities to critical infrastructure components. This shift demands that we treat them with the same operational rigor as databases or message queues.

The Evolution of API Value

2005: Google Maps API → Returns location data
2010: Twitter API → Returns social data
2015: Stripe API → Handles payments
2020: Twilio API → Enables communications
2024+: LLM APIs → Returns intelligence

Previous APIs returned data or performed actions. LLM APIs return reasoning. This is a fundamental shift in what an API can provide.

LLM Serving Challenges

LLM serving differs from traditional API serving:

AspectTraditional APILLM API
Response time10-100ms1-30 seconds
Memory per requestMBs10s-100s GBs
ComputeCPU-boundGPU-bound
Cost per request$0.00001$0.01-1.00

Optimization techniques that enable production-scale serving:

# Performance comparison (requests per second on single A100)
optimizations = {
    "naive": 10,          # Basic implementation
    "continuous_batching": 50,   # 5x improvement
    "paged_attention": 150,      # 15x improvement (vLLM)
    "quantization_int8": 250,    # 25x improvement
    "quantization_int4": 300,    # 30x improvement
}

Continuous Batching: Process multiple requests together, maximizing GPU utilization.

Paged Attention (vLLM): Efficiently manage GPU memory for the KV cache using virtual memory concepts.

Quantization: Reduce precision from FP16 to INT8 or INT4, trading minimal quality loss for significant speedup.

# vLLM deployment example
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70B-Instruct",
    tensor_parallel_size=4,  # Distribute across 4 GPUs
    quantization="awq",       # 4-bit quantization
    gpu_memory_utilization=0.9
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
    top_p=0.95
)

# Handles batching automatically
outputs = llm.generate(prompts, sampling_params)

The AI Gateway Pattern

A new architectural pattern for managing LLM traffic:

AI Gateway Pattern

Key capabilities beyond traditional API gateways:

Semantic Caching: “What is the capital of France?” and “France’s capital city?” should hit the same cache entry.

from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.cache = {}  # embedding -> response
        self.threshold = similarity_threshold

    def get(self, query):
        query_embedding = self.encoder.encode(query)

        for cached_embedding, response in self.cache.items():
            similarity = np.dot(query_embedding, cached_embedding)
            if similarity > self.threshold:
                return response
        return None

    def set(self, query, response):
        embedding = tuple(self.encoder.encode(query))
        self.cache[embedding] = response

Prompt Injection Protection:

def detect_prompt_injection(user_input):
    dangerous_patterns = [
        r"ignore (all )?(previous|prior|above) instructions",
        r"disregard (the )?(system|initial) prompt",
        r"you are now",
        r"new instruction:",
        r"```system",
    ]

    for pattern in dangerous_patterns:
        if re.search(pattern, user_input, re.IGNORECASE):
            return True

    # ML-based detection for sophisticated attacks
    injection_score = injection_classifier.predict_proba([user_input])[0][1]
    return injection_score > 0.8

Model Routing for cost optimization:

class ModelRouter:
    def __init__(self):
        self.complexity_classifier = load_classifier()
        self.models = {
            "simple": ("gpt-3.5-turbo", 0.002),    # $/1K tokens
            "medium": ("claude-3-haiku", 0.003),
            "complex": ("gpt-4", 0.06)
        }

    def route(self, prompt):
        complexity = self.complexity_classifier.predict([prompt])[0]
        model, cost = self.models[complexity]
        return model

    # 40-60% cost savings by routing simple queries to cheaper models

Prompts as Code

Treating prompts with the same rigor as application code:

# prompts/customer_support/v2.1.yaml
metadata:
  name: customer_support_agent
  version: "2.1"
  author: "platform-team"
  created: "2026-01-15"

model:
  provider: anthropic
  name: claude-3-5-sonnet
  temperature: 0.3
  max_tokens: 1024

system_prompt: |
  You are a helpful customer support agent for TechCorp.

  Guidelines:
  - Be concise and professional
  - If you don't know the answer, say so
  - Never make up information about products
  - For billing issues, always offer to escalate

  Available actions:
  - lookup_order(order_id)
  - check_inventory(product_id)
  - create_ticket(category, description)  

user_prompt_template: |
  Customer: {customer_name}
  Membership: {membership_tier}
  Previous interactions: {interaction_count}

  Query: {user_message}  

tests:
  - input:
      user_message: "Where is my order #12345?"
    expected_contains:
      - "lookup_order"
    expected_not_contains:
      - "I don't have access"

  - input:
      user_message: "Your product is terrible"
    expected_tone: "empathetic"
    expected_not_contains:
      - "I agree"

Prompt CI/CD pipeline:

# .github/workflows/prompt-ci.yaml
name: Prompt CI/CD

on:
  push:
    paths:
      - 'prompts/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run prompt tests
        run: |
          python scripts/test_prompts.py prompts/          

      - name: Evaluate quality metrics
        run: |
          python scripts/evaluate_prompts.py \
            --prompts prompts/ \
            --min-relevancy 0.8 \
            --min-faithfulness 0.9          

      - name: Deploy to staging
        if: github.ref == 'refs/heads/main'
        run: |
          python scripts/deploy_prompts.py \
            --env staging \
            --prompts prompts/          

Section 4: Infrastructure FOR AI

Building infrastructure to support AI workloads requires understanding their unique characteristics: GPU-intensive computation, large memory footprints, and specialized orchestration needs.

Kubernetes for AI Workloads

# GPU-enabled Kubernetes deployment for model training
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: training-image:v1.0
              resources:
                limits:
                  nvidia.com/gpu: 8
              env:
                - name: NCCL_DEBUG
                  value: "INFO"
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: pytorch
              image: training-image:v1.0
              resources:
                limits:
                  nvidia.com/gpu: 8
          nodeSelector:
            accelerator: nvidia-h100
          tolerations:
            - key: nvidia.com/gpu
              operator: Exists
              effect: NoSchedule

GPU Sharing Strategies

GPUs are expensive ($10-30K per card). Maximizing utilization is critical.

MIG (Multi-Instance GPU): Physically partition an A100/H100 into up to 7 isolated instances.

# MIG configuration for inference workloads
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-balanced:
        - devices: all
          mig-enabled: true
          mig-devices:
            "3g.40gb": 2  # Two 3-GPU slices with 40GB each    

Time-Slicing: Share GPU across pods with context switching.

# Time-slicing configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # Allow 4 pods to share each GPU    

MPS (Multi-Process Service): CUDA-level sharing for concurrent kernel execution.

ML Platform Reference Architecture

ML Platform Reference Architecture

Cost Optimization

AI infrastructure costs are dominated by GPU compute:

ComponentTypical % of Budget
GPU Compute60-70%
Storage15-20%
Networking/Egress5-10%
Other (CPU, memory)5-10%

Optimization strategies:

Spot Instances for Training

# Training with automatic checkpointing for spot instances
import signal

def checkpoint_handler(signum, frame):
    save_checkpoint(model, optimizer, epoch)
    sys.exit(0)

signal.signal(signal.SIGTERM, checkpoint_handler)

# Resume from checkpoint if exists
if checkpoint_exists():
    model, optimizer, start_epoch = load_checkpoint()
else:
    start_epoch = 0

for epoch in range(start_epoch, total_epochs):
    train_epoch(model, data)
    save_checkpoint(model, optimizer, epoch)  # Regular checkpoints

Idle Detection and Scale-to-Zero

# KNative service with scale-to-zero
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: inference-service
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "10"
        autoscaling.knative.dev/scale-down-delay: "5m"
    spec:
      containers:
        - image: inference:v1
          resources:
            limits:
              nvidia.com/gpu: 1

Workload Placement Strategies

Train in Cloud, Serve On-Prem: Use cloud GPUs for burst training, deploy models on-premises for data sovereignty and predictable inference costs.

Multi-Cloud Resilience: Use ONNX for model portability across providers.

# Export to ONNX for portability
import torch.onnx

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=17,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch_size'},
        'output': {0: 'batch_size'}
    }
)

Edge AI: Deploy quantized models on edge devices for low-latency inference.

Section 5: The Evolving Role

The DevOps engineer of 2015 needed shell scripting, CI/CD, containers, cloud services, and monitoring. The platform engineer of 2026 needs all of that plus:

  • ML/AI Literacy: Understanding how models work, their failure modes, and operational requirements
  • GPU Infrastructure: Managing GPU clusters, understanding memory hierarchies, optimizing utilization
  • LLM Operations: Deploying and managing LLM services, prompt management, cost control
  • AI Tool Orchestration: Building pipelines that incorporate AI components

This is not about becoming a data scientist. It’s about understanding AI systems well enough to operate them reliably.

AI-Assisted DevOps

The tools we use are also becoming AI-powered:

Production Ready

  • Code completion (GitHub Copilot, Cursor)
  • Log analysis and summarization
  • Documentation generation
  • Code review assistance

Emerging

  • Infrastructure as Code generation from natural language
  • Incident diagnosis from symptoms
  • Test generation from code
  • Security vulnerability detection

Future

  • Autonomous debugging
  • Self-healing systems with minimal human intervention
  • Predictive capacity planning
  • Automatic performance optimization

Natural Language to Infrastructure

Human: "Deploy a PostgreSQL database with 3 replicas for high
        availability, daily backups retained for 30 days,
        accessible only from the backend namespace"

AI generates:
├── terraform/
│   └── database.tf
├── kubernetes/
│   ├── statefulset.yaml
│   ├── service.yaml
│   ├── networkpolicy.yaml
│   └── backup-cronjob.yaml
└── README.md

The workflow becomes: describe intent → AI generates code → human reviews and refines → human tests and deploys. This is augmentation, not replacement.

Key Takeaways

  1. MLOps is the new DevOps: Same principles, new challenges around data, models, and drift.

  2. AIOps augments human operators: AI handles scale, humans handle judgment.

  3. LLMs are infrastructure now: Treat them like databases—version, cache, monitor, manage costs.

  4. AI infrastructure is expensive: GPU costs dominate. Optimization is a core skill.

  5. The role is evolving: More abstraction, more impact, new skills required. The engineers who thrive will be those who bridge both worlds.

The convergence of AI and infrastructure is not optional. Whether you’re operating ML systems or using AI to improve operations, understanding both domains is becoming essential for modern platform engineering.


AI & Infrastructure: The New DevOps Frontier

An exploration of how artificial intelligence and infrastructure operations are converging.

Achraf SOLTANI — January 24, 2026