How to Build Production AI Agents in 2026: The No-Bullshit Way

Everyone’s building AI agents. Nobody’s running them in production without praying. You’ve seen the demos — slick autonomous agents booking flights, writing code, managing inventory. But when you try to ship one into the real world, something always breaks. Costs explode. Context rots. Multi-agent systems start civil wars in your database. This guide is the production playbook I wish someone had handed me before my first 3 AM PagerDuty call about a misbehaving agent.

The Bridge Between It Works and It Makes Money

Here’s what they don’t tell you in the tutorials: there’s a chasm between an agent that demos well and one that handles your Black Friday traffic without melting your infrastructure and your career prospects.

Let me show you what screwed me (and three companies I consulted for) when we tried to scale from it works in Jupyter to it handles 50K transactions per hour without bankrupting us with token costs.

The Production Nightmare Nobody Mentions

Last year, a fintech startup brought me in after their genius demo agent went live. What happened? Their cute little RAG agent started hallucinating at 3 AM, ignored the circuit breakers I’d recommended, and kept retrying until it had called their payment API 47,000 times. No big deal, right? Except this API charges $0.10 per call, so they woke up to a $4,700 token bill and a CFO asking who authorized this bot thing again.

That’s not even the worst part. The worst part is how common this is.

The Dunning-Kruger Curve of Agent Development

Stage 1: This is EASY! You’ve built a cute agent that orders pizza. You show your boss. Everyone’s impressed. You feel like a god.

Stage 2: Wait, it’s doing WHAT? You give it access to more tools. Suddenly it’s calling external APIs, writing files, talking to databases. It’s weekend time, but your Slack is blowing up with the bot is behaving weirdly messages.

Stage 3: Oh no we’re bleeding money The production bill arrives. Your agent has spent more on API calls than your annual salary testing just one more thing. The finance team wants to chat with whoever authorized this.

Stage 4: Let me show you how to actually do this… That’s where we are today.

Why Demo Agents Fail in The Real World

Here’s the brutal truth: demo agents are designed to succeed one time on perfect data, in perfect conditions, with a human watching. Production agents need to survive when everything goes to hell, nobody’s watching, and there’s a million dollars on the line.

Let me show you three patterns I see destroying production systems every week:

Pattern 1: The Token Cost Bomb

What happens: Your agent hits a rate limit, ignores the 429 error, and keeps retrying. Each retry costs more tokens. Without a cost limit, it literally bankrupts the company.

The Real Story: In 2024, an e-commerce company let their discount finder agent loose on Black Friday. The agent got excited finding deals, hit rate limits, started retrying every millisecond, and burned through their $50K monthly AI budget in 45 minutes. By the time monitoring caught it, they’d spent $78K on tokens trying to find discounts that didn’t exist.

The Fix: Cost tracking on every API call with automatic circuit breakers.

Pattern 2: The Context Poisoning Loop

What happens: Agent gets into a state where bad outputs keep feeding back into its context. Each iteration makes it dumber. By iteration 5, it’s suggesting completely insane solutions.

The Real Story: A customer support agent learned that giving discounts made customers happy. After the 50th bug report about the agent giving away millions in discounts, we realized its context had been corrupted during two days of retry loops. It kept seeing customer is happy with discount and started giving 90% discounts automatically.

The Fix: Mandatory context refresh and session length limits.

Pattern 3: The Multi-Agent Civil War

What happens: You create three agents. Agent A makes a decision. Agent B doesn’t agree. Agent C tries to mediate but makes it worse. None know when to stop arguing. The database gets corrupted during their discussion.

The Real Story: Three agents coordinating inventory management at a major retailer somehow got into a three-day argument about product reorder quantities. Agent A wanted to order 500 units, Agent B said 200, Agent C tried to compromise at 350. They went in circles. The result? They all updated the same database record 40,000 times, creating a queue backup that cost 48 hours of sales.

The Fix: Billable timeouts and explicit conflict resolution rules.

Part 1: The Agent That Actually Runs Production

Here’s how we prevent all three disasters by building production infrastructure BEFORE agent capabilities:

from typing import TypedDict, Optional, Any
from dataclasses import dataclass
from datetime import datetime
import time

@dataclass
class AgentConstraints:
    max_cost_per_hour: float = 2.0  # Dollar amount that triggers emergency stop
    max_retries: int = 3  # Prevent infinite retry loops
    max_session_length: int = 3600  # Seconds before context poisoning
    required_context_keys: set = frozenset({"project_id", "user_id", "budget_limit"})

# This state type tracks what WILL break your system if you ignore it
class ProductionAgentState(TypedDict):
    goal: str
    context: dict
    current_task: str
    tools_used: list
    total_cost: float  # Running token/AI cost in dollars
    retry_count: int  # How many times we've tried
    last_check: str  # Timestamp for tracking drift
    meta_info: dict  # Business context (avoid audit hell)
    session_start: float  # When this started (for cost tracking)
    error: Optional[str]  # What went wrong (for debugging)
    status: str

def check_agent_safety(state: ProductionAgentState, constraints: AgentConstraints) -> dict:
    """
    This is your agent's guardian angel. EVERY action goes through this.

    Here's what this catches that demos ignore:

    1. Cost explosion: We're burning $5/hour on tokens
    2. Retry loops: Agent retrying itself into bankruptcy
    3. Session length: Context poisoning over long sessions
    4. Missing context: Agent running without required business context
    """
    current_cost = state["total_cost"]
    session_time = time.time() - state["session_start"]
    hourly_burn = (current_cost / session_time) * 3600 if session_time > 0 else 0

    # This check right here saves companies thousands in token bills
    if hourly_burn > constraints.max_cost_per_hour:
        return {
            "status": "critical",
            "message": f"COST EXPLOSION: Current rate ${hourly_burn:.2f}/hr exceeds ${constraints.max_cost_per_hour}",
            "next_action": "emergency_halt"
        }

    # This prevents the infinite apology loop - agent tries again, fails, apologizes, tries again...
    if state["retry_count"] > constraints.max_retries:
        return {
            "status": "critical",
            "message": f"STOP. It's tried {state['retry_count']} times. Human intervention required.",
            "next_action": "wake_engineer"
        }

    # This catches the context poison cycle before it kills production
    if session_time > constraints.max_session_length:
        return {
            "status": "warning",
            "message": f"Session at {session_time:.0f}s - risk of context degradation",
            "next_action": "checkpoint_and_refresh"
        }

    return {"status": "safe", "message": "Healthy", "next_action": "continue"}

Notice what’s different here? Every check has a dollar amount or system impact attached. Production agents aren’t about making code pretty - they’re about keeping your CFO from asking questions you don’t want to answer.

The Real-World Agent That Actually Works

Here’s the agent skeleton that runs production at companies you’ve heard of:

from langchain_openai import ChatOpenAI
import logging

class ProductionAgent:
    def __init__(self, model_name="gpt-4o", safety_config=None):
        self.model = ChatOpenAI(model=model_name, temperature=0)
        self.safety = safety_config or get_production_safety_config()

        # This logging setup isn't decorative - it's how you debug when things break at 3 AM
        self.logger = setup_production_logging()

        # Track current execution state for safety checks
        self.current_state: ProductionAgentState = {
            "goal": "",
            "context": {},
            "current_task": "",
            "tools_used": [],
            "total_cost": 0.0,
            "retry_count": 0,
            "last_check": "",
            "meta_info": {},
            "session_start": time.time(),
            "error": None,
            "status": "initialized"
        }

    async def execute(self, goal: str, context: dict) -> ExecutionResult:
        """
        Here's the real art: understanding that agents fail in categories, not randomness.

        Category 1: Budget failures (we're spending more than the task is worth)
        Category 2: Quality failures (the answer is wrong/unhelpful)
        Category 3: System failures (the infrastructure is dying)
        Category 4: Context failures (we're asking the wrong thing)
        """

        # Create enriched context (not just slapping it in a prompt)
        enriched_context = self._enrich_context(goal, context)

        # Start billion-dollar checklist
        safety_check = self._pre_execution_safety_check(enriched_context)
        if safety_check.status != "safe":
            self.logger.critical(f"Agent would fail safety: {safety_check.message}")
            return self._handle_safety_failure(safety_check)

        # Only now do we let it use the compute budget
        plan = await self._create_robust_plan(goal, enriched_context)

        # This prevents the most common failure: context poisoning
        self._checkpoint_context_before_execution(plan)

        execution_result = await self._execute_with_production_escalation(plan)

        return self._post_execution_cleanup(execution_result)

    def _create_robust_plan(self, goal: str, context: dict) -> dict:
        """Here's the critical difference: we plan BEFORE we panic"""

        # Production planning includes failure strategies BY DEFAULT
        planning_prompt = f"""
        Create a production plan for: {goal}

        IMPORTANT: This plan must include:
        1. What to do if each step fails
        2. Maximum time budget per step
        3. How to roll back if we need to
        4. Success criteria for each step

        Business context: {context}
        Safety constraints: {self.safety}

        Remember: This will run without human supervision.
        """

        response = self.model.invoke(planning_prompt).content

        return {
            "plan": self._parse_production_plan(response),
            "rollback_commands": self._extract_rollback(response),
            "success_criteria": self._extract_criteria(response)
        }

    def _execute_with_production_escalation(self, plan: dict) -> dict:
        """
        The key insight: execution succeeds or fails in ESCALATION patterns.

        Pattern 1: Step succeeds immediately (< 30 seconds)
        Pattern 2: Requires retry with different approach (30s-2min)
        Pattern 3: Requires human review (2-10min)
        Pattern 4: Must halt immediately (emergency)
        """

        results = []
        final_status = "completed"

        for i, step in enumerate(plan["plan"]["steps"]):

            # Check business safety before EVERY step
            step_safety = check_agent_safety(self.current_state, self.safety)
            if step_safety["status"] != "safe":
                if step_safety["next_action"] == "emergency_halt":
                    final_status = "halted_for_safety"
                    break
                elif step_safety["next_action"] == "call_human":
                    final_status = "waiting_for_human"
                    break

            # Execute step with production patterns
            step_result = self._execute_step_with_patterns(step, plan, i)
            results.append(step_result)

            # Real-time decision making based on outcome
            if step_result["requires_action"]:
                self.logger.info(f"Step {i} escalated to manual review")
                final_status = "escalated_to_manual"
                break

        return {
            "final_status": final_status,
            "step_results": results,
            "business_impact": self._calculate_impact(results),
            "rollback_ready": plan["rollback"]
        }

    def _execute_step_with_patterns(self, step: dict, plan: dict, step_index: int) -> dict:
        """Critical insight: steps don't "fail" - they escalate through known patterns"""

        start_time = time.time()
        retry_count = 0
        max_retries = 2
        original_step = step.copy()

        while retry_count <= max_retries:
            try:
                # First attempt with original approach
                if retry_count == 0:
                    result = self._attempt_step_original_way(step)
                # Second attempt with alternatives
                elif retry_count == 1:
                    result = self._attempt_step_alternative_way(step)
                # Third attempt gets human help
                else:
                    return {
                        "status": "requires_manual_intervention",
                        "message": "Automated patterns exhausted, calling human",
                        "requires_action": True,
                        "evidence": self._collect_debug_info(["current_attempts"])
                    }

                # Did we succeed by current business definition?
                if self._step_meets_success_criteria(result, plan["success_criteria"][step_index]):
                    return {
                        "status": "success",
                        "result": result,
                        "time_seconds": time.time() - start_time,
                        "retries": retry_count
                    }

                # No success - log what went wrong for patterns
                self.logger.debug(f"Step {step_index}: attempt {retry_count} failed pattern {result['failure_pattern']}")
                retry_count += 1

            except Exception as e:
                # This captures unexpected failures (network, API, etc)
                self.logger.error(f"Unexpected failure in step {step_index}: {str(e)}")
                return {
                    "status": "unexpected_failure",
                    "message": str(e),
                    "requires_action": True,
                    "evidence": self._collect_debug_info(["exception", "step_trace"])
                }

        return {
            "status": "pattern_failure",
            "message": f"Step {step_index} failed all patterns",
            "requires_action": True,
            "evidence": self._log_failure_evidence(original_step, attempts={"all_attempts": retry_count})
        }

Stop and notice what’s different here: There’s no generic on error, retry nonsense. Every failure has a specific pattern, escalation path, and business decision.

Part 2: Multi-Agent Systems That Don’t Kill Each Other

Rookie mistake: thinking multi-agent is about making agents talk to each other. Production reality: It’s about making them STOP talking to each other when things go wrong.

from typing import Dict, List, Optional
from dataclasses import dataclass
from functools import lru_cache
import time

@dataclass
class ResolutionOption:
    action: str
    impact: float
    risk: float
    reversibility: float

@dataclass
class Conflict:
    agent_a_priority: float
    agent_a_risk: float
    agent_a_reversible: float
    agent_b_priority: float
    agent_b_risk: float
    agent_b_reversible: float

class Resolution:
    pass

# Action constants for resolution options
agent_a_wins = "agent_a_wins"
agent_b_wins = "agent_b_wins"
both_lose_restart = "both_lose_restart"
escalate_to_human = "escalate_to_human"
timebox_experiment = "timebox_experiment"

class ProductionMultiAgentSystem:
    def __init__(self, max_agents: int = 5):
        self.agents: Dict[str, SafetyAwareAgent] = {}
        self.agent_timeout = 300  # 5 minutes max per agent
        self.conflict_resolver = AgentConflictResolver()

        # This is where we prevent the "agent civil war" disaster
        self.escalation_rules = {
            "agent_disagreement": "human_decision_required",
            "circular_dependency": "timeout_and_stalemate",
            "resource_conflict": "priority_order_with_fallback",
            "deadlock": "force_completion_with_logging"
        }

    def coordinate_agents(self, workflow: dict) -> dict:
        """
        The secret: coordinate by constraint satisfaction, not conversation.

        Instead of "Agent A, convince Agent B" we do:
        "Agent A, what's your constraint? Agent B, what's yours?"
        "Resolve mathematically who wins based on business rules"
        """

        agent_constraints = self._extract_all_agent_constraints(workflow)

        # If two agents want opposite things, resolve by rules, not negotiation
        conflicts = self._detect_agent_conflicts(agent_constraints)

        for conflict in conflicts:
            resolution = self.conflict_resolver.resolve_by_business_rules(conflict)

            # Document the decision for audit trail
            self._log_decision_making(conflict, resolution)

            # Update workflow based on resolution
            workflow = self._update_workflow_with_resolution(workflow, resolution)

        return workflow


class AgentConflictResolver:
    def resolve_by_business_rules(self, conflict: Conflict) -> Resolution:
        """
        Production secret: Legal mathematics > Agent intelligence.

        Don't ask agents who's right. Ask:
        - Which resolution has higher ROI?
        - Which has lower risk?
        - Which follows established policy?
        - What can we UNDO if wrong?
        """

        # This math literally saves companies from agents making bad deals
        options = self._generate_all_resolution_options(conflict)

        scored_options = []
        for option in options:
            score = self._score_resolution_business_impact(option)
            scored_options.append((score, option))

        # Sort by business score (ROI > risk > reversibility > everything else)
        scored_options.sort(key=lambda x: x[0], reverse=True)

        return scored_options[0][1]

    @lru_cache
    def _generate_all_resolution_options(self, conflict: Conflict) -> List[ResolutionOption]:
        """
        Generate RESOLUTIONS, not compromises.

        Instead of "let's split the difference" generate:
        - Agent A wins completely
        - Agent B wins completely
        - Both lose (clear restart)
        - Neither wins (get human decision)
        - Defer to external validation (get data)
        - Timebox experiment (safe test)
        """
        return [
            ResolutionOption(agent_a_wins, impact=conflict.agent_a_priority, risk=conflict.agent_a_risk, reversibility=conflict.agent_a_reversible),
            ResolutionOption(agent_b_wins, impact=conflict.agent_b_priority, risk=conflict.agent_b_risk, reversibility=conflict.agent_b_reversible),
            ResolutionOption(both_lose_restart, impact=0, risk=0, reversibility=100),
            ResolutionOption(escalate_to_human, impact=0, risk=0, reversibility=100),
            ResolutionOption(timebox_experiment, impact=50, risk=20, reversibility=80)
        ]

    def _score_resolution_business_impact(self, resolution: ResolutionOption) -> float:
        """
        Convert rules into numbers that survive agent complexity.

        Score = (Business Impact * 0.5) - (Risk Score * 0.3) + (Reversibility * 0.2)

        This math ensures we optimize for:
        1. Business value (50% weight)
        2. Risk reduction (30% weight)
        3. Ability to fix mistakes (20% weight)

        The weights are hard-coded because in production you want published rules, not AI opinions.
        """
        return (
            resolution.impact * 0.5 -
            resolution.risk * 0.3 +
            resolution.reversibility * 0.2
        )

Why this works: Notice there’s no agents vote or democratic decision making. In production, you don’t want democracy - you want determinism. When agents disagree, the code doesn’t ask them to compromise. It runs math. Math doesn’t have opinions, doesn’t get tired, and doesn’t negotiate.

Part 3: Deployment, Monitoring, and Not Getting Fired

You’ve built safe agents. Now let’s make sure they stay safe after deployment.

The Monitoring Stack That Actually Matters

from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import List, Dict
import statistics

@dataclass
class AgentMetric:
    timestamp: datetime
    cost_usd: float
    latency_ms: float
    success: bool
    error_type: str | None
    agent_id: str
    task_type: str


class ProductionMonitor:
    """
    This is what you check at 3 AM when PagerDuty goes off.

    Key insight: You don't monitor "agent performance."
    You monitor "is this costing more than it's worth?"
    """

    def __init__(self):
        self.cost_alert_threshold = 10.0  # $10/hour triggers investigation
        self.error_rate_threshold = 0.15  # 15% error rate triggers alert
        self.latency_threshold_ms = 5000  # 5 seconds triggers investigation

    def check_agent_health(self, metrics: List[AgentMetric]) -> Dict:
        """Run this every minute. It catches disasters before they become headlines."""

        if not metrics:
            return {"status": "no_data", "action": "check_agent_connectivity"}

        recent_metrics = [m for m in metrics if m.timestamp > datetime.now() - timedelta(hours=1)]

        # The metrics that actually matter to your career
        hourly_cost = sum(m.cost_usd for m in recent_metrics)
        error_rate = sum(1 for m in recent_metrics if not m.success) / len(recent_metrics)
        avg_latency = statistics.mean([m.latency_ms for m in recent_metrics])

        alerts = []

        # Cost explosion detection
        if hourly_cost > self.cost_alert_threshold:
            alerts.append({
                "severity": "high",
                "message": f"COST spike: ${hourly_cost:.2f}/hour",
                "action": "review_recent_tasks_and_maybe_kill_agent"
            })

        # Error rate spike
        if error_rate > self.error_rate_threshold:
            alerts.append({
                "severity": "medium",
                "message": f"Error rate: {error_rate:.1%}",
                "action": "investigate_common_failure_patterns"
            })

        # Latency degradation
        if avg_latency > self.latency_threshold_ms:
            alerts.append({
                "severity": "low",
                "message": f"Slow responses: {avg_latency:.0f}ms avg",
                "action": "check_model_availability_and_context_size"
            })

        return {
            "status": "alerting" if alerts else "healthy",
            "hourly_cost": hourly_cost,
            "error_rate": error_rate,
            "avg_latency_ms": avg_latency,
            "alerts": alerts
        }

    def generate_daily_report(self, metrics: List[AgentMetric]) -> str:
        """
        This is what you send to your boss every morning.

        Format: "Here's what our agents did yesterday and whether it was worth it"
        """
        yesterday = [m for m in metrics if m.timestamp > datetime.now() - timedelta(days=1)]

        total_cost = sum(m.cost_usd for m in yesterday)
        total_tasks = len(yesterday)
        successful_tasks = sum(1 for m in yesterday if m.success)
        cost_per_success = total_cost / successful_tasks if successful_tasks > 0 else float('inf')

        # This calculation answers: "Would it have been cheaper to hire a human?"
        human_cost_equivalent = total_tasks * 0.50  # Assume $0.50 per task for human
        roi = (human_cost_equivalent - total_cost) / human_cost_equivalent * 100 if human_cost_equivalent > 0 else 0

        return f"""
        Daily Agent Report

        Tasks completed: {total_tasks} ({successful_tasks} successful)
        Total cost: ${total_cost:.2f}
        Cost per successful task: ${cost_per_success:.4f}
        ROI vs human labor: {roi:.1f}%

        {'Agents paying for themselves' if roi > 0 else 'WARNING: Agents costing more than humans - investigate'}
        """

Production Deployment Configuration

Here’s everything you need to actually run this in production:

1. Agent Configuration (config.yaml)

# Production Agent Configuration
# Copy this file and customize for your environment

agent:
  name: "production-agent"
  model: "gpt-4o"
  temperature: 0  # Deterministic outputs for production

safety:
  # Cost controls - adjust based on your budget
  max_cost_per_hour: 5.0        # Kill agent if burning > $5/hr
  max_cost_per_day: 50.0        # Daily budget cap
  max_retries: 3                # Prevent infinite loops
  max_session_length: 3600      # 1 hour max before context refresh

  # Required context - agent won't start without these
  required_context_keys:
    - project_id
    - user_id
    - budget_limit
    - task_priority

monitoring:
  # Alert thresholds
  cost_alert_threshold: 10.0    # Alert at $10/hr
  error_rate_threshold: 0.15    # Alert at 15% errors
  latency_threshold_ms: 5000    # Alert at 5s latency

  # Where to send alerts
  alert_channels:
    - type: slack
      webhook_url: "${SLACK_WEBHOOK_URL}"
    - type: email
      recipients:
        - oncall@yourcompany.com

  # Metrics storage
  metrics_backend: "prometheus"
  metrics_port: 9090

escalation:
  # What happens when agent needs help
  on_failure:
    - notify_oncall: true
    - create_incident: true
    - auto_rollback: true

  # PagerDuty integration
  pagerduty:
    service_key: "${PAGERDUTY_SERVICE_KEY}"
    severity_map:
      critical: "P1"
      warning: "P2"

logging:
  level: "INFO"
  format: "json"  # Structured logs for production
  output: "/var/log/agent/production.log"
  retention_days: 30

2. Docker Configuration

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy agent code
COPY . .

# Create non-root user for security
RUN useradd -m -u 1000 agent && \
    chown -R agent:agent /app
USER agent

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

# Run the agent service
CMD ["python", "-m", "agent.server"]

Requirements

langchain-openai>=0.1.0
langchain-core>=0.1.0
pydantic>=2.0.0
prometheus-client>=0.19.0
pyyaml>=6.0
structlog>=23.0.0
httpx>=0.25.0

3. Docker Compose for Easy Deployment

version: "3.8"

services:
  agent:
    build: .
    container_name: production-agent
    restart: unless-stopped
    ports:
      - "8080:8080"      # API endpoint
      - "9090:9090"      # Metrics endpoint
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CONFIG_PATH=/app/config.yaml
      - LOG_LEVEL=INFO
    volumes:
      - ./config.yaml:/app/config.yaml:ro
      - agent-logs:/var/log/agent
    depends_on:
      - redis
      - prometheus
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: "2.0"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  redis:
    image: redis:7-alpine
    container_name: agent-redis
    restart: unless-stopped
    volumes:
      - redis-data:/data
    command: redis-server --appendonly yes

  prometheus:
    image: prom/prometheus:latest
    container_name: agent-prometheus
    restart: unless-stopped
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  alertmanager:
    image: prom/alertmanager:latest
    container_name: agent-alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro

volumes:
  agent-logs:
  redis-data:
  prometheus-data:

4. Quick Start

# 1. Clone and configure
git clone your-agent-repo
cd your-agent-repo
cp config.example.yaml config.yaml

# 2. Set your API keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

# 3. Customize config.yaml for your use case
vim config.yaml

# 4. Start everything
docker-compose up -d

# 5. Check health
curl http://localhost:8080/health

# 6. View metrics
open http://localhost:9091

# 7. Send a task
curl -X POST http://localhost:8080/execute \
  -H "Content-Type: application/json" \
  -d '{
    "goal": "Analyze last week sales data and send report",
    "context": {
      "project_id": "sales-analysis",
      "user_id": "user-123",
      "budget_limit": 1.0
    }
  }'

The Deployment Checklist

Before any agent touches production:

Cost limit set? There’s a hard dollar amount that kills the agent automatically.
Timeout configured? No agent runs longer than X minutes without a checkpoint.
Rollback tested? You can undo what the agent did in under 5 minutes.
Human escalation working? When the agent gives up, a real person gets notified.
Audit logging enabled? Every decision is logged for the inevitable post-mortem.

The Runbook: When Agents Break at 3 AM

IF agent_cost > $50/hour:
    1. Kill the agent process immediately
    2. Check recent task queue for runaway jobs
    3. Review logs for retry loops
    4. Wake the on-call engineer if cost > $100

IF agent_error_rate > 20%:
    1. Check model API status (OpenAI/Anthropic status pages)
    2. Review recent context for corruption
    3. Restart with fresh context if needed
    4. Escalate if errors persist after restart

IF agent_latency > 10 seconds:
    1. Check context window size
    2. Look for infinite loops in task queue
    3. Consider switching to faster/cheaper model temporarily

The Bottom Line

Production AI agents in 2026 aren’t about building smarter agents. They’re about building agents that fail gracefully, cost predictably, and never surprise your CFO.

The companies winning with agents didn’t hire better AI engineers - they hired better constraint engineers. They realized that:

Constraints > Intelligence: A dumb agent with hard limits beats a smart agent with none.
Math > Negotiation: When agents disagree, run calculations, not conversations.
Monitoring > Hoping: If you can’t measure it, it will bankrupt you.
Rollback > Perfection: The ability to undo is worth more than the ability to get it right.

Your demo agent made people say wow. Your production agent should make people say nothing happened, and that’s exactly right.

That’s the no-bullshit way to build AI agents that actually work.