Monitoring & Observability

This guide covers monitoring Station in production: metrics, logging, alerting, and dashboards.

:::tip[Looking for Tracing?] For OpenTelemetry/Jaeger tracing, see Observability. :::

Key Metrics

Agent Execution

MetricDescriptionAlert Threshold
agent_runs_totalTotal agent executions-
agent_runs_successSuccessful runs-
agent_runs_failedFailed runs> 5% failure rate
agent_run_duration_secondsExecution timeP95 > 60s
agent_steps_totalSteps per runAvg > max_steps

LLM Usage

MetricDescriptionAlert Threshold
llm_requests_totalTotal LLM API calls-
llm_tokens_inputInput tokens> budget
llm_tokens_outputOutput tokens> budget
llm_latency_secondsAPI response timeP95 > 5s
llm_errors_totalAPI errors> 1% error rate

System Health

MetricDescriptionAlert Threshold
station_uptime_secondsTime since start-
mcp_servers_activeConnected MCP servers< expected
scheduler_runs_pendingScheduled runs queued> 10
database_size_bytesSQLite database size> 1GB

Logging

Log Levels

LevelWhen Used
ERRORFailures requiring attention
WARNPotential issues
INFONormal operations
DEBUGDetailed debugging

Configure Logging

# config.yaml
log_level: info
log_format: json  # json or text

Or via environment:

export STN_LOG_LEVEL=debug
export STN_LOG_FORMAT=json

Log Output

{
  "level": "info",
  "ts": "2024-01-15T10:30:00Z",
  "msg": "Agent execution completed",
  "agent_id": 21,
  "agent_name": "cost-analyzer",
  "run_id": 123,
  "duration_ms": 4532,
  "status": "success"
}

View Logs

# All logs
stn logs

# Follow logs
stn logs -f

# Filter by level
stn logs --level error

# Filter by agent
stn logs --agent cost-analyzer

Health Checks

HTTP Endpoint

curl http://localhost:8585/health
{
  "status": "healthy",
  "version": "0.5.0",
  "uptime": "2h 15m 30s",
  "agents": 12,
  "mcp_servers": 5,
  "database": "ok",
  "scheduler": "running"
}

Kubernetes Probes

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: station
        livenessProbe:
          httpGet:
            path: /health
            port: 8585
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8585
          initialDelaySeconds: 5
          periodSeconds: 10

Docker Healthcheck

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8585/health || exit 1

Alerting

Webhook Notifications

Configure alerts via webhooks:

# config.yaml
notifications:
  webhook_url: https://hooks.slack.com/services/...
  webhook_events:
    - agent.failed
    - agent.timeout
    - scheduler.error

Example Alerts

Agent Failure Rate High:

- alert: AgentFailureRateHigh
  expr: rate(agent_runs_failed[5m]) / rate(agent_runs_total[5m]) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Agent failure rate above 5%"

LLM Latency High:

- alert: LLMLatencyHigh
  expr: histogram_quantile(0.95, llm_latency_seconds_bucket) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "LLM API P95 latency above 5s"

Dashboards

Grafana Dashboard

Import the Station dashboard:

{
  "dashboard": {
    "title": "Station Overview",
    "panels": [
      {
        "title": "Agent Executions",
        "type": "graph",
        "targets": [
          {"expr": "rate(agent_runs_total[5m])"}
        ]
      },
      {
        "title": "Success Rate",
        "type": "gauge",
        "targets": [
          {"expr": "rate(agent_runs_success[1h]) / rate(agent_runs_total[1h]) * 100"}
        ]
      },
      {
        "title": "LLM Token Usage",
        "type": "graph",
        "targets": [
          {"expr": "sum(rate(llm_tokens_total[1h])) by (type)"}
        ]
      }
    ]
  }
}

Key Panels

  1. Agent Execution Rate - Runs per minute
  2. Success Rate - Percentage of successful runs
  3. Execution Duration - P50, P95, P99 latency
  4. Token Usage - Input/output tokens over time
  5. Error Rate - Failures by type
  6. Active Schedules - Running scheduled agents

Performance Monitoring

Slow Agent Detection

# Find slow runs
stn runs list --min-duration 30s

# Inspect slow run
stn runs inspect 123 --verbose

Token Usage Tracking

# View token usage by agent
stn stats tokens --group-by agent --period 7d

# View token usage by model
stn stats tokens --group-by model --period 7d

Resource Usage

# CPU and memory
docker stats station

# Database size
ls -lh ~/.config/station/station.db

# Run history size
stn stats runs --period 30d

Production Checklist

Pre-Deployment

  • Configure structured logging (JSON format)
  • Set up health check endpoint monitoring
  • Configure alerting webhooks
  • Set resource limits (memory, CPU)
  • Configure database backups

Post-Deployment

  • Verify health checks passing
  • Test alerting pipeline
  • Baseline normal metrics
  • Set up dashboard
  • Document runbooks

Ongoing

  • Review error logs daily
  • Monitor token usage vs budget
  • Track success rate trends
  • Review slow agent runs
  • Prune old run history

Troubleshooting

High Error Rate

  1. Check logs: stn logs --level error
  2. Review failed runs: stn runs list --status error
  3. Inspect specific failure: stn runs inspect <id>
  4. Check MCP server health: stn status

Slow Execution

  1. Check LLM latency in traces (Jaeger)
  2. Review tool call durations
  3. Check for retry loops in agent output
  4. Consider increasing timeout or reducing max_steps

Memory Issues

  1. Check database size
  2. Prune old runs: stn runs prune --older-than 30d
  3. Review agent output sizes
  4. Consider cloud database for large deployments

Next Steps