Monitoring & Observability

This guide covers monitoring Station in production: metrics, logging, alerting, and dashboards.

:::tip[Looking for Tracing?] For OpenTelemetry/Jaeger tracing, see Observability. :::

Key Metrics

Agent Execution

Metric	Description	Alert Threshold
`agent_runs_total`	Total agent executions	-
`agent_runs_success`	Successful runs	-
`agent_runs_failed`	Failed runs	> 5% failure rate
`agent_run_duration_seconds`	Execution time	P95 > 60s
`agent_steps_total`	Steps per run	Avg > max_steps

LLM Usage

Metric	Description	Alert Threshold
`llm_requests_total`	Total LLM API calls	-
`llm_tokens_input`	Input tokens	> budget
`llm_tokens_output`	Output tokens	> budget
`llm_latency_seconds`	API response time	P95 > 5s
`llm_errors_total`	API errors	> 1% error rate

System Health

Metric	Description	Alert Threshold
`station_uptime_seconds`	Time since start	-
`mcp_servers_active`	Connected MCP servers	< expected
`scheduler_runs_pending`	Scheduled runs queued	> 10
`database_size_bytes`	SQLite database size	> 1GB

Logging

Log Levels

Level	When Used
`ERROR`	Failures requiring attention
`WARN`	Potential issues
`INFO`	Normal operations
`DEBUG`	Detailed debugging

Configure Logging

# config.yaml
log_level: info
log_format: json  # json or text

Or via environment:

export STN_LOG_LEVEL=debug
export STN_LOG_FORMAT=json

Log Output

{
  "level": "info",
  "ts": "2024-01-15T10:30:00Z",
  "msg": "Agent execution completed",
  "agent_id": 21,
  "agent_name": "cost-analyzer",
  "run_id": 123,
  "duration_ms": 4532,
  "status": "success"
}

View Logs

# All logs
stn logs

# Follow logs
stn logs -f

# Filter by level
stn logs --level error

# Filter by agent
stn logs --agent cost-analyzer

Health Checks

HTTP Endpoint

curl http://localhost:8585/health

{
  "status": "healthy",
  "version": "0.5.0",
  "uptime": "2h 15m 30s",
  "agents": 12,
  "mcp_servers": 5,
  "database": "ok",
  "scheduler": "running"
}

Kubernetes Probes

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: station
        livenessProbe:
          httpGet:
            path: /health
            port: 8585
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8585
          initialDelaySeconds: 5
          periodSeconds: 10

Docker Healthcheck

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:8585/health || exit 1

Alerting

Webhook Notifications

Configure alerts via webhooks:

# config.yaml
notifications:
  webhook_url: https://hooks.slack.com/services/...
  webhook_events:
    - agent.failed
    - agent.timeout
    - scheduler.error

Example Alerts

Agent Failure Rate High:

- alert: AgentFailureRateHigh
  expr: rate(agent_runs_failed[5m]) / rate(agent_runs_total[5m]) > 0.05
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Agent failure rate above 5%"

LLM Latency High:

- alert: LLMLatencyHigh
  expr: histogram_quantile(0.95, llm_latency_seconds_bucket) > 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "LLM API P95 latency above 5s"

Dashboards

Grafana Dashboard

Import the Station dashboard:

{
  "dashboard": {
    "title": "Station Overview",
    "panels": [
      {
        "title": "Agent Executions",
        "type": "graph",
        "targets": [
          {"expr": "rate(agent_runs_total[5m])"}
        ]
      },
      {
        "title": "Success Rate",
        "type": "gauge",
        "targets": [
          {"expr": "rate(agent_runs_success[1h]) / rate(agent_runs_total[1h]) * 100"}
        ]
      },
      {
        "title": "LLM Token Usage",
        "type": "graph",
        "targets": [
          {"expr": "sum(rate(llm_tokens_total[1h])) by (type)"}
        ]
      }
    ]
  }
}

Key Panels

Agent Execution Rate - Runs per minute
Success Rate - Percentage of successful runs
Execution Duration - P50, P95, P99 latency
Token Usage - Input/output tokens over time
Error Rate - Failures by type
Active Schedules - Running scheduled agents

Performance Monitoring

Slow Agent Detection

# Find slow runs
stn runs list --min-duration 30s

# Inspect slow run
stn runs inspect 123 --verbose

Token Usage Tracking

# View token usage by agent
stn stats tokens --group-by agent --period 7d

# View token usage by model
stn stats tokens --group-by model --period 7d

Resource Usage

# CPU and memory
docker stats station

# Database size
ls -lh ~/.config/station/station.db

# Run history size
stn stats runs --period 30d

Production Checklist

Pre-Deployment

Configure structured logging (JSON format)
Set up health check endpoint monitoring
Configure alerting webhooks
Set resource limits (memory, CPU)
Configure database backups