Monitoring & Observability
This guide covers monitoring Station in production: metrics, logging, alerting, and dashboards.
:::tip[Looking for Tracing?] For OpenTelemetry/Jaeger tracing, see Observability. :::
Key Metrics
Agent Execution
| Metric | Description | Alert Threshold |
|---|---|---|
agent_runs_total | Total agent executions | - |
agent_runs_success | Successful runs | - |
agent_runs_failed | Failed runs | > 5% failure rate |
agent_run_duration_seconds | Execution time | P95 > 60s |
agent_steps_total | Steps per run | Avg > max_steps |
LLM Usage
| Metric | Description | Alert Threshold |
|---|---|---|
llm_requests_total | Total LLM API calls | - |
llm_tokens_input | Input tokens | > budget |
llm_tokens_output | Output tokens | > budget |
llm_latency_seconds | API response time | P95 > 5s |
llm_errors_total | API errors | > 1% error rate |
System Health
| Metric | Description | Alert Threshold |
|---|---|---|
station_uptime_seconds | Time since start | - |
mcp_servers_active | Connected MCP servers | < expected |
scheduler_runs_pending | Scheduled runs queued | > 10 |
database_size_bytes | SQLite database size | > 1GB |
Logging
Log Levels
| Level | When Used |
|---|---|
ERROR | Failures requiring attention |
WARN | Potential issues |
INFO | Normal operations |
DEBUG | Detailed debugging |
Configure Logging
# config.yaml
log_level: info
log_format: json # json or text
Or via environment:
export STN_LOG_LEVEL=debug
export STN_LOG_FORMAT=json
Log Output
{
"level": "info",
"ts": "2024-01-15T10:30:00Z",
"msg": "Agent execution completed",
"agent_id": 21,
"agent_name": "cost-analyzer",
"run_id": 123,
"duration_ms": 4532,
"status": "success"
}
View Logs
# All logs
stn logs
# Follow logs
stn logs -f
# Filter by level
stn logs --level error
# Filter by agent
stn logs --agent cost-analyzer
Health Checks
HTTP Endpoint
curl http://localhost:8585/health
{
"status": "healthy",
"version": "0.5.0",
"uptime": "2h 15m 30s",
"agents": 12,
"mcp_servers": 5,
"database": "ok",
"scheduler": "running"
}
Kubernetes Probes
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: station
livenessProbe:
httpGet:
path: /health
port: 8585
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8585
initialDelaySeconds: 5
periodSeconds: 10
Docker Healthcheck
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8585/health || exit 1
Alerting
Webhook Notifications
Configure alerts via webhooks:
# config.yaml
notifications:
webhook_url: https://hooks.slack.com/services/...
webhook_events:
- agent.failed
- agent.timeout
- scheduler.error
Example Alerts
Agent Failure Rate High:
- alert: AgentFailureRateHigh
expr: rate(agent_runs_failed[5m]) / rate(agent_runs_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Agent failure rate above 5%"
LLM Latency High:
- alert: LLMLatencyHigh
expr: histogram_quantile(0.95, llm_latency_seconds_bucket) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "LLM API P95 latency above 5s"
Dashboards
Grafana Dashboard
Import the Station dashboard:
{
"dashboard": {
"title": "Station Overview",
"panels": [
{
"title": "Agent Executions",
"type": "graph",
"targets": [
{"expr": "rate(agent_runs_total[5m])"}
]
},
{
"title": "Success Rate",
"type": "gauge",
"targets": [
{"expr": "rate(agent_runs_success[1h]) / rate(agent_runs_total[1h]) * 100"}
]
},
{
"title": "LLM Token Usage",
"type": "graph",
"targets": [
{"expr": "sum(rate(llm_tokens_total[1h])) by (type)"}
]
}
]
}
}
Key Panels
- Agent Execution Rate - Runs per minute
- Success Rate - Percentage of successful runs
- Execution Duration - P50, P95, P99 latency
- Token Usage - Input/output tokens over time
- Error Rate - Failures by type
- Active Schedules - Running scheduled agents
Performance Monitoring
Slow Agent Detection
# Find slow runs
stn runs list --min-duration 30s
# Inspect slow run
stn runs inspect 123 --verbose
Token Usage Tracking
# View token usage by agent
stn stats tokens --group-by agent --period 7d
# View token usage by model
stn stats tokens --group-by model --period 7d
Resource Usage
# CPU and memory
docker stats station
# Database size
ls -lh ~/.config/station/station.db
# Run history size
stn stats runs --period 30d
Production Checklist
Pre-Deployment
- Configure structured logging (JSON format)
- Set up health check endpoint monitoring
- Configure alerting webhooks
- Set resource limits (memory, CPU)
- Configure database backups
Post-Deployment
- Verify health checks passing
- Test alerting pipeline
- Baseline normal metrics
- Set up dashboard
- Document runbooks
Ongoing
- Review error logs daily
- Monitor token usage vs budget
- Track success rate trends
- Review slow agent runs
- Prune old run history
Troubleshooting
High Error Rate
- Check logs:
stn logs --level error - Review failed runs:
stn runs list --status error - Inspect specific failure:
stn runs inspect <id> - Check MCP server health:
stn status
Slow Execution
- Check LLM latency in traces (Jaeger)
- Review tool call durations
- Check for retry loops in agent output
- Consider increasing timeout or reducing max_steps
Memory Issues
- Check database size
- Prune old runs:
stn runs prune --older-than 30d - Review agent output sizes
- Consider cloud database for large deployments
Next Steps
- Observability - Distributed tracing with Jaeger
- Security - Security configuration
- Production - Production deployment guide