Operations runbook
Running the control plane in production. Configuration is entirely environment
variables — .env.example in the repo is the canonical list, and the internal
deployment guide
covers the CI/CD and Railway path. This document explains the operational
subsystems and how to tune them.
Identity
The CP has its own AITP identity (Ed25519), served at
/.well-known/aitp-manifest and used to sign the revocation list.
CP_AID_SEED_HEX— 32-byte hex seed. Required in production. Without it the seed is regenerated on every boot, so the CP's AID changes on restart and any peer that pinned the old key breaks. Generate once and store it as a secret:node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"CP_BASE_URL— public URL embedded in the CP's own manifest. Set it to the externally reachable origin.
Rotating CP_AID_SEED_HEX rotates the control-plane identity — treat it like a
signing key, not a config toggle.
Authentication & exposure
API_KEYS— comma-separated allowlist for gated routes. Required in production: emptyAPI_KEYSin prod makes every gated route return503 SERVER_MISCONFIGURED(fail-safe). Empty in non-prod disables auth and logs a one-time warning.ENROLLMENT_SECRET— server-side HMAC key for minting/verifying one-time enrollment tokens. Required. Callers never see it.CORS_ORIGIN— allowed browser origin. Required in prod; falls back to*with a warning if unset.
See api.md for the full auth matrix.
Rate limiting
In-memory, per-process token buckets on every /api/* route except the probes
(/api/health, /api/readyz, /api/metrics). Over-limit → 429 RATE_LIMITED
with Retry-After and X-RateLimit-* headers.
| Bucket | Default | Env var | Keyed by |
|---|---|---|---|
enroll-ip | 5/min | RATE_LIMIT_ENROLLMENT_PER_IP_MIN | client IP (brute-force guard on enrollment) |
public-ip | 60/min | RATE_LIMIT_PUBLIC_PER_IP_MIN | client IP |
api-key | 600/min | RATE_LIMIT_API_KEY_PER_MIN | API key prefix |
RATE_LIMIT_WINDOW_MS(default 60000) is the accumulation window.RATE_LIMIT_ENABLED=falsedisables the limiter entirely (dev / load tests).- Set any individual limit to
0to disable that bucket.
Buckets are per-process. Behind multiple replicas the effective limit is roughly
N × limit. For a hard global limit, put a shared limiter at the edge.
Client-IP trust (important behind a proxy)
X-Forwarded-For is client-controllable, so per-IP buckets are spoofable unless
you tell the CP which hop to trust:
CLIENT_IP_HEADER— a single trusted header your edge sets to the real client IP (e.g.cf-connecting-ip,x-vercel-forwarded-for). Wins when set.TRUSTED_PROXY_HOPS— number of trusted proxies appending to XFF; the client IP is read this many entries from the right. Default0= XFF not trusted at all (leftmost is spoofable).
Misconfigure these and per-IP limits either bucket every request under one key or are trivially bypassed. Match them to your actual edge.
SSE capacity
GET /api/events/stream holds an in-process subscription per open stream.
MAX_SSE_CONNECTIONS(default 500) caps concurrent streams per process; over the cap returns503 SSE_CAPACITY. Clients should back off and retry.MAX_AUDIT_EVENTS_MEMORY(default 500) is the in-memory backlog each new subscriber replays before going live.
If you front the CP with a fan-out proxy that opens its own upstream pool, raise
MAX_SSE_CONNECTIONS accordingly.
Data retention
A periodic sweep keeps storage bounded. It is multi-instance safe via a Postgres
advisory lock (pg_try_advisory_xact_lock), so replicas don't duplicate work.
RETENTION_ENABLED(default true) — master switch.RETENTION_INTERVAL_MS(default 1800000 / 30 min) — sweep cadence.RETENTION_BATCH_LIMIT(default 10000) — max rows deleted per sweep, so a sweep never locks a table for minutes.
What is swept (set any TTL to 0 to keep that table indefinitely):
| Table | Env var | Default |
|---|---|---|
audit_events | AUDIT_EVENTS_TTL_DAYS | 90 |
webhook_deliveries (terminal rows) | WEBHOOK_DELIVERY_TTL_DAYS | 14 |
admin_audit_log | ADMIN_AUDIT_TTL_DAYS | 365 |
idempotency_keys | IDEMPOTENCY_KEY_TTL_DAYS | 7 |
enrollment_jtis (past expiry) | — | token TTL |
agents with status='deregistered' | EXPIRED_AGENT_GRACE_DAYS | 30 |
Despite its name,
EXPIRED_AGENT_GRACE_DAYSGCs operator-deregistered agents, notexpiredones —expiredrows are left in place so they can be re-enrolled. Authoritative records (revocation_entries,issued_tcts,delegations,trust_anchors,pinned_keys) are never swept.
Observability
- Metrics:
GET /api/metricsexposes Prometheus text format (public, exempt from rate limiting). - Logs: structured JSON via pino.
LOG_LEVEL∈trace|debug|info|warn|error|fatal(defaultinfo). Every request/response carriesx-request-idfor correlation. - Tracing (OpenTelemetry): off by default. Set
OTEL_ENABLED=trueto export spans to the OTLP HTTP endpoint atOTEL_EXPORTER_OTLP_ENDPOINT(path/v1/tracesis appended unlessOTEL_EXPORTER_OTLP_TRACES_ENDPOINTis set).OTEL_SERVICE_NAMEdefaults toaitp-control-plane. HTTP,pg, andfetchare auto-instrumented.
Health, readiness & graceful shutdown
GET /api/health— liveness + DB ping. Stays200even while draining.GET /api/readyz— readiness (DB reachable, identity initialized).
On SIGTERM the process enters a drain window: /api/readyz flips to
503 { "ready": false, "reason": "shutting_down" } so a load balancer pulls the
pod out of rotation, while /api/health stays 200 so the orchestrator doesn't
hard-kill it mid-drain. Point your LB/orchestrator readiness probe at
/api/readyz and the liveness probe at /api/health.
Database
DATABASE_URL— Postgres connection string (required).DB_POOL_MAX(default 20) — connection pool size.- Migrations run via
npm run db:migratefrom a checkout; the runtime image does not bundledrizzle-kit. See the internal deployment guide for the migration step against a hosted database.
Multi-tenancy
Namespaces (namespace column, X-Aitp-Namespace header, ?namespace= filter)
are an opt-in scoping convention, not an enforced boundary. GET /api/registry/agents without ?namespace= returns rows across all tenants by
design — registry discovery is an operational, non-normative layer in
AITP, not a protocol-defined isolation boundary. If you need isolation, your
callers must set the namespace on both discovery and enrollment; the CP enforces
no implicit boundary.