Operations runbook

Running the control plane in production. Configuration is entirely environment variables — .env.example in the repo is the canonical list, and the internal deployment guide covers the CI/CD and Railway path. This document explains the operational subsystems and how to tune them.

Identity

The CP has its own AITP identity (Ed25519), served at /.well-known/aitp-manifest and used to sign the revocation list.

  • CP_AID_SEED_HEX — 32-byte hex seed. Required in production. Without it the seed is regenerated on every boot, so the CP's AID changes on restart and any peer that pinned the old key breaks. Generate once and store it as a secret:
    node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"
  • CP_BASE_URL — public URL embedded in the CP's own manifest. Set it to the externally reachable origin.

Rotating CP_AID_SEED_HEX rotates the control-plane identity — treat it like a signing key, not a config toggle.

Authentication & exposure

  • API_KEYS — comma-separated allowlist for gated routes. Required in production: empty API_KEYS in prod makes every gated route return 503 SERVER_MISCONFIGURED (fail-safe). Empty in non-prod disables auth and logs a one-time warning.
  • ENROLLMENT_SECRET — server-side HMAC key for minting/verifying one-time enrollment tokens. Required. Callers never see it.
  • CORS_ORIGIN — allowed browser origin. Required in prod; falls back to * with a warning if unset.

See api.md for the full auth matrix.

Rate limiting

In-memory, per-process token buckets on every /api/* route except the probes (/api/health, /api/readyz, /api/metrics). Over-limit → 429 RATE_LIMITED with Retry-After and X-RateLimit-* headers.

BucketDefaultEnv varKeyed by
enroll-ip5/minRATE_LIMIT_ENROLLMENT_PER_IP_MINclient IP (brute-force guard on enrollment)
public-ip60/minRATE_LIMIT_PUBLIC_PER_IP_MINclient IP
api-key600/minRATE_LIMIT_API_KEY_PER_MINAPI key prefix
  • RATE_LIMIT_WINDOW_MS (default 60000) is the accumulation window.
  • RATE_LIMIT_ENABLED=false disables the limiter entirely (dev / load tests).
  • Set any individual limit to 0 to disable that bucket.

Buckets are per-process. Behind multiple replicas the effective limit is roughly N × limit. For a hard global limit, put a shared limiter at the edge.

Client-IP trust (important behind a proxy)

X-Forwarded-For is client-controllable, so per-IP buckets are spoofable unless you tell the CP which hop to trust:

  • CLIENT_IP_HEADER — a single trusted header your edge sets to the real client IP (e.g. cf-connecting-ip, x-vercel-forwarded-for). Wins when set.
  • TRUSTED_PROXY_HOPS — number of trusted proxies appending to XFF; the client IP is read this many entries from the right. Default 0 = XFF not trusted at all (leftmost is spoofable).

Misconfigure these and per-IP limits either bucket every request under one key or are trivially bypassed. Match them to your actual edge.

SSE capacity

GET /api/events/stream holds an in-process subscription per open stream.

  • MAX_SSE_CONNECTIONS (default 500) caps concurrent streams per process; over the cap returns 503 SSE_CAPACITY. Clients should back off and retry.
  • MAX_AUDIT_EVENTS_MEMORY (default 500) is the in-memory backlog each new subscriber replays before going live.

If you front the CP with a fan-out proxy that opens its own upstream pool, raise MAX_SSE_CONNECTIONS accordingly.

Data retention

A periodic sweep keeps storage bounded. It is multi-instance safe via a Postgres advisory lock (pg_try_advisory_xact_lock), so replicas don't duplicate work.

  • RETENTION_ENABLED (default true) — master switch.
  • RETENTION_INTERVAL_MS (default 1800000 / 30 min) — sweep cadence.
  • RETENTION_BATCH_LIMIT (default 10000) — max rows deleted per sweep, so a sweep never locks a table for minutes.

What is swept (set any TTL to 0 to keep that table indefinitely):

TableEnv varDefault
audit_eventsAUDIT_EVENTS_TTL_DAYS90
webhook_deliveries (terminal rows)WEBHOOK_DELIVERY_TTL_DAYS14
admin_audit_logADMIN_AUDIT_TTL_DAYS365
idempotency_keysIDEMPOTENCY_KEY_TTL_DAYS7
enrollment_jtis (past expiry)token TTL
agents with status='deregistered'EXPIRED_AGENT_GRACE_DAYS30

Despite its name, EXPIRED_AGENT_GRACE_DAYS GCs operator-deregistered agents, not expired ones — expired rows are left in place so they can be re-enrolled. Authoritative records (revocation_entries, issued_tcts, delegations, trust_anchors, pinned_keys) are never swept.

Observability

  • Metrics: GET /api/metrics exposes Prometheus text format (public, exempt from rate limiting).
  • Logs: structured JSON via pino. LOG_LEVELtrace|debug|info|warn|error|fatal (default info). Every request/response carries x-request-id for correlation.
  • Tracing (OpenTelemetry): off by default. Set OTEL_ENABLED=true to export spans to the OTLP HTTP endpoint at OTEL_EXPORTER_OTLP_ENDPOINT (path /v1/traces is appended unless OTEL_EXPORTER_OTLP_TRACES_ENDPOINT is set). OTEL_SERVICE_NAME defaults to aitp-control-plane. HTTP, pg, and fetch are auto-instrumented.

Health, readiness & graceful shutdown

  • GET /api/health — liveness + DB ping. Stays 200 even while draining.
  • GET /api/readyz — readiness (DB reachable, identity initialized).

On SIGTERM the process enters a drain window: /api/readyz flips to 503 { "ready": false, "reason": "shutting_down" } so a load balancer pulls the pod out of rotation, while /api/health stays 200 so the orchestrator doesn't hard-kill it mid-drain. Point your LB/orchestrator readiness probe at /api/readyz and the liveness probe at /api/health.

Database

  • DATABASE_URL — Postgres connection string (required).
  • DB_POOL_MAX (default 20) — connection pool size.
  • Migrations run via npm run db:migrate from a checkout; the runtime image does not bundle drizzle-kit. See the internal deployment guide for the migration step against a hosted database.

Multi-tenancy

Namespaces (namespace column, X-Aitp-Namespace header, ?namespace= filter) are an opt-in scoping convention, not an enforced boundary. GET /api/registry/agents without ?namespace= returns rows across all tenants by design — registry discovery is an operational, non-normative layer in AITP, not a protocol-defined isolation boundary. If you need isolation, your callers must set the namespace on both discovery and enrollment; the CP enforces no implicit boundary.