ADR 43: Centralized Log Observability (Vector, Loki, Grafana)
Status: Proposed
Date: 2026-06-12
Last Updated: 2026-06-12
Terms (this ADR)
| ID | Term | Meaning |
|---|---|---|
| Rn | Functional requirement | Numbered obligation the system must meet (R1, R2, …). |
| NRn | Non-functional requirement | Quality attribute: security, performance, operability (NR1, NR2, …). |
| Cn | Constraint | Non-negotiable boundary; violating it invalidates the decision (C1, C2, …). |
| CCn | Cross-cutting challenge | Risk or tension that spans components, with a documented mitigation (CC1, …). |
| Log agent | Vector | Lightweight observability pipeline installed on each edge droplet; collects from journald and Docker, transforms, and ships logs to Loki. |
| Log store | Loki | Open-source log aggregation system; indexes labels only; stores compressed log chunks. |
| Log UI | Grafana | Open-source visualization and alerting; queries Loki via LogQL. |
| Observability stack | infra/observability | Dedicated DigitalOcean droplet running Loki + Grafana (Garage v1); separate from customer-facing edge stacks. |
| Edge droplet | Production VM | app, admin, pds, or marketing stack droplet — runs Caddy and application processes. |
Canonical product vocabulary: Glossary.
Context
Substratum production runs on DigitalOcean droplets provisioned by Pulumi (infra/app, infra/admin, infra/pds, infra/marketing). Application services follow ADR 13: Twelve-Factor App Factor XI (Logs): Rust backends (substratum-gateway, ops-api, substratum-pds-authz-proxy) emit structured events to stdout via tracing; systemd and Docker capture those streams into journald. There is no centralized log sink today — operators debug via SSH and journalctl per host (infra/AGENTS.md treats host logs as a separate ops concern).
Garage v1 (~0–500 Substratum-login customers) needs low-cost, open-source centralized logging with a web UI for incident response and entitlement/support workflows (ADR 38). DigitalOcean Monitoring (do-agent) provides metrics and alerts only — not application log search.
Log sources (production)
| Stack | Host tag(s) | Primary log producers |
|---|---|---|
app | app, gateway-edge, substratum | substratum-gateway (systemd), Caddy |
admin | admin, operator, substratum | ops-api (systemd), Caddy |
pds | pds, substratum | substratum-pds-authz-proxy (systemd), Tranquil PDS (Docker), Caddy |
marketing | marketing, landing, substratum | Caddy (static SPA) |
data | managed Postgres | DO control panel DB insights — out of scope for this ADR |
Managed Postgres logs remain in the DO Databases UI; this ADR covers edge droplet application and edge-proxy logs only.
Requirements
Functional requirements
| ID | Requirement |
|---|---|
| R1 | Production Rust services (substratum-gateway, ops-api, substratum-pds-authz-proxy) SHALL emit JSON log lines in production via SUBSTRATUM_LOG_JSON=1 (or equivalent env documented in service AGENTS.md). |
| R2 | Each edge droplet SHALL run a Vector log agent that collects from journald (systemd units: substratum-gateway, substratum-pds-authz-proxy, ops-api, caddy) and, on the PDS droplet, Docker logs for Tranquil PDS. |
| R3 | Vector SHALL ship logs to a central Loki endpoint over HTTPS (or private network if co-located in nyc3). |
| R4 | Operators SHALL search and filter logs across all edge stacks in Grafana using LogQL, with labels at minimum: stack (app | admin | pds | marketing), service, and host. |
| R5 | Garage v1 Phase 1 storage SHALL use local disk on the observability droplet with configurable retention (default 7 days). |
| R6 | Phase 2 (optional) SHALL support Loki chunk storage in a dedicated DO Spaces bucket (substratum-logs or stack-configured name) with scoped access keys — same S3-compatible pattern as PDS blobstore (ADR 41). |
| R7 | Observability infrastructure SHALL be provisioned as infra/observability Pulumi stack (or equivalent) following infra/AGENTS.md: ObservabilityConfig.load(), thin index.ts, secrets via pulumi config set --secret. |
| R8 | Vector installation SHALL be bootstrapped via shared cloud-init helpers in infra/shared/ and invoked from each edge stack's cloud-init.ts — not one-off SSH steps. |
| R9 | Loki ingest credentials and Grafana admin credentials SHALL NOT appear in cloud-init userData in plaintext; they SHALL be supplied via Pulumi secrets or CI-rendered env on the observability host. |
Non-functional requirements
| ID | Requirement |
|---|---|
| NR1 | Garage COGS target: observability droplet default s-1vcpu-1gb (~$6/mo); 512MB droplets MUST NOT host Loki + Grafana together. |
| NR2 | Agent footprint: Vector on edge droplets SHOULD consume ≤ 128 MB RAM under normal Garage v1 load. |
| NR3 | Retention default: 7 days on local disk; configurable via stack config. Phase 2 Spaces lifecycle SHOULD expire objects after retention window. |
| NR4 | Security: logs MUST NOT contain session JWTs, OAuth codes, passwords, or DPoP private material; production RUST_LOG default info unless break-glass debugging. |
| NR5 | Replaceability: edge droplets remain cattle — Vector reinstalls via cloud-init on reprovision; optional Spaces backend preserves log chunks if the observability droplet is replaced. |
| NR6 | Twelve-factor alignment: applications continue to log only to stdout; no application log files on production droplets (ADR 13). |
| NR7 | Region alignment: observability droplet SHOULD use default region nyc3 with edge droplets to minimize ingest latency. |
Constraints
| ID | Constraint |
|---|---|
| C1 | Open-source stack only for Garage v1: Vector, Loki, and Grafana — no proprietary log SaaS as the primary store (Grafana Cloud MAY be used for a time-boxed trial without changing application code). |
| C2 | Spaces is not a log UI — raw object storage alone does not satisfy R4; Loki (or equivalent indexer) is required for queryable logs. |
| C3 | Separate bucket — log chunks MUST NOT share a Spaces bucket with PDS blobs, blockstore, or installer artifacts. |
| C4 | No log files in apps — do not add file-based logging to Rust services to work around journald; the agent reads journald/Docker natively. |
| C5 | Pulumi state discipline — observability droplet lifecycle follows the same pulumi / Spindle patterns as other operated stacks; no console-only deletes without state repair. |
Cross-cutting challenges
| ID | Challenge | Mitigation |
|---|---|---|
| CC1 | Loki query memory spikes on small VMs | Default s-1vcpu-1gb; monolithic Loki; low query concurrency; short retention; optional swap documented in ops runbook. |
| CC2 | Caddy access logs are high-volume | Sample or exclude health-check paths in Vector transforms; default retention 7 days. |
| CC3 | Observability droplet loss with local-only storage | Phase 1 accepts log loss on rebuild; Phase 2 enables Spaces-backed chunks (R6). |
| CC4 | Secret leakage in structured JSON logs | Code review + NR4; never log auth headers or token fields; audit tracing calls in ingress/auth crates during rollout. |
| CC5 | Firewall / network path from edge to Loki | Edge firewalls already allow outbound HTTPS; Loki ingest on observability droplet restricted to known edge IPs or mTLS/token auth. |
Decision
Adopt a three-layer open-source log pipeline for DigitalOcean production:
- Collection: Vector on each edge droplet — journald + Docker (PDS) sources; enrich with
stack,service,hostlabels derived from droplet tags and systemd unit names. - Storage & query: Loki on a dedicated
s-1vcpu-1gbobservability droplet — Phase 1 uses local disk with 7-day retention. - Visualization: Grafana on the same observability droplet — LogQL search, dashboards, and alerts for operator workflows.
- Production JSON: CI deploy scripts set
SUBSTRATUM_LOG_JSON=1when rendering/etc/substratum/*.envfor gateway, ops-api, and authz-proxy. - Phase 2 (optional): Pulumi-provisioned
substratum-logsSpaces bucket + scoped keys; Lokistorage_configpoints chunk storage at Spaces; lifecycle rule expires objects after retention window. - Out of scope (Garage v1): distributed tracing, metrics unification, and managed OpenSearch — may be revisited post-Garage.
Rejected alternatives
| Alternative | Why rejected |
|---|---|
SSH + journalctl only | Does not scale across four edge stacks; no cross-host search or alerting (status quo). |
| DigitalOcean Monitoring alone | Metrics only — no application log aggregation or LogQL search. |
| Vector → Spaces without Loki | Blob storage is not queryable; fails R4 and C2. |
| Elasticsearch / OpenSearch (self-hosted) | 2–4 GB+ RAM floor; poor fit for Garage v1 COGS; DO Managed OpenSearch adds cost and vendor coupling without meeting C1 self-host preference. |
| Graylog all-in-one | Heavier operational footprint (MongoDB/OpenSearch dependencies) for a small team. |
| Loki + Grafana on 512 MB droplet | Insufficient headroom for OS + idle stack; OOM during queries (NR1). |
| Grafana Cloud as permanent primary | Acceptable for trial; violates C1 long-term open-source/self-host goal and sends customer-adjacent logs off DO unless carefully reviewed. |
| Application log files + tail | Violates ADR 13 Factor XI and C4; duplicates journald. |
Consequences
Positive
- Single pane for gateway, ops-api, PDS proxy, and Caddy logs across all edge hosts.
- Aligns with ADR 13 — apps stay stdout-only; infra adds routing, not app complexity.
- Low Garage COGS — ~$6/mo observability droplet + $0 incremental on existing edge sizes for Vector.
- Open-source — no per-GB SaaS lock-in; LogQL and Grafana skills transfer widely.
- Phased durability — local disk trial first; Spaces when droplet cattle model matters for log retention.
Negative
- Another droplet to operate — backups, Grafana admin auth, Loki upgrades, disk pressure monitoring.
- Not full-text search — Loki indexes labels, not every token; deep message grep is slower than Elasticsearch.
- Query RAM spikes — small droplet requires tuning and discipline (CC1).
- Phase 1 log loss — reprovisioning the observability droplet without Spaces loses historical logs.
Neutral
- Managed Postgres query/slow logs remain in DO Databases UI — correlate manually by timestamp until trace IDs are added in a future ADR.
do-agentmetrics remain complementary — Grafana MAY later scrape node metrics; not required for Garage v1 log rollout.
Verification
| Scenario | Expected |
|---|---|
Gateway error on app droplet | Line appears in Grafana within ≤ 60 s with stack=app, service=substratum-gateway. |
| PDS authz proxy denial | Searchable with stack=pds, service=substratum-pds-authz-proxy. |
| Observability droplet reprovision (Phase 1) | Edge agents reconnect; historical logs before reprovision are gone; new logs flow. |
SUBSTRATUM_LOG_JSON=1 | Log lines parse as JSON in Loki; plain-text fallback still ingested if unset during rollout. |
| Secret discipline | No JWT/password patterns in sampled production logs during smoke test. |