Skip to content

ADR 43: Centralized Log Observability (Vector, Loki, Grafana)

Status: Proposed
Date: 2026-06-12
Last Updated: 2026-06-12

Terms (this ADR)

IDTermMeaning
RnFunctional requirementNumbered obligation the system must meet (R1, R2, …).
NRnNon-functional requirementQuality attribute: security, performance, operability (NR1, NR2, …).
CnConstraintNon-negotiable boundary; violating it invalidates the decision (C1, C2, …).
CCnCross-cutting challengeRisk or tension that spans components, with a documented mitigation (CC1, …).
Log agentVectorLightweight observability pipeline installed on each edge droplet; collects from journald and Docker, transforms, and ships logs to Loki.
Log storeLokiOpen-source log aggregation system; indexes labels only; stores compressed log chunks.
Log UIGrafanaOpen-source visualization and alerting; queries Loki via LogQL.
Observability stackinfra/observabilityDedicated DigitalOcean droplet running Loki + Grafana (Garage v1); separate from customer-facing edge stacks.
Edge dropletProduction VMapp, admin, pds, or marketing stack droplet — runs Caddy and application processes.

Canonical product vocabulary: Glossary.

Context

Substratum production runs on DigitalOcean droplets provisioned by Pulumi (infra/app, infra/admin, infra/pds, infra/marketing). Application services follow ADR 13: Twelve-Factor App Factor XI (Logs): Rust backends (substratum-gateway, ops-api, substratum-pds-authz-proxy) emit structured events to stdout via tracing; systemd and Docker capture those streams into journald. There is no centralized log sink today — operators debug via SSH and journalctl per host (infra/AGENTS.md treats host logs as a separate ops concern).

Garage v1 (~0–500 Substratum-login customers) needs low-cost, open-source centralized logging with a web UI for incident response and entitlement/support workflows (ADR 38). DigitalOcean Monitoring (do-agent) provides metrics and alerts only — not application log search.

Log sources (production)

StackHost tag(s)Primary log producers
appapp, gateway-edge, substratumsubstratum-gateway (systemd), Caddy
adminadmin, operator, substratumops-api (systemd), Caddy
pdspds, substratumsubstratum-pds-authz-proxy (systemd), Tranquil PDS (Docker), Caddy
marketingmarketing, landing, substratumCaddy (static SPA)
datamanaged PostgresDO control panel DB insights — out of scope for this ADR

Managed Postgres logs remain in the DO Databases UI; this ADR covers edge droplet application and edge-proxy logs only.

Requirements

Functional requirements

IDRequirement
R1Production Rust services (substratum-gateway, ops-api, substratum-pds-authz-proxy) SHALL emit JSON log lines in production via SUBSTRATUM_LOG_JSON=1 (or equivalent env documented in service AGENTS.md).
R2Each edge droplet SHALL run a Vector log agent that collects from journald (systemd units: substratum-gateway, substratum-pds-authz-proxy, ops-api, caddy) and, on the PDS droplet, Docker logs for Tranquil PDS.
R3Vector SHALL ship logs to a central Loki endpoint over HTTPS (or private network if co-located in nyc3).
R4Operators SHALL search and filter logs across all edge stacks in Grafana using LogQL, with labels at minimum: stack (app | admin | pds | marketing), service, and host.
R5Garage v1 Phase 1 storage SHALL use local disk on the observability droplet with configurable retention (default 7 days).
R6Phase 2 (optional) SHALL support Loki chunk storage in a dedicated DO Spaces bucket (substratum-logs or stack-configured name) with scoped access keys — same S3-compatible pattern as PDS blobstore (ADR 41).
R7Observability infrastructure SHALL be provisioned as infra/observability Pulumi stack (or equivalent) following infra/AGENTS.md: ObservabilityConfig.load(), thin index.ts, secrets via pulumi config set --secret.
R8Vector installation SHALL be bootstrapped via shared cloud-init helpers in infra/shared/ and invoked from each edge stack's cloud-init.ts — not one-off SSH steps.
R9Loki ingest credentials and Grafana admin credentials SHALL NOT appear in cloud-init userData in plaintext; they SHALL be supplied via Pulumi secrets or CI-rendered env on the observability host.

Non-functional requirements

IDRequirement
NR1Garage COGS target: observability droplet default s-1vcpu-1gb (~$6/mo); 512MB droplets MUST NOT host Loki + Grafana together.
NR2Agent footprint: Vector on edge droplets SHOULD consume ≤ 128 MB RAM under normal Garage v1 load.
NR3Retention default: 7 days on local disk; configurable via stack config. Phase 2 Spaces lifecycle SHOULD expire objects after retention window.
NR4Security: logs MUST NOT contain session JWTs, OAuth codes, passwords, or DPoP private material; production RUST_LOG default info unless break-glass debugging.
NR5Replaceability: edge droplets remain cattle — Vector reinstalls via cloud-init on reprovision; optional Spaces backend preserves log chunks if the observability droplet is replaced.
NR6Twelve-factor alignment: applications continue to log only to stdout; no application log files on production droplets (ADR 13).
NR7Region alignment: observability droplet SHOULD use default region nyc3 with edge droplets to minimize ingest latency.

Constraints

IDConstraint
C1Open-source stack only for Garage v1: Vector, Loki, and Grafana — no proprietary log SaaS as the primary store (Grafana Cloud MAY be used for a time-boxed trial without changing application code).
C2Spaces is not a log UI — raw object storage alone does not satisfy R4; Loki (or equivalent indexer) is required for queryable logs.
C3Separate bucket — log chunks MUST NOT share a Spaces bucket with PDS blobs, blockstore, or installer artifacts.
C4No log files in apps — do not add file-based logging to Rust services to work around journald; the agent reads journald/Docker natively.
C5Pulumi state discipline — observability droplet lifecycle follows the same pulumi / Spindle patterns as other operated stacks; no console-only deletes without state repair.

Cross-cutting challenges

IDChallengeMitigation
CC1Loki query memory spikes on small VMsDefault s-1vcpu-1gb; monolithic Loki; low query concurrency; short retention; optional swap documented in ops runbook.
CC2Caddy access logs are high-volumeSample or exclude health-check paths in Vector transforms; default retention 7 days.
CC3Observability droplet loss with local-only storagePhase 1 accepts log loss on rebuild; Phase 2 enables Spaces-backed chunks (R6).
CC4Secret leakage in structured JSON logsCode review + NR4; never log auth headers or token fields; audit tracing calls in ingress/auth crates during rollout.
CC5Firewall / network path from edge to LokiEdge firewalls already allow outbound HTTPS; Loki ingest on observability droplet restricted to known edge IPs or mTLS/token auth.

Decision

Adopt a three-layer open-source log pipeline for DigitalOcean production:

  1. Collection: Vector on each edge droplet — journald + Docker (PDS) sources; enrich with stack, service, host labels derived from droplet tags and systemd unit names.
  2. Storage & query: Loki on a dedicated s-1vcpu-1gb observability droplet — Phase 1 uses local disk with 7-day retention.
  3. Visualization: Grafana on the same observability droplet — LogQL search, dashboards, and alerts for operator workflows.
  4. Production JSON: CI deploy scripts set SUBSTRATUM_LOG_JSON=1 when rendering /etc/substratum/*.env for gateway, ops-api, and authz-proxy.
  5. Phase 2 (optional): Pulumi-provisioned substratum-logs Spaces bucket + scoped keys; Loki storage_config points chunk storage at Spaces; lifecycle rule expires objects after retention window.
  6. Out of scope (Garage v1): distributed tracing, metrics unification, and managed OpenSearch — may be revisited post-Garage.

Rejected alternatives

AlternativeWhy rejected
SSH + journalctl onlyDoes not scale across four edge stacks; no cross-host search or alerting (status quo).
DigitalOcean Monitoring aloneMetrics only — no application log aggregation or LogQL search.
Vector → Spaces without LokiBlob storage is not queryable; fails R4 and C2.
Elasticsearch / OpenSearch (self-hosted)2–4 GB+ RAM floor; poor fit for Garage v1 COGS; DO Managed OpenSearch adds cost and vendor coupling without meeting C1 self-host preference.
Graylog all-in-oneHeavier operational footprint (MongoDB/OpenSearch dependencies) for a small team.
Loki + Grafana on 512 MB dropletInsufficient headroom for OS + idle stack; OOM during queries (NR1).
Grafana Cloud as permanent primaryAcceptable for trial; violates C1 long-term open-source/self-host goal and sends customer-adjacent logs off DO unless carefully reviewed.
Application log files + tailViolates ADR 13 Factor XI and C4; duplicates journald.

Consequences

Positive

  • Single pane for gateway, ops-api, PDS proxy, and Caddy logs across all edge hosts.
  • Aligns with ADR 13 — apps stay stdout-only; infra adds routing, not app complexity.
  • Low Garage COGS — ~$6/mo observability droplet + $0 incremental on existing edge sizes for Vector.
  • Open-source — no per-GB SaaS lock-in; LogQL and Grafana skills transfer widely.
  • Phased durability — local disk trial first; Spaces when droplet cattle model matters for log retention.

Negative

  • Another droplet to operate — backups, Grafana admin auth, Loki upgrades, disk pressure monitoring.
  • Not full-text search — Loki indexes labels, not every token; deep message grep is slower than Elasticsearch.
  • Query RAM spikes — small droplet requires tuning and discipline (CC1).
  • Phase 1 log loss — reprovisioning the observability droplet without Spaces loses historical logs.

Neutral

  • Managed Postgres query/slow logs remain in DO Databases UI — correlate manually by timestamp until trace IDs are added in a future ADR.
  • do-agent metrics remain complementary — Grafana MAY later scrape node metrics; not required for Garage v1 log rollout.

Verification

ScenarioExpected
Gateway error on app dropletLine appears in Grafana within ≤ 60 s with stack=app, service=substratum-gateway.
PDS authz proxy denialSearchable with stack=pds, service=substratum-pds-authz-proxy.
Observability droplet reprovision (Phase 1)Edge agents reconnect; historical logs before reprovision are gone; new logs flow.
SUBSTRATUM_LOG_JSON=1Log lines parse as JSON in Loki; plain-text fallback still ingested if unset during rollout.
Secret disciplineNo JWT/password patterns in sampled production logs during smoke test.