Architecture Decision Record (ADR) 35: Drive Node Delete (Four-Layer Removal)

Status: Proposed
Date: 2026-06-04
Last Updated: 2026-06-05

Terms (this ADR)

ID	Term	Meaning
Rn	Functional requirement	Numbered obligation the system must meet (R1, R2, …).
NRn	Non-functional requirement	Quality attribute: security, performance, operability (NR1, NR2, …).
Cn	Constraint	Non-negotiable boundary; violating it invalidates the decision (C1, C2, …).
CCn	Cross-cutting challenge	Risk or tension that spans components, with a documented mitigation (CC1, …).
Catalog	PostgreSQL plane	`drive`, `drive_entry`, ACL junction tables, `passport` index — fast listing and RLS (ADR 09).
Blockstore	Content plane	FlatFS or S3-compatible object storage for file bytes (catalog vs blockstore).
Receipt sync	Async pipeline	`receipt_sync_outbox` + `ReceiptSyncWorker` — owner-repo `cloud.substratum.passport.receipt` (ADR 28).
Catalog sync	Async pipeline (planned)	`catalog_sync_outbox` + `CatalogSyncWorker` — owner-repo `cloud.substratum.filesystem.*` (ADR 30).
Subtree delete	HTTP semantics	`DELETE /api/v1/drives/{drive_id}/nodes?path=` removes the node at `path` and all descendants.
Global Triangle	Cloud mesh nodes	Configured gateway replicas (`PINNING_TARGETS`, bootstrap peers) that received blocks via `/substratum/replication/1.0.0` (ADR 17).
Mesh unpin	Replication teardown	Owner-initiated removal of a CID from local and triangle blockstores after catalog refcount is zero.
Mesh unpin outbox	Durable retry queue	Postgres `mesh_unpin_outbox` + `MeshUnpinWorker`; terminal `failed` rows are the DLQ, mirroring `receipt_sync_outbox` (ADR 28).
Sync failures dashboard	Account DLQ UI	`/account/sync-failures` in file-explorer (not a separate SPA per ADR 33); lists/replays failed `receipt_sync` / `mesh_unpin` outbox rows under Postgres RLS.

Canonical product vocabulary: Glossary. Use Drive for the user-scoped root (ADR 16).

Context

The file-explorer exposes Delete… for owned files and folders (ADR 15). Ingress already implements delete_drive_node, and the UI calls DELETE /api/v1/drives/{drive_id}/nodes, but the delete panel still states the action is unavailable in alpha.

Shipped behavior is incomplete. delete_subtree removes matching drive_entry rows only. It does not:

Remove passport / passport_access_control catalog rows.
Enqueue receipt sync to tombstone owner-repo receipts (mesh still consults PDS per ADR 27).
Enqueue catalog sync for filesystem.driveEntry (ADR 30 — pipeline not shipped).
Decrement SaaS entitlement.used_bytes (ADR 32) despite commit_quota on upload.
Reclaim blockstore bytes when the last catalog reference to an asset_cid is gone.
Unpin replicated blocks on all configured Global Triangle nodes (not only the originating gateway).

Users reasonably expect delete to mean gone from the library, access revoked, and storage freed on every node that held a copy. Catalog-only removal satisfies only the first—and poorly when passport rows, quota, block bytes, or triangle replicas remain.

Grantee shared entries must not use owner delete; grantees use self-removal via ACL patch (ADR 28 §3). Drive roots are not deletable via this endpoint.

Requirements

Functional requirements (R1–R18)

ID	Requirement
R1	Owner (or sole writer per RLS) may delete a file or folder at a non-empty path via `DELETE /api/v1/drives/{drive_id}/nodes?path=`.
R2	Subtree delete removes the target path and every descendant path under `{path}/`.
R3	Delete is forbidden for the drive root, for entries the caller does not own, and for grantee-shared browse (product: context menu disabled; API: 403/404).
R4	In one database transaction, ingress removes catalog rows (`drive_entry`, related `passport` / ACL index rows) for the subtree.
R5	For each deleted file with a passport receipt, ingress enqueues `file.deleted` on `receipt_sync_outbox`; the worker calls `com.atproto.repo.deleteRecord` on `cloud.substratum.passport.receipt` at the deterministic rkey (`receipt_record_rkey(owner_did, asset_cid)`).
R6	When ADR 30 catalog-sync is enabled, ingress enqueues `entry.deleted` per removed `drive_entry` row; `CatalogSyncWorker` issues `deleteRecord` on `cloud.substratum.filesystem.driveEntry`.
R7	HTTP response returns refreshed parent listing (`NodeMutationResponse`) and optional `receipt_sync: pending` when receipt jobs were enqueued (same honesty model as ACL patch).
R8	On SaaS `deployment_mode`, delete decrements `entitlement.used_bytes` by the sum of deleted file `size` values (directories contribute 0).
R9	After catalog commit, for each distinct `asset_cid` in the deleted subtree, if no remaining `drive_entry` references that CID (any drive for that owner under RLS), gateway removes the object(s) from the local blockstore (including chunk manifest and chunk CIDs for files above the swarm block cap).
R10	File-explorer delete panel uses real confirmation copy (permanent, folder includes descendants, quota impact on SaaS) and refreshes navigation after success (ADR 20).
R11	Idempotency: repeat delete of a missing path returns 404; receipt outbox uses idempotency key `(owner_did, entry_id, file.deleted)`.
R12	Bruno/OpenAPI document `DELETE` nodes; integration and E2E tests cover catalog removal, quota, and blockstore absence for a deleted file.
R13	When refcount for an `asset_cid` reaches zero, the originating gateway removes the block locally, then enqueues one `mesh_unpin_outbox` row per `(asset_cid, target_peer_id)` for every `PINNING_TARGETS` peer (and per chunk CID).
R14	`MeshUnpinWorker` claims pending rows, issues `UnpinRequest` over libp2p, applies exponential backoff retries, and after `max_attempts` moves rows to `failed` (DLQ) with `last_error`. HTTP delete does not fail when DLQ rows exist (NR3).
R15	Authenticated users list only outbox jobs they may access under RLS: `owner_did = session DID` OR `requested_by_did = session DID` (same rule as `receipt_sync_outbox_caller`). Applies to `receipt_sync_outbox` and `mesh_unpin_outbox`.
R16	Ingress exposes sync-failures APIs under the account/me surface (e.g. `GET /api/v1/me/sync-failures`, `POST …/{id}/retry`) implemented with `with_rls_context` on the gateway pool — never worker session or admin URL for list/replay.
R17	File-explorer provides a Sync failures page as an Account sub-route (`/account/sync-failures`), reachable via in-account sub-nav (not a top-level shell item): summary counts, failed-job table, detail, Retry and optional Discard / Retry all for peer.
R18	HTTP replay sets `status = pending`, resets `attempts` / `next_retry_at`; returns 404 when row is invisible under RLS (no cross-tenant ID oracle).

Non-functional requirements (NR1–NR7)

ID	Requirement
NR1	Delete HTTP handler must not block on owner PDS OAuth (enqueue-only, ADR 28).
NR2	Subtree selection uses indexed path-prefix queries, not full-drive scans.
NR3	Blockstore delete failure after successful catalog commit must not fail the HTTP response; log and metric for operator follow-up.
NR4	Per-owner receipt jobs remain ordered within `receipt_sync_outbox` (ADR 28).
NR5	Mesh unpin must not depend on a live `passport.receipt` on PDS (receipt may already be tombstoned); use owner-scoped unpin JWT (swarm security gaps).
NR6	Expose metrics: `mesh_unpin_outbox_pending`, `mesh_unpin_outbox_failed` (DLQ depth), claim lag, and per-peer failure rate (CC4).
NR7	Sync-failures UI and replay APIs use the same gateway DB role as ingress (`substratum_gateway`); superuser/admin pools must not bypass RLS.

Constraints (C1–C7)

ID	Constraint
C1	Mesh authorization follows owner-repo receipts only; catalog delete alone must not be treated as revoking mesh access (ADR 27).
C2	Passport tombstones use `receipt_sync_outbox` / `ReceiptSyncWorker` only — not catalog-sync (ADR 30 §7).
C3	Gateway must not hold user signing keys; PDS `deleteRecord` runs in the worker with restored owner OAuth (ADR 28).
C4	API models for HTTP remain in `crates/ingress/src/models` per repository root `AGENTS.md`.
C5	Self-hosted `deployment_mode` skips SaaS quota math; node `max_bytes` policy is unchanged by this ADR (ADR 32).
C6	Unpin on the libp2p wire uses a dedicated JWT (not session JWT), analogous to replication pin (ADR 28 / swarm security gaps §3).
C7	Sync-failures dashboard is a modular monolith route under Account in file-explorer (ADR 33); DTOs live in `crates/ingress/src/models` (e.g. `sync_failures.rs`).

Cross-cutting challenges (CC1–CC6)

ID	Challenge	Mitigation
CC1	Owner PDS OAuth cold at delete time	HTTP 200 + catalog removed; `receipt_sync: pending`; worker retries/poison per ADR 28
CC2	Large folder delete (many files → many outbox rows + block GC)	Path-prefix query; batch enqueue in one txn; consider per-request job cap in ingress if needed
CC3	Chunked files (above 25 MiB) span manifest + N blocks	Shared ingress helper with upload chunk layout; delete all CIDs when refcount zero
CC4	Triangle node offline or RPC timeout during unpin	`mesh_unpin_outbox` + `MeshUnpinWorker`; terminal `failed` = DLQ; Account sync-failures UI + RLS-scoped replay APIs (not only metrics/runbook); HTTP delete still succeeds (NR3)
CC5	Unpin cannot call `is_authorized` with deleted receipt	Owner-only unpin JWT (`sub == owner_did`, binds `cid`); inbound handler verifies JWT + deletes block without PDS receipt lookup
CC6	Grantee expects to retry owner-only mesh unpin jobs	Mesh unpin rows use `owner_did` only (delete is owner-only); grantees still see `receipt_sync_outbox` rows they enqueued via `requested_by_did`

Decision

We adopt four-layer delete for drive nodes (library → provenance → storage → mesh), plus a fifth operability layer (sync failures dashboard under Account) for DLQ visibility and replay. Ingress orchestrates catalog in one transaction, then async PDS convergence, synchronous storage reclaim, and mesh replication teardown on the Global Triangle.

Layer 1 — Library (catalog)

delete_drive_node validates writability, selects subtree rows, deletes drive_entry (cascade removes drive_entry_access_control), and deletes orphaned passport / passport_access_control rows tied to removed receipt/asset CIDs.
Returns list_tree_entries for the parent folder.

Layer 2 — Access and filesystem provenance (async)

Pipeline	Business event	PDS effect	When
Receipt sync	`file.deleted`	`deleteRecord` on `cloud.substratum.passport.receipt`	Each deleted file with receipt material
Catalog sync	`entry.deleted`	`deleteRecord` on `cloud.substratum.filesystem.driveEntry`	Each deleted row when ADR 30 worker ships

New ReceiptSyncEvent::FileDeleted (persisted as file.deleted) carries owner_did, drive_id, entry_id, asset_cid, and receipt locator fields sufficient for idempotent deleteRecord.

Folder delete: no receipt job for directory-only rows; one file.deleted per descendant file.

Layer 3 — Storage and quota (sync in handler)

Quota: UploadPolicy::reclaim_quota_on_delete (SaaS) subtracts summed file sizes in the same txn as catalog delete.
Local blockstore: after txn commit, for each asset_cid in the subtree with zero remaining drive_entry references, remove blocks from the originating gateway Blockstore (crates/retrieval — add remove if missing), including chunk manifest and chunk CIDs.

Layer 4 — Mesh replication teardown (durable outbox + DLQ)

Upload pin pushes blocks to triangle peers via SwarmCommand::Pin and /substratum/replication/1.0.0 (GatewayPinPort). Delete must symmetrically unpin via a second async pipeline (parallel to receipt sync, not inline-only RPC):

Protocol: extend replication (prefer /substratum/replication/1.1.0 or CBOR enum) with UnpinRequest { cid, owner_did, jwt } and UnpinResponse. No block payload on the wire.
Auth: create_replication_unpin_jwt in substratum-auth — owner sub, bound cid + owner_did, short TTL; inbound validate_replication_unpin before blockstore.remove (NR5).
Outbox table mesh_unpin_outbox (new migration): columns aligned with receipt_sync_outbox — owner_did, requested_by_did (same as deleter), target_peer_id, asset_cid, queue (mesh_unpin), payload_json, idempotency_key, status, attempts, next_retry_at, last_error, timestamps. RLS policies: mesh_unpin_outbox_caller (owner_did OR requested_by_did = app.current_user_did) and mesh_unpin_outbox_worker (app.mesh_unpin_worker = 'true'). claim_mesh_unpin_job() SECURITY DEFINER for worker poll only.
Ingress: in the delete transaction (or immediately after commit), after local blockstore remove (Layer 3), enqueue one row per (cid, PINNING_TARGETS peer) — do not block HTTP on libp2p round-trips.
MeshUnpinWorker (gateway background task, MESH_UNPIN_ENABLED / MESH_UNPIN_POLL_MS): claims jobs, mints unpin JWT, SwarmCommand::Unpin + send_request; on success → synced; on transient error → backoff retry; on max_attempts → failed (DLQ) for operator visibility (same poison semantics as ReceiptSyncWorker).
DLQ operations: metrics (NR6) plus Layer 5 sync-failures UI/API (R15–R18). Runbook remains for break-glass; primary replay is in-app under RLS. Policy TBD: TTL on stuck failed rows (mirror ADR 28 orphan guidance, e.g. alert after 30 days).

Home-base / sidecar do not host this worker until replication listener parity exists.

Layer 5 — Sync failures dashboard (Account, RLS-scoped)

Production delete requires users to see and retry failed background sync jobs without psql or log scraping.

Not a separate application (ADR 33): one Vite bundle, one OAuth origin, one installer staging path. Add /account/sync-failures in apps/file-explorer under the Account bounded context: secondary nav on account pages (Overview + Sync failures); existing shell Account item stays the only top-level entry (route.startsWith('/account')). Sub-link gated by deployment bootstrap (v1 self-hosted: show for authenticated users; SaaS may add sync_failures_enabled later).

Ingress (crates/ingress/src/models/sync_failures.rs) — namespace /api/v1/me/sync-failures (same me-scoped surface as GET /api/v1/me/limits, not a generic /ops prefix):

Endpoint	Behavior
`GET /api/v1/me/sync-failures?queue=&status=failed`	Paginated list; `queue` = `receipt_sync` \| `mesh_unpin`
`GET /api/v1/me/sync-failures/{id}`	Detail + optional catalog context (path) via RLS-visible `drive_entry` join
`POST /api/v1/me/sync-failures/{id}/retry`	`failed` → `pending`, reset attempts (R18)
`POST /api/v1/me/sync-failures/retry-bulk`	Optional: filter by `target_peer_id` (mesh) or `owner_did`

All handlers: session JWT → AuthenticatedDid → with_rls_context(db, &did) only. Never app.receipt_sync_worker, app.mesh_unpin_worker, or DATABASE_ADMIN_URL on these routes (resolution AGENTS).

Who sees what:

Session	Visible jobs
Owner	`owner_did = me` (upload/delete/receipt + mesh unpin)
Grantee	`requested_by_did = me` (e.g. self-removal receipt jobs); not owner’s mesh unpin
Other tenant	None (RLS)

Platform cross-tenant console (SaaS staff viewing all tenants) is out of scope — requires explicit break-glass auth, not caller RLS.

HTTP and UI

OpenAPI: existing delete_drive_node; extend response with optional receipt_sync when jobs enqueued; add me/sync-failures routes (R16).
Explorer: remove alpha stub; confirmation panel per R10; Account → Sync failures (/account/sync-failures) per R17.

Out of scope (this ADR)

Deleting entire drives (no API).
Grantee delete of shared content (ACL self-removal only).
Home-base / sidecar mesh unpin (until replication listener parity with gateway).
Guaranteed every historical peer that ever saw a block (only configured PINNING_TARGETS + explicit unpin RPC, not DHT-wide GC).
Separate sync-failures SPA or micro-frontend (ADR 33).
Platform-wide cross-tenant DLQ console (Substratum PDS staff SSO per ADR 38); tenant RLS console only (Layer 5).

Rejected alternatives

Alternative	Why rejected
Catalog-only delete (status quo)	Leaves PDS receipts and block bytes; mesh and quota remain wrong.
Synchronous owner `deleteRecord` in HTTP handler	Reintroduces OAuth coupling ADR 28 removed.
`file.deleted` on catalog-sync outbox	Passport collections are receipt-sync only (C2).
Delete drive root via same endpoint	Drive is the tenancy boundary; requires separate lifecycle ADR.
Soft-delete / trash retention	Deferred; v1 is hard delete with four-layer semantics.
Separate sync-failures web application	Second OAuth origin, second installer bundle; ADR 33 monolith is sufficient.
DLQ replay only via runbook/SQL	Does not meet operator UX; R15–R18 require in-app retry under RLS.
Reference-count blocks in Postgres	Extra schema; v1 uses post-delete `drive_entry` existence check per `asset_cid`.
Local-only blockstore GC (no triangle unpin)	Triangle nodes retain copies; undermines “delete frees storage” and data-sovereignty expectations.
Inline-only unpin RPC without outbox	Offline peers lose teardown permanently; no DLQ for operators.

Consequences

Positive

Delete matches user mental model: library, access, and storage align.
Receipt tombstones restore mesh truth on owner PDS without blocking HTTP.
Quota and disk usage track uploads (ADR 32).
Clear split between receipt-sync and catalog-sync events.

Negative

Folder deletes fan out N receipt jobs, block GC, and N × triangle unpin RPCs.
Eventual PDS convergence: UI may show receipt_sync: pending briefly.
Triangle unpin DLQ depth must be monitored; stuck failed rows mean replicas may still hold bytes until user retry from sync-failures UI (CC4).
Sync-failures dashboard adds API surface area; must stay behind session + RLS (NR7).

Neutral

ADR 30 catalog-sync can ship after receipt delete; entry.deleted is idempotent when filesystem records were never written.
Operations doc catalog-vs-blockstore-storage must describe post-delete lifecycle.

Verification

Scenario	Expected
Owner deletes uploaded file	200; entry absent from tree; `used_bytes` decreased (SaaS); local + triangle blockstore `get(asset_cid)` empty; receipt job enqueued
Mesh unpin worker (integration)	Pending rows drain to `synced`; simulated peer down → retries → `failed` DLQ
Metrics / DLQ alert	Nonzero `mesh_unpin_outbox_failed` count triggers dashboard or log alert (NR6)
Worker processes `file.deleted`	Owner repo record gone; catalog `receipt_sync` → `synced`
Delete folder with 3 files	3 outbox rows; all paths removed; quota sum of 3 sizes
Grantee calls DELETE on shared path	403 or 404
Delete same path twice	Second call 404
Self-hosted delete	Catalog + blocks reclaimed; no quota mutation
RLS isolation (two owners)	Owner A cannot list/retry Owner B failed `mesh_unpin` / `receipt_sync` rows
Sync-failures retry after peer recovery	`failed` → `pending` via `POST /me/sync-failures/{id}/retry`; worker drains to `synced`

Glossary
ADR 33: Frontend Modular Monolith — sync-failures UI stays in file-explorer under Account
ADR 27: Zero Trust PDS-Based Provenance
ADR 28: Receipt Sync Queue and Grantee Access Removal
ADR 30: Catalog–PDS Dual-Write — entry.deleted
ADR 32: Account entitlements and hosting policy
Catalog vs blockstore storage
Swarm command security gaps — pin JWT model; add unpin JWT + inbound handler
Ingress AGENTS — handlers that enqueue vs do not
Passport sync AGENTS
Implementation plan: provenance-aware delete

Architecture Decision Record (ADR) 35: Drive Node Delete (Four-Layer Removal) ​

Terms (this ADR) ​

Context ​

Requirements ​

Functional requirements (R1–R18) ​

Non-functional requirements (NR1–NR7) ​

Constraints (C1–C7) ​

Cross-cutting challenges (CC1–CC6) ​

Decision ​

Layer 1 — Library (catalog) ​

Layer 2 — Access and filesystem provenance (async) ​

Layer 3 — Storage and quota (sync in handler) ​

Layer 4 — Mesh replication teardown (durable outbox + DLQ) ​

Layer 5 — Sync failures dashboard (Account, RLS-scoped) ​

HTTP and UI ​

Out of scope (this ADR) ​

Rejected alternatives ​

Consequences ​

Positive ​

Negative ​

Neutral ​

Verification ​

Related ​