Skip to content

Architecture Decision Record (ADR) 35: Drive Node Delete (Four-Layer Removal)

Status: Proposed
Date: 2026-06-04
Last Updated: 2026-06-05

Terms (this ADR)

IDTermMeaning
RnFunctional requirementNumbered obligation the system must meet (R1, R2, …).
NRnNon-functional requirementQuality attribute: security, performance, operability (NR1, NR2, …).
CnConstraintNon-negotiable boundary; violating it invalidates the decision (C1, C2, …).
CCnCross-cutting challengeRisk or tension that spans components, with a documented mitigation (CC1, …).
CatalogPostgreSQL planedrive, drive_entry, ACL junction tables, passport index — fast listing and RLS (ADR 09).
BlockstoreContent planeFlatFS or S3-compatible object storage for file bytes (catalog vs blockstore).
Receipt syncAsync pipelinereceipt_sync_outbox + ReceiptSyncWorker — owner-repo cloud.substratum.passport.receipt (ADR 28).
Catalog syncAsync pipeline (planned)catalog_sync_outbox + CatalogSyncWorker — owner-repo cloud.substratum.filesystem.* (ADR 30).
Subtree deleteHTTP semanticsDELETE /api/v1/drives/{drive_id}/nodes?path= removes the node at path and all descendants.
Global TriangleCloud mesh nodesConfigured gateway replicas (PINNING_TARGETS, bootstrap peers) that received blocks via /substratum/replication/1.0.0 (ADR 17).
Mesh unpinReplication teardownOwner-initiated removal of a CID from local and triangle blockstores after catalog refcount is zero.
Mesh unpin outboxDurable retry queuePostgres mesh_unpin_outbox + MeshUnpinWorker; terminal failed rows are the DLQ, mirroring receipt_sync_outbox (ADR 28).
Sync failures dashboardAccount DLQ UI/account/sync-failures in file-explorer (not a separate SPA per ADR 33); lists/replays failed receipt_sync / mesh_unpin outbox rows under Postgres RLS.

Canonical product vocabulary: Glossary. Use Drive for the user-scoped root (ADR 16).

Context

The file-explorer exposes Delete… for owned files and folders (ADR 15). Ingress already implements delete_drive_node, and the UI calls DELETE /api/v1/drives/{drive_id}/nodes, but the delete panel still states the action is unavailable in alpha.

Shipped behavior is incomplete. delete_subtree removes matching drive_entry rows only. It does not:

  • Remove passport / passport_access_control catalog rows.
  • Enqueue receipt sync to tombstone owner-repo receipts (mesh still consults PDS per ADR 27).
  • Enqueue catalog sync for filesystem.driveEntry (ADR 30 — pipeline not shipped).
  • Decrement SaaS entitlement.used_bytes (ADR 32) despite commit_quota on upload.
  • Reclaim blockstore bytes when the last catalog reference to an asset_cid is gone.
  • Unpin replicated blocks on all configured Global Triangle nodes (not only the originating gateway).

Users reasonably expect delete to mean gone from the library, access revoked, and storage freed on every node that held a copy. Catalog-only removal satisfies only the first—and poorly when passport rows, quota, block bytes, or triangle replicas remain.

Grantee shared entries must not use owner delete; grantees use self-removal via ACL patch (ADR 28 §3). Drive roots are not deletable via this endpoint.

Requirements

Functional requirements (R1–R18)

IDRequirement
R1Owner (or sole writer per RLS) may delete a file or folder at a non-empty path via DELETE /api/v1/drives/{drive_id}/nodes?path=.
R2Subtree delete removes the target path and every descendant path under {path}/.
R3Delete is forbidden for the drive root, for entries the caller does not own, and for grantee-shared browse (product: context menu disabled; API: 403/404).
R4In one database transaction, ingress removes catalog rows (drive_entry, related passport / ACL index rows) for the subtree.
R5For each deleted file with a passport receipt, ingress enqueues file.deleted on receipt_sync_outbox; the worker calls com.atproto.repo.deleteRecord on cloud.substratum.passport.receipt at the deterministic rkey (receipt_record_rkey(owner_did, asset_cid)).
R6When ADR 30 catalog-sync is enabled, ingress enqueues entry.deleted per removed drive_entry row; CatalogSyncWorker issues deleteRecord on cloud.substratum.filesystem.driveEntry.
R7HTTP response returns refreshed parent listing (NodeMutationResponse) and optional receipt_sync: pending when receipt jobs were enqueued (same honesty model as ACL patch).
R8On SaaS deployment_mode, delete decrements entitlement.used_bytes by the sum of deleted file size values (directories contribute 0).
R9After catalog commit, for each distinct asset_cid in the deleted subtree, if no remaining drive_entry references that CID (any drive for that owner under RLS), gateway removes the object(s) from the local blockstore (including chunk manifest and chunk CIDs for files above the swarm block cap).
R10File-explorer delete panel uses real confirmation copy (permanent, folder includes descendants, quota impact on SaaS) and refreshes navigation after success (ADR 20).
R11Idempotency: repeat delete of a missing path returns 404; receipt outbox uses idempotency key (owner_did, entry_id, file.deleted).
R12Bruno/OpenAPI document DELETE nodes; integration and E2E tests cover catalog removal, quota, and blockstore absence for a deleted file.
R13When refcount for an asset_cid reaches zero, the originating gateway removes the block locally, then enqueues one mesh_unpin_outbox row per (asset_cid, target_peer_id) for every PINNING_TARGETS peer (and per chunk CID).
R14MeshUnpinWorker claims pending rows, issues UnpinRequest over libp2p, applies exponential backoff retries, and after max_attempts moves rows to failed (DLQ) with last_error. HTTP delete does not fail when DLQ rows exist (NR3).
R15Authenticated users list only outbox jobs they may access under RLS: owner_did = session DID OR requested_by_did = session DID (same rule as receipt_sync_outbox_caller). Applies to receipt_sync_outbox and mesh_unpin_outbox.
R16Ingress exposes sync-failures APIs under the account/me surface (e.g. GET /api/v1/me/sync-failures, POST …/{id}/retry) implemented with with_rls_context on the gateway pool — never worker session or admin URL for list/replay.
R17File-explorer provides a Sync failures page as an Account sub-route (/account/sync-failures), reachable via in-account sub-nav (not a top-level shell item): summary counts, failed-job table, detail, Retry and optional Discard / Retry all for peer.
R18HTTP replay sets status = pending, resets attempts / next_retry_at; returns 404 when row is invisible under RLS (no cross-tenant ID oracle).

Non-functional requirements (NR1–NR7)

IDRequirement
NR1Delete HTTP handler must not block on owner PDS OAuth (enqueue-only, ADR 28).
NR2Subtree selection uses indexed path-prefix queries, not full-drive scans.
NR3Blockstore delete failure after successful catalog commit must not fail the HTTP response; log and metric for operator follow-up.
NR4Per-owner receipt jobs remain ordered within receipt_sync_outbox (ADR 28).
NR5Mesh unpin must not depend on a live passport.receipt on PDS (receipt may already be tombstoned); use owner-scoped unpin JWT (swarm security gaps).
NR6Expose metrics: mesh_unpin_outbox_pending, mesh_unpin_outbox_failed (DLQ depth), claim lag, and per-peer failure rate (CC4).
NR7Sync-failures UI and replay APIs use the same gateway DB role as ingress (substratum_gateway); superuser/admin pools must not bypass RLS.

Constraints (C1–C7)

IDConstraint
C1Mesh authorization follows owner-repo receipts only; catalog delete alone must not be treated as revoking mesh access (ADR 27).
C2Passport tombstones use receipt_sync_outbox / ReceiptSyncWorker only — not catalog-sync (ADR 30 §7).
C3Gateway must not hold user signing keys; PDS deleteRecord runs in the worker with restored owner OAuth (ADR 28).
C4API models for HTTP remain in crates/ingress/src/models per repository root AGENTS.md.
C5Self-hosted deployment_mode skips SaaS quota math; node max_bytes policy is unchanged by this ADR (ADR 32).
C6Unpin on the libp2p wire uses a dedicated JWT (not session JWT), analogous to replication pin (ADR 28 / swarm security gaps §3).
C7Sync-failures dashboard is a modular monolith route under Account in file-explorer (ADR 33); DTOs live in crates/ingress/src/models (e.g. sync_failures.rs).

Cross-cutting challenges (CC1–CC6)

IDChallengeMitigation
CC1Owner PDS OAuth cold at delete timeHTTP 200 + catalog removed; receipt_sync: pending; worker retries/poison per ADR 28
CC2Large folder delete (many files → many outbox rows + block GC)Path-prefix query; batch enqueue in one txn; consider per-request job cap in ingress if needed
CC3Chunked files (above 25 MiB) span manifest + N blocksShared ingress helper with upload chunk layout; delete all CIDs when refcount zero
CC4Triangle node offline or RPC timeout during unpinmesh_unpin_outbox + MeshUnpinWorker; terminal failed = DLQ; Account sync-failures UI + RLS-scoped replay APIs (not only metrics/runbook); HTTP delete still succeeds (NR3)
CC5Unpin cannot call is_authorized with deleted receiptOwner-only unpin JWT (sub == owner_did, binds cid); inbound handler verifies JWT + deletes block without PDS receipt lookup
CC6Grantee expects to retry owner-only mesh unpin jobsMesh unpin rows use owner_did only (delete is owner-only); grantees still see receipt_sync_outbox rows they enqueued via requested_by_did

Decision

We adopt four-layer delete for drive nodes (library → provenance → storage → mesh), plus a fifth operability layer (sync failures dashboard under Account) for DLQ visibility and replay. Ingress orchestrates catalog in one transaction, then async PDS convergence, synchronous storage reclaim, and mesh replication teardown on the Global Triangle.

Layer 1 — Library (catalog)

  • delete_drive_node validates writability, selects subtree rows, deletes drive_entry (cascade removes drive_entry_access_control), and deletes orphaned passport / passport_access_control rows tied to removed receipt/asset CIDs.
  • Returns list_tree_entries for the parent folder.

Layer 2 — Access and filesystem provenance (async)

PipelineBusiness eventPDS effectWhen
Receipt syncfile.deleteddeleteRecord on cloud.substratum.passport.receiptEach deleted file with receipt material
Catalog syncentry.deleteddeleteRecord on cloud.substratum.filesystem.driveEntryEach deleted row when ADR 30 worker ships

New ReceiptSyncEvent::FileDeleted (persisted as file.deleted) carries owner_did, drive_id, entry_id, asset_cid, and receipt locator fields sufficient for idempotent deleteRecord.

Folder delete: no receipt job for directory-only rows; one file.deleted per descendant file.

Layer 3 — Storage and quota (sync in handler)

  • Quota: UploadPolicy::reclaim_quota_on_delete (SaaS) subtracts summed file sizes in the same txn as catalog delete.
  • Local blockstore: after txn commit, for each asset_cid in the subtree with zero remaining drive_entry references, remove blocks from the originating gateway Blockstore (crates/retrieval — add remove if missing), including chunk manifest and chunk CIDs.

Layer 4 — Mesh replication teardown (durable outbox + DLQ)

Upload pin pushes blocks to triangle peers via SwarmCommand::Pin and /substratum/replication/1.0.0 (GatewayPinPort). Delete must symmetrically unpin via a second async pipeline (parallel to receipt sync, not inline-only RPC):

  1. Protocol: extend replication (prefer /substratum/replication/1.1.0 or CBOR enum) with UnpinRequest { cid, owner_did, jwt } and UnpinResponse. No block payload on the wire.
  2. Auth: create_replication_unpin_jwt in substratum-auth — owner sub, bound cid + owner_did, short TTL; inbound validate_replication_unpin before blockstore.remove (NR5).
  3. Outbox table mesh_unpin_outbox (new migration): columns aligned with receipt_sync_outboxowner_did, requested_by_did (same as deleter), target_peer_id, asset_cid, queue (mesh_unpin), payload_json, idempotency_key, status, attempts, next_retry_at, last_error, timestamps. RLS policies: mesh_unpin_outbox_caller (owner_did OR requested_by_did = app.current_user_did) and mesh_unpin_outbox_worker (app.mesh_unpin_worker = 'true'). claim_mesh_unpin_job() SECURITY DEFINER for worker poll only.
  4. Ingress: in the delete transaction (or immediately after commit), after local blockstore remove (Layer 3), enqueue one row per (cid, PINNING_TARGETS peer) — do not block HTTP on libp2p round-trips.
  5. MeshUnpinWorker (gateway background task, MESH_UNPIN_ENABLED / MESH_UNPIN_POLL_MS): claims jobs, mints unpin JWT, SwarmCommand::Unpin + send_request; on success → synced; on transient error → backoff retry; on max_attemptsfailed (DLQ) for operator visibility (same poison semantics as ReceiptSyncWorker).
  6. DLQ operations: metrics (NR6) plus Layer 5 sync-failures UI/API (R15–R18). Runbook remains for break-glass; primary replay is in-app under RLS. Policy TBD: TTL on stuck failed rows (mirror ADR 28 orphan guidance, e.g. alert after 30 days).

Home-base / sidecar do not host this worker until replication listener parity exists.

Layer 5 — Sync failures dashboard (Account, RLS-scoped)

Production delete requires users to see and retry failed background sync jobs without psql or log scraping.

Not a separate application (ADR 33): one Vite bundle, one OAuth origin, one installer staging path. Add /account/sync-failures in apps/file-explorer under the Account bounded context: secondary nav on account pages (Overview + Sync failures); existing shell Account item stays the only top-level entry (route.startsWith('/account')). Sub-link gated by deployment bootstrap (v1 self-hosted: show for authenticated users; SaaS may add sync_failures_enabled later).

Ingress (crates/ingress/src/models/sync_failures.rs) — namespace /api/v1/me/sync-failures (same me-scoped surface as GET /api/v1/me/limits, not a generic /ops prefix):

EndpointBehavior
GET /api/v1/me/sync-failures?queue=&status=failedPaginated list; queue = receipt_sync | mesh_unpin
GET /api/v1/me/sync-failures/{id}Detail + optional catalog context (path) via RLS-visible drive_entry join
POST /api/v1/me/sync-failures/{id}/retryfailedpending, reset attempts (R18)
POST /api/v1/me/sync-failures/retry-bulkOptional: filter by target_peer_id (mesh) or owner_did

All handlers: session JWT → AuthenticatedDidwith_rls_context(db, &did) only. Never app.receipt_sync_worker, app.mesh_unpin_worker, or DATABASE_ADMIN_URL on these routes (resolution AGENTS).

Who sees what:

SessionVisible jobs
Ownerowner_did = me (upload/delete/receipt + mesh unpin)
Granteerequested_by_did = me (e.g. self-removal receipt jobs); not owner’s mesh unpin
Other tenantNone (RLS)

Platform cross-tenant console (SaaS staff viewing all tenants) is out of scope — requires explicit break-glass auth, not caller RLS.

HTTP and UI

  • OpenAPI: existing delete_drive_node; extend response with optional receipt_sync when jobs enqueued; add me/sync-failures routes (R16).
  • Explorer: remove alpha stub; confirmation panel per R10; Account → Sync failures (/account/sync-failures) per R17.

Out of scope (this ADR)

  • Deleting entire drives (no API).
  • Grantee delete of shared content (ACL self-removal only).
  • Home-base / sidecar mesh unpin (until replication listener parity with gateway).
  • Guaranteed every historical peer that ever saw a block (only configured PINNING_TARGETS + explicit unpin RPC, not DHT-wide GC).
  • Separate sync-failures SPA or micro-frontend (ADR 33).
  • Platform-wide cross-tenant DLQ console (Substratum PDS staff SSO per ADR 38); tenant RLS console only (Layer 5).

Rejected alternatives

AlternativeWhy rejected
Catalog-only delete (status quo)Leaves PDS receipts and block bytes; mesh and quota remain wrong.
Synchronous owner deleteRecord in HTTP handlerReintroduces OAuth coupling ADR 28 removed.
file.deleted on catalog-sync outboxPassport collections are receipt-sync only (C2).
Delete drive root via same endpointDrive is the tenancy boundary; requires separate lifecycle ADR.
Soft-delete / trash retentionDeferred; v1 is hard delete with four-layer semantics.
Separate sync-failures web applicationSecond OAuth origin, second installer bundle; ADR 33 monolith is sufficient.
DLQ replay only via runbook/SQLDoes not meet operator UX; R15–R18 require in-app retry under RLS.
Reference-count blocks in PostgresExtra schema; v1 uses post-delete drive_entry existence check per asset_cid.
Local-only blockstore GC (no triangle unpin)Triangle nodes retain copies; undermines “delete frees storage” and data-sovereignty expectations.
Inline-only unpin RPC without outboxOffline peers lose teardown permanently; no DLQ for operators.

Consequences

Positive

  • Delete matches user mental model: library, access, and storage align.
  • Receipt tombstones restore mesh truth on owner PDS without blocking HTTP.
  • Quota and disk usage track uploads (ADR 32).
  • Clear split between receipt-sync and catalog-sync events.

Negative

  • Folder deletes fan out N receipt jobs, block GC, and N × triangle unpin RPCs.
  • Eventual PDS convergence: UI may show receipt_sync: pending briefly.
  • Triangle unpin DLQ depth must be monitored; stuck failed rows mean replicas may still hold bytes until user retry from sync-failures UI (CC4).
  • Sync-failures dashboard adds API surface area; must stay behind session + RLS (NR7).

Neutral

  • ADR 30 catalog-sync can ship after receipt delete; entry.deleted is idempotent when filesystem records were never written.
  • Operations doc catalog-vs-blockstore-storage must describe post-delete lifecycle.

Verification

ScenarioExpected
Owner deletes uploaded file200; entry absent from tree; used_bytes decreased (SaaS); local + triangle blockstore get(asset_cid) empty; receipt job enqueued
Mesh unpin worker (integration)Pending rows drain to synced; simulated peer down → retries → failed DLQ
Metrics / DLQ alertNonzero mesh_unpin_outbox_failed count triggers dashboard or log alert (NR6)
Worker processes file.deletedOwner repo record gone; catalog receipt_syncsynced
Delete folder with 3 files3 outbox rows; all paths removed; quota sum of 3 sizes
Grantee calls DELETE on shared path403 or 404
Delete same path twiceSecond call 404
Self-hosted deleteCatalog + blocks reclaimed; no quota mutation
RLS isolation (two owners)Owner A cannot list/retry Owner B failed mesh_unpin / receipt_sync rows
Sync-failures retry after peer recoveryfailedpending via POST /me/sync-failures/{id}/retry; worker drains to synced