Migration runner is fragile under parallel-merge — replace single-counter tracker with applied-set #44

New Issue

mAi · 2026-05-20T08:33:19Z

mAi commented

2026-05-20 08:33:19 +00:00

What happened (2026-05-20)

paliad/main landed three migration-bearing merges in close succession:

6c7e9ef — fermi's t-paliad-207 follow-ups, added mig 104 + mig 105.
6a20241 — hertz's t-paliad-216 Slice A, added mig 103.
e035512 — Madrid, added mig 106.

Each push triggered a Dokploy redeploy. The deploys ran sequentially with the binary embedded at that revision. Result on production paliad.de:

Deploy 1 (after fermi merge): binary has migs 104+105. golang-migrate sees current=102 → applies 104+105 → counter is now 105.
Deploy 2 (after hertz Slice A merge): binary has migs 103, 104, 105. golang-migrate sees current=105, migrate.Up() only runs versions strictly greater than current → mig 103 silently skipped. Counter stays 105.
Deploy 3 (after Madrid merge): binary has migs 103-106. Counter=105 → applies 106 only. Counter is now 106.

Production DB state after all three deploys: schema_migrations.version=106, dirty=false, but paliad.approval_requests has neither counter_payload nor previous_request_id columns and the status CHECK constraint never got 'changes_requested' added. The Slice A backend code was deployed against a schema that didn't support it. Recovered by manually applying mig 103 via Supabase SQL, but the recovery is invisible to migrate — schema_migrations still says 106, so a future migrate down would skip 103's down too.

Root cause

golang-migrate/migrate/v4 tracks applied state as a single integer in paliad.paliad_schema_migrations(version int, dirty bool). The semantics are "every version ≤ current is considered applied." There is no way to express "version 103 was skipped, please come back to it." In a parallel-merge workflow where two workers' migration numbers race, whichever one lands first claims the counter and the other gets permanently skipped.

The rebase-and-renumber guidance in the project CLAUDE.md and the head SKILL.md mitigates this when the head catches it pre-merge — but it relies on every head reading the migrations directory before every merge. That's brittle. The runner itself should be the safety net, not the convention.

Goal

Replace (or wrap) the migration runner with a gap-tolerant version. Specifically: track applied migrations as a set, not a counter. Every deploy scans the embedded FS, compares against the applied-set, and runs anything missing — regardless of when it was authored or which version is currently "highest."

Proposed shape

New table paliad.applied_migrations(version int PRIMARY KEY, name text NOT NULL, applied_at timestamptz NOT NULL DEFAULT now(), checksum text NULL). One row per applied migration. The checksum column is forward-looking — drift detection if we ever care.
New runner in internal/db/migrate.go (or wherever the current one lives):
- Open the embedded migrations FS.
- List files matching \d+_*.up.sql, parse the version prefix.
- For each version not present in applied_migrations, run the SQL in a transaction together with INSERT INTO applied_migrations(version, name, ...) VALUES (...). All-or-nothing per migration.
- Run versions in ascending order (a 103 followed by 104 with referential dependency must still apply in order).
- On error, abort the deploy with a clear message (same fail-fast posture as today).
Backfill: a one-shot migration 107_backfill_applied_migrations.up.sql that populates applied_migrations from the existing schema (every embedded mig version + mig 103, which was manually applied). Idempotent. After this lands, the new runner takes over.
Keep paliad_schema_migrations for read-only display (or drop after burn-in if nothing reads it). Don't write to it from the new runner — the source of truth is applied_migrations. Optionally a periodic consistency check that flags divergence.
Drift detection (forward-looking, optional v1): store the SHA-256 of each migration's SQL in checksum on apply. On every deploy, the runner re-hashes the embedded version and compares; if a previously-applied migration's content has changed, that's a programmer error (you mutated a file that was already shipped). Hard-fail with a clear message.

Acceptance

After this issue ships, the scenario above (race between two workers' migration numbers) doesn't lose any migration. Any embedded migration not yet in applied_migrations runs on the next deploy.
Mig 103 is recorded as applied in applied_migrations (via the backfill).
The down-migration path still works (call sites: future rollbacks). Downs are tracked by removing from applied_migrations on successful down-run.
Existing migration files (.up.sql / .down.sql) need no changes — the runner is what's being replaced.
A clear failure when two migrations on the same version number get embedded (collision) — runner refuses to start.

Out of scope

Switching to a different migration library entirely (goose, dbmate). The fix can be a thin wrapper around the existing migrate library or a hand-rolled scanner over embed.FS — both are smaller than a library swap.
A retroactive audit trail of the production drift (the manual apply on 2026-05-20). That's captured in this issue body; no further write-up needed.
Lock contention across concurrent migration runs (rolling Dokploy deploys racing). The current runner takes an advisory lock; the new one should too. Standard pattern.

Role recommendation

inventor — schema + runner contract + backfill plan deserve a design pass before any code. Then coder shift for the build. Branch convention: mai/<inventor>/migration-runner-applied-set. The fix is small but sensitive (touches every future migration), so design-first is worth the round-trip.

## What happened (2026-05-20) `paliad/main` landed three migration-bearing merges in close succession: 1. `6c7e9ef` — fermi's t-paliad-207 follow-ups, added mig **104** + mig **105**. 2. `6a20241` — hertz's t-paliad-216 Slice A, added mig **103**. 3. `e035512` — Madrid, added mig **106**. Each push triggered a Dokploy redeploy. The deploys ran sequentially with the binary embedded at that revision. Result on production paliad.de: - Deploy 1 (after fermi merge): binary has migs 104+105. golang-migrate sees current=102 → applies 104+105 → counter is now 105. - Deploy 2 (after hertz Slice A merge): binary has migs 103, 104, 105. golang-migrate sees current=105, `migrate.Up()` only runs versions **strictly greater than current** → mig 103 **silently skipped**. Counter stays 105. - Deploy 3 (after Madrid merge): binary has migs 103-106. Counter=105 → applies 106 only. Counter is now 106. Production DB state after all three deploys: schema_migrations.version=106, dirty=false, but `paliad.approval_requests` has neither `counter_payload` nor `previous_request_id` columns and the status CHECK constraint never got `'changes_requested'` added. The Slice A backend code was deployed against a schema that didn't support it. Recovered by manually applying mig 103 via Supabase SQL, but the recovery is **invisible** to migrate — schema_migrations still says 106, so a future `migrate down` would skip 103's down too. ## Root cause `golang-migrate/migrate/v4` tracks applied state as a **single integer** in `paliad.paliad_schema_migrations(version int, dirty bool)`. The semantics are "every version ≤ current is considered applied." There is no way to express "version 103 was skipped, please come back to it." In a parallel-merge workflow where two workers' migration numbers race, whichever one lands first claims the counter and the other gets permanently skipped. The rebase-and-renumber guidance in the project CLAUDE.md and the head SKILL.md mitigates this when the head catches it pre-merge — but it relies on every head reading the migrations directory before every merge. That's brittle. The runner itself should be the safety net, not the convention. ## Goal Replace (or wrap) the migration runner with a gap-tolerant version. Specifically: track applied migrations as a **set**, not a counter. Every deploy scans the embedded FS, compares against the applied-set, and runs anything missing — regardless of when it was authored or which version is currently "highest." ## Proposed shape 1. **New table** `paliad.applied_migrations(version int PRIMARY KEY, name text NOT NULL, applied_at timestamptz NOT NULL DEFAULT now(), checksum text NULL)`. One row per applied migration. The `checksum` column is forward-looking — drift detection if we ever care. 2. **New runner** in `internal/db/migrate.go` (or wherever the current one lives): - Open the embedded migrations FS. - List files matching `\d+_*.up.sql`, parse the version prefix. - For each version not present in `applied_migrations`, run the SQL in a transaction together with `INSERT INTO applied_migrations(version, name, ...) VALUES (...)`. All-or-nothing per migration. - Run versions in ascending order (a 103 followed by 104 with referential dependency must still apply in order). - On error, abort the deploy with a clear message (same fail-fast posture as today). 3. **Backfill**: a one-shot migration `107_backfill_applied_migrations.up.sql` that populates `applied_migrations` from the existing schema (every embedded mig version + mig 103, which was manually applied). Idempotent. After this lands, the new runner takes over. 4. **Keep `paliad_schema_migrations` for read-only display** (or drop after burn-in if nothing reads it). Don't write to it from the new runner — the source of truth is `applied_migrations`. Optionally a periodic consistency check that flags divergence. 5. **Drift detection (forward-looking, optional v1)**: store the SHA-256 of each migration's SQL in `checksum` on apply. On every deploy, the runner re-hashes the embedded version and compares; if a previously-applied migration's content has changed, that's a programmer error (you mutated a file that was already shipped). Hard-fail with a clear message. ## Acceptance - After this issue ships, the scenario above (race between two workers' migration numbers) doesn't lose any migration. Any embedded migration not yet in `applied_migrations` runs on the next deploy. - Mig 103 is recorded as applied in `applied_migrations` (via the backfill). - The down-migration path still works (call sites: future rollbacks). Downs are tracked by removing from `applied_migrations` on successful down-run. - Existing migration files (.up.sql / .down.sql) need no changes — the runner is what's being replaced. - A clear failure when two migrations on the same version number get embedded (collision) — runner refuses to start. ## Out of scope - Switching to a different migration library entirely (goose, dbmate). The fix can be a thin wrapper around the existing migrate library or a hand-rolled scanner over `embed.FS` — both are smaller than a library swap. - A retroactive audit trail of the production drift (the manual apply on 2026-05-20). That's captured in this issue body; no further write-up needed. - Lock contention across concurrent migration runs (rolling Dokploy deploys racing). The current runner takes an advisory lock; the new one should too. Standard pattern. ## Role recommendation **inventor** — schema + runner contract + backfill plan deserve a design pass before any code. Then coder shift for the build. Branch convention: `mai/<inventor>/migration-runner-applied-set`. The fix is small but sensitive (touches every future migration), so design-first is worth the round-trip.

mAi self-assigned this 2026-05-20 08:33:19 +00:00

m referenced this issue from a commit

2026-05-20 10:58:27 +00:00

feat(db): t-paliad-218 — gap-tolerant migration runner with applied-set tracker

m referenced this issue from a commit

2026-05-20 10:59:37 +00:00

feat(db): t-paliad-218 — gap-tolerant migration runner with applied-set tracker

m referenced this issue from a commit

2026-07-06 10:33:20 +00:00

docs(claude): correct migration runbook — custom applied-set runner + paliad.applied_migrations tracker