CI/CD: pre-deploy test + migration gate so paliad.de stays online through failed deploys (inventor) #114

Open
opened 2026-05-25 14:56:54 +00:00 by mAi · 0 comments
Collaborator

m's report (2026-05-25 16:55)

can we fix our cicd in a way that paliad does not go 'offline' as often as it happens now? I dont really want the .dev variant vor it all - with a branch and a separate dokploy app. but we need test workers included. We have gitea, we have mlake, mriver... all our infra. Where is it smartest to run? We should keep the site online but also keep developing it without ruining everything during a failed deploy

Observed pain (today's session)

2x outages today both from migration failures at container boot:

  1. ~13:20 — brunel's in-process testing wrote a row to paliad.applied_migrations claiming version=123 with the wrong name; cronus's actual 123_backups merged in parallel; migrator bailed on name mismatch; container crashlooped ~30 min.
  2. ~16:05 — hermes's 125_cross_cutting_filter_legal_source referenced columns (is_mandatory, is_optional, condition_flag) dropped in mig 091; container crashlooped through two hotfix iterations.

In both cases, go build ./... was clean at merge time — the failure was migration-content. Container then went into Docker restart loop, Traefik served default 404. Site fully offline until head debugged.

Constraint envelope from m

  • No "paliad.dev" duplicate: don't want a separate Dokploy app + branch + DB. One source of truth, one prod surface.
  • Test workers included: build + tests must run somewhere before the deploy.
  • Existing infra only: gitea (mgit.msbls.de) · mlake (Dokploy + Docker Swarm) · mriver (workers, Tailscale-attached).
  • Site stays online through failed deploys: a broken migration must NOT take the running container down.

Phase: inventor design (READ-ONLY)

Inventor → coder gate per project CLAUDE.md.

Open design questions

Q1 — Where do tests run?

A. Gitea Actions (gitea has built-in CI runners). Runner on mriver or mlake. Pre-merge to main: every push fires .gitea/workflows/test.yaml which runs go build, go test ./internal/..., cd frontend && bun run build. Deploy webhook fires ONLY on green.
B. Custom proxy on mriver: receives the gitea push webhook, runs tests in a transient docker container (using the same Dockerfile as prod), forwards to Dokploy only on green. More work to maintain.
C. Dokploy pre-deploy hook: if Dokploy supports a pre-deploy script (per skill docs — verify), run tests inline before container swap. Tightest integration but Dokploy may not expose this.

(R) = A — Gitea Actions. The runner is the cheapest infra; gitea's workflow file is in the repo; pipeline status visible in PR view. mriver hosts the runner since it has the worker pool resources.

Q2 — Where does the migration get tested?

Go unit tests don't catch migration failures because they run against a clean schema. A real test for migration safety has to:

  • Take a snapshot of prod schema (or a representative one)
  • Apply the new migration
  • Run a smoke query that hits the new shape
  • Roll back

Options:
A. Migration smoke test in the CI job: pull the prod schema dump nightly into a runner-local Postgres, then on every push, spin up an ephemeral copy, apply all migrations through the new one, run a smoke query.
B. Dry-run migration mode: Go binary has a flag that runs migrations against a target DB-URL inside a transaction and rolls back. Run as a CI step against a clone DB.
C. Defer to prod with auto-rollback: container's startup migrator rolls back on failure + flags the deploy as broken; old container stays running. Dokploy / Docker Swarm support this via health checks + zero-downtime swap.

(R) = A + C: belt-and-suspenders. A catches column-mismatch + ownership traps before they ever reach prod. C handles the rare case where production data shape differs from the snapshot.

Q3 — Blue/green or canary deploy for the container itself?

Docker Swarm (Dokploy's runtime) supports rolling updates with health checks. Current setup: container starts → fails health check → swarm doesn't replace the running one. Verify this is actually configured — if update_config.failure_action: pause is set in the compose, today's crashloops would NOT have taken the site offline.

Looking at the symptom (Traefik default 404), it seems Swarm DID replace the old container with the failing one. The restart: always or restart_policy.condition: any may be too aggressive.

(R) = inventor researches the current compose's deploy: block (or update_config:) and proposes a config change that makes container-swap health-gated.

Q4 — How do test workers (existing mai workers) fit in?

m mentioned "we need test workers included." Two readings:

  1. CI runners ARE the test workers (per Q1 A) — gitea runs the tests via its workflow.
  2. mai workers run tests as part of their shift — every gitster/coder runs go test before pushing, AND a dedicated mai-test worker runs a broader smoke suite on every main update.

(R) = both. Per-worker pre-push tests stay (already-required convention — see Hard rules in every issue brief). The CI runner is a SAFETY NET that catches what individual workers might skip + catches integration issues between worker branches.

A dedicated mai-test shift (per the existing mai-test skill) can also kick off post-merge as a checker that runs the broader smoke + integration suite + reports back to gitea.

Q5 — Migration coordination (the ROOT CAUSE of today's first outage)

A process-level fix that's cheaper than infra:

  • Migration slot reservation: when head spawns a worker that will write a migration, head tells them the EXACT slot to use (e.g. "slot 124"). Inventor of this design also recommends checking paliad.applied_migrations slot availability BEFORE writing the file.
  • Pre-flight check in CI: if any migration file's slot already exists in applied_migrations with a different name → FAIL the build.

(R) = yes, head's slot-reservation behavior is already in flight (today's session). Codify as a Go check in CI + a heads-up to head when filing a task that touches deadline_rules / submission_drafts / projects.

Q6 — Existing prod traffic during deploy

Dokploy + Swarm should keep the previous container serving traffic during the new container's startup window. If the new container fails health checks, the old one keeps serving. Verify + document. If broken: this is the BIGGEST single win — fix update_config.failure_action: rollback in the compose.

Deliverable

docs/design-cicd-pre-deploy-gate-2026-05-25.md on branch mai/<inventor>/cicd-design. Sections:

  • §0 TL;DR
  • §1 Today's outages — root cause analysis (2x migration failures)
  • §2 m's constraints (no .dev clone, existing infra, online during failed deploy)
  • §3 Decision matrix (Q1-Q6 with R + cost + maintenance + time-to-ship)
  • §4 Recommended pipeline (visual: push → gitea actions → tests + migration smoke → on green Dokploy webhook → on red, PR/issue comment)
  • §5 Compose changes (deploy: update_config.* for health-gated swap)
  • §6 Migration smoke harness (how to clone schema, how to detect ownership trap, etc.)
  • §7 Existing infra resource map (mlake / mriver capacity)
  • §8 Slice plan (Slice A: gitea actions running build + test; Slice B: migration smoke test in CI; Slice C: compose update_config rollback)
  • §9 Risk + rollback
  • §10 Out of scope (full .dev environment; multi-region deploys)
  • §11 Open questions for m

Hard rules

  • READ-ONLY design phase. No code, no compose edits, no Dokploy changes.
  • Head answers questions — NO AskUserQuestion. Inventor uses mai instruct head. Defaults to (R) recommendations.
  • Verify the current compose's deploy: block live (ssh mlake docker stack config or read the compose from m/paliad/docker-compose.yml).
  • Check gitea actions support live (the mgit.msbls.de Gitea version + runners installed).

When done

Push design doc + mai report completed with "DESIGN READY FOR REVIEW". Inventor stays parked. Head gates coder shift.

Out of scope

  • Multi-region / DR.
  • Database backup or rollback strategy (#77 Backup Mode covers backups).
  • Migrating off Dokploy.
  • Adding a SECOND Dokploy app or branch (per m's explicit constraint).
## m's report (2026-05-25 16:55) > can we fix our cicd in a way that paliad does not go 'offline' as often as it happens now? I dont really want the .dev variant vor it all - with a branch and a separate dokploy app. but we need test workers included. We have gitea, we have mlake, mriver... all our infra. Where is it smartest to run? We should keep the site online but also keep developing it without ruining everything during a failed deploy ## Observed pain (today's session) 2x outages today both from migration failures at container boot: 1. ~13:20 — brunel's in-process testing wrote a row to `paliad.applied_migrations` claiming version=123 with the wrong name; cronus's actual 123_backups merged in parallel; migrator bailed on name mismatch; container crashlooped ~30 min. 2. ~16:05 — hermes's `125_cross_cutting_filter_legal_source` referenced columns (`is_mandatory`, `is_optional`, `condition_flag`) dropped in mig 091; container crashlooped through two hotfix iterations. In both cases, `go build ./...` was clean at merge time — the failure was migration-content. Container then went into Docker restart loop, Traefik served default 404. Site fully offline until head debugged. ## Constraint envelope from m - **No "paliad.dev" duplicate**: don't want a separate Dokploy app + branch + DB. One source of truth, one prod surface. - **Test workers included**: build + tests must run somewhere before the deploy. - **Existing infra only**: gitea (mgit.msbls.de) · mlake (Dokploy + Docker Swarm) · mriver (workers, Tailscale-attached). - **Site stays online through failed deploys**: a broken migration must NOT take the running container down. ## Phase: inventor design (READ-ONLY) Inventor → coder gate per project CLAUDE.md. ## Open design questions ### Q1 — Where do tests run? A. **Gitea Actions** (gitea has built-in CI runners). Runner on mriver or mlake. Pre-merge to main: every push fires `.gitea/workflows/test.yaml` which runs `go build`, `go test ./internal/...`, `cd frontend && bun run build`. Deploy webhook fires ONLY on green. B. **Custom proxy on mriver**: receives the gitea push webhook, runs tests in a transient docker container (using the same Dockerfile as prod), forwards to Dokploy only on green. More work to maintain. C. **Dokploy pre-deploy hook**: if Dokploy supports a pre-deploy script (per skill docs — verify), run tests inline before container swap. Tightest integration but Dokploy may not expose this. **(R) = A** — Gitea Actions. The runner is the cheapest infra; gitea's workflow file is in the repo; pipeline status visible in PR view. mriver hosts the runner since it has the worker pool resources. ### Q2 — Where does the migration get tested? Go unit tests don't catch migration failures because they run against a clean schema. A real test for migration safety has to: - Take a snapshot of prod schema (or a representative one) - Apply the new migration - Run a smoke query that hits the new shape - Roll back Options: A. **Migration smoke test in the CI job**: pull the prod schema dump nightly into a runner-local Postgres, then on every push, spin up an ephemeral copy, apply all migrations through the new one, run a smoke query. B. **Dry-run migration mode**: Go binary has a flag that runs migrations against a target DB-URL inside a transaction and rolls back. Run as a CI step against a clone DB. C. **Defer to prod with auto-rollback**: container's startup migrator rolls back on failure + flags the deploy as broken; old container stays running. Dokploy / Docker Swarm support this via health checks + zero-downtime swap. **(R) = A + C**: belt-and-suspenders. A catches column-mismatch + ownership traps before they ever reach prod. C handles the rare case where production data shape differs from the snapshot. ### Q3 — Blue/green or canary deploy for the container itself? Docker Swarm (Dokploy's runtime) supports rolling updates with health checks. Current setup: container starts → fails health check → swarm doesn't replace the running one. **Verify** this is actually configured — if `update_config.failure_action: pause` is set in the compose, today's crashloops would NOT have taken the site offline. Looking at the symptom (Traefik default 404), it seems Swarm DID replace the old container with the failing one. The `restart: always` or `restart_policy.condition: any` may be too aggressive. **(R)** = inventor researches the current compose's `deploy:` block (or `update_config:`) and proposes a config change that makes container-swap health-gated. ### Q4 — How do test workers (existing mai workers) fit in? m mentioned "we need test workers included." Two readings: 1. **CI runners ARE the test workers** (per Q1 A) — gitea runs the tests via its workflow. 2. **mai workers run tests as part of their shift** — every gitster/coder runs `go test` before pushing, AND a dedicated mai-test worker runs a broader smoke suite on every main update. **(R)** = both. Per-worker pre-push tests stay (already-required convention — see Hard rules in every issue brief). The CI runner is a SAFETY NET that catches what individual workers might skip + catches integration issues between worker branches. A dedicated `mai-test` shift (per the existing `mai-test` skill) can also kick off post-merge as a checker that runs the broader smoke + integration suite + reports back to gitea. ### Q5 — Migration coordination (the ROOT CAUSE of today's first outage) A process-level fix that's cheaper than infra: - Migration slot reservation: when head spawns a worker that will write a migration, head tells them the EXACT slot to use (e.g. "slot 124"). Inventor of this design also recommends checking `paliad.applied_migrations` slot availability BEFORE writing the file. - Pre-flight check in CI: `if any migration file's slot already exists in applied_migrations with a different name → FAIL the build.` **(R)** = yes, head's slot-reservation behavior is already in flight (today's session). Codify as a Go check in CI + a heads-up to head when filing a task that touches deadline_rules / submission_drafts / projects. ### Q6 — Existing prod traffic during deploy Dokploy + Swarm should keep the previous container serving traffic during the new container's startup window. If the new container fails health checks, the old one keeps serving. Verify + document. If broken: this is the BIGGEST single win — fix `update_config.failure_action: rollback` in the compose. ## Deliverable `docs/design-cicd-pre-deploy-gate-2026-05-25.md` on branch `mai/<inventor>/cicd-design`. Sections: - §0 TL;DR - §1 Today's outages — root cause analysis (2x migration failures) - §2 m's constraints (no .dev clone, existing infra, online during failed deploy) - §3 Decision matrix (Q1-Q6 with R + cost + maintenance + time-to-ship) - §4 Recommended pipeline (visual: push → gitea actions → tests + migration smoke → on green Dokploy webhook → on red, PR/issue comment) - §5 Compose changes (deploy: update_config.* for health-gated swap) - §6 Migration smoke harness (how to clone schema, how to detect ownership trap, etc.) - §7 Existing infra resource map (mlake / mriver capacity) - §8 Slice plan (Slice A: gitea actions running build + test; Slice B: migration smoke test in CI; Slice C: compose update_config rollback) - §9 Risk + rollback - §10 Out of scope (full .dev environment; multi-region deploys) - §11 Open questions for m ## Hard rules - READ-ONLY design phase. No code, no compose edits, no Dokploy changes. - **Head answers questions** — NO AskUserQuestion. Inventor uses `mai instruct head`. Defaults to (R) recommendations. - Verify the current compose's `deploy:` block live (`ssh mlake docker stack config` or read the compose from `m/paliad/docker-compose.yml`). - Check gitea actions support live (the `mgit.msbls.de` Gitea version + runners installed). ## When done Push design doc + `mai report completed` with "DESIGN READY FOR REVIEW". Inventor stays parked. Head gates coder shift. ## Out of scope - Multi-region / DR. - Database backup or rollback strategy (#77 Backup Mode covers backups). - Migrating off Dokploy. - Adding a SECOND Dokploy app or branch (per m's explicit constraint).
mAi self-assigned this 2026-05-25 14:56:54 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: m/paliad#114
No description provided.