Compare commits
1 Commits
639ff4f672
...
mai/cronus
| Author | SHA1 | Date | |
|---|---|---|---|
| f6096b4849 |
631
docs/design-cicd-pre-deploy-gate-2026-05-25.md
Normal file
631
docs/design-cicd-pre-deploy-gate-2026-05-25.md
Normal file
@@ -0,0 +1,631 @@
|
||||
# Design — CI/CD pre-deploy test + migration gate
|
||||
|
||||
**Author:** cronus (inventor)
|
||||
**Date:** 2026-05-25
|
||||
**Task:** t-paliad-282 — m/paliad#114
|
||||
**Branch:** `mai/cronus/inventor-ci-cd-pre`
|
||||
**Status:** DESIGN READY FOR REVIEW. No code, no compose edits, no Dokploy changes. Awaiting head go/no-go on §3 R-picks + §11 open questions before any coder shift.
|
||||
|
||||
---
|
||||
|
||||
## 0. TL;DR
|
||||
|
||||
Paliad has been offline ≈ 90 min of the last 4 h today (three independent migration crash-loops). The repo has a `TestMigrations_DryRun` (`internal/db/migrate_test.go`) that already would have caught two of the three; it has never been run by anything except local laptops because there is **no CI**. mendel's `docs/design-paliad-test-strategy-2026-05-19.md` Slice 7 deferred CI wiring on the Q2 question to m; today's outages make Slice 7 the highest-leverage paliad change on the board.
|
||||
|
||||
The site goes offline on a failed deploy because the compose has `restart: unless-stopped` + no healthcheck + no Swarm `deploy:` block. Dokploy compose mode (verified live on `compose-transmit-multi-byte-driver-v7jth9` — paliad's auto-named project) runs `docker compose up -d --force-recreate`: the old container is killed BEFORE the new one starts. When the new one's migrator panics, the old one is already gone — Traefik has nothing to route to, paliad.de serves 404. With 14 restart attempts and counting on the current container.
|
||||
|
||||
**Two-pronged fix:**
|
||||
|
||||
1. **Pre-deploy gate (PRIMARY) — Gitea Actions runs `go build` + `bun run build` + `go test ./...` + a beefed-up migration smoke test against an ephemeral Postgres BEFORE Dokploy's webhook fires.** If red, the deploy never happens. If green, Dokploy gets called and replaces the container. Per m's constraint: no `.dev` clone, no second Dokploy app. Single source of truth, single prod surface. The gate is in front of the existing deploy, not parallel to it.
|
||||
2. **Runtime safety net (SECONDARY) — tighten the migrator to fail-fast loudly + cap the restart loop so a bad deploy doesn't keep flailing for hours.** The compose change is small (add `healthcheck:` + `restart: on-failure:3`); the actual win is preventing the deploy with prong 1.
|
||||
|
||||
**Both prongs together close the failure mode. Prong 1 alone is enough for the three outages today.** Prong 2 is defense in depth for the rare case where prod data shape diverges from the CI Postgres in a way the smoke harness can't predict.
|
||||
|
||||
Six open questions for m at §11. The R-picks let m sign off in one chip-round instead of negotiating each one. Coder shift after head's go/no-go.
|
||||
|
||||
---
|
||||
|
||||
## 1. Today's outages — root cause analysis
|
||||
|
||||
Three migration crash-loops, three distinct mechanisms. All three would have been caught by CI:
|
||||
|
||||
### 1.1 ~13:20 — brunel slot collision (mig 123)
|
||||
|
||||
**Mechanism.** brunel's in-process tests wrote a row to `paliad.applied_migrations` claiming `version=123` with the wrong name. cronus's actual `123_backups.up.sql` shipped in parallel. On next deploy, the runner saw version 123 already "applied" with the wrong name and `checkNameAgreement` hard-failed.
|
||||
|
||||
**Why CI catches it.** Two workers writing to slot 123 means at PR-time both branches have a file named `123_*.up.sql`. A pre-merge CI gate that runs `scanEmbeddedMigrations()` (which already hard-fails on duplicate slots — `internal/db/migrate.go:148`) flags the second PR before it's merged. The merger has to coordinate (rename their migration to 124 or wait for the first to land).
|
||||
|
||||
**Currently caught by:** local `go test` of the duplicate-slot path, which always failed; the safeguard is in the runner. CI just enforces the safeguard before merge.
|
||||
|
||||
### 1.2 ~16:05 — hermes dropped-column refs (mig 125)
|
||||
|
||||
**Mechanism.** `125_cross_cutting_filter_legal_source.up.sql` referenced columns (`is_mandatory`, `is_optional`, `condition_flag`) that had been dropped in migration 091. mig 125 compiled fine; the failure only surfaced when the runner applied it against a DB that had already run mig 091.
|
||||
|
||||
**Why CI catches it.** `TestMigrations_DryRun` (live in `internal/db/migrate_test.go:47`) applies every pending migration in order against a scratch DB. On a fresh DB walked from 001 → 125, mig 091 drops the columns; by the time mig 125 runs, those columns don't exist, mig 125 errors out, the test fails. Today this test silently skips because no machine in CI sets `TEST_DATABASE_URL`.
|
||||
|
||||
**Currently caught by:** nothing in CI. Manually catchable by running `make verify-migrations` on a developer laptop with `TEST_DATABASE_URL` set — but that's "if the worker remembers."
|
||||
|
||||
### 1.3 ~14:56 → still failing — mig 129 ownership error (LIVE OFFLINE NOW)
|
||||
|
||||
**Mechanism.** `129_project_event_choices.up.sql` does something on `paliad.project_event_choices` that the DB role lacks the OWNER privilege for. Live container `compose-transmit-multi-byte-driver-v7jth9-web-1` is on RestartCount=14 with:
|
||||
|
||||
```
|
||||
migration failed: apply 129_project_event_choices.up.sql:
|
||||
exec sql: pq: must be owner of table project_event_choices (42501)
|
||||
```
|
||||
|
||||
Paliad.de returns 404 from Traefik (no healthy backend) as of 16:57 UTC.
|
||||
|
||||
**Why CI catches it — if and only if the CI Postgres is set up correctly.** The dry-run test runs as a role that owns the scratch DB it created → it WILL be owner of every table. The current CI proposal must run the migrations AS THE NON-OWNER ROLE that prod uses (or a CI role that mirrors prod's grants). Postgres error code 42501 surfaces only when the apply role isn't the table owner.
|
||||
|
||||
**Concrete CI requirement:** the smoke harness creates two roles in the scratch DB — one as table-owner (matching the original mig 001 schema-creator role), one as the application-deployer role (the one that runs `ApplyMigrations` in prod). Mig 091 → 129 are applied as the deployer role. Any migration that assumes implicit ownership will fail in CI exactly as it fails in prod.
|
||||
|
||||
**Currently caught by:** nothing. The dry-run test, even when run with `TEST_DATABASE_URL`, uses a single role that is implicitly owner of every table it touches; it does not simulate the role split that exists in prod (`youpc-supabase` DB, paliad app connects as a non-owner role).
|
||||
|
||||
### 1.4 Common failure path — why all three knock the site offline
|
||||
|
||||
Independent of WHICH migration fails, the moment the new container panics:
|
||||
|
||||
```
|
||||
docker compose up -d --force-recreate ← Dokploy runs this on webhook
|
||||
↓
|
||||
old container stopped + removed
|
||||
↓
|
||||
new container created, starts /app/paliad
|
||||
↓
|
||||
ApplyMigrations(databaseURL) panics
|
||||
↓
|
||||
container exits 1
|
||||
↓
|
||||
restart: unless-stopped triggers restart
|
||||
↓
|
||||
restart loop forever — Traefik has no healthy backend
|
||||
↓
|
||||
paliad.de returns 404 indefinitely
|
||||
```
|
||||
|
||||
The old container is GONE between `stop` and the new container's first health-check (there is no health-check). There is no rolling deploy, no `--no-recreate`, no swap-on-healthy. The compose `restart: unless-stopped` only ensures the failing container keeps trying — it does not preserve the old one.
|
||||
|
||||
---
|
||||
|
||||
## 2. m's constraints (verbatim from issue body)
|
||||
|
||||
- **No `paliad.dev` duplicate:** don't want a separate Dokploy app + branch + DB. One source of truth, one prod surface.
|
||||
- **Test workers included:** build + tests must run somewhere before the deploy.
|
||||
- **Existing infra only:** gitea (`mgit.msbls.de`) · mlake (Dokploy + Docker Swarm + Compose) · mriver (Tailscale-attached worker fleet).
|
||||
- **Site stays online through failed deploys:** a broken migration must NOT take the running container down.
|
||||
|
||||
Restating the implicit constraint: paliad on Dokploy uses Dokploy's "Compose" deployment type (not "Application"), so we do not get Docker Swarm's `deploy.update_config.failure_action: rollback` for free. Verified live: compose YAML at `/etc/dokploy/compose/compose-transmit-multi-byte-driver-v7jth9/code/docker-compose.yml` has no `deploy:` block; project's docker-compose.yml has none either. Dokploy "Compose" runs `docker compose up -d` which is not Swarm-aware.
|
||||
|
||||
---
|
||||
|
||||
## 3. Decision matrix
|
||||
|
||||
For each Q, this section gives the **R-pick**, the cost (LoC / new infra), the maintenance footprint, and the time-to-ship estimate (rough complexity bands: small / medium / large — no hours per project CLAUDE.md).
|
||||
|
||||
### Q1 — Where do tests run?
|
||||
|
||||
| Option | Cost | Maintenance | Notes |
|
||||
|---|---|---|---|
|
||||
| **A. Gitea Actions (R)** | 1 workflow YAML (~80 lines) + Postgres service container | Low — m's stack already runs gitea | mendel's Slice 7 picked this on Q2. Verified live: Gitea 1.24.4, `mgit.msbls.de`, `has_actions: true` on `m/paliad`, ≥2 admin runners registered. |
|
||||
| B. Custom mriver proxy | New Go service + webhook forwarder + container | Medium — paliad-specific glue | Reinvents Gitea Actions; only justified if Actions can't be made to work, which it can. |
|
||||
| C. Dokploy pre-deploy hook | Unknown — Dokploy compose mode may not expose this | Unknown | Tighter integration but no documented hook for compose mode. Skip. |
|
||||
|
||||
**R = A.** Gitea Actions runner on the existing infra (mriver workers can host a runner if mlake's load is a concern — see §7).
|
||||
|
||||
### Q2 — Where does the migration get tested?
|
||||
|
||||
| Option | Cost | Maintenance | Notes |
|
||||
|---|---|---|---|
|
||||
| **A. CI smoke against ephemeral scratch Postgres (R)** | Postgres service container in Gitea workflow + extension to `TestMigrations_DryRun` to cover role-ownership | Low — runs on every PR | Catches all three of today's outages if the role-split (§1.3) is wired correctly. |
|
||||
| B. Dry-run-mode CLI flag | Add `--migrate-dry-run` to the Go binary; CI step `./paliad --migrate-dry-run` against scratch DB | Low | Equivalent to A but with an entry-point that's also useful for manual ops. Nice-to-have, not blocking. |
|
||||
| **C. Runtime fail-fast + restart cap (R, defense in depth)** | Edit docker-compose.yml `restart: on-failure:3` + add `healthcheck:` | Trivial | Doesn't prevent the outage but caps the crash-loop blast radius and gives Dokploy/Traefik a signal to fall back. |
|
||||
|
||||
**R = A + C.** Belt-and-suspenders. A catches every shape-error before it reaches prod; C ensures the rare unknown-unknown doesn't crash-loop for hours.
|
||||
|
||||
### Q3 — Blue/green or canary for the container itself?
|
||||
|
||||
| Option | Cost | Maintenance | Notes |
|
||||
|---|---|---|---|
|
||||
| A. Switch to Dokploy "Application" type (Swarm-backed) | Re-create the Dokploy deployment as Application; migrate `paliad_exports` volume; reconfigure SSH-multi-line secrets | Medium — new deployment shape | Gives proper `deploy.update_config.failure_action: rollback`. But m has explicitly excluded multi-app/multi-branch setups. |
|
||||
| **B. Stay on Compose + tighten healthcheck + restart cap (R)** | `healthcheck:` block (~6 lines) + `restart: on-failure:3` (~1 line) | Trivial | Does NOT give us rolling deploy. The "stay online during failed deploy" property is delivered by **Q1 + Q2** (the gate prevents the broken deploy from happening at all). The compose changes are for the residual case. |
|
||||
| C. Do nothing | 0 | 0 | Today's outages recur. |
|
||||
|
||||
**R = B.** Stay on Compose. Real online-during-failure protection comes from the CI gate (Q1+Q2). The compose changes are damage limitation, not the primary mechanism.
|
||||
|
||||
### Q4 — How do test workers (existing mai workers) fit in?
|
||||
|
||||
| Option | Cost | Maintenance | Notes |
|
||||
|---|---|---|---|
|
||||
| **A. Per-worker pre-push tests stay (already-convention) + CI is the safety net (R)** | 0 (convention exists) + Slice A | Low | Workers run `go build / go vet / go test / bun run build` by convention. CI ensures the convention isn't accidentally skipped. |
|
||||
| B. Replace per-worker tests with CI-only | 0 in CI; but workers waste cycles pushing red diffs | Higher feedback latency | Workers find out their work is broken from CI instead of locally — slower loop. |
|
||||
| **C. Add a `mai-test` post-merge shift on main (R, optional polish)** | Existing skill, just wire it to the merge webhook | Low | Per `mai-test` skill — broader smoke + integration suite, reports back to gitea as a check status. Nice-to-have, can be Slice D. |
|
||||
|
||||
**R = A + C.** Per-worker discipline at push, Gitea Actions at PR, `mai-test` polish post-merge.
|
||||
|
||||
### Q5 — Migration coordination (root cause of outage 1)
|
||||
|
||||
| Option | Cost | Maintenance | Notes |
|
||||
|---|---|---|---|
|
||||
| **A. Head reserves migration slots when assigning tasks that need a migration (R, in flight)** | 0 in code — process discipline | Low — head already trending this way | Today's session: head already does this in flight. Codify as a skill rule in `mai-head` SKILL.md. |
|
||||
| **B. CI pre-flight check: fail build if any migration's slot exists in `applied_migrations` with a different name (R)** | ~20 LoC Go test reading `applied_migrations` from prod-snapshot | Low | Belt and suspenders for A. Catches the brunel case before merge. |
|
||||
| C. Branch-time check in `mai hire` | ~20 LoC shell in mai CLI | Medium — paliad-specific in the cross-project CLI | Wrong place. The check belongs in CI, not in worker spawning. |
|
||||
|
||||
**R = A + B.** Head coordination as the primary; CI flag as the safety net.
|
||||
|
||||
### Q6 — Existing prod traffic during deploy
|
||||
|
||||
| Option | Cost | Maintenance | Notes |
|
||||
|---|---|---|---|
|
||||
| **A. Verify Dokploy "Compose" deploy behavior live + document in CLAUDE.md (R)** | 1 SSH session + write-up | 0 | Verified above (§1.4): Dokploy compose mode does `--force-recreate`, no rolling deploy. The protection comes from the CI gate, not from this property. |
|
||||
| B. Investigate Dokploy "Application" migration | Medium — a separate proposal | Medium | Out of scope per m's constraint (#3). |
|
||||
|
||||
**R = A.** Document the limitation; the CI gate (Q1+Q2) is the primary online-during-failure mechanism.
|
||||
|
||||
### Summary of R-picks
|
||||
|
||||
| Q | Pick | Slice |
|
||||
|---|---|---|
|
||||
| Q1 | A — Gitea Actions on mriver/mlake runner | A (workflow) |
|
||||
| Q2 | A + C — CI smoke + runtime fail-cap | A (smoke) + B (compose) |
|
||||
| Q3 | B — Stay on Compose, add healthcheck + restart-cap | B |
|
||||
| Q4 | A + C — Per-worker tests + CI + mai-test polish | A + D |
|
||||
| Q5 | A + B — Head reservation + CI duplicate-slot check | A (CI) + head SKILL.md |
|
||||
| Q6 | A — Document compose mode behavior | (doc only) |
|
||||
|
||||
---
|
||||
|
||||
## 4. Recommended pipeline
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Worker shift │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ mai/<worker>/<task> branch │ │
|
||||
│ │ go build / go vet / go test ./internal/... / bun build │ │ pre-push (convention, exists)
|
||||
│ │ push to gitea │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────┬──────────────────────────────────┘
|
||||
│ git push
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ mgit.msbls.de — Gitea │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ .gitea/workflows/test.yml fires on: │ │
|
||||
│ │ push to any branch (gate tier) │ │ Slice A
|
||||
│ │ push to main (gate + full + deploy step) │ │
|
||||
│ │ │ │
|
||||
│ │ Jobs (all on a single runner; parallel where independent):│ │
|
||||
│ │ ┌──────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ job: build │ │ │
|
||||
│ │ │ bun install + bun run build │ │ │
|
||||
│ │ │ go build ./... │ │ │
|
||||
│ │ ├──────────────────────────────────────────────────────┤ │ │
|
||||
│ │ │ job: test-go │ │ │
|
||||
│ │ │ services: postgres:16 (ephemeral) │ │ │
|
||||
│ │ │ step: psql -c "CREATE ROLE paliad_app …" │ │ │
|
||||
│ │ │ step: TEST_DATABASE_URL=…@…/scratch go test ./... │ │ │
|
||||
│ │ │ step: TEST_DATABASE_URL=…@…/scratch │ │ │
|
||||
│ │ │ (extended) TestMigrations_DryRun │ │ │
|
||||
│ │ ├──────────────────────────────────────────────────────┤ │ │
|
||||
│ │ │ job: test-frontend (optional Slice C) │ │ │
|
||||
│ │ │ bun test │ │ │
|
||||
│ │ ├──────────────────────────────────────────────────────┤ │ │
|
||||
│ │ │ job: migration-coordination-check │ │ │ Slice A.4 — duplicate-slot, name-mismatch
|
||||
│ │ │ go test -run TestMigrations_NoDuplicate ./internal/db│ │ │
|
||||
│ │ └──────────────────────────────────────────────────────┘ │ │
|
||||
│ └─────────────────────┬──────────────────────────────────────┘ │
|
||||
└────────────────────────┼─────────────────────────────────────────┘
|
||||
│ all green
|
||||
│ AND ref == refs/heads/main
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ Gitea workflow step: deploy │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ curl POST https://<dokploy>/api/compose/ │ │
|
||||
│ │ <Zx147ycurfYagKRl_Zzyo>/deploy │ │
|
||||
│ │ Authorization: Bearer ${{ secrets.DOKPLOY_TOKEN }} │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
└───────────────────────────────┬──────────────────────────────────┘
|
||||
│ Dokploy webhook (instead of gitea push webhook)
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ mlake — Dokploy │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ docker compose pull + up -d --force-recreate │ │
|
||||
│ │ ┌──────────────────────────────────────────────────────┐ │ │
|
||||
│ │ │ web container: /app/paliad │ │ │
|
||||
│ │ │ ApplyMigrations(DATABASE_URL) │ │ │
|
||||
│ │ │ ✓ all green from CI ⇒ this will succeed │ │ │
|
||||
│ │ │ bind :8080 │ │ │
|
||||
│ │ │ healthcheck GET /health/ready every 10s │ │ │ Slice B
|
||||
│ │ │ restart: on-failure:3 (was: unless-stopped) │ │ │ Slice B
|
||||
│ │ └──────────────────────────────────────────────────────┘ │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Critical change vs. today:** the gitea push webhook → Dokploy is REPLACED by gitea workflow → CI → if green, the workflow itself calls Dokploy. The webhook is removed (or pointed at a no-op endpoint) so Dokploy can't be triggered any way except via the workflow's final step. **One way in, gated by CI.**
|
||||
|
||||
If the CI fails:
|
||||
- The deploy step never runs.
|
||||
- Dokploy never fires.
|
||||
- The old container keeps serving paliad.de.
|
||||
- Gitea workflow status goes red; gitea posts a check status on the commit; head sees it on the project page or via `mai status`.
|
||||
|
||||
If a worker pushes red to a feature branch (not main):
|
||||
- The gate-tier subset of jobs runs (build + go test + migration smoke).
|
||||
- Red status on the branch surfaces in PR view.
|
||||
- Deploy never even attempted (only fires on `main`).
|
||||
- Worker fixes locally and pushes again.
|
||||
|
||||
---
|
||||
|
||||
## 5. Compose changes (Slice B)
|
||||
|
||||
Minimal, targeted. The compose stays Docker-Compose-mode (Dokploy "Compose" type). No Swarm migration.
|
||||
|
||||
**Diff (conceptual; coder produces real diff):**
|
||||
|
||||
```yaml
|
||||
services:
|
||||
web:
|
||||
build: .
|
||||
expose:
|
||||
- "8080"
|
||||
environment:
|
||||
# …(unchanged)…
|
||||
volumes:
|
||||
- paliad_exports:/var/lib/paliad/exports
|
||||
restart: on-failure:3 # was: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "-q", "--spider", "http://localhost:8080/health/ready"]
|
||||
interval: 10s
|
||||
timeout: 3s
|
||||
retries: 3
|
||||
start_period: 30s # boot + migrate window
|
||||
```
|
||||
|
||||
**What `restart: on-failure:3` buys:**
|
||||
- After 3 failed restarts within the policy window, Docker stops auto-restarting.
|
||||
- The container enters `exited` state.
|
||||
- Dokploy can surface "deploy failed" in its UI.
|
||||
- We don't burn CPU + Postgres connections on an infinite crash-loop.
|
||||
|
||||
**What `healthcheck:` buys:**
|
||||
- Traefik (Dokploy's reverse proxy) checks `condition: service_healthy` on the target before routing.
|
||||
- If the container is `unhealthy`, Traefik returns 503 (one extra retry path).
|
||||
- More importantly: when CI is wired to call Dokploy's API, the API call can poll `/health/ready` after the deploy and report success/failure in the workflow.
|
||||
|
||||
**What this does NOT buy (per Q3):**
|
||||
- It does NOT keep the old container alive while the new one starts. Compose mode kills the old container first. **The CI gate (Slice A) is what keeps the old container alive — by preventing the broken deploy from firing at all.**
|
||||
|
||||
**Implementation gotcha — `/health/ready` doesn't exist yet:**
|
||||
- `internal/handlers/` has no `health` handler. The endpoint must be added (cheap, ~20 LoC: open `internal/db` pool ping + return 200 / 503). Slice B includes this.
|
||||
|
||||
---
|
||||
|
||||
## 6. Migration smoke harness (Slice A.2)
|
||||
|
||||
Extending `internal/db/migrate_test.go` to catch today's three outage classes:
|
||||
|
||||
### 6.1 What exists today (working)
|
||||
|
||||
`TestMigrations_DryRun` (migrate_test.go:47) walks pending migrations and applies each in BEGIN/ROLLBACK against a scratch DB. It catches:
|
||||
- SQL syntax errors (rare; `go build` doesn't see SQL).
|
||||
- Statements that reference columns that genuinely don't exist (mig 125 case — IF the test runs after the prior migrations have applied to the scratch DB).
|
||||
|
||||
It does NOT catch ownership errors (mig 129 case) because the test role implicitly owns every table it creates in BEGIN/ROLLBACK.
|
||||
|
||||
### 6.2 Extensions needed (Slice A.2)
|
||||
|
||||
**(a) End-to-end apply pass with role split.**
|
||||
|
||||
Add `TestMigrations_EndToEndAsAppRole` to `internal/db/migrate_test.go`. Setup:
|
||||
|
||||
```sql
|
||||
-- run as superuser (postgres) once per CI job
|
||||
CREATE ROLE paliad_owner LOGIN PASSWORD 'ci';
|
||||
CREATE ROLE paliad_app LOGIN PASSWORD 'ci';
|
||||
CREATE DATABASE paliad_scratch OWNER paliad_owner;
|
||||
\c paliad_scratch paliad_owner
|
||||
GRANT USAGE ON SCHEMA public TO paliad_app;
|
||||
GRANT ALL ON SCHEMA paliad TO paliad_app;
|
||||
-- mirror prod: paliad_app is the deploy role, paliad_owner created the schema
|
||||
```
|
||||
|
||||
Test body runs `ApplyMigrations(<paliad_app DSN>)` end-to-end (no rollback between migrations). If migration N assumes ownership it doesn't have, it fails here with the exact same `42501 must be owner of table X` that we see in prod. Catches mig 129.
|
||||
|
||||
**Open Q:** the exact role split is something m + head must look at against the youpc-supabase setup. The CI role names don't have to match prod exactly — they just have to model the same OWNER vs. APP-CONNECT split. Q11.2 below asks m to confirm.
|
||||
|
||||
**(b) Duplicate-slot pre-flight check.**
|
||||
|
||||
Add `TestMigrations_NoDuplicateSlot` to `internal/db/migrate_test.go`. The `scanEmbeddedMigrations` runner-side check is already there but only runs when the runner runs (i.e. at prod boot). Hoisting it into a unit test:
|
||||
|
||||
```go
|
||||
func TestMigrations_NoDuplicateSlot(t *testing.T) {
|
||||
_, err := scanEmbeddedMigrations()
|
||||
if err != nil {
|
||||
t.Fatalf("duplicate slot: %v", err)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Catches the brunel case at CI time (before merge to main). Cheap, no DB needed, runs in every PR.
|
||||
|
||||
**(c) Down-script smoke (optional, Slice A.5).**
|
||||
|
||||
For every applied `.up.sql` in CI, apply the matching `.down.sql` immediately after and assert it doesn't error. Catches "down script forgot to revert one of the up's actions." Cheap-ish, ~50 LoC of test code, adds ~30s to CI. Not blocking for the outage-prevention goal; nice-to-have.
|
||||
|
||||
### 6.3 Scratch DB topology in CI
|
||||
|
||||
Per mendel's design Slice 7 + this design: **Postgres service container** in the Gitea workflow YAML:
|
||||
|
||||
```yaml
|
||||
services:
|
||||
postgres:
|
||||
image: postgres:16
|
||||
env:
|
||||
POSTGRES_PASSWORD: ci
|
||||
POSTGRES_DB: paliad_scratch
|
||||
options: >-
|
||||
--health-cmd "pg_isready"
|
||||
--health-interval 5s
|
||||
```
|
||||
|
||||
The runner sees Postgres on `localhost:5432`. Each CI invocation gets a clean DB. No coupling to YouPC (per mendel's Q3 R-pick). No cleanup needed — the container dies with the job.
|
||||
|
||||
---
|
||||
|
||||
## 7. Existing infra resource map
|
||||
|
||||
Verified live (this session):
|
||||
|
||||
### 7.1 mlake (Dokploy host, Docker Swarm active)
|
||||
|
||||
- Docker version 29.3.0, Swarm `active`.
|
||||
- ~50 running containers (Dokploy services + 40+ compose projects).
|
||||
- Hosts paliad's `compose-transmit-multi-byte-driver-v7jth9` compose project (currently crash-looping).
|
||||
- Could host a Gitea Actions runner, but at risk of contention — paliad's CI Postgres + build would compete with everything else on the box.
|
||||
|
||||
### 7.2 mriver (worker fleet, Tailscale-attached)
|
||||
|
||||
- Runs the mai worker pool (cronus, brunel, hermes, dirac, mendel, …).
|
||||
- Hosts the aichat backend on `:8765`.
|
||||
- Has more idle CPU than mlake (workers spend most of their time waiting on `claude` API).
|
||||
- **R: register the Gitea Actions runner here.** Lower contention, same gitea reachability (Tailscale).
|
||||
|
||||
### 7.3 mgit.msbls.de (Gitea)
|
||||
|
||||
- Version 1.24.4, `has_actions: true` on `m/paliad` (verified).
|
||||
- `/admin/actions/runners` → 2 runners already registered (verified live: `curl /api/v1/admin/actions/runners | jq length` → `2`).
|
||||
- No workflow runs on `m/paliad` yet (verified: `workflow_runs:[], total_count:0`).
|
||||
- The actions infrastructure is fully present; paliad just hasn't authored a workflow YAML.
|
||||
|
||||
### 7.4 youpc-supabase (paliad's prod DB)
|
||||
|
||||
- Postgres on port 11833, paliad uses the `paliad` schema.
|
||||
- Out of scope for CI — CI uses its own ephemeral Postgres in the runner. Prod DB is touched ONLY by Dokploy's deploy step (post-CI-green).
|
||||
|
||||
---
|
||||
|
||||
## 8. Slice plan — tracer-bullet roll-out
|
||||
|
||||
Each slice is independently shippable. Slice A is the load-bearing one.
|
||||
|
||||
### Slice A — Gitea Actions workflow + extended migration smoke (LOAD-BEARING)
|
||||
|
||||
**Branch:** `mai/<coder>/cicd-slice-a-actions`
|
||||
|
||||
**Files added:**
|
||||
- `.gitea/workflows/test.yml` — single workflow, fires on `push` to any branch.
|
||||
- Jobs: `build`, `test-go`, `migration-coordination-check`.
|
||||
- On `push` to `main`: additional `deploy` job that POSTs to Dokploy compose deploy API.
|
||||
- `internal/db/migrate_test.go` — extend with:
|
||||
- `TestMigrations_EndToEndAsAppRole` (catches mig 129 ownership case).
|
||||
- `TestMigrations_NoDuplicateSlot` (catches brunel slot collision).
|
||||
- `Makefile` — add `test-go`, `test-frontend`, `verify-migrations` targets. (mendel's design Slice 1 punted this; pull it in here so workers can repro the CI gate locally.)
|
||||
- `internal/handlers/health.go` — `/health/ready` endpoint (pool ping + 200/503). ~25 LoC.
|
||||
- Migration `132_*.up.sql` — IF the existing test exposes that we need to backfill role grants for some tables. Verify against prod schema before merging.
|
||||
|
||||
**Files modified:**
|
||||
- `cmd/server/main.go` — register `/health/ready` handler.
|
||||
|
||||
**Gitea-side action items (one-time, head or m runs):**
|
||||
1. Set `secrets.DOKPLOY_TOKEN` in the `m/paliad` repo secrets (Dokploy API token with deploy permission on the paliad compose).
|
||||
2. Verify ≥1 Gitea Actions runner is online and tagged appropriately (`ubuntu-latest` or a custom tag).
|
||||
3. Optionally: remove the Dokploy gitea-push webhook (so the only path to deploy is the workflow's deploy step). Discussed in Q11.4 below.
|
||||
|
||||
**Catches:** All three of today's outages, plus future shape-/ownership-/duplicate-slot regressions.
|
||||
|
||||
**Cost:** Small (one workflow YAML, two test functions, one Makefile, one health handler).
|
||||
|
||||
### Slice B — Compose hardening (DEFENSE IN DEPTH)
|
||||
|
||||
**Branch:** `mai/<coder>/cicd-slice-b-compose`
|
||||
|
||||
**Files modified:**
|
||||
- `docker-compose.yml` — change `restart: unless-stopped` → `restart: on-failure:3`; add `healthcheck:` block targeting `/health/ready`.
|
||||
|
||||
**Depends on:** Slice A (health endpoint must exist before the healthcheck can use it).
|
||||
|
||||
**Catches:** Caps the crash-loop blast radius. Does not prevent outages — Slice A does that.
|
||||
|
||||
**Cost:** Trivial (~10 lines).
|
||||
|
||||
### Slice C — Frontend test wiring (OPTIONAL POLISH)
|
||||
|
||||
**Branch:** `mai/<coder>/cicd-slice-c-frontend`
|
||||
|
||||
**Files added/modified:**
|
||||
- `frontend/package.json` — add `"test": "bun test"` script.
|
||||
- `.gitea/workflows/test.yml` — add `test-frontend` job calling `cd frontend && bun test`.
|
||||
|
||||
**Depends on:** Slice A (workflow exists).
|
||||
|
||||
**Catches:** The 4 existing frontend tests run on every PR. Future bun:test additions (per mendel Slice 3) get exercised automatically.
|
||||
|
||||
**Cost:** Trivial (~5 lines).
|
||||
|
||||
### Slice D — mai-test post-merge shift (OPTIONAL POLISH)
|
||||
|
||||
**Branch:** `mai/<coder>/cicd-slice-d-mai-test`
|
||||
|
||||
**Wiring:** Gitea webhook on `m/paliad` "push to main" → notifies a queue → triggers a `mai-test` shift to run the broader smoke suite + post results as a Gitea commit status.
|
||||
|
||||
**Depends on:** Slice A (CI green is a prerequisite for the deploy step which precedes the merge-to-main signal). Could land in parallel with Slice A.
|
||||
|
||||
**Catches:** Integration issues between worker branches that pass CI individually but break on main. The post-merge layer is a follow-up safety net, not a gate.
|
||||
|
||||
**Cost:** Small (config; `mai-test` skill already exists).
|
||||
|
||||
### Slice E — Documentation (REQUIRED, lands with Slice A)
|
||||
|
||||
**Branch:** combined with Slice A's branch.
|
||||
|
||||
**Files modified:**
|
||||
- `docs/project-status.md` — note CI gate is live + how to interpret red CI.
|
||||
- `.claude/CLAUDE.md` — note that pushing to main now requires CI green; workers must verify their branch passes locally before pushing.
|
||||
- `docs/design-paliad-test-strategy-2026-05-19.md` — link to this doc; mark Slice 7 of mendel's design as implemented.
|
||||
|
||||
**Catches:** Workers reading CLAUDE.md learn the new convention without head having to broadcast.
|
||||
|
||||
### Slice ordering rationale
|
||||
|
||||
- **Slice A ships first.** Until Slice A is on main, paliad has no CI gate; any merge to main can crash-loop the site. Slice A is single-PR-mergeable, doesn't touch the compose, and exercises only test code + a tiny handler addition.
|
||||
- **Slice B ships second** (same day if possible). Health-gated restart is meaningless without `/health/ready` (Slice A provides it). Once both land, the runtime safety net is in place.
|
||||
- **Slice C, D, E** are independent; they can land in any order after A.
|
||||
|
||||
---
|
||||
|
||||
## 9. Risk + rollback
|
||||
|
||||
| Risk | Mitigation | Rollback |
|
||||
|---|---|---|
|
||||
| CI workflow blocks legitimate emergency deploys | Slice A's `.gitea/workflows/test.yml` always passes for `[skip ci]` commits in head's emergency-deploy commits. Manually trigger Dokploy from the UI as a last resort. | Re-enable the gitea push webhook to Dokploy as a fallback path. |
|
||||
| Gitea Actions runner is overloaded / offline | mendel's Q2 R-pick prefers gitea actions; if the runner dies, deploys are blocked. Mitigation: register a second runner on mlake (passive) so one failure doesn't lock the queue. | Switch the workflow to a job-less `deploy: needs: nothing` step temporarily; restore after runner recovery. |
|
||||
| The end-to-end migration test against an ephemeral DB diverges from prod role grants in ways we don't anticipate | The CI role split is a model of prod, not a copy. Real divergence (e.g. role X granted privilege Y on table Z) will not be caught. | Slice B's runtime fail-cap prevents the crash-loop from running for hours; head triages. Update CI role grants when divergence is discovered. |
|
||||
| The Dokploy compose deploy API call signature is wrong | Verify against `mai-dokploy` skill docs + try one manual invocation before merging Slice A. | Re-enable the gitea push webhook as the deploy path; CI green is then advisory, not enforcing. |
|
||||
| Removing the gitea push webhook to Dokploy is a one-way door (re-enabling requires Dokploy UI action) | Don't remove the webhook in Slice A's PR. Keep both paths live during a "soft launch" period. Cut the webhook only after Slice A has gated 5+ green deploys. | Re-enable the webhook in the Dokploy UI (a single toggle). |
|
||||
|
||||
### Online-during-failure invariant
|
||||
|
||||
The design guarantees the site stays online iff **at least one of**:
|
||||
- CI catches the bad migration (smoke test, duplicate-slot, end-to-end role apply) before the deploy step runs. ← Primary, expected to catch all three known classes.
|
||||
- The healthcheck on the new container fails AND the old container hasn't been removed yet AND Traefik's `condition: service_healthy` is honored.
|
||||
|
||||
The second path is fragile (Compose mode kills the old container before the new one is healthy — see §1.4). The design therefore relies on **CI being the gate**, with the compose changes as a residual safety net for unknown failure classes.
|
||||
|
||||
---
|
||||
|
||||
## 10. Out of scope
|
||||
|
||||
- **Multi-region / DR.** Not asked for; not implied.
|
||||
- **Database backup or rollback strategy.** m/paliad#77 Backup Mode covers backups. This design does not duplicate that work.
|
||||
- **Migrating off Dokploy.** Not asked for; explicitly excluded by m's constraint.
|
||||
- **A second Dokploy app or branch.** Explicitly excluded by m's constraint ("no .dev").
|
||||
- **Full E2E browser smoke (Playwright).** mendel's Slice 4 covers this; out of scope for outage-prevention. May land later as a Slice E follow-up to this design.
|
||||
- **Coverage % gating.** Per mendel Q4 — coverage as visibility, not as gate.
|
||||
- **mai-tester full E2E in CI.** Slice D mentions `mai-test` as a post-merge polish; the full browser fleet is its own design.
|
||||
- **Migrations that drop columns currently used by code.** Compile-time `go build` covers some of this; the broader question of "does the live frontend reference DB columns we just dropped" is mendel Slice 4 territory.
|
||||
|
||||
---
|
||||
|
||||
## 11. Open questions for m
|
||||
|
||||
Six picks. Recommended answers in **bold**. Mostly small, but each one shapes a real load-bearing choice. m can answer in one chip-round.
|
||||
|
||||
### Q11.1 — Where does the Gitea Actions runner live?
|
||||
|
||||
**A. (R) Register a new runner on mriver.**
|
||||
B. Use existing mlake runners.
|
||||
C. Spin up a dedicated mini-VM.
|
||||
|
||||
mriver has idle cycles; mlake is contended. (A) is cheapest.
|
||||
|
||||
### Q11.2 — How closely should CI's role split mirror prod?
|
||||
|
||||
**A. (R) Two-role model (owner + app-connect) generic to Postgres.**
|
||||
B. Exact mirror — recreate the actual youpc-supabase role names + grants in CI.
|
||||
|
||||
(A) catches today's `42501` class without coupling CI to youpc-supabase changes. (B) is brittle but exhaustive. Recommend (A) and tighten if a future outage slips through.
|
||||
|
||||
### Q11.3 — How does the workflow call Dokploy?
|
||||
|
||||
**A. (R) Direct API call via `mai-dokploy` skill conventions — token in `secrets.DOKPLOY_TOKEN`.**
|
||||
B. SSH to mlake and run `docker compose pull && up -d` directly from the runner.
|
||||
|
||||
(A) keeps Dokploy as the single deploy authority. (B) bypasses Dokploy and removes its observability.
|
||||
|
||||
### Q11.4 — Do we remove the existing gitea push → Dokploy webhook?
|
||||
|
||||
**A. (R) Keep both paths live for one week of soft-launch; remove webhook once Slice A has gated ≥5 successful green deploys.**
|
||||
B. Remove immediately when Slice A lands.
|
||||
C. Keep both forever (CI as advisory, webhook as enforcing).
|
||||
|
||||
(A) is the cautious rollout. (C) defeats the gate purpose.
|
||||
|
||||
### Q11.5 — Backwards-compat for in-flight worker branches that don't yet have `.gitea/workflows/test.yml`?
|
||||
|
||||
**A. (R) Slice A's workflow lives on `main`. Worker branches inherit when they merge from main. No backfill needed — feature branches that haven't merged from main yet just don't get CI until their next sync.**
|
||||
B. Force every worker to rebase onto Slice A's commit before pushing again.
|
||||
|
||||
(A) is zero-coordination. (B) is paranoid.
|
||||
|
||||
### Q11.6 — Should CI block on red gate-tier or warn only?
|
||||
|
||||
**A. (R) Block.** Red gate-tier → no deploy. This is the entire point.
|
||||
B. Warn — surface red status, but let head override and deploy anyway.
|
||||
|
||||
(A) is the brief. (B) recreates today's outages.
|
||||
|
||||
---
|
||||
|
||||
## 12. Verification checklist (head to confirm before greenlighting)
|
||||
|
||||
- [ ] Q1-Q6 picks above match head's read.
|
||||
- [ ] Q11.1-Q11.6 answered (chip round).
|
||||
- [ ] Slice A is sized for one coder shift (not multiple).
|
||||
- [ ] No Slice creates a second source of truth (single Dokploy compose `Zx147ycurfYagKRl_Zzyo` remains the only paliad deploy).
|
||||
- [ ] The `mai-dokploy` skill has a documented "deploy compose by ID" API call.
|
||||
- [ ] paliad.de current outage (mig 129) gets a manual recovery path (see Appendix A) — Slice A doesn't fix the live failure on its own; m or head must reset `paliad.applied_migrations` and grant ownership.
|
||||
|
||||
---
|
||||
|
||||
## Appendix A — Recovering the live outage (mig 129)
|
||||
|
||||
Independent of this design, the live paliad.de outage needs operator action:
|
||||
|
||||
1. SSH to youpc-supabase Postgres as superuser.
|
||||
2. `GRANT OWNERSHIP OF paliad.project_event_choices TO <paliad-app-role>` (or whichever role does the connect).
|
||||
3. OR: hand-apply mig 129's body as superuser; `INSERT INTO paliad.applied_migrations(version, name, applied_at, checksum) VALUES (129, 'project_event_choices', now(), '<sha256 of file>')`.
|
||||
4. Restart `compose-transmit-multi-byte-driver-v7jth9`.
|
||||
5. Verify paliad.de returns 200.
|
||||
|
||||
This recovery is OUT OF SCOPE for the design but is the immediate-action follow-up. mai head or m to handle when this design lands.
|
||||
|
||||
---
|
||||
|
||||
## Appendix B — Why not Docker Swarm?
|
||||
|
||||
m's constraint #3 explicitly excludes a `.dev` clone. Swarm's `deploy.update_config.failure_action: rollback` requires the deployment to be a Docker Swarm service (Dokploy "Application" type), which is a SECOND deployment surface alongside the existing "Compose" project. That's a duplicate Dokploy deployment in everything but name — exactly what m rejected.
|
||||
|
||||
The Compose-mode workaround (the CI gate) achieves the same online-during-failure invariant with less infrastructure. It's the right trade-off for paliad's scale.
|
||||
|
||||
---
|
||||
|
||||
## Appendix C — Today's restart count
|
||||
|
||||
For posterity (one-shot snapshot from live mlake):
|
||||
|
||||
```
|
||||
compose-transmit-multi-byte-driver-v7jth9-web-1
|
||||
state: restarting
|
||||
RestartCount: 14
|
||||
restart policy: unless-stopped
|
||||
health: <nil> ← no healthcheck configured
|
||||
last error: migration failed: apply 129_project_event_choices.up.sql:
|
||||
exec sql: pq: must be owner of table project_event_choices (42501)
|
||||
```
|
||||
|
||||
Slice A would have caught this at the worker's pre-push step (the same `go test ./...` would have surfaced the 42501 if the CI role split were modeled locally). Slice A's CI run would have caught it at the gitea push. Either gate prevents the deploy. The site stays online.
|
||||
Reference in New Issue
Block a user