Adds .gitea/workflows/test.yaml that gates every push on `go build`, `bun run build`, `go vet`, the migration coordination check, and the role-split end-to-end migration smoke. On push to main + green, calls Dokploy's compose.deploy API and polls /health/ready until 200. t-paliad-282 / m/paliad#114. Design: docs/design-cicd-pre-deploy-gate-2026-05-25.md (inventor shift on mai/cronus/inventor-ci-cd-pre). Catches all three of today's outage classes: brunel (~13:20) slot collision -> TestMigrations_NoDuplicateSlot hermes (~16:05) dropped-col refs -> TestBootSmoke mig 129 (~14:56) 42501 ownership -> TestMigrations_EndToEndAsAppRole Snapshot approach. internal/db/testdata/prod-snapshot.sql is a pg_dump of youpc-supabase paliad schema + applied_migrations rows. CI restores this into a fresh `supabase/postgres:15.8.1.060` (same image, same role topology as prod) and runs ApplyMigrations as the `postgres` role (which is NOT a superuser on supabase/postgres, matching prod). Existing migrations are skipped (already in applied_migrations); only NEW migs from the PR run end-to-end. This sidesteps the fresh-DB idempotence debt in some historical migrations (mig 037 missing pg_trgm, mig 051 inner COMMIT) — those are tracked separately and don't block the gate. Sub-changes: - internal/handlers/handlers.go — new /health/ready endpoint distinct from /healthz. /healthz stays liveness (process alive, no DB); /ready is readiness (DB pool pings within 2 s). Returns 503 when svc or pool is nil (DB-less deploys are intentionally not-ready). svc.Pool added to handlers.Services, wired in cmd/server/main.go. - internal/db/migrate_test.go — TestMigrations_NoDuplicateSlot (pure unit, catches brunel) and TestMigrations_EndToEndAsAppRole (snapshot- gated, catches the 42501 class). - cmd/server/main_smoke_test.go — TestBootSmoke now also asserts /health/ready returns 503 with a nil svc. New TestHealthReady_Live asserts 200 against a live pool. - internal/db/migrations/024_rename_department_columns.up.sql and 027_rename_to_partner_units.up.sql — ALTER INDEX / ALTER POLICY exception handlers now catch undefined_object OR undefined_table OR duplicate_object. Old handler only caught undefined_object; Postgres raises undefined_table when source object never existed, and duplicate_object when destination already exists. The expanded handlers make these migrations truly idempotent across all plausible starting states. - Makefile — verify-mig-app, test-frontend, refresh-snapshot targets. refresh-snapshot pg_dumps youpc-supabase prod (needs PALIAD_PROD_DATABASE_URL), strips pg16 \restrict commands for pg15 restore compat, and filters applied_migrations rows to this branch's max on-disk version. - internal/db/testdata/README.md — explains the snapshot's purpose, refresh procedure, and how to verify locally. - docs/cicd-runner-setup-2026-05-25.md — one-time admin steps for registering a Gitea Actions runner on mriver and wiring DOKPLOY_TOKEN as a repo secret. Documents soft-launch plan per m's Q11.4 (keep Dokploy's autoDeploy=true webhook alive for one week, disable after the workflow has gated 5 successful deploys). Build clean. Full go test ./internal/... ./cmd/... green without TEST_DATABASE_URL. With TEST_DATABASE_URL + TEST_APP_DATABASE_URL set to a supabase/postgres scratch + snapshot restored: TestMigrations_NoDuplicateSlot, TestMigrations_EndToEndAsAppRole, TestBootSmoke, TestHealthReady_Live all pass. Live-DB service tests in internal/services/* fail under supabase/postgres 15.8 with a 42P08 parameter-binding error (unrelated to Slice A — tracked as a follow-up).
10 KiB
CI/CD runner setup — paliad
Companion to: docs/design-cicd-pre-deploy-gate-2026-05-25.md (Slice A, t-paliad-282 / m/paliad#114)
Date: 2026-05-25
Audience: mlake / mriver admin (m or head)
Slice A's .gitea/workflows/test.yaml requires (a) at least one online Gitea Actions runner and (b) a Dokploy API token wired as a repo secret. Both are one-time setup actions that paliad's source tree cannot perform itself — they live on infra-side. This doc lists them so the workflow can go green on its first run.
0. Pre-flight: what already exists
Verified live (2026-05-25 cronus inventor shift):
- Gitea 1.24.4 on
mgit.msbls.de,has_actions: trueonm/paliad. /api/v1/admin/actions/runnersreports 2 runners registered. They are likely the shared runners used bym/mGreenandm/mGeo(both have.gitea/workflows/deploy.ymlwithruns-on: self-hosted).m/paliad/actions/tasksreportstotal_count=0— paliad has never run a workflow yet.
The existing runners may already be capable of running paliad's workflow without further setup. The verification step (§3) below tells you whether they are.
1. Runner placement decision (m's Q11.1)
m's pick: mriver.
Rationale: mriver hosts the mai worker fleet but workers spend most of their time waiting on Anthropic. mlake's Dokploy + Swarm workload is more contended. A new runner on mriver adds the least pressure to either box.
If mriver is offline or saturated when CI first fires, fall back to the existing mlake-side runners (they're already registered; no provisioning needed).
2. One-time setup (admin steps)
2.1 Register a new Gitea Actions runner on mriver
# On mriver, as m:
# 1. Download the act_runner binary (matching Gitea 1.24.x)
curl -L -o /usr/local/bin/act_runner \
https://gitea.com/gitea/act_runner/releases/download/v0.2.13/act_runner-0.2.13-linux-amd64
chmod +x /usr/local/bin/act_runner
# 2. Get a runner registration token. In the Gitea UI:
# /admin → Actions → Runners → "Create new Runner"
# (or org-scope: /m/paliad/settings/actions/runners)
# Copy the token.
# 3. Register
mkdir -p ~/act_runner && cd ~/act_runner
act_runner register --no-interactive \
--instance https://mgit.msbls.de \
--token <REGISTRATION_TOKEN> \
--name mriver-paliad-1 \
--labels ubuntu-latest:docker://node:20-bookworm
# 4. Run as a systemd unit (preferred) or as a session daemon
# Systemd unit example: /etc/systemd/system/act_runner.service
# [Unit]
# Description=Gitea Actions runner
# After=network.target
# [Service]
# User=m
# WorkingDirectory=/home/m/act_runner
# ExecStart=/usr/local/bin/act_runner daemon
# Restart=on-failure
# [Install]
# WantedBy=multi-user.target
sudo systemctl enable --now act_runner
sudo systemctl status act_runner
Why ubuntu-latest:docker://node:20-bookworm for the label? Gitea Actions' runs-on: ubuntu-latest resolves via the runner's label map. Mapping it to a Docker image gives the workflow a sandbox with Docker available — required for our Postgres service container in test.yaml. mriver should have Docker (for paliadin-shim); if not, install it.
2.2 Register the Dokploy API token as a repo secret
The workflow's deploy job needs secrets.DOKPLOY_TOKEN. Use the existing project-wide Dokploy API key (the one stored in ~/.claude/skills/mai-dokploy/SKILL.md).
In the Gitea UI:
- Navigate to
https://mgit.msbls.de/m/paliad/settings/actions/secrets - Click "Add secret"
- Name:
DOKPLOY_TOKEN - Value:
mai-ottosSyRHMhmLhhhXaCbKzbqKBuSqzqEtmKDOPelPCeimTaYsbmaVslVyEgJZGCIxVdz
- Name:
Or via API (mAi identity):
curl --netrc-file ~/.netrc-mai -sS -X POST \
-H "Content-Type: application/json" \
https://mgit.msbls.de/api/v1/repos/m/paliad/actions/secrets/DOKPLOY_TOKEN \
-d '{"data":"mai-ottosSyRHMhmLhhhXaCbKzbqKBuSqzqEtmKDOPelPCeimTaYsbmaVslVyEgJZGCIxVdz"}'
(Requires repo-owner permission. If mAi lacks it, m runs it.)
3. Verify the runner sees the workflow
After (2.1) + (2.2):
# Push the Slice A branch (the one this doc lives on)
git push origin mai/cronus/coder-cicd-slice-a
# Confirm the runner picked up the job
curl --netrc-file ~/.netrc-mai -sS \
"https://mgit.msbls.de/api/v1/repos/m/paliad/actions/tasks?limit=5" | jq '.'
A new task per job should appear (build, test-go). If total_count stays 0, the runner labels don't match the workflow's runs-on. Re-register with --labels ubuntu-latest (no docker:// suffix) and the existing runners on mlake will pick it up via shell mode.
4. Soft-launch (m's Q11.4)
m's pick: keep both Dokploy auto-deploy and the workflow's deploy step alive for ~1 week. After ≥5 successful green deploys via the workflow, disable Dokploy's autoDeploy in the Dokploy UI for the paliad compose.
While both are live, every push to main fires:
- Dokploy webhook (existing path) → deploys immediately, no gate.
- Gitea workflow → on green, ALSO calls
compose.deploy.
The second call is idempotent — if Dokploy already deployed the same commit, this is a no-op. The workflow's value during soft-launch is the gate signal: a red workflow on a green main = the bad migration shipped via the unguarded webhook and broke prod, and the workflow is shouting about it.
After confidence builds:
- In the Dokploy UI, navigate to the paliad compose → Settings.
- Toggle "Auto Deploy" off.
- Save.
From this point, the only path to deploy is the workflow's deploy job. Red workflow = no deploy.
5. What Slice A catches today — and what it doesn't
After this branch (mai/cronus/coder-cicd-slice-a) merges to main:
Catches (active in CI)
- Build breakage —
go build,go vet,bun run build. Red gate, no deploy. - Slot collisions —
TestMigrations_NoDuplicateSlotruns without a DB. A PR adding migration N when version N already exists fails at gate time. This is the brunel-class catch (m/paliad#114 ~13:20 outage). - New-migration shape errors (hermes class) —
TestBootSmokerunsApplyMigrationsagainst the snapshot-restored DB. New migs from this PR get applied for real; any column/relation/syntax error fails the gate before merge. - New-migration ownership errors (mig 129 42501 class) —
TestMigrations_EndToEndAsAppRolerunsApplyMigrationsconnected aspostgres(NON-superuser onsupabase/postgres:15.8.1.060, same role topology as youpc-supabase prod). Any migration that assumes supabase_admin privilege fails with the same42501 must be ownererror class that took paliad.de offline on 2026-05-25. - Readiness probe regressions —
TestHealthReady_Liveconfirms/health/readyreturns 200 against a live pool, 503 against a nil pool. - Pure-Go test regressions —
go test ./internal/... ./cmd/...runs withoutTEST_DATABASE_URL(live-DB service tests skip the same way they do on a developer laptop without a scratch DB).
Mechanism — the snapshot approach
CI's scratch DB starts from a pg_dump of youpc-supabase paliad schema +
paliad.applied_migrations rows, committed to internal/db/testdata/prod-snapshot.sql. After restore, the scratch DB is at "paliad HEAD of snapshot" and ApplyMigrations sees only this PR's new migrations as pending.
This sidesteps the fresh-DB idempotence problem: several historical migrations (notably mig 037's missing CREATE EXTENSION pg_trgm, mig 051's inner COMMIT;) can't be replayed from scratch against supabase/postgres:15.8.1.060. The snapshot pins everything that's already applied in prod and lets CI focus on what's new — which is what we actually care about for outage prevention.
Snapshot refresh: make refresh-snapshot with PALIAD_PROD_DATABASE_URL set (see internal/db/testdata/README.md).
Known gap — live-DB service tests don't run in CI
internal/services/*_test.go tests with TEST_DATABASE_URL set fail against supabase/postgres:15.8.1.060 with 42P08 inconsistent types deduced for parameter errors on some INSERT bind paths. The same tests pass against youpc-supabase prod. Cause is unconfirmed — likely subtle differences in type inference between the dockerized image and the prod cluster's configuration. CI today runs go test ./... without TEST_DATABASE_URL so these tests skip. Not blocking outage prevention; tracked as a follow-up for the post-Slice-A coder.
Migration cleanup also bundled in this PR
Two surgical migration improvements that surfaced during snapshot debugging — kept here because they're small and harmless:
- mig 024 + 027 —
ALTER INDEX/ALTER POLICYexception handlers now catchundefined_objectORundefined_tableORduplicate_object. Old handler caught onlyundefined_object; Postgres raisesundefined_tablewhen the source object never existed andduplicate_objectwhen the destination already exists. The expanded handler makes the migrations truly idempotent across the three plausible states: source-still-German (rename succeeds), already-renamed (catches duplicate_object), and fresh-DB-never-had-German (catches undefined_table).
Other migration history bugs (mig 037 missing pg_trgm, mig 051 inner COMMIT) are tracked as a separate cleanup task — not blocking, because the snapshot bypasses them.
Verification checklist (after Slice A merges)
- Workflow green on its first PR run? Check
/m/paliad/actions. If not, fix before merging. - Dokploy
compose.deploycall succeeds? The workflow'sdeployjob logs the POST response. A successful response is a Dokploy job ID; a 4xx is an auth or compose-id problem. /health/readyreturns 200 within 5 minutes after a green deploy? The workflow polls this. If it times out, the migration may have failed silently inside the new container — checkdocker logs --tail 50 compose-transmit-multi-byte-driver-v7jth9-web-1on mlake.- Reproduce the slot-collision catch locally: rename
131_…up.sqlto129_…(duplicate slot) → workflow MUST fail atMigration coordination check. Revert before pushing. - Reproduce the role-split catch locally: add a no-op migration
132_test_supersedes.up.sqlcontainingREINDEX SYSTEM paliad_scratch;(requires superuser). Workflow MUST fail atMigration end-to-end (deploy role). Revert before pushing.
6. Future polish (Slice D, m's Q4 R-pick)
mai-test post-merge shift: once Slice A is stable, wire a Gitea webhook on push-to-main that fires /mai-test as a follow-up shift. It runs the broader smoke + integration suite and posts results as a Gitea commit status. Not blocking; the gate doesn't depend on it.
Implementation belongs in m/mAi (the mai webhook handler), not in paliad. Out of scope for Slice A.