CI/CD: pre-deploy test + migration gate so paliad.de stays online through failed deploys (inventor) #114

New Issue

mAi · 2026-05-25T14:56:54Z

mAi commented

2026-05-25 14:56:54 +00:00

m's report (2026-05-25 16:55)

can we fix our cicd in a way that paliad does not go 'offline' as often as it happens now? I dont really want the .dev variant vor it all - with a branch and a separate dokploy app. but we need test workers included. We have gitea, we have mlake, mriver... all our infra. Where is it smartest to run? We should keep the site online but also keep developing it without ruining everything during a failed deploy

Observed pain (today's session)

2x outages today both from migration failures at container boot:

~13:20 — brunel's in-process testing wrote a row to paliad.applied_migrations claiming version=123 with the wrong name; cronus's actual 123_backups merged in parallel; migrator bailed on name mismatch; container crashlooped ~30 min.
~16:05 — hermes's 125_cross_cutting_filter_legal_source referenced columns (is_mandatory, is_optional, condition_flag) dropped in mig 091; container crashlooped through two hotfix iterations.

In both cases, go build ./... was clean at merge time — the failure was migration-content. Container then went into Docker restart loop, Traefik served default 404. Site fully offline until head debugged.

Constraint envelope from m

No "paliad.dev" duplicate: don't want a separate Dokploy app + branch + DB. One source of truth, one prod surface.
Test workers included: build + tests must run somewhere before the deploy.
Existing infra only: gitea (mgit.msbls.de) · mlake (Dokploy + Docker Swarm) · mriver (workers, Tailscale-attached).
Site stays online through failed deploys: a broken migration must NOT take the running container down.

Phase: inventor design (READ-ONLY)

Inventor → coder gate per project CLAUDE.md.

Open design questions

Q1 — Where do tests run?

A. Gitea Actions (gitea has built-in CI runners). Runner on mriver or mlake. Pre-merge to main: every push fires .gitea/workflows/test.yaml which runs go build, go test ./internal/..., cd frontend && bun run build. Deploy webhook fires ONLY on green.
B. Custom proxy on mriver: receives the gitea push webhook, runs tests in a transient docker container (using the same Dockerfile as prod), forwards to Dokploy only on green. More work to maintain.
C. Dokploy pre-deploy hook: if Dokploy supports a pre-deploy script (per skill docs — verify), run tests inline before container swap. Tightest integration but Dokploy may not expose this.

(R) = A — Gitea Actions. The runner is the cheapest infra; gitea's workflow file is in the repo; pipeline status visible in PR view. mriver hosts the runner since it has the worker pool resources.

Q2 — Where does the migration get tested?

Go unit tests don't catch migration failures because they run against a clean schema. A real test for migration safety has to:

Take a snapshot of prod schema (or a representative one)
Apply the new migration
Run a smoke query that hits the new shape
Roll back

Options:
A. Migration smoke test in the CI job: pull the prod schema dump nightly into a runner-local Postgres, then on every push, spin up an ephemeral copy, apply all migrations through the new one, run a smoke query.
B. Dry-run migration mode: Go binary has a flag that runs migrations against a target DB-URL inside a transaction and rolls back. Run as a CI step against a clone DB.
C. Defer to prod with auto-rollback: container's startup migrator rolls back on failure + flags the deploy as broken; old container stays running. Dokploy / Docker Swarm support this via health checks + zero-downtime swap.

(R) = A + C: belt-and-suspenders. A catches column-mismatch + ownership traps before they ever reach prod. C handles the rare case where production data shape differs from the snapshot.

Q3 — Blue/green or canary deploy for the container itself?

Docker Swarm (Dokploy's runtime) supports rolling updates with health checks. Current setup: container starts → fails health check → swarm doesn't replace the running one. Verify this is actually configured — if update_config.failure_action: pause is set in the compose, today's crashloops would NOT have taken the site offline.

Looking at the symptom (Traefik default 404), it seems Swarm DID replace the old container with the failing one. The restart: always or restart_policy.condition: any may be too aggressive.

(R) = inventor researches the current compose's deploy: block (or update_config:) and proposes a config change that makes container-swap health-gated.

Q4 — How do test workers (existing mai workers) fit in?

m mentioned "we need test workers included." Two readings:

CI runners ARE the test workers (per Q1 A) — gitea runs the tests via its workflow.
mai workers run tests as part of their shift — every gitster/coder runs go test before pushing, AND a dedicated mai-test worker runs a broader smoke suite on every main update.

(R) = both. Per-worker pre-push tests stay (already-required convention — see Hard rules in every issue brief). The CI runner is a SAFETY NET that catches what individual workers might skip + catches integration issues between worker branches.

A dedicated mai-test shift (per the existing mai-test skill) can also kick off post-merge as a checker that runs the broader smoke + integration suite + reports back to gitea.

Q5 — Migration coordination (the ROOT CAUSE of today's first outage)

A process-level fix that's cheaper than infra:

Migration slot reservation: when head spawns a worker that will write a migration, head tells them the EXACT slot to use (e.g. "slot 124"). Inventor of this design also recommends checking paliad.applied_migrations slot availability BEFORE writing the file.
Pre-flight check in CI: if any migration file's slot already exists in applied_migrations with a different name → FAIL the build.

(R) = yes, head's slot-reservation behavior is already in flight (today's session). Codify as a Go check in CI + a heads-up to head when filing a task that touches deadline_rules / submission_drafts / projects.

Q6 — Existing prod traffic during deploy

Dokploy + Swarm should keep the previous container serving traffic during the new container's startup window. If the new container fails health checks, the old one keeps serving. Verify + document. If broken: this is the BIGGEST single win — fix update_config.failure_action: rollback in the compose.

Deliverable

docs/design-cicd-pre-deploy-gate-2026-05-25.md on branch mai/<inventor>/cicd-design. Sections:

§0 TL;DR
§1 Today's outages — root cause analysis (2x migration failures)
§2 m's constraints (no .dev clone, existing infra, online during failed deploy)
§3 Decision matrix (Q1-Q6 with R + cost + maintenance + time-to-ship)
§4 Recommended pipeline (visual: push → gitea actions → tests + migration smoke → on green Dokploy webhook → on red, PR/issue comment)
§5 Compose changes (deploy: update_config.* for health-gated swap)
§6 Migration smoke harness (how to clone schema, how to detect ownership trap, etc.)
§7 Existing infra resource map (mlake / mriver capacity)
§8 Slice plan (Slice A: gitea actions running build + test; Slice B: migration smoke test in CI; Slice C: compose update_config rollback)
§9 Risk + rollback
§10 Out of scope (full .dev environment; multi-region deploys)
§11 Open questions for m

Hard rules

READ-ONLY design phase. No code, no compose edits, no Dokploy changes.
Head answers questions — NO AskUserQuestion. Inventor uses mai instruct head. Defaults to (R) recommendations.
Verify the current compose's deploy: block live (ssh mlake docker stack config or read the compose from m/paliad/docker-compose.yml).
Check gitea actions support live (the mgit.msbls.de Gitea version + runners installed).

When done

Push design doc + mai report completed with "DESIGN READY FOR REVIEW". Inventor stays parked. Head gates coder shift.

Out of scope

Multi-region / DR.
Database backup or rollback strategy (#77 Backup Mode covers backups).
Migrating off Dokploy.
Adding a SECOND Dokploy app or branch (per m's explicit constraint).

## m's report (2026-05-25 16:55) > can we fix our cicd in a way that paliad does not go 'offline' as often as it happens now? I dont really want the .dev variant vor it all - with a branch and a separate dokploy app. but we need test workers included. We have gitea, we have mlake, mriver... all our infra. Where is it smartest to run? We should keep the site online but also keep developing it without ruining everything during a failed deploy ## Observed pain (today's session) 2x outages today both from migration failures at container boot: 1. ~13:20 — brunel's in-process testing wrote a row to `paliad.applied_migrations` claiming version=123 with the wrong name; cronus's actual 123_backups merged in parallel; migrator bailed on name mismatch; container crashlooped ~30 min. 2. ~16:05 — hermes's `125_cross_cutting_filter_legal_source` referenced columns (`is_mandatory`, `is_optional`, `condition_flag`) dropped in mig 091; container crashlooped through two hotfix iterations. In both cases, `go build ./...` was clean at merge time — the failure was migration-content. Container then went into Docker restart loop, Traefik served default 404. Site fully offline until head debugged. ## Constraint envelope from m - **No "paliad.dev" duplicate**: don't want a separate Dokploy app + branch + DB. One source of truth, one prod surface. - **Test workers included**: build + tests must run somewhere before the deploy. - **Existing infra only**: gitea (mgit.msbls.de) · mlake (Dokploy + Docker Swarm) · mriver (workers, Tailscale-attached). - **Site stays online through failed deploys**: a broken migration must NOT take the running container down. ## Phase: inventor design (READ-ONLY) Inventor → coder gate per project CLAUDE.md. ## Open design questions ### Q1 — Where do tests run? A. **Gitea Actions** (gitea has built-in CI runners). Runner on mriver or mlake. Pre-merge to main: every push fires `.gitea/workflows/test.yaml` which runs `go build`, `go test ./internal/...`, `cd frontend && bun run build`. Deploy webhook fires ONLY on green. B. **Custom proxy on mriver**: receives the gitea push webhook, runs tests in a transient docker container (using the same Dockerfile as prod), forwards to Dokploy only on green. More work to maintain. C. **Dokploy pre-deploy hook**: if Dokploy supports a pre-deploy script (per skill docs — verify), run tests inline before container swap. Tightest integration but Dokploy may not expose this. **(R) = A** — Gitea Actions. The runner is the cheapest infra; gitea's workflow file is in the repo; pipeline status visible in PR view. mriver hosts the runner since it has the worker pool resources. ### Q2 — Where does the migration get tested? Go unit tests don't catch migration failures because they run against a clean schema. A real test for migration safety has to: - Take a snapshot of prod schema (or a representative one) - Apply the new migration - Run a smoke query that hits the new shape - Roll back Options: A. **Migration smoke test in the CI job**: pull the prod schema dump nightly into a runner-local Postgres, then on every push, spin up an ephemeral copy, apply all migrations through the new one, run a smoke query. B. **Dry-run migration mode**: Go binary has a flag that runs migrations against a target DB-URL inside a transaction and rolls back. Run as a CI step against a clone DB. C. **Defer to prod with auto-rollback**: container's startup migrator rolls back on failure + flags the deploy as broken; old container stays running. Dokploy / Docker Swarm support this via health checks + zero-downtime swap. **(R) = A + C**: belt-and-suspenders. A catches column-mismatch + ownership traps before they ever reach prod. C handles the rare case where production data shape differs from the snapshot. ### Q3 — Blue/green or canary deploy for the container itself? Docker Swarm (Dokploy's runtime) supports rolling updates with health checks. Current setup: container starts → fails health check → swarm doesn't replace the running one. **Verify** this is actually configured — if `update_config.failure_action: pause` is set in the compose, today's crashloops would NOT have taken the site offline. Looking at the symptom (Traefik default 404), it seems Swarm DID replace the old container with the failing one. The `restart: always` or `restart_policy.condition: any` may be too aggressive. **(R)** = inventor researches the current compose's `deploy:` block (or `update_config:`) and proposes a config change that makes container-swap health-gated. ### Q4 — How do test workers (existing mai workers) fit in? m mentioned "we need test workers included." Two readings: 1. **CI runners ARE the test workers** (per Q1 A) — gitea runs the tests via its workflow. 2. **mai workers run tests as part of their shift** — every gitster/coder runs `go test` before pushing, AND a dedicated mai-test worker runs a broader smoke suite on every main update. **(R)** = both. Per-worker pre-push tests stay (already-required convention — see Hard rules in every issue brief). The CI runner is a SAFETY NET that catches what individual workers might skip + catches integration issues between worker branches. A dedicated `mai-test` shift (per the existing `mai-test` skill) can also kick off post-merge as a checker that runs the broader smoke + integration suite + reports back to gitea. ### Q5 — Migration coordination (the ROOT CAUSE of today's first outage) A process-level fix that's cheaper than infra: - Migration slot reservation: when head spawns a worker that will write a migration, head tells them the EXACT slot to use (e.g. "slot 124"). Inventor of this design also recommends checking `paliad.applied_migrations` slot availability BEFORE writing the file. - Pre-flight check in CI: `if any migration file's slot already exists in applied_migrations with a different name → FAIL the build.` **(R)** = yes, head's slot-reservation behavior is already in flight (today's session). Codify as a Go check in CI + a heads-up to head when filing a task that touches deadline_rules / submission_drafts / projects. ### Q6 — Existing prod traffic during deploy Dokploy + Swarm should keep the previous container serving traffic during the new container's startup window. If the new container fails health checks, the old one keeps serving. Verify + document. If broken: this is the BIGGEST single win — fix `update_config.failure_action: rollback` in the compose. ## Deliverable `docs/design-cicd-pre-deploy-gate-2026-05-25.md` on branch `mai/<inventor>/cicd-design`. Sections: - §0 TL;DR - §1 Today's outages — root cause analysis (2x migration failures) - §2 m's constraints (no .dev clone, existing infra, online during failed deploy) - §3 Decision matrix (Q1-Q6 with R + cost + maintenance + time-to-ship) - §4 Recommended pipeline (visual: push → gitea actions → tests + migration smoke → on green Dokploy webhook → on red, PR/issue comment) - §5 Compose changes (deploy: update_config.* for health-gated swap) - §6 Migration smoke harness (how to clone schema, how to detect ownership trap, etc.) - §7 Existing infra resource map (mlake / mriver capacity) - §8 Slice plan (Slice A: gitea actions running build + test; Slice B: migration smoke test in CI; Slice C: compose update_config rollback) - §9 Risk + rollback - §10 Out of scope (full .dev environment; multi-region deploys) - §11 Open questions for m ## Hard rules - READ-ONLY design phase. No code, no compose edits, no Dokploy changes. - **Head answers questions** — NO AskUserQuestion. Inventor uses `mai instruct head`. Defaults to (R) recommendations. - Verify the current compose's `deploy:` block live (`ssh mlake docker stack config` or read the compose from `m/paliad/docker-compose.yml`). - Check gitea actions support live (the `mgit.msbls.de` Gitea version + runners installed). ## When done Push design doc + `mai report completed` with "DESIGN READY FOR REVIEW". Inventor stays parked. Head gates coder shift. ## Out of scope - Multi-region / DR. - Database backup or rollback strategy (#77 Backup Mode covers backups). - Migrating off Dokploy. - Adding a SECOND Dokploy app or branch (per m's explicit constraint).

mAi self-assigned this 2026-05-25 14:56:54 +00:00

m referenced this issue from a commit

2026-05-25 15:07:47 +00:00

docs(cicd): inventor design — pre-deploy gate + migration smoke (t-paliad-282)

mAi referenced this issue

2026-05-25 15:15:48 +00:00

UPC Damages proceeding: missing post-submission court followup (oral hearing, interim conference, appeal route) #117

m referenced this issue from a commit

2026-05-25 15:42:12 +00:00

feat(cicd): Slice A — pre-deploy gate + role-split migration smoke

m referenced this issue from a commit

2026-05-25 15:42:53 +00:00

Merge: t-paliad-282 Slice A — CI/CD pre-deploy gate + snapshot-based migration smoke (m/paliad#114)

mAi referenced this issue from a commit

2026-07-24 08:58:32 +00:00

docs(test): test-gap audit — coverage map, dormant-test problem, dead tests, ordered backlog

mAi referenced this issue from a commit

2026-07-24 09:00:50 +00:00

Merge branch 'mai/athena/test-gap-audit-map' (test-gap audit)

mAi referenced this issue from a commit

2026-07-24 09:37:37 +00:00

build(test): one-command test Postgres + PII-free reference seed (B1, #114)

mAi referenced this issue from a commit

2026-07-24 09:37:37 +00:00

test(services): fix 42P08 latent seed bug in dormant live tests (B1, #114)

mAi referenced this issue from a commit

2026-07-24 09:37:37 +00:00

test(services): fix dropped-column + NOT-NULL seed bugs in dormant tests (B1, #114)

mAi referenced this issue from a commit

2026-07-24 09:37:37 +00:00

test(services): fix reminder_log $3 timestamptz/date 42P08 reuse (B1, #114)

mAi referenced this issue from a commit

2026-07-24 09:52:59 +00:00

fix(services): LookupEvents ambiguous-column bug caught by dormant test (B1, #114)

mAi referenced this issue from a commit

2026-07-24 09:52:59 +00:00

test(services): fixture fixes — matview populate, auth.users FK, reopen admin team (B1, #114)

mAi referenced this issue from a commit

2026-07-24 09:52:59 +00:00

test(services): reopen fixture — seed creator as responsibility='lead' (B1, #114)

mAi referenced this issue from a commit

2026-07-24 09:52:59 +00:00

ci(gate): Gitea Actions Stage-1 workflow + B1 design doc (#114)

mAi commented

2026-07-24 09:55:07 +00:00

B1 — dormant test gate made runnable (shift-1, patton/lead)

Branch mai/patton/b1-make-the-dormant-test (6 commits, pushed). This is the test-side slice of #114.

Delivered (all locally green):

One-command test Postgres: make db-test-up → supabase/postgres:15.8 + schema snapshot + PII-free reference seed + matview populate, prints TEST_DATABASE_URL. docker-compose.test.yml + scripts/db-test-setup.sh (shared by dev + CI). make db-test-down, make refresh-reference-seed.
Reference-data decision — (a)+(b): committed internal/db/testdata/reference-seed.sql = 19 PII-free catalogs (courts, offices, proceeding_types, procedural_events, scenario_flag_catalog, submission_bases/blocks, …), sourced from prod. Fixes TestSearchCourtsDB (41 courts), the condition-expr validators, all catalog readers, and the mig-177-class FK-insert gap in verify-migrations. Per-test hermetic seeding for PII-linked rows.
Gitea Actions Stage-1 workflow (.gitea/workflows/ci.yml): build + verify-migrations + verify-mig-app + make test + bun test. Does NOT block the deploy webhook. Fatal core = migration gate (green). Full live suite runs visible but non-fatal (continue-on-error) — see residual below.
Migration gate is GREEN (verify-migrations + verify-mig-app) — the outage-prevention core of this issue.

Load-bearing finding: the audit's "~55 dormant tests pass instantly" was empirically false — they never ran, so they carried latent bugs. Fixed the big shared classes (42P08 reused-$1-as-text ×17 files, dropped users.role column, pin project_teams NOT-NULL role, approval users.profession fixture ×16 tests, matview/FK/reopen fixtures). ~34 of ~55 now pass for real; 550 service asserts green. Also caught + fixed a genuine latent service bug: LookupEvents selected ambiguous bare id in a JOIN (42702) — exactly what the gate is for.

Residual (21 red, made VISIBLE not skipped → B1b): ~13 dead tests vs dropped paliad.deadline_rules (mig 140) → t-paliad-dead-migration-tests (audit undercounted "2"); 3 data-expectation drift → B4; ~7 genuine bugs/isolation → B1b. Full triage in docs/design-cicd-b1-migration-gate-2026-07-24.md §6.

Runner status: workflow fires on push (run #220 — CI wiring confirmed) but concluded failure: there is no act_runner on mriver (the cicd-runner-setup-2026-05-25.md §2 registration was never actually done); the job landed on the shared mlake shell-executor which lacks go/bun/docker/checkout. Going green on-runner needs the one-time docker-capable runner registration (head/m infra step, documented). Local proof: eval $(make -s db-test-up | tail -2); make verify-migrations verify-mig-app → green; bun test → 256 green.

Stage 2 (blocking the deploy) is a separate commit — head + m decide, only once reliably green.

## B1 — dormant test gate made runnable (shift-1, patton/lead) Branch `mai/patton/b1-make-the-dormant-test` (6 commits, pushed). This is the test-side slice of #114. **Delivered (all locally green):** - **One-command test Postgres:** `make db-test-up` → supabase/postgres:15.8 + schema snapshot + PII-free reference seed + matview populate, prints `TEST_DATABASE_URL`. `docker-compose.test.yml` + `scripts/db-test-setup.sh` (shared by dev + CI). `make db-test-down`, `make refresh-reference-seed`. - **Reference-data decision — (a)+(b):** committed `internal/db/testdata/reference-seed.sql` = 19 PII-free catalogs (courts, offices, proceeding_types, procedural_events, scenario_flag_catalog, submission_bases/blocks, …), sourced from prod. Fixes `TestSearchCourtsDB` (41 courts), the condition-expr validators, all catalog readers, **and** the mig-177-class FK-insert gap in `verify-migrations`. Per-test hermetic seeding for PII-linked rows. - **Gitea Actions Stage-1 workflow** (`.gitea/workflows/ci.yml`): build + `verify-migrations` + `verify-mig-app` + `make test` + `bun test`. **Does NOT block the deploy webhook.** Fatal core = migration gate (green). Full live suite runs **visible but non-fatal** (`continue-on-error`) — see residual below. - **Migration gate is GREEN** (verify-migrations + verify-mig-app) — the outage-prevention core of this issue. **Load-bearing finding:** the audit's "~55 dormant tests pass instantly" was **empirically false** — they never ran, so they carried latent bugs. Fixed the big shared classes (42P08 reused-`$1`-as-text ×17 files, dropped `users.role` column, pin project_teams NOT-NULL role, approval `users.profession` fixture ×16 tests, matview/FK/reopen fixtures). **~34 of ~55 now pass for real; 550 service asserts green.** Also caught + fixed a **genuine latent service bug**: `LookupEvents` selected ambiguous bare `id` in a JOIN (42702) — exactly what the gate is for. **Residual (21 red, made VISIBLE not skipped → B1b):** ~13 dead tests vs dropped `paliad.deadline_rules` (mig 140) → **t-paliad-dead-migration-tests** (audit undercounted "2"); 3 data-expectation drift → **B4**; ~7 genuine bugs/isolation → **B1b**. Full triage in `docs/design-cicd-b1-migration-gate-2026-07-24.md` §6. **Runner status:** workflow **fires** on push (run #220 — CI wiring confirmed) but concluded `failure`: there is **no act_runner on mriver** (the `cicd-runner-setup-2026-05-25.md` §2 registration was never actually done); the job landed on the shared mlake shell-executor which lacks go/bun/docker/checkout. **Going green on-runner needs the one-time docker-capable runner registration (head/m infra step, documented).** Local proof: `eval $(make -s db-test-up | tail -2); make verify-migrations verify-mig-app` → green; `bun test` → 256 green. **Stage 2 (blocking the deploy) is a separate commit — head + m decide, only once reliably green.**

mAi referenced this issue from a commit

2026-07-24 10:05:21 +00:00

ci(gate): listed known-failing gate instead of silent skips (B1, #114)

mAi referenced this issue from a commit

2026-07-24 10:08:24 +00:00

Merge branch 'mai/patton/b1-make-the-dormant-test' (B1: the dormant test gate runs)

mAi commented

2026-07-28 19:20:52 +00:00

paliad's CI executed for the first time — and it is green

Branch mai/knuth/give-paliad-its-own (t-paliad-ci-runner-own-stack). Not merged; head's call.

Runs 640 and 642 both success. Before today: 274 runs, 274 failures, zero successes.

Root cause

Gitea Actions registration tokens are repo-scoped, and a runner only claims jobs for the repo it registered against. m/paliad had none. Every run was created, claimed by nobody, marked failed. The workflow YAML was never at fault — an earlier theory blamed actions/checkout@v4 and was wrong.

The diagnostic is started_at, not the status badge: a build that fails has a start time; a run no runner claimed does not, and its job-log endpoint answers {"errors":["job not started"]}.

Measured, so nobody re-derives it: m/mGreen has a repo-scoped runner and goes green; m/mGeo has none and never starts. The user-level runner mdock-native serves neither.

What shipped

infra/gitea-act-runner/ — act_runner 0.2.13 + docker:29-dind, paliad's own Dokploy stack (paliad-act-runner, project patholo), deliberately not a third service on mgit-actrunner-lthrhz, which serves yoUPC/youpc.org's working CI. Runner mlake-paliad, Gitea id 13. The file is a reviewable mirror; the authoritative copy is the Dokploy raw compose, and the README carries the deploy, edit and rotation recipes.
runs-on: paliad-ci, not self-hosted. Run 637 was claimed by mdock-native, which also advertises self-hosted, and died on Cannot find: node in PATH — it is a host executor with no container. Two runners racing means half of paliad's runs fail for reasons unrelated to paliad. A label only paliad's runner carries settles it.
Toolchain by install step, not a custom image. The job image was measured, not assumed: catthehacker/ubuntu:act-22.04 carries node 24.18, docker 29.6.1, compose 5.3.1, git 2.54, make, jq, gcc — but not go, bun or psql. A custom image would need a registry mlake does not have, a rebuild owner nobody would be, and a second place pinning the Go version. actions/setup-go@v5 with go-version-file: go.mod keeps go.mod the only Go pin, so the runner cannot drift from a developer's toolchain. Verified: the job resolves go1.24.0, exactly go.mod. Same pattern is already green on this Gitea in yoUPC/youpc.org.
Two divergences from the youpc stack, both forced by make db-test-up, both commented where they occur: dind speaks plaintext on the stack-private network (the job needs a docker client and the TLS certs cannot reach it), and job containers run in dind's network namespace — which is what makes this workflow's hardcoded localhost:15455 DSNs correct. The latter pins runner capacity to 1; two concurrent jobs would collide on that port.

The gate ran for real

Not a green-by-skip. From run 640's log:

==> done: applied_migrations HEAD=197, courts seeded=41, non-postgres-owned tables=5
TEST_DATABASE_URL is set — live-DB gate will run for real.
origin/main resolves (d3657a7) and 89 sibling branches are present — both collision checks will run for real.
==> migration dry-run (per-mig BEGIN..ROLLBACK)                    ok
==> migration number collisions (branch vs origin/main vs applied set)  ok
==> boot smoke (apply + tracker + /healthz)                        ok
==> known-failing tolerated: 0 (see scripts/ci-known-failing.txt)
==> OK: no new failures. Only tracked known-failing tests are red.
Ran 390 tests across 26 files.

ci-known-failing.txt tolerated 0 entries — the ~22 reds that list was built for are gone, so §5's "known gap: live-DB service tests don't run in CI" in docs/cicd-runner-setup-2026-05-25.md is closed. Whole job: ~2.5 min.

Still open, deliberately not done here

Stage 2 — making the gate block the Dokploy deploy — is untouched. The workflow is still non-blocking by design; that flip is head's and m's call.
A green CI checks nothing retroactively. Every branch merged between 2026-07-24 and 2026-07-28 was merged with verify-migrations, verify-mig-app, both collision checks and ci-test-gate.sh never having fired. That window stays unverified, and it is why all six migration-number collisions were caught by humans reading branches by hand.
m/mAi still has zero runners, and only escapes visibly failing because it has no workflow files. Any repo under m/ that adds one hits this identically.
Now that a green run exists, the gofmt / go vet gates that were correctly deferred (a gate on a CI that cannot run looks like coverage and is worse than none) become addable.

Commits

0434027 — own Dokploy act_runner stack + workflow toolchain
6cbdbea — target paliad-ci, not self-hosted
ef3bda3 — correct CLAUDE.md and the 2026-05-25 runner doc, which said this was impossible

## paliad's CI executed for the first time — and it is green Branch `mai/knuth/give-paliad-its-own` (t-paliad-ci-runner-own-stack). Not merged; head's call. **Runs 640 and 642 both `success`.** Before today: 274 runs, 274 failures, zero successes. ### Root cause Gitea Actions registration tokens are **repo-scoped**, and a runner only claims jobs for the repo it registered against. `m/paliad` had none. Every run was created, claimed by nobody, marked failed. The workflow YAML was never at fault — an earlier theory blamed `actions/checkout@v4` and was wrong. The diagnostic is **`started_at`, not the status badge**: a build that fails has a start time; a run no runner claimed does not, and its job-log endpoint answers `{"errors":["job not started"]}`. Measured, so nobody re-derives it: `m/mGreen` has a repo-scoped runner and goes green; `m/mGeo` has none and never starts. The user-level runner `mdock-native` serves neither. ### What shipped - **`infra/gitea-act-runner/`** — act_runner 0.2.13 + `docker:29-dind`, paliad's **own** Dokploy stack (`paliad-act-runner`, project `patholo`), deliberately not a third service on `mgit-actrunner-lthrhz`, which serves yoUPC/youpc.org's working CI. Runner **`mlake-paliad`**, Gitea id 13. The file is a reviewable mirror; the authoritative copy is the Dokploy raw compose, and the README carries the deploy, edit and rotation recipes. - **`runs-on: paliad-ci`, not `self-hosted`.** Run 637 was claimed by `mdock-native`, which also advertises `self-hosted`, and died on `Cannot find: node in PATH` — it is a host executor with no container. Two runners racing means half of paliad's runs fail for reasons unrelated to paliad. A label only paliad's runner carries settles it. - **Toolchain by install step, not a custom image.** The job image was measured, not assumed: `catthehacker/ubuntu:act-22.04` carries node 24.18, docker 29.6.1, compose 5.3.1, git 2.54, make, jq, gcc — but **not go, bun or psql**. A custom image would need a registry mlake does not have, a rebuild owner nobody would be, and a **second place pinning the Go version**. `actions/setup-go@v5` with `go-version-file: go.mod` keeps `go.mod` the only Go pin, so the runner cannot drift from a developer's toolchain. Verified: the job resolves `go1.24.0`, exactly `go.mod`. Same pattern is already green on this Gitea in `yoUPC/youpc.org`. - **Two divergences from the youpc stack**, both forced by `make db-test-up`, both commented where they occur: dind speaks plaintext on the stack-private network (the job needs a docker client and the TLS certs cannot reach it), and job containers run in dind's network namespace — which is what makes this workflow's hardcoded `localhost:15455` DSNs correct. The latter pins runner capacity to 1; two concurrent jobs would collide on that port. ### The gate ran for real Not a green-by-skip. From run 640's log: ``` ==> done: applied_migrations HEAD=197, courts seeded=41, non-postgres-owned tables=5 TEST_DATABASE_URL is set — live-DB gate will run for real. origin/main resolves (d3657a7) and 89 sibling branches are present — both collision checks will run for real. ==> migration dry-run (per-mig BEGIN..ROLLBACK) ok ==> migration number collisions (branch vs origin/main vs applied set) ok ==> boot smoke (apply + tracker + /healthz) ok ==> known-failing tolerated: 0 (see scripts/ci-known-failing.txt) ==> OK: no new failures. Only tracked known-failing tests are red. Ran 390 tests across 26 files. ``` `ci-known-failing.txt` tolerated **0** entries — the ~22 reds that list was built for are gone, so §5's "known gap: live-DB service tests don't run in CI" in `docs/cicd-runner-setup-2026-05-25.md` is closed. Whole job: ~2.5 min. ### Still open, deliberately not done here 1. **Stage 2 — making the gate block the Dokploy deploy — is untouched.** The workflow is still non-blocking by design; that flip is head's and m's call. 2. **A green CI checks nothing retroactively.** Every branch merged between 2026-07-24 and 2026-07-28 was merged with `verify-migrations`, `verify-mig-app`, both collision checks and `ci-test-gate.sh` never having fired. That window stays unverified, and it is why all six migration-number collisions were caught by humans reading branches by hand. 3. **`m/mAi` still has zero runners**, and only escapes visibly failing because it has no workflow files. Any repo under `m/` that adds one hits this identically. 4. Now that a green run exists, the gofmt / `go vet` gates that were correctly deferred (a gate on a CI that cannot run looks like coverage and is worse than none) become addable. ### Commits - [`0434027`](https://mgit.msbls.de/m/paliad/commit/043402774989dba36785c9aeaf8fe8351bad43ea) — own Dokploy act_runner stack + workflow toolchain - [`6cbdbea`](https://mgit.msbls.de/m/paliad/commit/6cbdbeaa88d05b795a5bbe6d0109dedfb566e8c3) — target `paliad-ci`, not `self-hosted` - [`ef3bda3`](https://mgit.msbls.de/m/paliad/commit/ef3bda309c16bf9fa07aa3322d175fb2fe8f0b58) — correct CLAUDE.md and the 2026-05-25 runner doc, which said this was impossible

mAi commented

2026-07-29 16:26:26 +00:00

Sweep verdict (2026-07-29): PARTLY.

Stage 1 exists and — since 2026-07-28 — genuinely runs: .gitea/workflows/ci.yml on runner mlake-paliad, carrying the migration dry-run, both collision checks, boot smoke, the full Go suite and the frontend build. Before that day every run was created and claimed by nobody (274 of 274 failed unstarted), so treat the 2026-07-24 → 2026-07-28 window as unverified.

Stage 2 — the part this title asks for, gating the deploy — is not wired, deliberately. ci.yml's header: "It does NOT block the Dokploy deploy webhook — there is deliberately no deploy job here." So paliad.de is told about a bad deploy, not protected from one.

Full sweep: docs/findings-issue-sweep-2026-07-29.md (commit 4c39886).

**Sweep verdict (2026-07-29): PARTLY.** Stage 1 exists and — since 2026-07-28 — genuinely runs: `.gitea/workflows/ci.yml` on runner `mlake-paliad`, carrying the migration dry-run, both collision checks, boot smoke, the full Go suite and the frontend build. Before that day every run was created and claimed by nobody (274 of 274 failed unstarted), so treat the 2026-07-24 → 2026-07-28 window as unverified. **Stage 2 — the part this title asks for, gating the deploy — is not wired**, deliberately. ci.yml's header: "It does NOT block the Dokploy deploy webhook — there is deliberately no deploy job here." So paliad.de is told about a bad deploy, not protected from one. Full sweep: `docs/findings-issue-sweep-2026-07-29.md` (commit 4c39886).

Sign in to join this conversation.

Branches Tags

main

mai/ritchie/stale-negative-claims

mai/knuth/reset-form-language-and-email

mai/knuth/adopt-mauth-module

mai/knuth/reset-link-scanner-safe

mai/knuth/registry-coherence-139-postscript

mai/ritchie/db-test-packages-sh-and

mai/knuth/gen-skeleton-submission

mai/knuth/retire-skeleton-generator-tier5

mai/jason/seed-orphan-drafts-guard

mai/knuth/ci-lane-no-dsn

mai/jason/seed-script-prod-guard

mai/knuth/skeleton-doccomment-completeness

mai/brunel/harness-findings-postscript

mai/hades/dead-surface-sweep

mai/brunel/views-eventkind-flake

mai/jason/issue-158-service-address

mai/knuth/issue-139-letterhead-vars

mai/cronus/issue-148-trigger-picker

mai/hades/issue-155-followup

mai/hades/issue-155-naming

mai/brunel/escalation-visibility-flag

mai/jason/alles-overrides-horizon

mai/knuth/m-paliad-150-part-b-m

mai/hades/issue-161-zustandigkeit

mai/cronus/m-paliad-160-per-user

mai/jason/issue-163-parties-role

mai/ares/issue-162-one-convention

mai/brunel/m-paliad-115-the-sweep-s

mai/goodall/for-every-check-in-this

mai/knuth/land-darwin-s-follow-up

mai/diesel/guard-report-lib

mai/diesel/issue-139-slice-b

mai/diesel/issue-139-letterhead-vars

mai/darwin/148-crossparty-ui

mai/diesel/m-paliad-158-a-stale

mai/darwin/vacation-doc-warnings

mai/darwin/upc-vacation-findings

mai/darwin/rop-citation-fix

mai/darwin/issue-150-holidays

mai/ritchie/build-the-block-editor

mai/darwin/swallowed-cleanup-errors

mai/darwin/formalities-refusal-schema4

mai/darwin/drift-caveat-shape

mai/darwin/http-smoke-enforcing

mai/darwin/s6-round-3

mai/darwin/loops-acting-user

mai/darwin/s6-rehearsal-round-2

mai/darwin/close-the-s6-blockers

mai/darwin/rehearse-the-s6-flip

mai/knuth/drilling-the-scheduled

mai/brunel/21-test-files-under-pkg

mai/atlas/design-hlc-com-as

mai/hopper3/a-hand-run-can-advance

mai/grace4/re-vendor-mai

mai/grace3/vendor-the-nine-german

mai/head/slug-rule-contract

mai/head/vendor-contract-note

mai/grace2/wiki-generator-language

mai/marco/verify-the-outlook-add

mai/pike2/an-explicit-begin-commit

mai/noether5/remove-the-paris-p3-and

mai/lexy2/r2-backfill-procedural

mai/kepler/issue-502-hl-to-hlc

mai/hertz2/r4-litigationplanner

mai/shannon2/docker-compose-yml-never

mai/linus2/r3-finish-the-b-5

mai/zeus2/guard-no-live-sql-string

mai/galileo2/the-embedded-upc-planner

mai/kepler2/slice-b-procedural

mai/diesel2/mig044-erwiderung-repair

mai/diesel2/fresh-db-replay-past-mig

mai/head/gen-upc-snapshot-dead-table

mai/noether4/offices-export-regen-201

mai/noether4/base-p1-genericize-m

mai/hopper/finish-the-half-built

mai/pike/dead-migration-tests

mai/linus/audit-comment-fix

mai/linus/fristensuche-82-search

mai/linus/b7-checklists

mai/linus/b8-frontend-pure-logic

mai/pike/b5-auth-path-coverage

mai/diesel/rule-test-resync

mai/diesel/regression-m-confirmed

mai/patton/b1-make-the-dormant-test

mai/athena/test-gap-audit-map

mai/diesel/kostenrechner-bug-upc

mai/hopper/patentsstyle-styleguide

mai/pike/re-render-patentsstyle

mai/linus/firm-footer-officelanguag

mai/carmack/re-render-deploy

mai/diesel/fresh-db-bootstrap

mai/pike/follow-up-gen-template

mai/turing/docforge-flip

mai/cronus/bighand-delimiter-constant

mai/ritchie/composer-delete-all

mai/atlas/inventor-followup-rules

mai/knuth/coder-conditional-rule

mai/cronus/inventor-ci-cd-pre

mai/demeter/gitster-submission

mai/atlas/inventor-per-event-card

mai/cronus/inventor-procedural

mai/cronus/inventor-backup-mode

mai/icarus/inventor-inbox-overhaul

mai/atlas/inventor-symmetric-date

mai/gauss/inventorcoder-team-admin

mai/kepler/inventorcoder-project

mai/darwin/roadmap-ccr-en

mai/euler/coder-small-ux-polish

mai/darwin/fristenrechner-cleanup

mai/darwin/fixercoder-priority-bug

mai/leibniz/inventor-caldav-multi

mai/hertz/inventor-unified-modal

mai/archimedes/inventor-excel-data

mai/boltzmann/inventor-gap-tolerant

mai/copernicus/submission-slice-1

mai/fermi/interactive-session

mai/hertz/inventor-suggest-changes

mai/copernicus/inventor-submission

mai/mendel/test-strategy-slice-1

mai/ampere/custom-views-improvements

mai/planck/paliadin-per-user-rls

mai/ritchie/phase-h-ai-deadline

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: m/paliad#114