Paliadin: route prod via Tailscale SSH to mRiver (preserve Claude Code subscription) #12

Open
opened 2026-05-07 20:34:30 +00:00 by mAi · 6 comments
Collaborator

Goal

Route Paliadin from paliad.de's Dokploy container (mLake, 100.99.98.201) to mRiver (100.99.98.203) via Tailscale + SSH, so m can use Paliadin from prod without losing the Claude-Code-subscription benefit (vs paying Anthropic API tokens).

Locked direction (m, 2026-05-07 22:33)

SSH-tunnel via Tailscale chosen over Anthropic API direct (preserves subscription) and over standalone HTTP daemon on mRiver (cleaner protocol but extra moving piece).

Concrete shape:

  • paliad container has Tailscale connectivity (sidecar or network=host or userspace tailscaled — inventor decides)
  • Container has SSH client + an identity key authorised on mRiver
  • PaliadinService swaps the local tmux new-session ... invocation for ssh m@100.99.98.203 tmux ... when running on a host without local tmux
  • mRiver runs the long-lived tmux session + claude CLI; paliad container just proxies

Out of scope (v1)

  • Multi-host failover (only mRiver targeted)
  • Encryption beyond what SSH provides
  • API-key fallback when mRiver is offline (just emit a friendly "Paliadin offline" message)
  • Cross-firm or production multi-tenant variants

Open design questions (for inventor — m will engage)

Container Tailscale shape

  1. Tailscale provisioning: Sidecar container with Tailscale daemon? network_mode: host so paliad inherits host's Tailscale? Userspace tailscaled inside paliad's container with auth-key from secrets? Inventor recommends, m signs off.
  2. Auth-key rotation: where does the Tailscale auth-key live? Dokploy secret? .env.age? How is it rotated?
  3. Container image change: paliad's Dockerfile currently lacks ssh client + Tailscale binary. Inventor proposes the Dockerfile diff. Image-size impact?

SSH identity + auth

  1. Key generation: One SSH keypair for the paliad container, or per-instance? Where is the private key stored (Dokploy secret? .env.age?). Public key authorised on mRiver under ~m/.ssh/authorized_keys.
  2. Restricted command: authorized_keys entry restricts to command="<paliadin-shim>" so the key can ONLY run the tmux invocation, not arbitrary shells. Define the shim shape.
  3. Host key pinning: Container's ~/.ssh/known_hosts pre-populated with mRiver's host key? Or StrictHostKeyChecking=accept-new first time? Inventor recommends.

Service-layer integration

  1. Code changes in internal/services/paliadin.go: Where exactly does the tmux invocation happen? Does the SSH version share the same code path with a flag, or is there a separate RemotePaliadinService implementation?
  2. Routing decision: How does paliad know to use SSH vs local tmux? Env var PALIADIN_REMOTE_HOST=100.99.98.203? Auto-detect (try local tmux first, fall back to SSH)? Inventor recommends.
  3. Reverse port: SSH connection is short-lived (per turn) or long-lived (one persistent SSH master)? ControlMaster auto?

Reliability + monitoring

  1. mRiver offline detection: How does paliad probe mRiver before sending a query? ssh -o ConnectTimeout=2? Cached health check?
  2. Friendly error when mRiver offline: Reuse the existing friendlyErrorMessage shape from t-150. Add error code mriver_unreachable with localised message ("mRiver ist offline — Paliadin nicht erreichbar. Lokal mit ./paliad starten oder mRiver wecken.").
  3. Wake-on-LAN: Out of scope for v1 (m's laptop wake semantics are device-specific). Document as future-work.

Security + auth-domain

  1. Trust model: Anyone who can read paliad's container disk + SSH private key can SSH into mRiver as user m. Mitigations: SSH command= restriction (Q5); audit log on mRiver-side; Dokploy host-disk encryption assumption.
  2. PaliadinOwnerEmail gate (existing): stays in place — only m's email can hit /paliadin. So even if SSH key leaks, paliad must already be authenticated as m for the route to fire.
  3. Rate limit: should paliad cap requests/minute to avoid runaway SSH connections if Paliadin loops?

Phasing

  1. Phase A — manual proof of concept: m manually adds the SSH key + Tailscale on dev laptop, paliad prod points at it, verify flow end-to-end before automating Dockerfile changes.
  2. Phase B — Dockerfile + Dokploy secret integration: ship to prod under PALIADIN_REMOTE_HOST env var.
  3. Phase C — friendly errors + monitoring: round out UX.

References

  • internal/services/paliadin.go — current local-tmux implementation
  • docs/design-paliadin-2026-05-07.md — original design (notes Phase 1 was "Anthropic API direct"; this issue introduces a third path)
  • mLake Tailscale: 100.99.98.201 (Dokploy host)
  • mRiver Tailscale: 100.99.98.203 (m's laptop, runs Claude Code)
  • .env.age for Dokploy secrets pattern
  • t-150 (friendlyErrorMessage) — pattern to extend for mriver_unreachable

Inventor brief

  • Role: inventor
  • Hire: noether (deepest Paliadin substrate context, just shipped t-146). NOT cronus per memory.
  • Branch: mai/noether/inventor-paliadin-tailscale-ssh
  • Deliverable: docs/design-paliadin-tailscale-ssh-2026-05-07.md. Three sub-designs:
    1. Container Tailscale + Dockerfile + auth-key (Q1-Q3)
    2. SSH identity + restricted command + host-key pinning (Q4-Q6)
    3. Service-layer integration + routing decision + reliability (Q7-Q15)
  • Inventor STOPs after design. Surface design-shaping questions via AskUserQuestion or PWA form, NOT a markdown §X.Y dump (paliad dogma: make it easy for m).
  • m available during design pass.
## Goal Route Paliadin from paliad.de's Dokploy container (mLake, `100.99.98.201`) to mRiver (`100.99.98.203`) via Tailscale + SSH, so m can use Paliadin from prod without losing the Claude-Code-subscription benefit (vs paying Anthropic API tokens). ## Locked direction (m, 2026-05-07 22:33) **SSH-tunnel via Tailscale** chosen over Anthropic API direct (preserves subscription) and over standalone HTTP daemon on mRiver (cleaner protocol but extra moving piece). Concrete shape: - paliad container has Tailscale connectivity (sidecar or network=host or userspace tailscaled — inventor decides) - Container has SSH client + an identity key authorised on mRiver - PaliadinService swaps the local `tmux new-session ...` invocation for `ssh m@100.99.98.203 tmux ...` when running on a host without local tmux - mRiver runs the long-lived tmux session + claude CLI; paliad container just proxies ## Out of scope (v1) - Multi-host failover (only mRiver targeted) - Encryption beyond what SSH provides - API-key fallback when mRiver is offline (just emit a friendly "Paliadin offline" message) - Cross-firm or production multi-tenant variants ## Open design questions (for inventor — m will engage) ### Container Tailscale shape 1. **Tailscale provisioning**: Sidecar container with Tailscale daemon? `network_mode: host` so paliad inherits host's Tailscale? Userspace `tailscaled` inside paliad's container with auth-key from secrets? Inventor recommends, m signs off. 2. **Auth-key rotation**: where does the Tailscale auth-key live? Dokploy secret? `.env.age`? How is it rotated? 3. **Container image change**: paliad's Dockerfile currently lacks `ssh` client + Tailscale binary. Inventor proposes the Dockerfile diff. Image-size impact? ### SSH identity + auth 4. **Key generation**: One SSH keypair for the paliad container, or per-instance? Where is the private key stored (Dokploy secret? `.env.age`?). Public key authorised on mRiver under `~m/.ssh/authorized_keys`. 5. **Restricted command**: `authorized_keys` entry restricts to `command="<paliadin-shim>"` so the key can ONLY run the tmux invocation, not arbitrary shells. Define the shim shape. 6. **Host key pinning**: Container's `~/.ssh/known_hosts` pre-populated with mRiver's host key? Or `StrictHostKeyChecking=accept-new` first time? Inventor recommends. ### Service-layer integration 7. **Code changes in `internal/services/paliadin.go`**: Where exactly does the tmux invocation happen? Does the SSH version share the same code path with a flag, or is there a separate `RemotePaliadinService` implementation? 8. **Routing decision**: How does paliad know to use SSH vs local tmux? Env var `PALIADIN_REMOTE_HOST=100.99.98.203`? Auto-detect (try local tmux first, fall back to SSH)? Inventor recommends. 9. **Reverse port**: SSH connection is short-lived (per turn) or long-lived (one persistent SSH master)? `ControlMaster auto`? ### Reliability + monitoring 10. **mRiver offline detection**: How does paliad probe mRiver before sending a query? `ssh -o ConnectTimeout=2`? Cached health check? 11. **Friendly error when mRiver offline**: Reuse the existing `friendlyErrorMessage` shape from t-150. Add error code `mriver_unreachable` with localised message ("mRiver ist offline — Paliadin nicht erreichbar. Lokal mit `./paliad` starten oder mRiver wecken."). 12. **Wake-on-LAN**: Out of scope for v1 (m's laptop wake semantics are device-specific). Document as future-work. ### Security + auth-domain 13. **Trust model**: Anyone who can read paliad's container disk + SSH private key can SSH into mRiver as user `m`. Mitigations: SSH `command=` restriction (Q5); audit log on mRiver-side; Dokploy host-disk encryption assumption. 14. **PaliadinOwnerEmail gate (existing)**: stays in place — only m's email can hit /paliadin. So even if SSH key leaks, paliad must already be authenticated as m for the route to fire. 15. **Rate limit**: should paliad cap requests/minute to avoid runaway SSH connections if Paliadin loops? ### Phasing 16. **Phase A — manual proof of concept**: m manually adds the SSH key + Tailscale on dev laptop, paliad prod points at it, verify flow end-to-end before automating Dockerfile changes. 17. **Phase B — Dockerfile + Dokploy secret integration**: ship to prod under `PALIADIN_REMOTE_HOST` env var. 18. **Phase C — friendly errors + monitoring**: round out UX. ## References - `internal/services/paliadin.go` — current local-tmux implementation - `docs/design-paliadin-2026-05-07.md` — original design (notes Phase 1 was "Anthropic API direct"; this issue introduces a third path) - mLake Tailscale: `100.99.98.201` (Dokploy host) - mRiver Tailscale: `100.99.98.203` (m's laptop, runs Claude Code) - `.env.age` for Dokploy secrets pattern - t-150 (`friendlyErrorMessage`) — pattern to extend for `mriver_unreachable` ## Inventor brief - Role: inventor - Hire: noether (deepest Paliadin substrate context, just shipped t-146). NOT cronus per memory. - Branch: `mai/noether/inventor-paliadin-tailscale-ssh` - Deliverable: `docs/design-paliadin-tailscale-ssh-2026-05-07.md`. Three sub-designs: 1. Container Tailscale + Dockerfile + auth-key (Q1-Q3) 2. SSH identity + restricted command + host-key pinning (Q4-Q6) 3. Service-layer integration + routing decision + reliability (Q7-Q15) - Inventor STOPs after design. Surface design-shaping questions via AskUserQuestion or PWA form, NOT a markdown §X.Y dump (paliad dogma: make it easy for m). - m available during design pass.
mAi self-assigned this 2026-05-07 20:34:30 +00:00
Author
Collaborator

Inventor design pushed: docs/design-paliadin-tailscale-ssh-2026-05-07.md (commit befa41c on mai/noether/inventor-paliadin).

m's locked decisions (verbatim, 22:35):

  1. Tailscale shape → network_mode: host on paliad (overrode inventor's sidecar recommendation; Phase A gates rollout on traefik still routing under host mode)
  2. SSH RPC → server-side paliadin-shim (one RPC per turn: run-turn / reset / health / bootstrap)
  3. Routing → env var PALIADIN_REMOTE_HOST + Paliadin interface split (LocalPaliadinServiceRemotePaliadinService)
  4. Key storage → Dokploy secret PALIADIN_SSH_PRIVATE_KEY, chmod-600 tmpfile at startup

Verified live before designing (so the design isn't built on stale CLAUDE.md):

  • mRiver 100.99.98.203 has tmux 3.6a + claude CLI at /home/m/.local/bin/claude
  • mLake 100.99.98.201 has Tailscale running, sees mRiver active; direct [...]:41641
  • paliad Dockerfile is alpine:3.21 minimal (no SSH, no tailscaled)
  • ~/.ssh/authorized_keys does NOT exist on mRiver yet — Phase A creates it
  • compose runs default Docker bridge — no Tailscale anywhere yet

Three-phase rollout:

  • A — manual proof on a temp branch: keypair gen + shim install + authorized_keys + host-key pin + flip compose to host mode and verify traefik still serves paliad.de. Gate: if traefik 502s, abort A and revisit decision 1 in a follow-up.
  • B — Dockerfile (+openssh-client), compose (host mode + 4 env vars), Go interface split, Dokploy secrets registration.
  • Cmriver_unreachable friendly error extending t-paliad-150, admin dashboard health probe.

Three open questions for m at end of doc (§11): traefik+host-mode Dokploy doc check before B, shim location (repo vs mRiver-only), and the dead ANTHROPIC_API_KEY comment line in compose.

Inventor stopped here. No code shipped. Awaiting m's go/no-go before coder shift.

**Inventor design pushed:** [`docs/design-paliadin-tailscale-ssh-2026-05-07.md`](https://mgit.msbls.de/m/paliad/src/branch/mai/noether/inventor-paliadin/docs/design-paliadin-tailscale-ssh-2026-05-07.md) (commit befa41c on `mai/noether/inventor-paliadin`). **m's locked decisions** (verbatim, 22:35): 1. Tailscale shape → `network_mode: host` on paliad (overrode inventor's sidecar recommendation; Phase A gates rollout on traefik still routing under host mode) 2. SSH RPC → server-side `paliadin-shim` (one RPC per turn: `run-turn` / `reset` / `health` / `bootstrap`) 3. Routing → env var `PALIADIN_REMOTE_HOST` + `Paliadin` interface split (`LocalPaliadinService` ↔ `RemotePaliadinService`) 4. Key storage → Dokploy secret `PALIADIN_SSH_PRIVATE_KEY`, chmod-600 tmpfile at startup **Verified live before designing** (so the design isn't built on stale CLAUDE.md): - mRiver `100.99.98.203` has tmux 3.6a + claude CLI at `/home/m/.local/bin/claude` - mLake `100.99.98.201` has Tailscale running, sees mRiver `active; direct [...]:41641` - paliad Dockerfile is alpine:3.21 minimal (no SSH, no tailscaled) - `~/.ssh/authorized_keys` does NOT exist on mRiver yet — Phase A creates it - compose runs default Docker bridge — no Tailscale anywhere yet **Three-phase rollout:** - **A** — manual proof on a temp branch: keypair gen + shim install + authorized_keys + host-key pin + flip compose to host mode and verify traefik still serves paliad.de. **Gate**: if traefik 502s, abort A and revisit decision 1 in a follow-up. - **B** — Dockerfile (`+openssh-client`), compose (host mode + 4 env vars), Go interface split, Dokploy secrets registration. - **C** — `mriver_unreachable` friendly error extending t-paliad-150, admin dashboard health probe. **Three open questions** for m at end of doc (§11): traefik+host-mode Dokploy doc check before B, shim location (repo vs mRiver-only), and the dead `ANTHROPIC_API_KEY` comment line in compose. **Inventor stopped here.** No code shipped. Awaiting m's go/no-go before coder shift.
Author
Collaborator

Phase A.0 complete (coder shift, noether) — SSH path proven end-to-end on the tailnet.

Commits: 0248411 shim, f952fb8 design amendment.

What changed from the original design

Tailscale SSH intercepts mRiver:22 — banner says SSH-2.0-Tailscale, auth method none, authorized_keys command= directive never fires. Discovered while debugging "fish: Unknown command: health" on the first run-turn test.

Fix: added a separate listen port :22022 via a systemd ssh.socket drop-in. Tailscale SSH only intercepts :22, so :22022 hits real OpenSSH where command=/from=/no-pty/... work as specified. m's interactive tailscale ssh m@mriver on :22 is untouched.

(First attempt at the drop-in briefly failed with Address already in use — a stale Port 22022 directive from sshd_config was holding the port; reverted in ~30 s and retried clean. Documented in §4.5.)

Verified live

ssh -p 22022 -i paliad-prod-key m@100.99.98.203 health
→ ok
ssh ... run-turn $(uuidgen) "$(... | base64 -w0)"
→ "test ok" (3.4 s round-trip including a real Claude response)
ssh from mRiver itself (not 100.99.98.201)
→ Permission denied (publickey,password)   # from= clause works

State on mRiver (m's laptop) — already in place

  • /home/m/.local/bin/paliadin-shim (executable) — repo-version-controlled at scripts/paliadin-shim
  • ~/.ssh/authorized_keys — paliad-prod public key with command=/from="100.99.98.201"/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc
  • /etc/systemd/system/ssh.socket.d/paliad.conf — port 22022 socket drop-in
  • ~/.paliad-staging/ — keypair + known_hosts staged for Dokploy registration

Secrets to register in Dokploy (m's hands)

Dokploy secret Value
PALIADIN_REMOTE_HOST 100.99.98.203
PALIADIN_REMOTE_PORT 22022
PALIADIN_REMOTE_USER m
PALIADIN_SSH_PRIVATE_KEY cat ~/.paliad-staging/paliad-prod-key (multi-line)
PALIADIN_KNOWN_HOSTS cat ~/.paliad-staging/known_hosts (one line)

Phase A.5 (PENDING m's hands)

Validate network_mode: host + traefik routing on prod paliad.de:

  1. Branch docker-compose.yml: add network_mode: host to web, remove expose: ["8080"]
  2. Push to trigger Dokploy redeploy
  3. curl --connect-timeout 5 -sSI https://paliad.de/ — expect 200 (or login redirect), NOT 502
  4. Pass: keep host-mode; ready for Phase B
  5. Fail (502): git revert HEAD && git push to roll back; revisit decision 1 in a follow-up issue

This step touches prod paliad.de — coder did NOT execute it without your hands.

Phase B unblocked from a code perspective

But should not merge until A.5 confirms host-mode is OK. Code work: Dockerfile +openssh-client; compose host-mode + 5 env vars; Go interface split (Paliadin interface, rename LocalPaliadinService, new RemotePaliadinService in paliadin_remote.go); cmd/server/main.go env-var wiring; friendlyErrorMessage mriver_unreachable case; i18n keys.

Two remaining open questions

  1. Phase A.5 traefik+host-mode on prod paliad.de — your call when to flip
  2. ANTHROPIC_API_KEY env var reservation in compose comments — keep for production-v1 or strip?

Standing by for next instruction.

**Phase A.0 complete (coder shift, noether) — SSH path proven end-to-end on the tailnet.** Commits: [`0248411`](https://mgit.msbls.de/m/paliad/commit/0248411) shim, [`f952fb8`](https://mgit.msbls.de/m/paliad/commit/f952fb8) design amendment. ### What changed from the original design **Tailscale SSH intercepts mRiver:22** — banner says `SSH-2.0-Tailscale`, auth method `none`, `authorized_keys command=` directive **never fires**. Discovered while debugging "fish: Unknown command: health" on the first run-turn test. **Fix:** added a separate listen port `:22022` via a `systemd ssh.socket` drop-in. Tailscale SSH only intercepts `:22`, so `:22022` hits real OpenSSH where `command=/from=/no-pty/...` work as specified. m's interactive `tailscale ssh m@mriver` on `:22` is untouched. (First attempt at the drop-in briefly failed with `Address already in use` — a stale `Port 22022` directive from sshd_config was holding the port; reverted in ~30 s and retried clean. Documented in §4.5.) ### Verified live ``` ssh -p 22022 -i paliad-prod-key m@100.99.98.203 health → ok ssh ... run-turn $(uuidgen) "$(... | base64 -w0)" → "test ok" (3.4 s round-trip including a real Claude response) ssh from mRiver itself (not 100.99.98.201) → Permission denied (publickey,password) # from= clause works ``` ### State on mRiver (m's laptop) — already in place - `/home/m/.local/bin/paliadin-shim` (executable) — repo-version-controlled at `scripts/paliadin-shim` - `~/.ssh/authorized_keys` — paliad-prod public key with `command=`/`from="100.99.98.201"`/no-pty/no-port-forwarding/no-agent-forwarding/no-X11-forwarding/no-user-rc - `/etc/systemd/system/ssh.socket.d/paliad.conf` — port 22022 socket drop-in - `~/.paliad-staging/` — keypair + known_hosts staged for Dokploy registration ### Secrets to register in Dokploy (m's hands) | Dokploy secret | Value | |---|---| | `PALIADIN_REMOTE_HOST` | `100.99.98.203` | | `PALIADIN_REMOTE_PORT` | `22022` | | `PALIADIN_REMOTE_USER` | `m` | | `PALIADIN_SSH_PRIVATE_KEY` | `cat ~/.paliad-staging/paliad-prod-key` (multi-line) | | `PALIADIN_KNOWN_HOSTS` | `cat ~/.paliad-staging/known_hosts` (one line) | ### Phase A.5 (PENDING m's hands) Validate `network_mode: host` + traefik routing on prod paliad.de: 1. Branch `docker-compose.yml`: add `network_mode: host` to `web`, remove `expose: ["8080"]` 2. Push to trigger Dokploy redeploy 3. `curl --connect-timeout 5 -sSI https://paliad.de/` — expect 200 (or login redirect), NOT 502 4. **Pass:** keep host-mode; ready for Phase B 5. **Fail (502):** `git revert HEAD && git push` to roll back; revisit decision 1 in a follow-up issue This step touches prod paliad.de — coder did NOT execute it without your hands. ### Phase B unblocked from a code perspective But should not merge until A.5 confirms host-mode is OK. Code work: Dockerfile +openssh-client; compose host-mode + 5 env vars; Go interface split (`Paliadin` interface, rename `LocalPaliadinService`, new `RemotePaliadinService` in `paliadin_remote.go`); `cmd/server/main.go` env-var wiring; `friendlyErrorMessage` mriver_unreachable case; i18n keys. ### Two remaining open questions 1. Phase A.5 traefik+host-mode on prod paliad.de — your call when to flip 2. `ANTHROPIC_API_KEY` env var reservation in compose comments — keep for production-v1 or strip? Standing by for next instruction.
Author
Collaborator

Phase B code merged into main as 1d7c7d7, bundling 7 commits from mai/noether/inventor-paliadin:

  • befa41c — design doc (docs/design-paliadin-tailscale-ssh-2026-05-07.md)
  • 0248411scripts/paliadin-shim (mRiver-side restricted shim)
  • f952fb8 — design amend: port 22022 bypass + Phase A.0 results
  • f62bf9f — Dockerfile +openssh-client
  • 56a3dc9 — refactor: extract Paliadin interface; PaliadinServiceLocalPaliadinService
  • 0c8a2f1RemotePaliadinService + DisabledPaliadinService + main.go env-var switch (PALIADIN_REMOTE_HOST → remote / tmux on PATH → local / else → disabled)
  • 68c56ea — 14 unit tests via callShimHook (no real SSH required)
  • e4110cf — frontend friendlyErrorMessage for remote-Paliadin error codes (DE+EN)

Production behavior unchanged. Without PALIADIN_REMOTE_HOST in env, paliad never invokes ssh — local-tmux PoC path is byte-identical (tests pass).

Phase A.5 still pending m's handsda971a7 (docker-compose network_mode: host + 5 new env vars) is held on mai/noether/inventor-paliadin and explicitly marked DO NOT MERGE before Phase A.5. The compose flip's commit message has the A.5 procedure (curl test + revert path on M1 502).

Three open items for m before this issue closes:

  1. Phase A.5 deploy test — flip docker-compose.yml host-mode on a temp branch (or cherry-pick da971a7), redeploy, curl paliad.de, gate the merge on 200 vs 502.
  2. Register PALIADIN_SSH_PRIVATE_KEY + PALIADIN_KNOWN_HOSTS as Dokploy secrets (values staged at ~/.paliad-staging/ on mRiver per issuecomment-6886).
  3. Optional: strip the dead ANTHROPIC_API_KEY comment line from compose.
Phase B code merged into main as [`1d7c7d7`](https://mgit.msbls.de/m/paliad/commit/1d7c7d7), bundling 7 commits from `mai/noether/inventor-paliadin`: - [`befa41c`](https://mgit.msbls.de/m/paliad/commit/befa41c) — design doc (`docs/design-paliadin-tailscale-ssh-2026-05-07.md`) - [`0248411`](https://mgit.msbls.de/m/paliad/commit/0248411) — `scripts/paliadin-shim` (mRiver-side restricted shim) - [`f952fb8`](https://mgit.msbls.de/m/paliad/commit/f952fb8) — design amend: port 22022 bypass + Phase A.0 results - [`f62bf9f`](https://mgit.msbls.de/m/paliad/commit/f62bf9f) — Dockerfile +openssh-client - [`56a3dc9`](https://mgit.msbls.de/m/paliad/commit/56a3dc9) — refactor: extract `Paliadin` interface; `PaliadinService` → `LocalPaliadinService` - [`0c8a2f1`](https://mgit.msbls.de/m/paliad/commit/0c8a2f1) — `RemotePaliadinService` + `DisabledPaliadinService` + `main.go` env-var switch (`PALIADIN_REMOTE_HOST` → remote / `tmux` on PATH → local / else → disabled) - [`68c56ea`](https://mgit.msbls.de/m/paliad/commit/68c56ea) — 14 unit tests via `callShimHook` (no real SSH required) - [`e4110cf`](https://mgit.msbls.de/m/paliad/commit/e4110cf) — frontend `friendlyErrorMessage` for remote-Paliadin error codes (DE+EN) **Production behavior unchanged.** Without `PALIADIN_REMOTE_HOST` in env, paliad never invokes ssh — local-tmux PoC path is byte-identical (tests pass). **Phase A.5 still pending m's hands** — `da971a7` (docker-compose `network_mode: host` + 5 new env vars) is held on `mai/noether/inventor-paliadin` and explicitly marked `DO NOT MERGE before Phase A.5`. The compose flip's commit message has the A.5 procedure (curl test + revert path on M1 502). **Three open items for m before this issue closes:** 1. Phase A.5 deploy test — flip `docker-compose.yml` host-mode on a temp branch (or cherry-pick `da971a7`), redeploy, curl `paliad.de`, gate the merge on 200 vs 502. 2. Register `PALIADIN_SSH_PRIVATE_KEY` + `PALIADIN_KNOWN_HOSTS` as Dokploy secrets (values staged at `~/.paliad-staging/` on mRiver per [issuecomment-6886](https://mgit.msbls.de/m/paliad/issues/12#issuecomment-6886)). 3. Optional: strip the dead `ANTHROPIC_API_KEY` comment line from compose.
Author
Collaborator

Phase A.5 + A.7 done — Paliadin chat works end-to-end from paliad.de prod.

m's "I still don't have a connection" was the symptom of two issues stacked on top of the original A.5 attempt that was reverted (a80652a82faa3d on 2026-05-08 00:38).

Root causes

  1. Decision 1 (network_mode: host) was actually wrong — not because traefik 502s (the M2 case I had designed for), but because Dokploy auto-injects networks: [dokploy-network, default] on the primary service for traefik routing. Compose then refuses: service web declares mutually exclusive network_mode and networks: invalid compose project. That's the reverted-merge's failure mode, recorded in /etc/dokploy/logs/.../...:00:38:05.log.

  2. The host-mode premise was unnecessary anyway. Empirical test (commit log):

    $ docker run --rm -v /tmp/paliad-prod-key:/tmp/k:ro \
                     -v /tmp/paliad-known_hosts:/tmp/kh:ro alpine:3.21 \
        sh -c 'apk add openssh-client && \
               ssh -p 22022 -i /tmp/k -o UserKnownHostsFile=/tmp/kh \
                   -o IdentitiesOnly=yes m@100.99.98.203 health'
    → ok
    

    Plain alpine container on Dokploy's default bridge SSHs to mRiver fine. Docker source NAT masquerades the bridge IP onto mLake's tailscale0 (100.99.98.201), which matches the from="100.99.98.201" clause on mRiver's authorized_keys. The kernel routes tailnet traffic for free; no network_mode: host, no Tailscale-in-container needed.

  3. Multi-line PEM env vars don't survive Dokploy's .env mechanism — got truncated to the BEGIN line (36 bytes) inside the container. ssh -i … failed with Load key: error in libcrypto. Fixed by base64-encoding the secret and decoding in buildPaliadinRemoteConfig.

Commits to main today

  • a0d1e77 — Phase A.5 (correct version): drops network_mode: host, adds 5 PALIADIN_* env entries to compose, includes the rationale + the empirical alpine-on-bridge proof in the commit body
  • 4c47819 — base64-decode PALIADIN_SSH_PRIVATE_KEY; accepts both raw PEM (local-dev) and base64 (Dokploy)

Dokploy compose Zx147ycurfYagKRl_Zzyo (paliad) updated via API: PALIADIN_SSH_PRIVATE_KEY now base64-encoded (560 chars, single line).

End-to-end verification (inside the live paliad-prod container, 11:33)

$ ssh -F /dev/null -i $KEY -p 22022 -o IdentitiesOnly=yes \
       -o UserKnownHostsFile=$KH -o StrictHostKeyChecking=yes \
       -o BatchMode=yes -o ConnectTimeout=5 m@100.99.98.203 -- health
ok

$ ssh ... -- bootstrap <base64-prompt>
ok

$ ssh ... -- run-turn <uuid> <base64-msg>
ok                       (Claude wrote /tmp/paliadin/<uuid>.txt on mRiver)

paliad startup log confirms: paliadin: remote mode → ssh m@100.99.98.203:22022. Claude pane reset + paliad container restarted clean so m's first /paliadin invocation gets the real system prompt via Go's lazy ensureBootstrapped path.

Design doc follow-up

docs/design-paliadin-tailscale-ssh-2026-05-07.md §4–§4.5 / §7 are now empirically wrong — host mode was never needed. A small amendment is in scope, but not urgent (the production code works; the design rationale lives in the commit messages of a0d1e77 and 4c47819).

Status: t-paliad-151 working end-to-end. Standing by.

**Phase A.5 + A.7 done — Paliadin chat works end-to-end from paliad.de prod.** m's "I still don't have a connection" was the symptom of two issues stacked on top of the original A.5 attempt that was reverted (a80652a → 82faa3d on 2026-05-08 00:38). ### Root causes 1. **Decision 1 (`network_mode: host`) was actually wrong** — not because traefik 502s (the M2 case I had designed for), but because Dokploy auto-injects `networks: [dokploy-network, default]` on the primary service for traefik routing. Compose then refuses: `service web declares mutually exclusive network_mode and networks: invalid compose project`. That's the reverted-merge's failure mode, recorded in `/etc/dokploy/logs/.../...:00:38:05.log`. 2. **The host-mode premise was unnecessary anyway.** Empirical test (commit log): ``` $ docker run --rm -v /tmp/paliad-prod-key:/tmp/k:ro \ -v /tmp/paliad-known_hosts:/tmp/kh:ro alpine:3.21 \ sh -c 'apk add openssh-client && \ ssh -p 22022 -i /tmp/k -o UserKnownHostsFile=/tmp/kh \ -o IdentitiesOnly=yes m@100.99.98.203 health' → ok ``` Plain alpine container on Dokploy's default bridge SSHs to mRiver fine. Docker source NAT masquerades the bridge IP onto mLake's tailscale0 (100.99.98.201), which matches the `from="100.99.98.201"` clause on mRiver's authorized_keys. The kernel routes tailnet traffic for free; no `network_mode: host`, no Tailscale-in-container needed. 3. **Multi-line PEM env vars don't survive Dokploy's `.env` mechanism** — got truncated to the BEGIN line (36 bytes) inside the container. `ssh -i …` failed with `Load key: error in libcrypto`. Fixed by base64-encoding the secret and decoding in `buildPaliadinRemoteConfig`. ### Commits to main today - `a0d1e77` — Phase A.5 (correct version): drops `network_mode: host`, adds 5 PALIADIN_* env entries to compose, includes the rationale + the empirical alpine-on-bridge proof in the commit body - `4c47819` — base64-decode `PALIADIN_SSH_PRIVATE_KEY`; accepts both raw PEM (local-dev) and base64 (Dokploy) Dokploy compose `Zx147ycurfYagKRl_Zzyo` (paliad) updated via API: `PALIADIN_SSH_PRIVATE_KEY` now base64-encoded (560 chars, single line). ### End-to-end verification (inside the live paliad-prod container, 11:33) ``` $ ssh -F /dev/null -i $KEY -p 22022 -o IdentitiesOnly=yes \ -o UserKnownHostsFile=$KH -o StrictHostKeyChecking=yes \ -o BatchMode=yes -o ConnectTimeout=5 m@100.99.98.203 -- health ok $ ssh ... -- bootstrap <base64-prompt> ok $ ssh ... -- run-turn <uuid> <base64-msg> ok (Claude wrote /tmp/paliadin/<uuid>.txt on mRiver) ``` paliad startup log confirms: `paliadin: remote mode → ssh m@100.99.98.203:22022`. Claude pane reset + paliad container restarted clean so m's first /paliadin invocation gets the real system prompt via Go's lazy `ensureBootstrapped` path. ### Design doc follow-up `docs/design-paliadin-tailscale-ssh-2026-05-07.md` §4–§4.5 / §7 are now empirically wrong — host mode was never needed. A small amendment is in scope, but not urgent (the production code works; the design rationale lives in the commit messages of `a0d1e77` and `4c47819`). Status: t-paliad-151 working end-to-end. Standing by.
Author
Collaborator

Phase A.5 + A.7 complete — Paliadin chat works end-to-end from paliad.de prod via SSH to mRiver.

Final commit chain on main:

  • a0d1e77 — Phase A.5: compose env-var passthrough (5 PALIADIN_* entries; no host-mode flip needed)
  • db3514c — Merge of the above
  • 4c47819 — base64-decode SSH key (Dokploy .env truncates multi-line values to first line)
  • 319221f — Merge of the above

Two empirical findings that override the original design

(1) network_mode: host is incompatible with this Dokploy app's compose lifecycle. Dokploy auto-injects networks: [dokploy-network, default] on the primary service for traefik routing, which is mutually exclusive with network_mode: host. First attempt at host mode (a80652a) failed compose validation; reverted as 82faa3d.

(2) host mode wasn't needed anyway. Verified by running a plain alpine container on Dokploy's default bridge:

docker run --rm -v /tmp/paliad-prod-key:/tmp/k:ro -v /tmp/paliad-known_hosts:/tmp/kh:ro alpine:3.21 \
  sh -c 'apk add openssh-client && ssh -p 22022 -i /tmp/k -o UserKnownHostsFile=/tmp/kh \
         -o IdentitiesOnly=yes m@100.99.98.203 health'
→ ok

Docker's outbound NAT masquerades the container's bridge IP onto mLake's host IPs, including tailscale0 (100.99.98.201). Linux routing on mLake sends 100.99.98.0/24 to tailscale0. mRiver's sshd sees the connection coming from 100.99.98.201, matching the from="100.99.98.201" clause on the paliad-prod authorized_keys entry. The kernel does the masquerade for free — no Tailscale-in-container, no sidecar, no host networking.

Design doc follow-up needed

docs/design-paliadin-tailscale-ssh-2026-05-07.md §4 (host-mode shape) is empirically wrong; §7 Phase A.5 needs an "M3: kernel does the masquerade for you" entry, and decision 1 in §3 should be amended. Filed as a TODO in the A.5 commit message — worth a small design-doc-amend follow-up before this thread closes.

Done

  • Paliadin chat works end-to-end from paliad.de prod (verified by m + noether 11:23-11:34 CEST, including a real claude turn through the shim).
  • Dokploy secrets registered (host, port, user, base64 SSH key, known_hosts).
  • mRiver host setup persisted (shim, authorized_keys, ssh.socket drop-in for :22022 to bypass Tailscale-SSH interception).
  • Test artifacts cleaned up on mLake.
**Phase A.5 + A.7 complete — Paliadin chat works end-to-end from paliad.de prod via SSH to mRiver.** Final commit chain on main: - [`a0d1e77`](https://mgit.msbls.de/m/paliad/commit/a0d1e77) — Phase A.5: compose env-var passthrough (5 PALIADIN_* entries; **no host-mode flip needed**) - [`db3514c`](https://mgit.msbls.de/m/paliad/commit/db3514c) — Merge of the above - [`4c47819`](https://mgit.msbls.de/m/paliad/commit/4c47819) — base64-decode SSH key (Dokploy `.env` truncates multi-line values to first line) - [`319221f`](https://mgit.msbls.de/m/paliad/commit/319221f) — Merge of the above ## Two empirical findings that override the original design **(1) `network_mode: host` is incompatible with this Dokploy app's compose lifecycle.** Dokploy auto-injects `networks: [dokploy-network, default]` on the primary service for traefik routing, which is mutually exclusive with `network_mode: host`. First attempt at host mode ([`a80652a`](https://mgit.msbls.de/m/paliad/commit/a80652a)) failed compose validation; reverted as [`82faa3d`](https://mgit.msbls.de/m/paliad/commit/82faa3d). **(2) host mode wasn't needed anyway.** Verified by running a plain alpine container on Dokploy's default bridge: ``` docker run --rm -v /tmp/paliad-prod-key:/tmp/k:ro -v /tmp/paliad-known_hosts:/tmp/kh:ro alpine:3.21 \ sh -c 'apk add openssh-client && ssh -p 22022 -i /tmp/k -o UserKnownHostsFile=/tmp/kh \ -o IdentitiesOnly=yes m@100.99.98.203 health' → ok ``` Docker's outbound NAT masquerades the container's bridge IP onto mLake's host IPs, including `tailscale0` (`100.99.98.201`). Linux routing on mLake sends `100.99.98.0/24` to `tailscale0`. mRiver's sshd sees the connection coming from `100.99.98.201`, matching the `from="100.99.98.201"` clause on the paliad-prod `authorized_keys` entry. The kernel does the masquerade for free — no Tailscale-in-container, no sidecar, no host networking. ## Design doc follow-up needed `docs/design-paliadin-tailscale-ssh-2026-05-07.md` §4 (host-mode shape) is empirically wrong; §7 Phase A.5 needs an "M3: kernel does the masquerade for you" entry, and decision 1 in §3 should be amended. Filed as a TODO in the A.5 commit message — worth a small design-doc-amend follow-up before this thread closes. ## Done - Paliadin chat works end-to-end from paliad.de prod (verified by m + noether 11:23-11:34 CEST, including a real claude turn through the shim). - Dokploy secrets registered (host, port, user, base64 SSH key, known_hosts). - mRiver host setup persisted (shim, authorized_keys, ssh.socket drop-in for :22022 to bypass Tailscale-SSH interception). - Test artifacts cleaned up on mLake.
Author
Collaborator

t-paliad-155 merged into main as 5893c45. Bundle:

  • 97a4124 — real Claude SKILL.md + per-user tmux session keying (paliad-paliadin-<user_id_short>)
  • 9579032 — re-author skill via /write-a-skill conventions (96-line SKILL.md + 134-line references/sql-recipes.md)
  • e75a71fcwd fix: shim spawns claude in /home/m/dev/paliad (configurable via PALIADIN_REMOTE_CWD) so project-scoped MCPs (Supabase) load. Solves m's 'no DB access' symptom from earlier dogfood.
  • 3e1f4eePALIADIN_TIMEOUT_S default 60→120s for cold-start safety; SKILL.md bans psql/curl fallbacks (Claude must write 'DB unreachable' rather than nix-shelling postgres on a 1m20s detour). Solves m's 'loses connection before response came in' from earlier dogfood.

Lockstep update on mRiver: ~/.local/bin/paliadin-shim refreshed (new verb signatures: health <session>, run-turn <session> <uuid> <msg-base64>, reset <session>; bootstrap verb removed). ~/.claude/skills/paliadin/ refreshed via scripts/install-paliadin-skill. Both done before paliad container redeploys, so the new Go side talks to the new shim from the first post-deploy turn.

Service-side (paliadin_remote.go, paliadin.go, main.go): paliadinSystemPrompt keystroke-bootstrap path deleted. Per-user session keying derived from req.UserID. paliadin_prompt.go removed (skill is now source of truth). 14 unit tests via callShimHook updated for the new shape.

Known limitation flagged for next task (t-paliad-156, queued): even with the skill loaded and the right MCP, Claude queries via service role — sees ALL data, RLS bypassed. Skill enforces paliad.can_see_project predicate as a stopgap, but it's discipline, not enforcement. m's call (2026-05-08 13:29): proper fix is per-turn JWT minted by paliad with sub=<user_id>, passed through SSH/shim/file to Claude, used as Authorization: Bearer against PostgREST. Filed as separate task; ships after this lands and is dogfooded.

t-paliad-155 merged into main as [`5893c45`](https://mgit.msbls.de/m/paliad/commit/5893c45). Bundle: - [`97a4124`](https://mgit.msbls.de/m/paliad/commit/97a4124) — real Claude SKILL.md + per-user tmux session keying (`paliad-paliadin-<user_id_short>`) - [`9579032`](https://mgit.msbls.de/m/paliad/commit/9579032) — re-author skill via `/write-a-skill` conventions (96-line SKILL.md + 134-line `references/sql-recipes.md`) - [`e75a71f`](https://mgit.msbls.de/m/paliad/commit/e75a71f) — **cwd fix**: shim spawns claude in `/home/m/dev/paliad` (configurable via `PALIADIN_REMOTE_CWD`) so project-scoped MCPs (Supabase) load. Solves m's 'no DB access' symptom from earlier dogfood. - [`3e1f4ee`](https://mgit.msbls.de/m/paliad/commit/3e1f4ee) — `PALIADIN_TIMEOUT_S` default 60→120s for cold-start safety; SKILL.md bans psql/curl fallbacks (Claude must write 'DB unreachable' rather than nix-shelling postgres on a 1m20s detour). Solves m's 'loses connection before response came in' from earlier dogfood. **Lockstep update** on mRiver: `~/.local/bin/paliadin-shim` refreshed (new verb signatures: `health <session>`, `run-turn <session> <uuid> <msg-base64>`, `reset <session>`; bootstrap verb removed). `~/.claude/skills/paliadin/` refreshed via `scripts/install-paliadin-skill`. Both done before paliad container redeploys, so the new Go side talks to the new shim from the first post-deploy turn. **Service-side** (`paliadin_remote.go`, `paliadin.go`, `main.go`): `paliadinSystemPrompt` keystroke-bootstrap path deleted. Per-user session keying derived from `req.UserID`. `paliadin_prompt.go` removed (skill is now source of truth). 14 unit tests via `callShimHook` updated for the new shape. **Known limitation flagged for next task (t-paliad-156, queued):** even with the skill loaded and the right MCP, Claude queries via service role — sees ALL data, RLS bypassed. Skill enforces `paliad.can_see_project` predicate as a stopgap, but it's discipline, not enforcement. m's call (2026-05-08 13:29): proper fix is per-turn JWT minted by paliad with `sub=<user_id>`, passed through SSH/shim/file to Claude, used as `Authorization: Bearer` against PostgREST. Filed as separate task; ships after this lands and is dogfooded.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: m/paliad#12
No description provided.