feat: Schritt 2 — mGPUmanager MVP routing + /v1/status

Go daemon listening on :8770 that fronts mvoice (8766), whisper-server (8178), ollama (11434), comfyui (8188) behind a single /v1 façade. What this MVP does: - Loads config/consumers.yaml: routing table, per-consumer URL + health + paths + vram_resident_mib + can_coexist_with + load/unload routes. - Background health probe (5s) on every consumer; refuses fast with a structured 503 if the last probe failed (no Felix-Banholzer-style silent fallback). - POST /v1/{tts,stt,llm,image} proxies the request body + Content-Type to the routed consumer's path and streams the response back. - GET /audio/* proxies to audio_proxy consumer (wa.sh fetches its WAV this way). - GET /v1/status exposes live GPU sample (nvidia-smi every 2s), per-consumer health/loaded/gpu_resident_mib/active/total_requests, scheduler stats. - GET /healthz, GET / — broker liveness. The Scheduler interface is in place but the implementation is 'Passthrough' — every job runs immediately, no lock, no queue. Schritt 4 replaces it with a serialising mutex; Schritt 5 adds VRAM-pressure eviction. The interface boundary means server.go stays unchanged. Out of scope here: - Schritt 3: wa.sh migration (parallel work in mAi). - Schritt 4: queue + global GPU lock. - Schritt 5: nvidia-smi-driven LRU eviction. Tests: config validation (good/bad), proxy forwards body, audio proxy streams bytes, unhealthy consumer returns 503, /v1/status JSON shape. Refs: m/mGPUmanager#1
2026-05-11 13:30:05 +02:00
parent b31b6f6580
commit c81c145163
16 changed files with 1701 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,3 +1,73 @@
 # mGPUmanager

-GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher /v1-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.
+GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher `/v1`-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.
+
+Full design: [`docs/design.md`](docs/design.md) — Bestandsaufnahme, 10-Alternativen-Survey, Eviction-Algorithmus, Migrationspfad.
+
+## Was es macht
+
+Auf `mrock:8770` sitzt ein Go-Daemon, der:
+
+- `/v1/tts`, `/v1/stt`, `/v1/llm`, `/v1/image` als einheitliche Konsumenten-Fassade exponiert,
+- jede Anfrage durch einen globalen GPU-Scheduler schleust (seriell, Queue),
+- bei VRAM-Druck LRU-Eviction über die deklarierten Coexistenz-Gruppen aus `config/consumers.yaml` fährt,
+- in `/v1/status` Live-GPU-Belegung + Consumer-Health + Scheduler-Statistiken zeigt,
+- niemals stille Fallbacks zurückgibt — Fehler kommen als strukturiertes `{error,message,consumer,retryable}`.
+
+## Konsumenten-Registry
+
+`config/consumers.yaml` deklariert pro Consumer:
+
+- `url`, `health.{method,path}` für Liveness-Probing
+- `paths.<kind>.{method,path}` — wie der Broker zu seinem TTS/STT/LLM/Image-Endpoint kommt
+- `vram_resident_mib` — für die Scheduler-Mathe (Schritt 5)
+- `unload.{method,path,body}` und optional `load.{method,path}` — wie der Broker den Consumer aus dem VRAM räumt / wieder hochfährt
+- `can_coexist_with: [..]` — wer parallel resident sein darf
+- `priority` (0=low, 4=urgent), `max_concurrency`
+
+## Build + Deploy
+
+```sh
+make build       # ./bin/mgpumanager
+make test        # go test ./...
+make run         # lokal gegen ./config/consumers.yaml
+make deploy HOST=mrock  # rsync + systemd reload + restart
+```
+
+Auf mRock läuft der Daemon als System-Unit (`/etc/systemd/system/mgpumanager.service`).
+
+## Endpoints
+
+| Verb | Pfad | Verhalten |
+|---|---|---|
+| POST | `/v1/tts`   | Proxy zu `routing.tts`-Consumer (default: mvoice `/api/synthesize`) |
+| POST | `/v1/stt`   | Proxy zu `routing.stt`-Consumer (default: mvoice `/api/transcribe`) |
+| POST | `/v1/llm`   | Proxy zu `routing.llm`-Consumer (default: ollama `/api/generate`) |
+| POST | `/v1/image` | Proxy zu `routing.image`-Consumer (default: comfyui `/prompt`) |
+| GET  | `/audio/*`  | Proxy zu `audio_proxy`-Consumer (wa.sh fetcht generiertes Audio so) |
+| GET  | `/v1/status`| Live-Snapshot: GPU + Consumer-Health + Scheduler-Stats |
+| GET  | `/healthz`  | Broker-Liveness (200 OK) |
+
+## Fehler-Schema
+
+Jeder Broker-eigene Fehler hat die Form:
+
+```json
+{
+  "error": "consumer_unreachable",
+  "message": "upstream mvoice last probe failed: connection refused",
+  "consumer": "mvoice",
+  "retryable": true
+}
+```
+
+Codes: `consumer_unreachable`, `no_consumer`, `scheduler_error`, `bad_consumer_url`, `bad_request`. Pass-through-4xx/5xx vom Consumer landet unverändert beim Client.
+
+## Phase 1 Status (Issue #1)
+
+- ✅ Schritt 0 — ComfyUI persistent (`systemd: comfyui.service`)
+- ✅ Schritt 1 — `mvoice /api/admin/{load,unload}` (mai/knuth/admin-load-unload @ mVoice)
+- ✅ Schritt 2 — Routing-Façade + `/v1/status` (passthrough scheduler)
+- ☐ Schritt 3 — wa.sh auf Broker umgestellt
+- ☐ Schritt 4 — Queue + globaler GPU-Lock
+- ☐ Schritt 5 — Coexistenz-Gruppen + LRU-Eviction