mAi c81c145163 feat: Schritt 2 — mGPUmanager MVP routing + /v1/status
Go daemon listening on :8770 that fronts mvoice (8766), whisper-server
(8178), ollama (11434), comfyui (8188) behind a single /v1 façade.

What this MVP does:
- Loads config/consumers.yaml: routing table, per-consumer URL + health +
  paths + vram_resident_mib + can_coexist_with + load/unload routes.
- Background health probe (5s) on every consumer; refuses fast with a
  structured 503 if the last probe failed (no Felix-Banholzer-style
  silent fallback).
- POST /v1/{tts,stt,llm,image} proxies the request body + Content-Type
  to the routed consumer's path and streams the response back.
- GET /audio/* proxies to audio_proxy consumer (wa.sh fetches its WAV
  this way).
- GET /v1/status exposes live GPU sample (nvidia-smi every 2s),
  per-consumer health/loaded/gpu_resident_mib/active/total_requests,
  scheduler stats.
- GET /healthz, GET / — broker liveness.

The Scheduler interface is in place but the implementation is
'Passthrough' — every job runs immediately, no lock, no queue. Schritt 4
replaces it with a serialising mutex; Schritt 5 adds VRAM-pressure
eviction. The interface boundary means server.go stays unchanged.

Out of scope here:
- Schritt 3: wa.sh migration (parallel work in mAi).
- Schritt 4: queue + global GPU lock.
- Schritt 5: nvidia-smi-driven LRU eviction.

Tests: config validation (good/bad), proxy forwards body, audio proxy
streams bytes, unhealthy consumer returns 503, /v1/status JSON shape.

Refs: m/mGPUmanager#1
2026-05-11 13:30:17 +02:00

mGPUmanager

GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher /v1-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.

Full design: docs/design.md — Bestandsaufnahme, 10-Alternativen-Survey, Eviction-Algorithmus, Migrationspfad.

Was es macht

Auf mrock:8770 sitzt ein Go-Daemon, der:

  • /v1/tts, /v1/stt, /v1/llm, /v1/image als einheitliche Konsumenten-Fassade exponiert,
  • jede Anfrage durch einen globalen GPU-Scheduler schleust (seriell, Queue),
  • bei VRAM-Druck LRU-Eviction über die deklarierten Coexistenz-Gruppen aus config/consumers.yaml fährt,
  • in /v1/status Live-GPU-Belegung + Consumer-Health + Scheduler-Statistiken zeigt,
  • niemals stille Fallbacks zurückgibt — Fehler kommen als strukturiertes {error,message,consumer,retryable}.

Konsumenten-Registry

config/consumers.yaml deklariert pro Consumer:

  • url, health.{method,path} für Liveness-Probing
  • paths.<kind>.{method,path} — wie der Broker zu seinem TTS/STT/LLM/Image-Endpoint kommt
  • vram_resident_mib — für die Scheduler-Mathe (Schritt 5)
  • unload.{method,path,body} und optional load.{method,path} — wie der Broker den Consumer aus dem VRAM räumt / wieder hochfährt
  • can_coexist_with: [..] — wer parallel resident sein darf
  • priority (0=low, 4=urgent), max_concurrency

Build + Deploy

make build       # ./bin/mgpumanager
make test        # go test ./...
make run         # lokal gegen ./config/consumers.yaml
make deploy HOST=mrock  # rsync + systemd reload + restart

Auf mRock läuft der Daemon als System-Unit (/etc/systemd/system/mgpumanager.service).

Endpoints

Verb Pfad Verhalten
POST /v1/tts Proxy zu routing.tts-Consumer (default: mvoice /api/synthesize)
POST /v1/stt Proxy zu routing.stt-Consumer (default: mvoice /api/transcribe)
POST /v1/llm Proxy zu routing.llm-Consumer (default: ollama /api/generate)
POST /v1/image Proxy zu routing.image-Consumer (default: comfyui /prompt)
GET /audio/* Proxy zu audio_proxy-Consumer (wa.sh fetcht generiertes Audio so)
GET /v1/status Live-Snapshot: GPU + Consumer-Health + Scheduler-Stats
GET /healthz Broker-Liveness (200 OK)

Fehler-Schema

Jeder Broker-eigene Fehler hat die Form:

{
  "error": "consumer_unreachable",
  "message": "upstream mvoice last probe failed: connection refused",
  "consumer": "mvoice",
  "retryable": true
}

Codes: consumer_unreachable, no_consumer, scheduler_error, bad_consumer_url, bad_request. Pass-through-4xx/5xx vom Consumer landet unverändert beim Client.

Phase 1 Status (Issue #1)

  • Schritt 0 — ComfyUI persistent (systemd: comfyui.service)
  • Schritt 1 — mvoice /api/admin/{load,unload} (mai/knuth/admin-load-unload @ mVoice)
  • Schritt 2 — Routing-Façade + /v1/status (passthrough scheduler)
  • ☐ Schritt 3 — wa.sh auf Broker umgestellt
  • ☐ Schritt 4 — Queue + globaler GPU-Lock
  • ☐ Schritt 5 — Coexistenz-Gruppen + LRU-Eviction
Description
GPU-Inference-Control-Plane für mRock — Scheduler vor TTS/STT/LLM/Image-Gen mit globalem GPU-Lock + LRU-Eviction + einheitlicher /v1-Fassade. Konsumenten: mVoice, whisper-server, Ollama, ComfyUI/FLUX, später Furbotto. Go.
Readme 68 KiB
Languages
Go 98.2%
Makefile 1.8%