Replaces the MVP Passthrough with scheduler.Locked: a capacity-1 channel
serialises every consumer's GPU work end-to-end. main.go switches to it.
Behavioural contract:
- Jobs that arrive while another job holds the GPU block on the channel
until the holder finishes. Context cancellation aborts the wait
cleanly (no leaked tokens, queue depth decremented).
- Stats track queue_depth, in_flight, total_jobs, last_wait_ms,
last_run_ms, oldest_queued — surfaced through /v1/status.
- One lock for ALL consumers (not per-consumer): the design (§4.3) is
explicit that grobgranular > GPU-stream-granular on single-GPU
single-user hardware. mvoice + ollama + comfyui never run truly
concurrently any more, which is the whole point — that's what
produced the CUDA-OOM under load.
Tests:
- 5 goroutines hammer the scheduler concurrently → max in-flight = 1.
- Cancellation while parked on the lock returns ctx.Err() and frees
the queue slot.
- Stats reflect in-flight + queue-depth transitions correctly.
- Race detector clean.
Schritt 5 will compose this with VRAM-pressure eviction: before
acquiring the lock, check if the target consumer's resident cost fits
under the current GPU headroom; if not, unload the LRU non-coexistent
consumer first.
Refs: m/mGPUmanager#1 (Schritt 4).