paperless-AI prompt: never use 'Matthias Siebels' as correspondent + allow new correspondents for genuinely new senders + reconcile prompt drift #3
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem (live, hit by m 2026-05-16)
paperless-AI just classified a Vattenfall electricity-contract doc (#280) with correspondent "Matthias Siebels" — picked off the recipient address block on the letter. m is the recipient of nearly every doc in this DMS; he is essentially never the correspondent.
Related misclassification: docs #283 + #284 (also Vattenfall content, edited from #281/#282) got classified as Telekom because the prompt forces "Bevorzuge IMMER einen existierenden Correspondent vor einem neuen" — the AI force-matched to closest-existing instead of creating Vattenfall. After m added Vattenfall manually, future docs should classify cleanly, but the underlying prompt bias is the cause.
Two roots
a) Prompt missing 'm is the recipient' rule
The live
.envSYSTEM_PROMPT (read viadocker exec paperless-ai cat /app/data/.env) has detailed correspondent fuzzy-matching rules but nothing that prevents "Matthias Siebels" (or any spelling variant) from being used as correspondent. The AI sees the recipient address and treats it as a candidate.b) 'Always prefer existing' rule is too strict
The live prompt's
Bevorzuge IMMER einen existierenden Correspondent vor einem neuen+ the fuzzy-match catalogue makes the AI force-match to the closest existing name. Vattenfall → Telekom is the most recent example. The settingRESTRICT_TO_EXISTING_CORRESPONDENTS=noalready allows new ones, so the bottleneck is purely the prompt.c) Major drift between repo and live
infra/paperless/SYSTEM_PROMPT.txtin this repo is a much shorter, simpler version than what's running on mDock. The live prompt was expanded ad-hoc (the live.enveven has a note:Drift mechanism observed: AI emitted unknown tag name -> paperless-ai auto-created tag 328 ("Information") despite prompt forbidding it.). Fixing the prompt only on mDock means the next deploy loses the fix. Reconcile.Scope
docker exec paperless-ai cat /app/data/.envon mDock → extract theSYSTEM_PROMPT=value) as the source of truth, copy it intoinfra/paperless/SYSTEM_PROMPT.txt, and update the deploy mechanism (or doc the manual sync step) so the repo and live stay aligned.Recipient / Empfaenger-rule at the top of the correspondent section:Matthias Siebels (alle Schreibweisen — Mathias, Siebels, MS, Herr Siebels, Empfaenger-Adresse Windscheidstr. 33) ist der EMPFAENGER. NIEMALS als Correspondent setzen. Der Correspondent ist die Organisation oder Person, die das Dokument geschrieben/gesendet hat. In den seltenen Faellen, in denen m selbst Autor ist (eigene Briefe an Behoerden), explizit als Personal Correspondence + Correspondent = die EMPFAENGENDE Organisation.Bevorzuge IMMER einen existierenden Correspondent vor einem neuenwith something likeBevorzuge existierende Correspondents bei klarer semantischer Aehnlichkeit (Fuzzy-Regel unten). Wenn der OCR-Absender genuinely neu ist (z.B. ein neuer Versorger, Vermieter, Arzt, Dienstleister), lege einen neuen Correspondent an statt zwanghaft zu mappen.infra/paperless/SYSTEM_PROMPT.txtwith the reconciled + improved version..envon mDock, restart paperless-AI container.processed_documents(in paperless-AI's sqlite at/app/data/documents.db) AND have its correspondent unset in Paperless so the AI can reclassify cleanly. Find them with:docker exec paperless-webserver-1 python -c "import django,os,sys; sys.path.insert(0,'/usr/src/paperless/src'); os.environ.setdefault('DJANGO_SETTINGS_MODULE','paperless.settings'); django.setup(); from documents.models import Document; [print(d.id, d.title) for d in Document.objects.filter(correspondent__name__icontains='Siebels')]". Verify each one before clearing — there may be legitimate cases (e.g. a letter m wrote himself).Acceptance
infra/paperless/SYSTEM_PROMPT.txtmatches live (+ the two new rules).Out of scope
RESTRICT_TO_EXISTING_TAGS=yes,RESTRICT_TO_EXISTING_DOCUMENT_TYPES=yes). Curated taxonomy is intentional.Context for the worker
~/paperless-ai/-ish, containerpaperless-ai, data at/app/data/(volumepaperless_aidata), config in/app/data/.env. SQLite tracker at/app/data/documents.db— relevant tables:processed_documents,history_documents,openai_metrics.docker exec paperless-webserver-1 pythonsnippet (head's session today used this — see most recent reports).Gitea: filed against m/mDMS
Done
Acceptance criteria met. Doc 280 (
Vattenfall Stromliefervertrag) is nowcorrespondent = Vattenfall(ID 257, head's manually-added canonical). The old wrong assignment toMatthias Siebelsis gone, and the live SYSTEM_PROMPT now forbids the recipient-as-correspondent pattern.What changed
Commit:
7ba5bb9on branchmai/hermes/issue-3-paperless-ai.Drift reconciled. The live
/app/data/.envSYSTEM_PROMPT (130 lines, with full fuzzy-matching catalogue + correspondent list + title-generation rules) is now the baseline ininfra/paperless/SYSTEM_PROMPT.txt. The old repo version (24 lines) had been left behind during the otto#429 / otto#433 / otto#435 expansions. Repo now uses a{{CORRESPONDENTS_LIST}}placeholder rendered at deploy time so the names stay in sync with Paperless automatically.Recipient rule added at the top of the Correspondents section:
Bevorzuge IMMER existierendensoftened toBevorzuge existierende bei klarer semantischer Ähnlichkeit; lege neue an wenn der Sender wirklich neu ist, plus explicit example: a new utility provider must not be force-mapped onto eprimo/Vodafone just because they're the nearest existing energy/telco name.Deploy mechanism migrated from
m/ottoto this repo asinfra/paperless/push_system_prompt.py. New:RECIPIENT_EXCLUDEfilter stripsMatthias / Mathias Siebelsfrom the rendered correspondents list — defense in depth so the LLM never sees those names as candidate senders. The Paperless correspondent records (IDs 3, 255) are preserved for historical doc assignments.Live
.envupdated (backup/app/data/.env.bak.20260516T162255, then a second push.bak.20260516T163039after the consolidation step). Container restarted twice.Reclassification batch
41 docs had
correspondent= Matthias Siebels (27) / Mathias Siebels (1) / Paul Siebels (13). Reviewed each by OCR content:correspondentset to null in Paperless, plus rows deleted from paperless-AI's sqlite tracker (processed_documents,history_documents,openai_metrics). All reprocessed on the post-restart initial scan.Vollmacht, Paul is Vollmachtgeber to Jochen Janssen) and doc #130 (Schadensmeldung, Paul fills out a VHV form). Both are genuine Paul-as-author cases that match the new rule's exception.Doc 280 verification
Reprocess output from the live scan log:
AI initially picked
Vattenfall Europe Sales GmbH(ID 258, which it had auto-created in the original misclassification scan) instead of head's canonicalVattenfall(ID 257). The prompt example"Vattenfall Europe Sales GmbH" -> "Vattenfall"was ignored because the exact-match path inserver.jswins over the prompt rule when both names exist. Manual cleanup:corr=258 -> corr=257corr=258 -> corr=257Vattenfall Europe Sales GmbH) deletedVattenfallFinal state:
Vattenfall(ID 257) has 2 docs (280, 283).Telekom Deutschland GmbH(ID 256) has 1 doc (284) — doc 284 is actually a real Telekom Glasfaser letter, so that one stays.Follow-ups surfaced (not in this issue's scope)
Reprocessing 39 docs in one batch exposed two paperless-AI architecture issues worth tracking separately:
Praxis Irle,Irle,Praxis für Psychotherapie und Coaching) becausepaperlessService.listCorrespondentsNames()is called once at scan start; correspondents created during the scan aren't visible to subsequent docs. The fuzzy-rule in the prompt doesn't help because both candidates are equally OCR-derived. Fix would be either (a) re-fetch correspondents between docs, or (b) batch consolidate after each scan.Hogan Lovells(ID 267) even thoughHogan Lovells International LLPalready exists. The OCR returned the short brand, and the AI didn't apply the explicit prompt example mapping the short form to the LLP entry.I'll leave both for a follow-up issue if m wants them addressed — they're not the Siebels misclassification this issue is about.
Title not overwritten (minor)
Doc 280's title stayed
Information - Matthias Siebelseven after reprocess. paperless-AI'ssaveDocumentChangesapparently preserves existing titles whenACTIVATE_TITLE=yes(separate from correspondent). Not blocking; m can rename manually or this can be a separate cleanup batch.Files
infra/paperless/SYSTEM_PROMPT.txt— reconciled + new rules + placeholderinfra/paperless/push_system_prompt.py— migrated from m/otto, addedRECIPIENT_EXCLUDEfilterinfra/paperless/README.md— documented the deploy mechanism