mAi a2fa76a41a mAi: #4 - paperless-AI prompt: intra-scan dedup + short-brand prefix match
Two prompt-only rules added to address follow-ups from #3:

1. Intra-scan dedup (new rule 4 in Correspondents section): when
   processing multiple docs from the same sender in one scan batch,
   reuse the correspondent name created earlier in the same session
   instead of letting each doc create a fresh alias. Triggered by
   paperless-AI creating 3 Praxis-Irle aliases in one batch (no native
   batch-context plumbing; best-effort via prompt).

2. Short-brand prefix match (extension of Fuzzy-Regel): if OCR name is
   a strict prefix of an existing correspondent (or vice-versa) and
   the first 2 brand tokens match, use the existing correspondent.
   Triggered by 'Hogan Lovells' creating a new correspondent despite
   'Hogan Lovells International LLP' already existing.

Deployed via push_system_prompt.py --apply, container restarted, both
strings verified present in /app/data/.env (backup at
.env.bak.20260521T092606). Effectiveness will be observed as
multi-doc scans flow through.
2026-05-21 11:26:40 +02:00

mDMS

m's document management — Paperless-ngx + AI-classification pipeline, Canon scanner SMB bridge, strategy + tooling.

Spun out from m/otto on 2026-05-15 — issues #429#438 in m/otto are the provenance trail. Going forward, all mDMS work lives here.

Layout

mDMS/
├── docs/
│   └── strategy.md          # Taxonomy, layout, conventions (the bible)
├── infra/
│   ├── paperless/           # Paperless-AI config: SYSTEM_PROMPT, audit scripts,
│   │                        # migrate_types.py, deploy docker-compose
│   └── samba-canon/         # SMB1 bridge container for Canon MB5100 scanner
│                            # (host-network + nmbd, SMB1+NTLMv1 for old printer)
└── README.md

Components

Paperless-ngx (deployment)

Compose lives in m/paperless (separate repo). That repo is the deployment artifact — ~/paperless/ on mDock is its checkout. This repo (m/mDMS) tracks the AI classification layer that sits on top of Paperless-ngx (infra/paperless/SYSTEM_PROMPT.txt, the type/tag/ correspondent migration scripts, the audit pipeline).

Paperless-AI

Runs on mdock:3077 in front of Paperless-ngx (mdock:8777). Classifies each ingested document into one of the 10 canonical types and ≤2 of the 13 canonical tags. The system prompt + the migration scripts in infra/paperless/ are the source of truth — keep this repo and the live Paperless-AI aidata/.env in sync.

Canon SMB bridge

infra/samba-canon/ is the host-network Samba 4.10 container on mDock that the Canon MB5100 scans to. Files land in /mnt/mdms/inbox/ (NFS from mTrueNAS) and Paperless polls every 60s. The two-stage inbox (staging dir + age-gated mover) lives separately under ~/mdms-mover/ on mDock — see m/otto issue #438.

Data

NFS-mounted from mTrueNAS: /mnt/mPool/mdms//mnt/mdms/ on all consumers. Layout:

/mnt/mPool/mdms/
├── inbox/         # SMB scanner target (Canon writes here)
├── toprocess/     # Age-gated staging → Paperless consumes here
├── paperless/     # Paperless storage (post-ingest)
├── archive/       # Long-term archive
├── templates/     # Document templates
└── export/        # Manual exports

Reference

  • docs/strategy.md — full strategy, taxonomy decisions, type/tag rationale
  • m/otto issues #429#438 — original implementation history
  • m/paperless — the bare Paperless-ngx Docker Compose setup
Description
m document management — Paperless-ngx pipeline, samba scanner bridge, strategy + tooling
Readme 82 KiB
Languages
Python 80.8%
Shell 15.3%
Dockerfile 3.9%