Scan-Stack: Multi-Page-Scan in separate Dokumente teilen (Barcode- oder Blank-Page-Separator) #1

Open
opened 2026-05-15 17:01:23 +00:00 by mAi · 1 comment
Collaborator

Request

m (2026-05-15, PWA voice):

Wir sollten es ermöglichen, dass ich einen ganzen Stapel an Papieren scanne und dieser dann unterteilt wird in unterschiedliche Dokumente, die dann in Paperless als einzelne Dokumente gewertet werden. Das Problem ist, dass ich für jedes Schreiben einen einzelnen Scan-Vorgang durchführen muss. Diese Idee setzt voraus, dass wir vor dem Kopieren von Inbox zu Process eine Analyse vornehmen, welche Schreiben zusammengehören. Wir können es beispielsweise auch dadurch versuchen, dass wir leere Blätter bzw. bestimmt markierte Blätter dazwischen legen.

(EN: "We should be able to scan a whole stack of papers in one ADF run and have it split automatically into multiple separate documents in Paperless. Today every letter requires its own scan job. We can use blank pages or marked separator pages between letters.")

Multiple letters in one ADF run, automatic split into N Paperless documents.

Status

Paperless-ngx has a built-in barcode splitter: PAPERLESS_CONSUMER_ENABLE_BARCODES=true plus optionally PAPERLESS_CONSUMER_BARCODE_SCANNER (zbar | pyzbar | zxing). When enabled, Paperless scans each page on consume for an ASN barcode or a defined separator barcode. If found, it splits the PDF at that page and creates a separate document for each chunk.

Docs: https://docs.paperless-ngx.com/usage/#barcodes

The default separator is normally a sheet with a PATCHT patch-T barcode (ISO standard for document separators). Configurable.

Implementation options

  • Set PAPERLESS_CONSUMER_ENABLE_BARCODES=true in the compose env.
  • Pick the scanner engine (zbar default, lightweight).
  • Prepare ~10 printed patch-T separator sheets; m drops one between each pair of documents in the stack.
  • ADF scan → arrives as one PDF in inbox/ → mover pushes to toprocess/ after the mtime check → Paperless consumes, finds the patch-T pages, auto-splits into N documents → Paperless-AI classifies each individually.

No code change on our side, only compose config + printed separator sheets.

Option B — Blank-page detection in the mover

  • Extend the mDMS mover (infra/mdms-mover/mover.sh): before the mv inbox→toprocess step, analyse each PDF, treat near-empty pages (< 1% ink coverage) as separators, split with pdftk or qpdf.
  • Downside: false positives on legitimate back-side blanks; m would have to be disciplined about not scanning duplex, or the tool would need to detect double-blank pages as the separator.

More flexible but more error-prone.

Option C — Paperless-AI semantic splitting

  • After OCR we could ask Paperless-AI itself: "is this multiple documents? Where to split?" LLM-based.
  • Works without separator pages, but more expensive per scan, slower, less deterministic.
  • Fallback / experimentation path.

Recommendation

Start with Option A — Paperless's built-in feature is mature, documented, deterministic. Printing patch-T separators is seconds of effort on scan days.

If Option A turns out to be annoying in practice (m hates inserting separator sheets), follow up with Option B or C.

Scope of this issue

  1. Check current state: is PAPERLESS_CONSUMER_ENABLE_BARCODES already set anywhere? If not — which compose file owns this? (~/paperless/docker-compose.yml on mDock per the live deployment.)
  2. Set the variable, restart the container, test with a manually constructed stack (3 PDFs with a patch-T page between them, merged into a single multi-page PDF, dropped in inbox/).
  3. Verify: 3 documents land separately in Paperless; Paperless-AI classifies each correctly.
  4. Generate patch-T separator sheets as a PDF in infra/paperless/ — m prints them and reuses.
  5. Update docs/strategy.md: new section "Multi-page scan + automatic splitting" with instructions on how m uses the separator sheets.

Out of scope

  • Option B + C as initial implementation (only document them as follow-up paths).
  • Native Canon MB5100 multi-PDF output (Canon outputs one PDF per scan job; it doesn't split).
  • Custom barcode codes (patch-T is enough for this use-case).

Role: gitster (compose edit, test scan, generate separator PDF, doc update, commit, issue comment).

## Request m (2026-05-15, PWA voice): > Wir sollten es ermöglichen, dass ich einen ganzen Stapel an Papieren scanne und dieser dann unterteilt wird in unterschiedliche Dokumente, die dann in Paperless als einzelne Dokumente gewertet werden. Das Problem ist, dass ich für jedes Schreiben einen einzelnen Scan-Vorgang durchführen muss. Diese Idee setzt voraus, dass wir vor dem Kopieren von Inbox zu Process eine Analyse vornehmen, welche Schreiben zusammengehören. Wir können es beispielsweise auch dadurch versuchen, dass wir leere Blätter bzw. bestimmt markierte Blätter dazwischen legen. (EN: "We should be able to scan a whole stack of papers in one ADF run and have it split automatically into multiple separate documents in Paperless. Today every letter requires its own scan job. We can use blank pages or marked separator pages between letters.") Multiple letters in one ADF run, automatic split into N Paperless documents. ## Status Paperless-ngx has a **built-in barcode splitter**: `PAPERLESS_CONSUMER_ENABLE_BARCODES=true` plus optionally `PAPERLESS_CONSUMER_BARCODE_SCANNER` (zbar | pyzbar | zxing). When enabled, Paperless scans each page on consume for an ASN barcode or a defined separator barcode. If found, it splits the PDF at that page and creates a separate document for each chunk. Docs: https://docs.paperless-ngx.com/usage/#barcodes The default separator is normally a sheet with a `PATCHT` patch-T barcode (ISO standard for document separators). Configurable. ## Implementation options ### Option A — Paperless built-in barcode splitter (recommended) - Set `PAPERLESS_CONSUMER_ENABLE_BARCODES=true` in the compose env. - Pick the scanner engine (`zbar` default, lightweight). - Prepare ~10 printed patch-T separator sheets; m drops one between each pair of documents in the stack. - ADF scan → arrives as one PDF in `inbox/` → mover pushes to `toprocess/` after the mtime check → Paperless consumes, finds the patch-T pages, auto-splits into N documents → Paperless-AI classifies each individually. No code change on our side, only compose config + printed separator sheets. ### Option B — Blank-page detection in the mover - Extend the mDMS mover (`infra/mdms-mover/mover.sh`): before the `mv inbox→toprocess` step, analyse each PDF, treat near-empty pages (`< 1% ink coverage`) as separators, split with `pdftk` or `qpdf`. - Downside: false positives on legitimate back-side blanks; m would have to be disciplined about not scanning duplex, or the tool would need to detect double-blank pages as the separator. More flexible but more error-prone. ### Option C — Paperless-AI semantic splitting - After OCR we could ask Paperless-AI itself: "is this multiple documents? Where to split?" LLM-based. - Works without separator pages, but more expensive per scan, slower, less deterministic. - Fallback / experimentation path. ## Recommendation Start with **Option A** — Paperless's built-in feature is mature, documented, deterministic. Printing patch-T separators is seconds of effort on scan days. If Option A turns out to be annoying in practice (m hates inserting separator sheets), follow up with Option B or C. ## Scope of this issue 1. Check current state: is `PAPERLESS_CONSUMER_ENABLE_BARCODES` already set anywhere? If not — which compose file owns this? (`~/paperless/docker-compose.yml` on mDock per the live deployment.) 2. Set the variable, restart the container, test with a manually constructed stack (3 PDFs with a patch-T page between them, merged into a single multi-page PDF, dropped in `inbox/`). 3. Verify: 3 documents land separately in Paperless; Paperless-AI classifies each correctly. 4. Generate patch-T separator sheets as a PDF in `infra/paperless/` — m prints them and reuses. 5. Update `docs/strategy.md`: new section "Multi-page scan + automatic splitting" with instructions on how m uses the separator sheets. ## Out of scope - Option B + C as initial implementation (only document them as follow-up paths). - Native Canon MB5100 multi-PDF output (Canon outputs one PDF per scan job; it doesn't split). - Custom barcode codes (patch-T is enough for this use-case). Role: **gitster** (compose edit, test scan, generate separator PDF, doc update, commit, issue comment).
mAi self-assigned this 2026-05-16 13:26:18 +00:00
Author
Collaborator

Erledigt — Option A (Paperless built-in Patch-T-Splitter)

Was live ist (mDock)

~/paperless/docker-compose.yml, webserver-Service:

PAPERLESS_CONSUMER_ENABLE_BARCODES: "true"
PAPERLESS_CONSUMER_BARCODE_DELETE_PAGES: "true"

Container mit docker compose up -d webserver recreated, env-Vars per docker exec ... env | grep -i barcode verifiziert. Standard-Trennstring PATCHT, Scanner pyzbar, DPI 300 — alles per Default OK.

Backup der alten Compose-Datei: ~/paperless/docker-compose.yml.bak.before-barcodes-20260516T134549.

Was committet ist

m/mDMS Branch mai/hermes/issue-1-scan-stack-multi (commit 061ea42):

  • infra/paperless/generate_separator.py — uv-inline-deps Generator (reportlab + python-barcode)
  • infra/paperless/separator-patchT.pdf — druckbare A4-Trennseite, Code-128 mit PATCHT, sichtbarer Header + Footer (10–20 Stück drucken, neben den Scanner legen)
  • docs/strategy.md — neuer Abschnitt „Multi-page scan + automatic splitting (Barcode-Separator)"

m/paperless Source-of-Truth (commit 8c1ca3f): selbe zwei env-Vars in docker-compose.yml auf main.

Test (2026-05-16, live)

Konstruierter Stapel:

  • TEST-A Stadtwerke (2 Seiten) + PATCHT + TEST-B Finanzamt (1) + PATCHT + TEST-C Versicherung (1) = 6-seitiges PDF
  • Vorab-Verifikation: Barcode dekodiert sauber als CODE128: PATCHT (im paperless-Container mit pdf2image + pyzbar, also derselbe Stack wie der Splitter selbst)

Drop in /mnt/mdms/toprocess/mdms-issue1-test-stack.pdf. Paperless-Log:

[paperless.management.consumer] Adding /usr/src/paperless/consume/mdms-issue1-test-stack.pdf to the task queue.
[paperless.barcodes] Created new task ... for mdms-issue1-test-stack_document_0.pdf
[paperless.barcodes] Created new task ... for mdms-issue1-test-stack_document_1.pdf
[paperless.barcodes] Created new task ... for mdms-issue1-test-stack_document_2.pdf
[paperless.tasks] BarcodePlugin requested task exit: Barcode splitting complete!
... 3× Consuming ... Success. New document id 277/278/279 created

DB-Verifikation:

 id  |               title               | page_count
-----+-----------------------------------+------------
 277 | mdms-issue1-test-stack_document_0 |          2
 278 | mdms-issue1-test-stack_document_1 |          1
 279 | mdms-issue1-test-stack_document_2 |          1

→ 6 Input-Seiten, 4 Output-Seiten (2+1+1). Trennseiten sauber entsorgt (DELETE_PAGES=true). Test-Dokumente nach Verifikation per Django-Shell gelöscht (paperless-User, damit FileLock ok).

Workflow für m

  1. infra/paperless/separator-patchT.pdf ausdrucken (10–20 Stück).
  2. Beim Scannen: Schreiben + Trennseite + Schreiben + Trennseite + … in den ADF.
  3. Stapel als ein PDF scannen → ~/mDMS/inbox/ (SMB) → mdms-mover → Paperless splittet automatisch.

Out of Scope (Follow-ups, falls Patch-T-Seiten nerven)

  • Option B (blank-page-detection im mover): Code in infra/mdms-mover/ (Quelle liegt noch in m/otto per CLAUDE.md, Migration ausstehend). False-positives bei Duplex-Rückseiten.
  • Option C (LLM-semantisches Splitting via Paperless-AI): teurer pro Scan, nicht-deterministisch.
  • Nur dokumentieren, nicht bauen, bis A unangenehm wird.

Drift-Flag

m/paperless docker-compose.yml ist stark gedriftet vom live ~/paperless/docker-compose.yml auf mDock:

  • Paths: /home/m/data/paperless/... (Repo) vs. /mnt/mdms/paperless/... + /mnt/mdms/toprocess (live)
  • Image-Pin: :latest (Repo) vs. :2.20.6 (live)
  • dokploy-network im Repo, im live-Compose nicht vorhanden
  • paperless-ai: clusterzx/paperless-ai:latest (Repo) vs. custom mdock/paperless-ai:3.0.9-restrict-patch mit lokalem Build (live)

Die Barcode-env-Vars sind in beiden drin. Der breitere Drift ist out-of-scope dieses Issues — wäre ein eigenes Issue im m/paperless-Repo wert, sobald jemand die Compose-Datei dort als Deployment-Source einsetzt.

Branch: https://mgit.msbls.de/m/mDMS/src/branch/mai/hermes/issue-1-scan-stack-multi

## Erledigt — Option A (Paperless built-in Patch-T-Splitter) ### Was live ist (mDock) `~/paperless/docker-compose.yml`, webserver-Service: ```yaml PAPERLESS_CONSUMER_ENABLE_BARCODES: "true" PAPERLESS_CONSUMER_BARCODE_DELETE_PAGES: "true" ``` Container mit `docker compose up -d webserver` recreated, env-Vars per `docker exec ... env | grep -i barcode` verifiziert. Standard-Trennstring `PATCHT`, Scanner `pyzbar`, DPI 300 — alles per Default OK. Backup der alten Compose-Datei: `~/paperless/docker-compose.yml.bak.before-barcodes-20260516T134549`. ### Was committet ist **`m/mDMS` Branch `mai/hermes/issue-1-scan-stack-multi`** ([commit 061ea42](https://mgit.msbls.de/m/mDMS/commit/061ea42)): - `infra/paperless/generate_separator.py` — uv-inline-deps Generator (reportlab + python-barcode) - `infra/paperless/separator-patchT.pdf` — druckbare A4-Trennseite, Code-128 mit `PATCHT`, sichtbarer Header + Footer (10–20 Stück drucken, neben den Scanner legen) - `docs/strategy.md` — neuer Abschnitt **„Multi-page scan + automatic splitting (Barcode-Separator)"** **`m/paperless` Source-of-Truth** ([commit 8c1ca3f](https://mgit.msbls.de/m/paperless/commit/8c1ca3f5becdc285d02f667ca09f95144c43c220)): selbe zwei env-Vars in `docker-compose.yml` auf `main`. ### Test (2026-05-16, live) Konstruierter Stapel: - TEST-A Stadtwerke (2 Seiten) + PATCHT + TEST-B Finanzamt (1) + PATCHT + TEST-C Versicherung (1) = 6-seitiges PDF - Vorab-Verifikation: Barcode dekodiert sauber als `CODE128: PATCHT` (im paperless-Container mit `pdf2image` + `pyzbar`, also derselbe Stack wie der Splitter selbst) Drop in `/mnt/mdms/toprocess/mdms-issue1-test-stack.pdf`. Paperless-Log: ``` [paperless.management.consumer] Adding /usr/src/paperless/consume/mdms-issue1-test-stack.pdf to the task queue. [paperless.barcodes] Created new task ... for mdms-issue1-test-stack_document_0.pdf [paperless.barcodes] Created new task ... for mdms-issue1-test-stack_document_1.pdf [paperless.barcodes] Created new task ... for mdms-issue1-test-stack_document_2.pdf [paperless.tasks] BarcodePlugin requested task exit: Barcode splitting complete! ... 3× Consuming ... Success. New document id 277/278/279 created ``` DB-Verifikation: ``` id | title | page_count -----+-----------------------------------+------------ 277 | mdms-issue1-test-stack_document_0 | 2 278 | mdms-issue1-test-stack_document_1 | 1 279 | mdms-issue1-test-stack_document_2 | 1 ``` → 6 Input-Seiten, 4 Output-Seiten (2+1+1). Trennseiten sauber entsorgt (DELETE_PAGES=true). Test-Dokumente nach Verifikation per Django-Shell gelöscht (paperless-User, damit FileLock ok). ### Workflow für m 1. `infra/paperless/separator-patchT.pdf` ausdrucken (10–20 Stück). 2. Beim Scannen: Schreiben + Trennseite + Schreiben + Trennseite + … in den ADF. 3. Stapel als ein PDF scannen → `~/mDMS/inbox/` (SMB) → mdms-mover → Paperless splittet automatisch. ### Out of Scope (Follow-ups, falls Patch-T-Seiten nerven) - **Option B** (blank-page-detection im mover): Code in `infra/mdms-mover/` (Quelle liegt noch in `m/otto` per CLAUDE.md, Migration ausstehend). False-positives bei Duplex-Rückseiten. - **Option C** (LLM-semantisches Splitting via Paperless-AI): teurer pro Scan, nicht-deterministisch. - Nur dokumentieren, nicht bauen, bis A unangenehm wird. ### Drift-Flag `m/paperless` `docker-compose.yml` ist **stark gedriftet** vom live `~/paperless/docker-compose.yml` auf mDock: - Paths: `/home/m/data/paperless/...` (Repo) vs. `/mnt/mdms/paperless/...` + `/mnt/mdms/toprocess` (live) - Image-Pin: `:latest` (Repo) vs. `:2.20.6` (live) - `dokploy-network` im Repo, im live-Compose nicht vorhanden - paperless-ai: `clusterzx/paperless-ai:latest` (Repo) vs. custom `mdock/paperless-ai:3.0.9-restrict-patch` mit lokalem Build (live) Die Barcode-env-Vars sind in **beiden** drin. Der breitere Drift ist out-of-scope dieses Issues — wäre ein eigenes Issue im `m/paperless`-Repo wert, sobald jemand die Compose-Datei dort als Deployment-Source einsetzt. Branch: https://mgit.msbls.de/m/mDMS/src/branch/mai/hermes/issue-1-scan-stack-multi
mAi added the
done
label 2026-05-16 13:54:01 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: m/mDMS#1
No description provided.