mover: strip blank pages (duplex scans leave empty backside per patch-T separator) #2

Open
opened 2026-05-16 15:46:23 +00:00 by mAi · 1 comment
Collaborator

Problem

Now that the patch-T splitter is live (#1) and m scans duplex, every document after a separator inherits an empty first page — the backside of the separator sheet that the scanner picks up.

The Canon MB5100 LCD doesn't expose a working "skip blank page" setting in m's experience. Paperless-ngx has no native blank-page-removal env (only BARCODE_DELETE_PAGES for the separator itself, which is already on). So the fix belongs in the mover: strip blank pages from each PDF after the stability check, before promoting it to toprocess.

Scope

  1. Migrate ~/mdms-mover/mover.sh source into this repo under infra/mdms-mover/ (currently still in m/otto per docs/strategy.md + project CLAUDE.md "Live deployment touchpoints" → the otto#438 migration debt). Same script, just move + adjust paths.
  2. Add blank-page stripping as a step inside the mover, run after the stability check and before mv to toprocess.
  3. Deploy to mDock ~/mdms-mover/ (replace live).

Implementation hint (option set)

  • Python with pikepdf + pdf2image + PIL: open the PDF, render each page to a small thumbnail, drop pages whose extracted text is empty AND whose pixel histogram is >97% white. Threshold tunable via env var (MDMS_BLANK_THRESHOLD, default 0.97).
  • Or qpdf --pages + a pdf-to-text whiteness check.
  • Whatever's chosen: must be uv-inline-deps so the script stays single-file (matches the infra/paperless/generate_separator.py pattern hermes set up in #1).

Edge cases to handle:

  • Faintly-marked pages (form lines, edge artifacts): threshold should be conservative, prefer keeping a borderline page over dropping it. False-negatives (keep too many) are recoverable in Paperless; false-positives (drop a real page) silently lose data.
  • Already-1-page PDF: skip stripping entirely.
  • All-pages-blank PDF: keep the original, log a warning, don't move (would result in empty doc).
  • Stripping should be opt-out via env var (MDMS_STRIP_BLANK=false) for emergency disable.

Test plan

  • Build a test PDF: 2 real pages + 1 blank + 1 real page (mimicking duplex post-separator).
  • Drop into /mnt/mdms/inbox/, wait for mover ticks.
  • Verify: the file lands in /mnt/mdms/toprocess/ with 3 pages, the blank dropped.
  • Verify: a 1-page PDF passes through unchanged.
  • Verify: an all-blank PDF stays in inbox with a warning log line (visible via journalctl --user -u mdms-mover).

Out of scope

  • Paperless-side blank handling — confirmed there's no native env for this.
  • Canon-side blank-skip — m has looked, the setting isn't accessible/working on this firmware.
  • More aggressive ML-based detection — start with histogram, only invest more if false rates are too high in practice.

Repo context

  • Live mover: mDock ~/mdms-mover/mover.sh
  • Target source location: infra/mdms-mover/ in this repo (new directory)
  • Companion: infra/paperless/generate_separator.py is the same uv-inline-deps single-file Python pattern to mirror.
  • Project CLAUDE.md spinout note mentions otto#438 as the original implementation issue — useful as historical context if needed.
## Problem Now that the patch-T splitter is live (#1) and m scans duplex, every document after a separator inherits an empty first page — the backside of the separator sheet that the scanner picks up. The Canon MB5100 LCD doesn't expose a working "skip blank page" setting in m's experience. Paperless-ngx has no native blank-page-removal env (only `BARCODE_DELETE_PAGES` for the separator itself, which is already on). So the fix belongs in the mover: strip blank pages from each PDF after the stability check, before promoting it to `toprocess`. ## Scope 1. **Migrate `~/mdms-mover/mover.sh` source into this repo** under `infra/mdms-mover/` (currently still in `m/otto` per docs/strategy.md + project CLAUDE.md "Live deployment touchpoints" → the otto#438 migration debt). Same script, just move + adjust paths. 2. **Add blank-page stripping** as a step inside the mover, run after the stability check and before `mv` to `toprocess`. 3. **Deploy** to mDock `~/mdms-mover/` (replace live). ## Implementation hint (option set) - Python with `pikepdf` + `pdf2image` + `PIL`: open the PDF, render each page to a small thumbnail, drop pages whose extracted text is empty AND whose pixel histogram is >97% white. Threshold tunable via env var (`MDMS_BLANK_THRESHOLD`, default `0.97`). - Or `qpdf --pages` + a pdf-to-text whiteness check. - Whatever's chosen: must be uv-inline-deps so the script stays single-file (matches the `infra/paperless/generate_separator.py` pattern hermes set up in #1). Edge cases to handle: - Faintly-marked pages (form lines, edge artifacts): threshold should be conservative, prefer keeping a borderline page over dropping it. False-negatives (keep too many) are recoverable in Paperless; false-positives (drop a real page) silently lose data. - Already-1-page PDF: skip stripping entirely. - All-pages-blank PDF: keep the original, log a warning, don't move (would result in empty doc). - Stripping should be opt-out via env var (`MDMS_STRIP_BLANK=false`) for emergency disable. ## Test plan - Build a test PDF: 2 real pages + 1 blank + 1 real page (mimicking duplex post-separator). - Drop into `/mnt/mdms/inbox/`, wait for mover ticks. - Verify: the file lands in `/mnt/mdms/toprocess/` with 3 pages, the blank dropped. - Verify: a 1-page PDF passes through unchanged. - Verify: an all-blank PDF stays in inbox with a warning log line (visible via `journalctl --user -u mdms-mover`). ## Out of scope - Paperless-side blank handling — confirmed there's no native env for this. - Canon-side blank-skip — m has looked, the setting isn't accessible/working on this firmware. - More aggressive ML-based detection — start with histogram, only invest more if false rates are too high in practice. ## Repo context - Live mover: mDock `~/mdms-mover/mover.sh` - Target source location: `infra/mdms-mover/` in this repo (new directory) - Companion: `infra/paperless/generate_separator.py` is the same uv-inline-deps single-file Python pattern to mirror. - Project CLAUDE.md spinout note mentions otto#438 as the original implementation issue — useful as historical context if needed.
mAi self-assigned this 2026-05-16 15:46:23 +00:00
Author
Collaborator

Done. Mover now strips blank pages before promoting to toprocess. Migrated infra/mdms-mover/ from m/otto (otto#438 → mDMS#2) and added strip_blank_pages.py.

Commit: 90142396d8
Branch: mai/hermes/issue-2-mover-strip — open PR: https://mgit.msbls.de/m/mDMS/pulls/new/mai/hermes/issue-2-mover-strip

How it works

mover.sh calls strip_blank_pages.py <inbox/foo.pdf> <toprocess/.mdms-tmp.PID.foo.pdf> after the stability check. A page is dropped iff both:

  1. embedded text is empty / whitespace-only, AND
  2. rendered thumbnail is ≥ MDMS_BLANK_THRESHOLD near-white pixels (default 0.97).

The .mdms-tmp.* dotfile prefix keeps Paperless from picking up the half-written staging file (mover skips dotfiles too). On success the staged file is renamed into place; on rc=2 (all-blank) or any other rc, the original is preserved.

Stack choice — deviation from issue body

Issue suggested pikepdf + pdf2image + PIL. I went with PyMuPDF as the single Python dep because pdf2image needs system poppler-utils (apt install + sudo on mdock), and CLAUDE.md is explicit about no installs without permission. PyMuPDF ships its C libs in the wheel — uv pulls just one wheel (~24 MB) into ~/.cache/uv/ and the whole pipeline runs in user-space. Same single-file uv-inline-deps pattern as infra/paperless/generate_separator.py.

Flag if you want me to switch to the pikepdf + pdf2image + PIL stack — that would need sudo apt install poppler-utils on mdock.

Edge cases handled

case behavior
1-page input strip skipped, file copied unchanged
all pages would drop script exits 2, mover keeps file in inbox, logs WARNING: <name> appears all-blank, kept in inbox
strip script errors out mover falls back to plain mv, no scan blocked
MDMS_STRIP_BLANK=false strip bypassed entirely (emergency disable)
non-PDF in inbox strip bypassed (only *.pdf is touched)

Threshold note: 0.97 per issue spec, but for image-only duplex scans the genuinely-blank backside is more like 99.9% white, while real scans with sparse content sit between. If real pages start getting dropped, raise MDMS_BLANK_THRESHOLD toward 0.99 — that's the conservative direction (keeps more). Set in ~/.config/systemd/user/mdms-mover.service.d/override.conf.

Test results

Local (synthetic PDFs, reportlab):

=== 4-page (2 real + 1 blank + 1 real) ===
test_4page.pdf: dropped 1/4 blank page(s)
rc=0    pages out: 3   ✓

=== 1-page ===
rc=0    pages out: 1   ✓ (unchanged)

=== all-blank (3 blank pages) ===
all pages blank in test_allblank.pdf
rc=2    no output      ✓ (file stays in inbox)

=== MDMS_STRIP_BLANK=false (4-page input) ===
pages out: 4           ✓ (strip bypassed)

Live on mdock (/tmp/mover-live/ sandbox, same 4-page input):

duplex_scan.pdf: dropped 1/4 blank page(s)
pages out: 3

First-run uv cache miss took ~30 s (downloads PyMuPDF wheel + Pillow); subsequent runs reuse the cache.

Live deployment touchpoints

mdock:~/mdms-mover/mover.sh                        # updated
mdock:~/mdms-mover/strip_blank_pages.py            # new
mdock:~/.config/systemd/user/mdms-mover.service    # unchanged shape, daemon-reload'd
mdock:~/.config/systemd/user/mdms-mover.timer      # unchanged shape, daemon-reload'd
mdock:~/.local/bin/uv                              # rsync'd from local (single static binary)
mdock:~/.cache/uv/                                 # pymupdf + pillow wheels (first run)

systemctl --user list-timers mdms-mover.timer → active, next fire ~1 min.
journalctl --user -u mdms-mover -f for live trace.

Verify (m, next time you scan)

Fire a duplex stack with a patch-T separator. Expected: docs after the separator land in Paperless without the empty backside. If anything looks wrong, MDMS_STRIP_BLANK=false in an override drop-in turns it off without breaking the stability gate.

Docs: infra/mdms-mover/README.md.

Done. Mover now strips blank pages before promoting to `toprocess`. Migrated `infra/mdms-mover/` from `m/otto` (otto#438 → mDMS#2) and added `strip_blank_pages.py`. Commit: https://mgit.msbls.de/m/mDMS/commit/90142396d8be5f2e74341d927a91a6f481893530 Branch: `mai/hermes/issue-2-mover-strip` — open PR: https://mgit.msbls.de/m/mDMS/pulls/new/mai/hermes/issue-2-mover-strip ## How it works `mover.sh` calls `strip_blank_pages.py <inbox/foo.pdf> <toprocess/.mdms-tmp.PID.foo.pdf>` after the stability check. A page is dropped iff **both**: 1. embedded text is empty / whitespace-only, AND 2. rendered thumbnail is ≥ `MDMS_BLANK_THRESHOLD` near-white pixels (default `0.97`). The `.mdms-tmp.*` dotfile prefix keeps Paperless from picking up the half-written staging file (mover skips dotfiles too). On success the staged file is renamed into place; on rc=2 (all-blank) or any other rc, the original is preserved. ## Stack choice — deviation from issue body Issue suggested `pikepdf + pdf2image + PIL`. I went with **PyMuPDF** as the single Python dep because `pdf2image` needs system `poppler-utils` (apt install + sudo on mdock), and CLAUDE.md is explicit about no installs without permission. PyMuPDF ships its C libs in the wheel — `uv` pulls just one wheel (~24 MB) into `~/.cache/uv/` and the whole pipeline runs in user-space. Same single-file uv-inline-deps pattern as `infra/paperless/generate_separator.py`. Flag if you want me to switch to the `pikepdf + pdf2image + PIL` stack — that would need `sudo apt install poppler-utils` on mdock. ## Edge cases handled | case | behavior | |---|---| | 1-page input | strip skipped, file copied unchanged | | all pages would drop | script exits `2`, mover keeps file in inbox, logs `WARNING: <name> appears all-blank, kept in inbox` | | strip script errors out | mover falls back to plain `mv`, no scan blocked | | `MDMS_STRIP_BLANK=false` | strip bypassed entirely (emergency disable) | | non-PDF in inbox | strip bypassed (only `*.pdf` is touched) | Threshold note: `0.97` per issue spec, but for image-only duplex scans the genuinely-blank backside is more like 99.9% white, while real scans with sparse content sit between. If real pages start getting dropped, **raise** `MDMS_BLANK_THRESHOLD` toward `0.99` — that's the conservative direction (keeps more). Set in `~/.config/systemd/user/mdms-mover.service.d/override.conf`. ## Test results Local (synthetic PDFs, reportlab): ``` === 4-page (2 real + 1 blank + 1 real) === test_4page.pdf: dropped 1/4 blank page(s) rc=0 pages out: 3 ✓ === 1-page === rc=0 pages out: 1 ✓ (unchanged) === all-blank (3 blank pages) === all pages blank in test_allblank.pdf rc=2 no output ✓ (file stays in inbox) === MDMS_STRIP_BLANK=false (4-page input) === pages out: 4 ✓ (strip bypassed) ``` Live on mdock (`/tmp/mover-live/` sandbox, same 4-page input): ``` duplex_scan.pdf: dropped 1/4 blank page(s) pages out: 3 ``` First-run `uv` cache miss took ~30 s (downloads PyMuPDF wheel + Pillow); subsequent runs reuse the cache. ## Live deployment touchpoints ``` mdock:~/mdms-mover/mover.sh # updated mdock:~/mdms-mover/strip_blank_pages.py # new mdock:~/.config/systemd/user/mdms-mover.service # unchanged shape, daemon-reload'd mdock:~/.config/systemd/user/mdms-mover.timer # unchanged shape, daemon-reload'd mdock:~/.local/bin/uv # rsync'd from local (single static binary) mdock:~/.cache/uv/ # pymupdf + pillow wheels (first run) ``` `systemctl --user list-timers mdms-mover.timer` → active, next fire ~1 min. `journalctl --user -u mdms-mover -f` for live trace. ## Verify (m, next time you scan) Fire a duplex stack with a patch-T separator. Expected: docs after the separator land in Paperless **without** the empty backside. If anything looks wrong, `MDMS_STRIP_BLANK=false` in an override drop-in turns it off without breaking the stability gate. Docs: `infra/mdms-mover/README.md`.
mAi added the
done
label 2026-05-16 15:58:22 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: m/mDMS#2
No description provided.