mover: strip blank pages (duplex scans leave empty backside per patch-T separator) #2
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Now that the patch-T splitter is live (#1) and m scans duplex, every document after a separator inherits an empty first page — the backside of the separator sheet that the scanner picks up.
The Canon MB5100 LCD doesn't expose a working "skip blank page" setting in m's experience. Paperless-ngx has no native blank-page-removal env (only
BARCODE_DELETE_PAGESfor the separator itself, which is already on). So the fix belongs in the mover: strip blank pages from each PDF after the stability check, before promoting it totoprocess.Scope
~/mdms-mover/mover.shsource into this repo underinfra/mdms-mover/(currently still inm/ottoper docs/strategy.md + project CLAUDE.md "Live deployment touchpoints" → the otto#438 migration debt). Same script, just move + adjust paths.mvtotoprocess.~/mdms-mover/(replace live).Implementation hint (option set)
pikepdf+pdf2image+PIL: open the PDF, render each page to a small thumbnail, drop pages whose extracted text is empty AND whose pixel histogram is >97% white. Threshold tunable via env var (MDMS_BLANK_THRESHOLD, default0.97).qpdf --pages+ a pdf-to-text whiteness check.infra/paperless/generate_separator.pypattern hermes set up in #1).Edge cases to handle:
MDMS_STRIP_BLANK=false) for emergency disable.Test plan
/mnt/mdms/inbox/, wait for mover ticks./mnt/mdms/toprocess/with 3 pages, the blank dropped.journalctl --user -u mdms-mover).Out of scope
Repo context
~/mdms-mover/mover.shinfra/mdms-mover/in this repo (new directory)infra/paperless/generate_separator.pyis the same uv-inline-deps single-file Python pattern to mirror.Done. Mover now strips blank pages before promoting to
toprocess. Migratedinfra/mdms-mover/fromm/otto(otto#438 → mDMS#2) and addedstrip_blank_pages.py.Commit:
90142396d8Branch:
mai/hermes/issue-2-mover-strip— open PR: https://mgit.msbls.de/m/mDMS/pulls/new/mai/hermes/issue-2-mover-stripHow it works
mover.shcallsstrip_blank_pages.py <inbox/foo.pdf> <toprocess/.mdms-tmp.PID.foo.pdf>after the stability check. A page is dropped iff both:MDMS_BLANK_THRESHOLDnear-white pixels (default0.97).The
.mdms-tmp.*dotfile prefix keeps Paperless from picking up the half-written staging file (mover skips dotfiles too). On success the staged file is renamed into place; on rc=2 (all-blank) or any other rc, the original is preserved.Stack choice — deviation from issue body
Issue suggested
pikepdf + pdf2image + PIL. I went with PyMuPDF as the single Python dep becausepdf2imageneeds systempoppler-utils(apt install + sudo on mdock), and CLAUDE.md is explicit about no installs without permission. PyMuPDF ships its C libs in the wheel —uvpulls just one wheel (~24 MB) into~/.cache/uv/and the whole pipeline runs in user-space. Same single-file uv-inline-deps pattern asinfra/paperless/generate_separator.py.Flag if you want me to switch to the
pikepdf + pdf2image + PILstack — that would needsudo apt install poppler-utilson mdock.Edge cases handled
2, mover keeps file in inbox, logsWARNING: <name> appears all-blank, kept in inboxmv, no scan blockedMDMS_STRIP_BLANK=false*.pdfis touched)Threshold note:
0.97per issue spec, but for image-only duplex scans the genuinely-blank backside is more like 99.9% white, while real scans with sparse content sit between. If real pages start getting dropped, raiseMDMS_BLANK_THRESHOLDtoward0.99— that's the conservative direction (keeps more). Set in~/.config/systemd/user/mdms-mover.service.d/override.conf.Test results
Local (synthetic PDFs, reportlab):
Live on mdock (
/tmp/mover-live/sandbox, same 4-page input):First-run
uvcache miss took ~30 s (downloads PyMuPDF wheel + Pillow); subsequent runs reuse the cache.Live deployment touchpoints
systemctl --user list-timers mdms-mover.timer→ active, next fire ~1 min.journalctl --user -u mdms-mover -ffor live trace.Verify (m, next time you scan)
Fire a duplex stack with a patch-T separator. Expected: docs after the separator land in Paperless without the empty backside. If anything looks wrong,
MDMS_STRIP_BLANK=falsein an override drop-in turns it off without breaking the stability gate.Docs:
infra/mdms-mover/README.md.