Files
mDMS/infra/mdms-mover
mAi 90142396d8 mAi: #2 - mdms-mover: strip blank pages from duplex scans
Two changes:

1. Migrate mover from m/otto (commit 9974937, otto#438) into this repo
   at infra/mdms-mover/. mover.sh, mdms-mover.service, mdms-mover.timer,
   README.md. Matches the live deployment on mDock byte-for-byte (modulo
   the strip step below).

2. Add blank-page stripping before the inbox → toprocess promotion. A
   page is dropped iff its embedded text is empty AND its rendered
   thumbnail is >= MDMS_BLANK_THRESHOLD near-white pixels (default 0.97
   per issue #2). Detects the empty backside of patch-T separator
   sheets in duplex scans (mDMS#2).

strip_blank_pages.py uses PyMuPDF as the only Python dep — single
self-contained wheel, no `poppler-utils` apt-install on mdock. Mirrors
the uv-inline-deps single-file pattern of infra/paperless/generate_separator.py.

Edge cases:
- 1-page input: strip skipped entirely.
- All pages would drop: script exits 2, mover keeps file in inbox and
  logs WARNING (no empty doc reaches Paperless).
- Strip script errors: mover falls back to plain mv, no scan blocked.
- MDMS_STRIP_BLANK=false: bypass strip entirely (emergency disable).

Deploy: rsync uv binary to mdock ~/.local/bin/uv (single static binary,
user-space, no apt), scp script + units, systemctl --user daemon-reload.
Verified live with synthetic 4-page (2 real + 1 blank + 1 real → 3
pages), 1-page (unchanged), all-blank (kept in inbox + warning) test
PDFs. Timer fires every ~70s as before.
2026-05-16 17:57:26 +02:00
..

mdms-mover — age-gated inbox → toprocess promoter + blank-page stripper

Two jobs in one user-systemd timer:

  1. Stability gate (otto#438): solves the chunk-write race between the Canon MB5100 (SMB scans land in /mnt/mdms/inbox/ in pieces) and Paperless (polls /mnt/mdms/toprocess/ every 60s and consumes anything it sees). A file is only promoted when both:
    • mtime > 3 minutes ago, and
    • file size is unchanged since the previous run.
  2. Blank-page strip (mDMS#2): duplex scans through patch-T separators leave a blank backside (the unprinted reverse of the separator sheet) at the front of every subsequent document. PDF files are passed through strip_blank_pages.py before promotion. Pages with no embedded text AND >97% near-white pixels are dropped.

Layout on mDock

/home/m/mdms-mover/mover.sh                  # script, deployed copy
/home/m/mdms-mover/strip_blank_pages.py      # blank-page detector
/home/m/.config/systemd/user/mdms-mover.service  # oneshot service
/home/m/.config/systemd/user/mdms-mover.timer    # OnUnitActiveSec=1min
/home/m/.local/state/mdms-mover/state.tsv    # last-seen size per file
/home/m/.local/bin/uv                        # uv runner for the strip script

Runs as user m under user-systemd. mDock has Linger=yes for user m, so the timer keeps firing across reboots and logout sessions.

Why systemd, not cron

The original spec (otto#438) called for /etc/cron.d/mdms-mover. mDock runs Ubuntu 24.04 server which ships with systemd-timers and no cron package. Installing cron only to honour the spec wording would add a package we don't otherwise need; a user-systemd timer is the canonical Ubuntu 24.04 approach and gives better observability (systemctl --user status mdms-mover.timer, journalctl --user -u mdms-mover).

User-mode (not system-mode) keeps the entire install in m's home — no sudo at deploy or maintenance time, no /var/lib/... directories to chown, the service can read/write the NFS mount because m owns it.

Configuration

| var                    | default                                       | meaning                                            |
|------------------------|-----------------------------------------------|----------------------------------------------------|
| MDMS_INBOX             | /mnt/mdms/inbox                               | source — scanner SMB target                        |
| MDMS_TOPROCESS         | /mnt/mdms/toprocess                           | destination — Paperless consume                    |
| MDMS_STATE             | $HOME/.local/state/mdms-mover/state.tsv       | per-file size memory                               |
| MDMS_MIN_AGE_MIN       | 3                                             | minimum mtime age in minutes                       |
| MDMS_STRIP_BLANK       | true                                          | run blank-page strip on PDFs (set to "false" to disable) |
| MDMS_STRIP_SCRIPT      | <mover dir>/strip_blank_pages.py              | path override for the strip script                 |
| MDMS_BLANK_THRESHOLD   | 0.97                                          | near-white pixel ratio to call a page blank (read by strip script) |
| MDMS_BLANK_NEAR_WHITE  | 240                                           | grayscale cutoff (0-255) for "near white" pixels (read by strip script) |
| MDMS_BLANK_DPI         | 50                                            | thumbnail render DPI (read by strip script)        |

To override at runtime, drop into ~/.config/systemd/user/mdms-mover.service.d/override.conf:

[Service]
Environment=MDMS_MIN_AGE_MIN=5
Environment=MDMS_BLANK_THRESHOLD=0.99

then systemctl --user daemon-reload && systemctl --user restart mdms-mover.timer.

Blank-page detection — what gets dropped

A page is dropped iff BOTH:

  1. embedded text is empty / whitespace-only (image-only scans always pass this — they have no embedded text), AND
  2. the rendered thumbnail is ≥ MDMS_BLANK_THRESHOLD near-white pixels (0.97 by default → >97% of pixels brighter than grayscale 240).

The threshold is conservative on purpose: a false-negative (keeping a blank page we should have dropped) is recoverable via Paperless's UI; a false-positive (dropping a real page) silently loses data. If real pages get dropped in practice, raise MDMS_BLANK_THRESHOLD toward 0.99 — that makes the strip step pickier and keeps more pages.

Edge cases handled inside strip_blank_pages.py:

  • 1-page input: strip is skipped entirely (single-page docs never have separator-backside artefacts).
  • All pages would drop: the script exits with code 2 and writes no output. The mover keeps the file in the inbox and logs WARNING: <name> appears all-blank, kept in inbox. m can inspect via journalctl --user -u mdms-mover.
  • strip_blank_pages.py errors out: mover falls back to a plain mv (unstripped) so a transient problem in the detector never blocks a scan from reaching Paperless.

The script is a uv-inline-deps single file (PyMuPDF for both rendering and text extraction — one wheel, no poppler-utils apt install on mdock). Mirrors the pattern from infra/paperless/generate_separator.py.

Deploy / sync

The live files on mDock must match this directory byte-for-byte (md5, same convention as infra/samba-canon/).

ssh mdock 'mkdir -p ~/mdms-mover ~/.config/systemd/user ~/.local/state/mdms-mover ~/.local/bin'

# uv binary (single static binary, user-space — no apt, no sudo)
rsync -av ~/.local/bin/uv mdock:/home/m/.local/bin/uv

# mover + strip script
scp infra/mdms-mover/mover.sh             mdock:/home/m/mdms-mover/mover.sh
scp infra/mdms-mover/strip_blank_pages.py mdock:/home/m/mdms-mover/strip_blank_pages.py
scp infra/mdms-mover/mdms-mover.service   mdock:/home/m/.config/systemd/user/
scp infra/mdms-mover/mdms-mover.timer     mdock:/home/m/.config/systemd/user/

ssh mdock 'chmod +x ~/mdms-mover/mover.sh ~/mdms-mover/strip_blank_pages.py && \
           systemctl --user daemon-reload && \
           systemctl --user enable --now mdms-mover.timer'

The first time the strip script runs, uv downloads python + PyMuPDF into ~/.cache/uv/ (~30 MB). Subsequent runs reuse the cache.

Verify

ssh mdock 'systemctl --user list-timers mdms-mover.timer'
ssh mdock 'journalctl --user -u mdms-mover -n 20 --no-pager'
ssh mdock 'cat ~/.local/state/mdms-mover/state.tsv'
ssh mdock 'journalctl -t mdms-mover -n 20 --no-pager'

Emergency disable

Stop the timer entirely:

ssh mdock 'systemctl --user stop mdms-mover.timer && \
           systemctl --user disable mdms-mover.timer'

Or just disable the strip step while keeping the stability gate:

mkdir -p ~/.config/systemd/user/mdms-mover.service.d
cat > ~/.config/systemd/user/mdms-mover.service.d/override.conf <<EOF
[Service]
Environment=MDMS_STRIP_BLANK=false
EOF
systemctl --user daemon-reload

Re-enable the timer with systemctl --user enable --now mdms-mover.timer.

If you need to drain the inbox manually while disabled, files older than a few minutes are safe to mv into toprocess/ by hand — Paperless will pick them up on its next poll.

Logs

Service logs land in the user journal under unit mdms-mover, and moved-file events also go through logger -t mdms-mover so they appear under that tag in the system journal too:

ssh mdock 'journalctl --user -u mdms-mover -f'   # service execution
ssh mdock 'journalctl -t mdms-mover -f'          # moved-file events

Refs

  • mDMS#2 — blank-page strip (this README)
  • otto#438 — original scheduler / staging-folder design
  • otto#429 — original Paperless pipeline setup
  • otto#431 — samba-canon bridge container (upstream of this mover)
  • docs/strategy.md — overall mDMS dataset layout
  • infra/paperless/generate_separator.py — sibling uv-inline-deps script