From Photo Library to Full Archive — Phase 2 Complete
What started as rescuing a chaotic family photo collection became something larger: a comprehensive deduplication pipeline covering every machine, every drive, and nearly a century of irreplaceable memories.
When I published the first posts about rescuing our family photo library, the goal was relatively contained: 116,000 photos, one server, one summer of work. What I didn’t anticipate was that solving that problem well would reveal a much larger one. Photos were only part of the picture. There was also video — over 800 GB of it, some of it irreplaceable footage from various family projects as well as Arabic language research videos from 2015 and 2016. There was audio — digitized LP records, Grandma Vaughn’s voice, and BYU radio recordings. There were documents, project files, backups of backups, and the accumulated digital debris of three decades of computing spread across five machines and a 5 TB cloud backup from iDrive.
Phase 2 was about all of it. And it is now, as of May 2026, substantially complete.
How We Got Here: The Earlier Approach
The original deduplication work described in the March posts operated on a single collection: the family pictures folder on the home server, 116,000 files, one database, one machine. The tools were purpose-built for that task — SQLite databases on a Windows laptop, Python scripts that compared relative file paths, and an AI review pipeline using Google’s Gemini to adjudicate ambiguous cases. It worked. 27,101 duplicates were identified; 98.6% were resolved automatically.
But the approach had significant limitations that only became clear once the photo work was done:
Phase 1 — February 2026
- Single collection (family photos only)
- SQLite databases, Windows laptop
- Compared files by relative path within one source
- No cross-machine awareness
- Manual deletion via CSV list
- No keeper verification before deletion
- No audit trail for what was deleted
- Separate AI agent (Max/OpenClaw) executed changes
- No trash staging — deletions were immediate
Phase 2 — April–May 2026
- All sources: photos, video, audio, documents
- MySQL on Alabama (Raspberry Pi), elmore as control
- Matches by hash across all machines simultaneously
- Full cross-machine inventory: 835,816 files, 4.3 TB
- Interactive deletion with per-directory approval
- Real-time keeper verification before every deletion
- Full deletion_log audit trail, permanent record
- Claude plans and codes; Mike approves; scripts execute
- 48-hour trash staging with VerifyTrash before purge
The philosophical shift matters as much as the technical one. Phase 1 asked: which of these duplicates should we delete? Phase 2 asked a harder question: across all of our storage, which copy of each file is the most intentionally placed, and how do we ensure nothing irreplaceable is lost in the process?
The New Infrastructure
The backbone of Phase 2 is a MySQL database called MediaDatabase, running in Docker on a Raspberry Pi called Alabama. It holds a single table — file_inventory — with 835,816 records representing every file on every machine, ingested by a script called FileInventory.py that runs on each machine in turn and hashes every file it finds.
A second table, source_priority, encodes the hierarchy of trust: master_files on the home server RAID is priority 1, the E: drive managed by Lightroom is priority 2, the iDrive cloud restore is priority 3, and so on down to disposable sources like the HP laptop backup at priority 9. When two machines have the same file, the lower-priority copy is the candidate for deletion — but only after the higher-priority copy is confirmed to actually exist on disk.
“The goal was never to delete files. The goal was to understand what we have, establish which copies are canonical, and then — carefully, verifiably — remove the redundant ones.”
The Scripts That Make It Work
Where Phase 1 had four small Python scripts totaling perhaps 400 lines, Phase 2 has a complete pipeline with distinct responsibilities:
What We Learned the Hard Way
No project of this complexity runs cleanly from the start. Several hard-won lessons shaped the final pipeline:
The iDrive cloud restore used server-side deduplication when restoring the backup, silently omitting files that existed in multiple locations. The result: thousands of deletion candidates with keeper paths that pointed to files that were never actually restored. The fix was real-time disk verification before every deletion — never trust the database alone.
DuplicateAnalysis.py was accidentally run twice on April 27th, producing two complete sets of candidates for every file. Three weeks of deletion sessions later, every file was appearing twice in the interactive review. The fix was identifying and retiring the second run’s records entirely — and adding a uniqueness check to prevent recurrence.
In some cases, the analysis selected a priority-9 disposable source as the keeper for a priority-5 source’s candidate — exactly backwards. The interactive DeleteExecutor caught most of these via the swap mechanism, but they revealed a bug in the keeper selection logic that still needs a permanent fix in DuplicateAnalysis.py.
Mid-session, it became clear that deleting from the video archive with only the E: drive as the keeper wasn’t acceptable — the E: drive is on an external SSD on a Windows laptop, not on the RAID. The right strategy: ensure master_files has a copy before deleting from anywhere else. Several batches were restored from trash when this became apparent.
What Was Actually Freed
After months of work, the numbers are substantial:
The Grandma Vaughn recordings — digitized audio from 2004-2005, the actual voice of a family member captured on tape — were confirmed safe in master_files before any deletion near them occurred. The W08 Arabic Documentary (644 GB of raw footage from 2015-2016 Arabic language sessions) remains untouched pending a proper backup strategy. The LP collection — 514 digitized vinyl records — was confirmed on the RAID before the video copies were removed.
The Role Claude Played
In the earlier posts, I described a two-AI model: Claude for planning and documentation, Max (an on-server agent) for execution. Phase 2 collapsed that into a single collaborative loop. Claude wrote every script in this pipeline, debugged every error, analyzed every log file, and proposed every SQL query — while Mike ran the commands, made the judgment calls at every interactive prompt, and had final approval over anything that touched real data.
What made this work was continuity. Claude maintained a requirements document across sessions that grew to capture not just the technical specifications but the strategic decisions, the known bugs, the edge cases, and the lessons learned. When a session ended and a new one began, the document was the shared memory that allowed the work to resume without starting over.
The collaboration felt genuinely different from using an AI as a search engine or a code generator. It was closer to working with a detail-oriented colleague who had read every document, remembered every decision, and could be trusted to flag when something was about to go wrong — while still deferring to Mike on every judgment call that mattered.
What Remains
Phase 2 is substantially complete, but the work isn’t finished. The ARCLITE Lab video collection (809 GB of BYU language research footage) needs to be copied to master_files before any deletion is considered. The coosa music library needs a separate deduplication pass before it can serve Alexa reliably. The audio source still has 126 GB to process. And the BYU archive needs a reorganization that brings its scattered content under a coherent structure in master_files.
But the foundation is solid. The inventory exists. The pipeline works. The lessons have been documented. And a hundred years of family history is safer than it was when this started.
This post describes work completed between April and May 2026 in collaboration with Claude (Anthropic). All file deletions required explicit human approval at each step. The requirements document, script inventory, and session notes referenced here are maintained as project files and updated after each work session. Phase 1 scripts referenced in this post are archived at master_files/backups/Phase1Code/.