Deduplication for Photo Database - Part 2 (Version 1) - Digital and Multimedia Commentary OnlineDigital and Multimedia Commentary Online

Personal Archive · Update · March 2026

A Hundred Years of Family Photos — The Project Continues

Building the roadmap, meeting the AI team, and getting the documentation right before we go any further.

By Mike Bush & Claude · March 2026

When I published the original post about rescuing our family photo library, I described it as thirty years of memories. That was an understatement. Counting scans of old prints and slides, this collection reaches back nearly a hundred years — photographs of family members as children in the 1940s, all the way forward to last year’s trip to Egypt. The scale of what we’re preserving is larger than I initially let on, and it deserves to be said plainly: this is a century of one family’s life, and getting it right matters.

Since that post, the project has taken a significant step forward — not in deleting more files, but in something arguably more important: building a solid foundation so that the work already done doesn’t unravel, and so the work still ahead can be done safely and confidently.

Meet the Team

I should introduce the collaborators here, because this project genuinely could not have happened the way it did without them — and because they’re not human, which is worth explaining to anyone who hasn’t worked this way before.

Claude (that’s me — I’m an AI assistant made by Anthropic) has been working with Mike today in this conversation. Think of me as a planning partner and documentation specialist. I read documents, ask clarifying questions, spot inconsistencies, and write the specifications and guides that keep complex projects organized. I don’t run directly on Mike’s server, but I can reason about what needs to happen there and produce the materials that make it possible.

Max is a separate AI agent running inside a tool called OpenClaw — think of Max as the hands-on technician who actually connects to the database, runs the scripts, and does the step-by-step work on the server. Max operates inside a sandboxed environment on Mike’s machine and takes direction from the documents and instructions we produce together. Max handled the earlier phases of deduplication work described in the original post.

The division of labor is straightforward: Claude thinks and plans and documents; Max executes. Mike approves anything that could cause permanent changes.

What We Did Today

Today’s session was entirely focused on documentation and project governance — the kind of work that isn’t glamorous but determines whether a complex technical project stays on the rails months from now.

We started by reviewing Max’s own summary of the project — a “Statement of Work” that Max had drafted at the start of an earlier session. It was good, but it had gaps. Missing were the rules around a special category of photos called misfiled duplicates (more on those in a moment), the safety rules that protect against accidental data loss, and the lessons learned from mistakes made in an earlier run of the deduplication process.

From there, we reviewed the actual database structure and all of the Python scripts that have been written over the course of this project. Some of those scripts were from early experimental phases and are now obsolete. Others are current but have gaps that need to be fixed before they can be trusted with live data. And a couple of critical scripts don’t exist yet at all and need to be written before the next major phase of work can begin.

By the end of the session, we had produced five formal documents:

Master Strategy — the top-level reference explaining the whole project: what we’re doing, why, how the system is set up, and the rules that can never be broken.
Workflow Specification — the step-by-step operational guide Max follows during each work session, including a pre-session checklist and verification queries for every phase.
Database Specification — a complete reference for every table in the database, what it contains, what’s active versus legacy, and how the tables relate to each other.
Script Reference — a catalog of every Python script in the project: what each one does, which ones are current, which are obsolete, and exactly what needs to be built next.
Statement of Work — the formal agreement between Mike and Max defining responsibilities, deliverables, and constraints for the phases ahead.

The Misfiled Duplicate Problem

One thing worth explaining for non-technical readers is the concept of a misfiled duplicate, because it illustrates why human judgment still matters in this process even after most decisions have been automated.

Most duplicates in this library are simple: the same photo exists in a well-organized folder and in a staging dump, and the right answer is obvious — keep the organized one, delete the dump. Rules handle those automatically.

But a small number of cases are genuinely puzzling. Imagine a photograph from a Mission trip that somehow ended up filed in the House Flood folder as well. Both copies are real; neither location is obviously wrong in the way a staging dump is wrong. This isn’t a duplicate that should be deleted — it’s a filing error that should be corrected. A human needs to look at it and decide: which folder is right for this photo?

In the original deduplication run, 63 such cases were identified. They were reviewed individually. Going forward, a dedicated process will catch and flag these cases separately so they never accidentally get swept up in an automated deletion.

A Technical Wrinkle Worth Mentioning

One of the more interesting challenges this project has surfaced is the relationship between two computing environments: the Linux server where the photos actually live, and the sandboxed Docker container where Max runs his scripts.

Without going too deep into the technical weeds: Max operates inside a kind of isolated virtual workspace. To give Max access to the actual photo library on the server, the two environments are connected by a special link — essentially a shortcut that makes the server’s photo folder appear inside Max’s workspace. This has worked well, but it’s fragile. If the server restarts, or the connection is reconfigured, that link can quietly break — and if Max then runs a script that’s supposed to move or delete files, the results can be unpredictable.

We’ve now explicitly documented this risk and added a pre-session verification step to every workflow: before Max does anything involving files, he checks that the link is intact. It’s a small thing, but it’s the kind of small thing that prevents a very bad day.

What Comes Next

With the documentation in place, the next session can focus on the actual deduplication work that remains. The to-do list is concrete:

First, a new script needs to be written — the core engine that scans the database for duplicate files, applies the quality rules, and populates the deletion staging table. This is the piece that was missing from the original script set, and without it the next phase of deduplication can’t begin.

Second, an existing deletion script needs two small but critical fixes: it currently skips a required cleanup step (removing keyword tags before deleting a photo record), and it has no “practice mode” — it just runs for real. Both of those need to be corrected before it touches live data again.

Third, a small audit needs to run on a leftover database table that may be a relic from an earlier phase of the project. If it contains nothing unique, it gets dropped. If it contains files that weren’t captured elsewhere, those need to be recovered first.

After all of that, the actual deletion run can proceed — with preflight checks, dry runs, and explicit approval at each step.

The goal remains what it always was: a library where you can find the photo of your grandfather as a young man, know approximately when it was taken, and trust that you’re looking at the only copy.

We’re closer than ever.

✦

Documentation for this project — including the workflow specification, database reference, and script guide — was produced in collaboration with Claude (Anthropic) and reflects work in progress as of March 2026. The AI agents described here, Claude and Max, operated under Mike’s supervision with human approval required before any file deletions.

Deduplication for Photo Database — Part 2 (Version 1)

A Hundred Years of Family Photos — The Project Continues

Meet the Team

What We Did Today

The Misfiled Duplicate Problem

A Technical Wrinkle Worth Mentioning

What Comes Next

Leave a Reply Cancel reply

Archives

Meta