Deduplication for Photo Database — Part 2 (Version 2)

Personal Archive · Update · March 2026

A Hundred Years of Family Photos — The Project Continues

Building the roadmap, meeting the AI team, and getting the documentation right before we go any further.

When I published the original post about rescuing our family photo library, I described it as thirty years of memories. That was an understatement. Counting scans of old prints and slides, this collection reaches back nearly a hundred years — photographs of family members as children in the 1940s, all the way forward to last year’s trip to Egypt. The scale of what we’re preserving is larger than I initially let on, and it deserves to be said plainly: this is a century of one family’s life, and getting it right matters.

Since that post, the project has taken a significant step forward — not in deleting more files, but in something arguably more important: building a solid foundation so that the work already done doesn’t unravel, and so the work still ahead can be done safely and confidently.

Meet the Team

I should introduce the collaborators here, because this project genuinely could not have happened the way it did without them — and because they’re not human, which is worth explaining to anyone who hasn’t worked this way before.

Claude — Planning Partner & Documentation Specialist

Claude is an AI assistant made by Anthropic. In this project, Claude serves as the planning and documentation layer — reading documents, spotting inconsistencies, asking clarifying questions, and writing the specifications and guides that keep the work organized across sessions. Claude doesn’t run directly on the server, but reasons about what needs to happen there and produces the materials that make it possible.

Max — On-Server AI Agent

Max is a separate AI agent running inside a tool called OpenClaw. Think of Max as the hands-on technician who actually connects to the database, runs the scripts, and executes the step-by-step work on the server. Max operates inside a sandboxed environment on the home server and takes direction from the documents and instructions that Claude produces. Max handled the earlier phases of deduplication described in the original post.

The division of labor is straightforward: Claude thinks, plans, and documents. Max executes. Mike approves anything that could cause permanent changes.

What We Did Today

Today’s session was entirely focused on documentation and project governance — the kind of work that isn’t glamorous but determines whether a complex technical project stays on the rails months from now.

We started by reviewing Max’s own summary of the project — a “Statement of Work” that Max had drafted at the start of an earlier session. It was good, but it had gaps: missing were the rules around misfiled duplicates (more on those shortly), the safety rules that protect against accidental data loss, and lessons learned from mistakes made in a previous deduplication run.

From there we reviewed the actual database structure and all of the Python scripts written over the course of this project. Some were from early experimental phases and are now obsolete. Others are current but have gaps that need to be fixed before they can be trusted with live data. A couple of critical scripts don’t exist yet at all.

By the end of the session, five formal documents had been produced:

Doc 01
Master Strategy
The top-level reference: what we’re doing, why, how the system is set up, and the rules that can never be broken.
Doc 02
Workflow Specification
The step-by-step operational guide Max follows during each work session, including a pre-session safety checklist.
Doc 03
Database Specification
A complete reference for every table in the database — what it contains, what’s active, what’s legacy, and how they relate.
Doc 04
Script Reference
A catalog of every Python script: what each does, which are current, which are obsolete, and what still needs to be built.
Doc 05
Statement of Work
The formal agreement between Mike and Max defining responsibilities, deliverables, and constraints for the phases ahead.
“Documentation isn’t the opposite of action — it’s what makes action safe when the stakes are a hundred years of family history.”

The Misfiled Duplicate Problem

One thing worth explaining for non-technical readers is the concept of a misfiled duplicate, because it illustrates why human judgment still matters even after most decisions have been automated.

Most duplicates are simple: the same photo exists in a well-organized event folder and in a staging dump, and the right answer is obvious — keep the organized one, delete the dump. Rules handle those automatically.

But a small number of cases are genuinely puzzling. Imagine a photograph from a Mission trip that somehow ended up filed in the House Flood folder as well. Both copies are real; neither location is obviously wrong in the way a staging dump is wrong. This isn’t a duplicate that should be deleted — it’s a filing error that needs a human to resolve.

In the original deduplication run, 63 such cases were identified and reviewed individually. Going forward, a dedicated process catches and flags these separately so they are never accidentally swept up in an automated deletion.

A Technical Wrinkle Worth Mentioning

One of the more interesting challenges this project has surfaced is the relationship between two computing environments: the Linux server where the photos actually live, and the sandboxed container where Max runs his scripts.

Max operates inside a kind of isolated virtual workspace. To give Max access to the actual photo library, the two environments are connected by a special link — essentially a shortcut that makes the server’s photo folder appear inside Max’s workspace. This has worked well overall, but it’s fragile. If the server restarts or the connection is reconfigured, that link can quietly break. If Max then runs a script supposed to move or delete files, the results can be unpredictable.

This risk is now explicitly documented, and a verification step has been added to the start of every work session: before Max does anything involving files, he confirms the link is intact. It’s a small thing — but the kind of small thing that prevents a very bad day with irreplaceable photographs.

What Comes Next

With the documentation in place, the next session can focus on the deduplication work that remains.

Next Step

Write the Core Deduplication Engine

A new script needs to be written that scans the database for duplicate files, applies the quality-hierarchy rules, and populates the deletion staging table. This is the missing piece without which the next phase can’t begin.

Then

Fix Two Gaps in the Deletion Script

The existing deletion script currently skips a required cleanup step and has no “practice mode” — it runs for real immediately. Both need to be corrected before it touches live data again.

Then

Audit a Legacy Database Table

A leftover table from an earlier project phase needs to be checked. If it contains nothing unique, it gets dropped. If it holds files never captured elsewhere, those are recovered first.

Finally

Execute the Deletion Run

With preflight checks, dry runs, and explicit approval at each step — the approved duplicates are removed and the library moves one phase closer to complete.

5 Documents produced today
~100 Years of history preserved
0 Photos lost so far

The goal remains what it always was: a library where you can find the photograph of your grandfather as a young man, know approximately when it was taken, and trust that you’re looking at the only copy.

We’re closer than ever.

Documentation for this project — including the workflow specification, database reference, and script guide — was produced in collaboration with Claude (Anthropic) and reflects work in progress as of March 2026. Claude and Max operated under Mike’s supervision; human approval was required before any file deletions.

This entry was posted in Geeky Stuff, Generative AI, Large Language Models, Uncategorized and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *