Rescuing 30 Years of Family Photos Personal Archive · March 2026

Rescuing Thirty Years of Family Memories

How a chaotic digital photo library of over 116,000 images was sorted, deduplicated, and organized — without losing a single irreplaceable moment.

Somewhere on a hard drive, there are photographs of me as a child in the 1940s. There are photos and videos from a trip to Egypt in 2023, snapshots from Normandy beaches, a grandchild’s first steps, and decades of Christmases. Over seventy years of digital photography and scans, the collection had grown to more than 116,000 files — and it was a mess.

The same photo might exist in three different folders. A picture taken at the Paris Temple in 2015 could be filed under “2015 France,” “2016 France,” and a staging folder called “00 iPhone Dumps” — all at once. Nobody had done anything wrong. This is simply what happens when photos accumulate across phones, cameras, computers, and backup drives over three decades without a consistent system.

This is the story of how we fixed it.

The Scale of the Problem

Before any work began, a complete audit of the library revealed the true scope of the disorder. The collection contained 116,468 individual photo and video files stored across a Linux server. Many files appeared multiple times — not because anyone intended to keep duplicates, but because of how photos naturally accumulate: syncing a phone creates one copy, backing up a laptop creates another, and importing into a new photo app creates a third.

116,468 Total files in library
27,101 Duplicate files found
23% Of library was duplicates

Nearly one in four files was a duplicate. That’s roughly 27,000 photographs and videos taking up space, cluttering searches, and making it harder to find the photos that matter.

“The same photo of a monastery in Crete existed in four different folders simultaneously — none of them labeled with the location.”

Beyond duplicates, the folder organization itself was inconsistent. Some years had their photos neatly organized — 2018 had a folder for France, a folder from our missionary assignment to the Visitors’ Center of the Paris Temple, a folder for San Francisco. Other years had events scattered at the root level with no parent folder. Staging folders with names like “00 iPhone Dumps” and “00 Camera Roll” had accumulated thousands of photos that were never properly filed. One folder, left over from a 2021 attempt to use Adobe Lightroom’s cloud sync, contained photos that had been quietly duplicated across the entire library.

How We Approached It

Rather than manually reviewing 27,000 files — a task that would take weeks — I worked with Claude from Anthropic to build a system to do most of the work automatically, with human judgment applied only where it genuinely mattered.

Step One

Finding Every Duplicate

Each file was given a unique digital fingerprint based on its contents. Two files with identical fingerprints are guaranteed to be identical, regardless of filename or folder. This identified every true duplicate in the library with complete certainty.

Step Two

Building the Rules

Not all duplicates are equal. A photo in a well-organized event folder (“2018 France/Normandy”) is more valuable than the same photo in a staging dump (“00 iPhone Dumps”). We built a hierarchy of folder quality with eight levels — from “nested named event” at the top to “abandoned sync folder” at the bottom — and wrote rules to automatically approve deletion of the lower-quality copy in clear-cut cases. These rules alone resolved nearly 25,000 of the 27,101 duplicate pairs.

Step Three

AI Review for Ambiguous Cases

After the automatic rules processed the obvious cases, roughly 3,000 duplicate pairs remained where the right answer wasn’t clear from folder names alone. These were sent to an AI assistant (Google’s Gemini) which examined each pair and decided which copy to keep, explaining its reasoning. The entire AI review cost 24 cents in computing time.

Step Four

Human Review of Edge Cases

A small number of cases — 63 out of 27,101 — required a human eye. These were photos that had ended up in genuinely unrelated folders: a photo from a Mission trip that somehow appeared in the House Flood folder, or a 1940s family photo filed in both the 1940s and 1950s folders. These were reviewed individually, with full file paths provided for side-by-side comparison.

What GPS Data Revealed

Modern smartphones embed precise GPS coordinates in every photo they take. This turned out to be a powerful verification tool. When the AI flagged uncertainty about whether a photo belonged in “2023 France & Egypt” or “2023 Egypt,” we could simply check: where was the camera when this photo was taken?

Running a geographic check on over 250 disputed photos confirmed that every single one was taken within Egypt’s borders — not France. The AI’s decisions were correct in every case.

This GPS verification process opens up exciting possibilities for the future. That 2014 trip to Crete? The GPS data already reveals clusters of photos taken at Preveli Monastery, near Rethymno, and near Heraklion — places that were visited but never named in the folder structure. A future phase of this project will use geographic clustering to automatically suggest subfolder names based on where the photos were actually taken, discovering the named places within a trip that were never manually labeled.

The Outcome

The deduplication review is now complete. Of the 27,101 duplicate files identified, 26,735 have been approved for deletion — an approval rate of 98.6%. The process took one evening of work, most of it automated.

98.6% Auto-resolved by rules + AI
<$0.50 Total AI processing cost
1 Evening of work

The next phase will consolidate the folder structure itself — moving root-level event folders under their proper year parents, so that every photo from 2017 lives somewhere inside a “2017” folder rather than scattered at the root level. After that, Phase 3 will use GPS clustering to automatically suggest sub-locations within trip folders.

The goal is simple: a library where you can find the photo of your father as a child, know approximately when it was taken, and trust that you’re looking at the only copy.

What This Means for the Future

Everything built for this library — the duplicate detection, the folder quality rules, the AI review pipeline, the GPS verification — is reusable. The same system will run against a second photo library on a separate drive, and eventually against the full combined archive. The hard work of building and debugging the pipeline is done. Applying it to new collections is now a matter of days, not months.

Thirty years of family history, finally in order.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *