Rescuing 30 Years of Family Photos Personal Archive · March 2026

Rescuing Thirty Years of Family Memories

How a chaotic digital photo library of over 116,000 images was sorted, deduplicated, and organized — without losing a single irreplaceable moment.

Somewhere on a hard drive, there are photographs of me as a child in the 1940s. There are photos and videos from a trip to Egypt in 2023, snapshots from Normandy beaches, a grandchild’s first steps, and decades of Christmases. Over seventy years of digital photography and scans, the collection had grown to more than 116,000 files — and it was a mess.

The same photo might exist in three different folders. A picture taken at the Paris Temple in 2015 could be filed under “2015 France,” “2016 France,” and a staging folder called “00 iPhone Dumps” — all at once. Nobody had done anything wrong. This is simply what happens when photos accumulate across phones, cameras, computers, and backup drives over three decades without a consistent system.

This is the story of how we fixed it.

The Scale of the Problem

Before any work began, a complete audit of the library revealed the true scope of the disorder. The collection contained 116,468 individual photo and video files stored across a Linux server. Many files appeared multiple times — not because anyone intended to keep duplicates, but because of how photos naturally accumulate: syncing a phone creates one copy, backing up a laptop creates another, and importing into a new photo app creates a third.

116,468 Total files in library
27,101 Duplicate files found
23% Of library was duplicates

Nearly one in four files was a duplicate. That’s roughly 27,000 photographs and videos taking up space, cluttering searches, and making it harder to find the photos that matter.

“The same photo of a monastery in Crete existed in four different folders simultaneously — none of them labeled with the location.”

Beyond duplicates, the folder organization itself was inconsistent. Some years had their photos neatly organized — 2018 had a folder for France, a folder from our missionary assignment to the Visitors’ Center of the Paris Temple, a folder for San Francisco. Other years had events scattered at the root level with no parent folder. Staging folders with names like “00 iPhone Dumps” and “00 Camera Roll” had accumulated thousands of photos that were never properly filed. One folder, left over from a 2021 attempt to use Adobe Lightroom’s cloud sync, contained photos that had been quietly duplicated across the entire library.

How We Approached It

Rather than manually reviewing 27,000 files — a task that would take weeks — I worked with Claude from Anthropic to build a system to do most of the work automatically, with human judgment applied only where it genuinely mattered.

Step One

Finding Every Duplicate

Each file was given a unique digital fingerprint based on its contents. Two files with identical fingerprints are guaranteed to be identical, regardless of filename or folder. This identified every true duplicate in the library with complete certainty.

Step Two

Building the Rules

Not all duplicates are equal. A photo in a well-organized event folder (“2018 France/Normandy”) is more valuable than the same photo in a staging dump (“00 iPhone Dumps”). We built a hierarchy of folder quality with eight levels — from “nested named event” at the top to “abandoned sync folder” at the bottom — and wrote rules to automatically approve deletion of the lower-quality copy in clear-cut cases. These rules alone resolved nearly 25,000 of the 27,101 duplicate pairs.

Step Three

AI Review for Ambiguous Cases

After the automatic rules processed the obvious cases, roughly 3,000 duplicate pairs remained where the right answer wasn’t clear from folder names alone. These were sent to an AI assistant (Google’s Gemini) which examined each pair and decided which copy to keep, explaining its reasoning. The entire AI review cost 24 cents in computing time.

Step Four

Human Review of Edge Cases

A small number of cases — 63 out of 27,101 — required a human eye. These were photos that had ended up in genuinely unrelated folders: a photo from a Mission trip that somehow appeared in the House Flood folder, or a 1940s family photo filed in both the 1940s and 1950s folders. These were reviewed individually, with full file paths provided for side-by-side comparison.

What GPS Data Revealed

Modern smartphones embed precise GPS coordinates in every photo they take. This turned out to be a powerful verification tool. When the AI flagged uncertainty about whether a photo belonged in “2023 France & Egypt” or “2023 Egypt,” we could simply check: where was the camera when this photo was taken?

Running a geographic check on over 250 disputed photos confirmed that every single one was taken within Egypt’s borders — not France. The AI’s decisions were correct in every case.

This GPS verification process opens up exciting possibilities for the future. That 2014 trip to Crete? The GPS data already reveals clusters of photos taken at Preveli Monastery, near Rethymno, and near Heraklion — places that were visited but never named in the folder structure. A future phase of this project will use geographic clustering to automatically suggest subfolder names based on where the photos were actually taken, discovering the named places within a trip that were never manually labeled.

The Outcome

The deduplication review is now complete. Of the 27,101 duplicate files identified, 26,735 have been approved for deletion — an approval rate of 98.6%. The process took one evening of work, most of it automated.

98.6% Auto-resolved by rules + AI
<$0.50 Total AI processing cost
1 Evening of work

The next phase will consolidate the folder structure itself — moving root-level event folders under their proper year parents, so that every photo from 2017 lives somewhere inside a “2017” folder rather than scattered at the root level. After that, Phase 3 will use GPS clustering to automatically suggest sub-locations within trip folders.

The goal is simple: a library where you can find the photo of your father as a child, know approximately when it was taken, and trust that you’re looking at the only copy.

What This Means for the Future

Everything built for this library — the duplicate detection, the folder quality rules, the AI review pipeline, the GPS verification — is reusable. The same system will run against a second photo library on a separate drive, and eventually against the full combined archive. The hard work of building and debugging the pipeline is done. Applying it to new collections is now a matter of days, not months.

Thirty years of family history, finally in order.

Posted in Uncategorized | Leave a comment

LLM Uses Metaphors to Explain Problem

I am developing software using VB.net and the OpenXML library to substitute translations and screen captures into a master Microsoft Word document in English. The purpose is to create new, target-language Word documents of the originals. Text has its challenges (multiple runs of text for a single string of formatted text, just to name one). Graphics, on the other hand, move things to a whole new level of complexity. Replacing an image requires the software to understand the layout information of the original image for insertion of the new image, and this gets very tricky.

I had been working with Gemini 2.5 Pro in developing the software, but the images were being sized incorrectly upon insertion into the new document. After one version that distorted the images upon insertion, the next version had images extending beyond the edge of the page like this:

That portion of the page should in fact look more like this:

Even that version, however, is not ideal. The screen capture is from the software being explained and does not occupy such a significant portion of the screen, which wastes space in the final document.

So, after that version, I returned to Gemini Pro 2.5 with this prompt:

That works better, but some of the graphics were captured at such a high resolution that they filll the whole page. Give me no new code, but give me some ideas to think about. I am thinking that I might also need to furnish you with the actual XML of the file so you can see what we might use. In the original version, the images were sized for a better page layout, so there must be some way to get that info.

Beyond wording in the response that was sycophantic at times, I found the use of metaphors rather intriguing:

  • “The “scale to page width” logic is a sledgehammer. It’s a great fallback, but what we really need is the scalpel—the exact dimensions the original author intended”
  • “Golden Ticket”
  • “we were essentially replacing the engine of a car but leaving the original chassis, wheels, and body. The new engine was too big for the chassis.”

Here is the excerpt from the exchange that contains the use of metaphors:

Wanting to reflect more on what I was seeing there, I asked Anthropic’s Claude:

Is the use of metaphors more than clever syntax?

The full response was quite interesting and seemed to confirm what I was thinking: LLMs often appear to be doing more than simply placing the next word in a sentence according to statistical probabilities! Here is the final paragraph from Claude’s response:

So while metaphors can certainly serve as elegant rhetorical devices, their primary significance lies in their role as cognitive tools that structure thought, enable conceptual understanding, and mediate between abstract and concrete domains of experience.

Now, on to the next version of my code to to replace images!

Note: I use Simtheory.ai to access all the primary LLM engines, which I highly recommend: For one reasonable fee, the subscriber has access to many models.

Posted in Uncategorized | Tagged , , , | Leave a comment

Do Thinking Machines Actually Think?

I was intrigued as I perused this article this morning, “Thinking Machines That Don’t: Confronting AI’s Biases & System Flaws” by Markus Bernhardt. As I read, it was clear that the author’s goal was to support a claim often heard about the artificial intelligence (AI) tools that have come on the scene over the past two years. The article states, “At their architectural heart, LLMs are not thinking entities but extraordinarily sophisticated sequence processors.” Specifically, the claim is that large language models merely place one word after the other in an order derived from the probabilities of word occurrence that it has obtained during training by being fed huge amounts of training data.

I submitted the piece to xAI’s Grok and requested that it summarize the piece in a single paragaph:

Markus Bernhardt’s article, Thinking Machines That Don’t: Confronting AI’s Biases & Systemic Flaws (July 7, 2025), critiques the limitations of large language models (LLMs), arguing that their fluent, confident outputs create a deceptive illusion of understanding despite lacking genuine cognition, intentionality, or a persistent world model. It highlights the “paradox of neutrality,” where efforts to make LLMs unbiased result in a bias toward acquiescence, reinforcing dominant norms and flattening complex considerations, as seen in corporate and educational settings. The article also points to LLMs’ tendency to perpetuate prevalent discourse due to their statistical foundations and their “contextual collapse,” which applies a uniform approach across diverse knowledge domains, risking epistemological missteps. While promising a follow-up on persistent error patterns, the piece calls for critical, human-centric use of LLMs to counter their inherent flaws and prevent over-reliance in high-stakes contexts.

I also asked Grok to assess the piece and received this rather lengthy commentary that included this:

Conclusion

Bernhardt’s article is a necessary wake-up call, urging us to look beyond LLMs’ polished outputs and confront their systemic flaws: the illusion of understanding, the paradox of neutrality, prevalence reinforcement, and contextual collapse. These critiques are grounded in the technical reality of LLMs as statistical pattern-matchers, not cognitive entities. While the article slightly overgeneralizes the risks and underplays potential mitigations, its core claims are valid and urgent, especially as LLMs proliferate in critical domains. It’s a reminder that tools like me are powerful but fallible, requiring active, critical engagement from users to avoid being misled by our confident facades. The promised follow-up on error patterns should further enrich this discussion, and I look forward to reacting to it when it’s published.

It is interesting to note what some might feel is a remarkable sense of self-awareness in the statement there, “It’s a reminder that tools like me are powerful but fallible… The promised follow-up on error patterns should further enrich this discussion, and I look forward to reacting to it when it’s published.”

While I have no substantive rebuttal to the idea that these systems do not “think,” I have to say that I am often amazed at the non-trivial insights often provided by any one of the several LLMs I use on a daily basis. Also interesting is the fact that my wife, a non-user of LLMs frequently agrees with me when I share one of those responses with her.

Posted in Generative AI, Large Language Models | Tagged , | Leave a comment

AI for Creating Lists of Terms for Software Manuals and Translations

User manuals for software typically make use of terms that were used during screen design for the software. Such lists of terms are especially useful if the software is to be internationalized. For a project I am working on to create such documentation, I was unable to come up with such a list.

Casting about to find a way to address the issue of how to proceed without a list of terms. I contemplated simply compiling a list from the documentation itself, which I did in French. The results were less than satisfactory. My thinking then turned to having the translators for each of the three languages review the first section in hopes that this process might suggest the terms to be added. That effort also met with challenges.

Remembering the 106 screenshots that we have for the software in all four languages, a solution for a reasonable approach began to take shape. Returning to the idea of collecting the needed terms from those screenshots, I quickly realized it would be a daunting task to do this manually. This approach would also be one fraught with the possibility of many mistakes.

I recently posted about my project to do automatic handwriting recognition on a journal my wife kept for many years while our children were young. After setting that aside due to challenges encountered two years ago, I decided to try again a couple of weeks ago and was nothing short of astounded at what I found.

With that in mind last week and wanting to find a solution for the list of terms I needed, it hit me that I could get generative AI tools to do optical character recognition on the 106 screenshots in four languages to produce the needed list. After a quick couple of successful experiments submitting screenshots via the Web interface and retrieving the needed text, I was impressed. That success led to the realization of what it would take to go through that process with the 424 images involved. What was needed was a program to submit screenshots, retrieve the results, and then match terms from English to the other three languages.

I then worked with Google’s Gemini, to set up access to the system’s API (Application Programming Interface). Gemini helped me create a program in Python to submit the 424 images to Gemini’s large language model and receive a total of over 2,000 terms in each language. Then came the daunting task of matching up those terms across languages. I had a couple of unsuccessful attempts, the result of asking the AI to match terms in two lists of over 2000 items each. I suggested a divide and conquer approach, but that did not help. The firstg couple of attempts met with total failure. Indeed, one of the systems AI simply provided some code and a dozen terms. Another simply responded that it could do what was need needed, but it then abandoned the task with no further explanation. Yet another (the Chinese DeepSeek system running on Groq hardware to accelerate processing) generated an incredibly lengthy explanation of the difficulty of what I was asking. The response to the same prompt I had supplied the other systems began with this:

Okay, so I need to figure out how to solve this problem where I have to calculate the number of ways to arrange some objects with certain constraints. Let me try to break it down step by step. First, I remember that when dealing with permutations and combinations, the key is to understand whether the order matters and if there are any restrictions. In this case, the problem involves arranging objects where some are identical and others are different, and there are specific constraints on their positions. Hmm, so let’s say I have a total of n objects. Out of these, there are some identical objects and some that are distinct. For example, maybe I have 3 identical red balls and 2 distinct blue balls. The problem might ask how many ways I can arrange these balls in a line or a circle, considering that some positions are restricted. Wait, the user didn’t specify the exact problem, but I think I should consider a general approach. Let me think of a common scenario. Suppose we have n objects where there are duplicates and some specific constraints, like certain objects must be placed in particular positions or can’t be placed next to each other.

That was the first of FOURTEEN pages of explanation! Needless to say, more work was needed on the approach I was taking. 😊

The good news is that the failures I encountered prompted me to come up with an approach that worked. I returned to Gemini for help creating code in Python to submit the English screenshots one at a time, doing the matching across the other three languages. The process included saving the results into an Excel file, which resulted in 2,382 entries. After using Excel to remove the duplicates, 368 items remained, with some cleanup still needed. Et voilà!!

The following illustrates screenshot illustrates user error on my part. I made a mistake and used an up-to-date set of English screenshots, but the list of screenshots for the other languages were out of date.

This once again proves the old GIGO adage, “Garbage in, garbage out!”

Posted in Generative AI, Large Language Models | Leave a comment