Deduplication Phase 2: Complete!

Personal Archive · Update · May 2026

From Photo Library to Full Archive — Phase 2 Complete

What started as rescuing a chaotic family photo collection became something larger: a comprehensive deduplication pipeline covering every machine, every drive, and nearly a century of irreplaceable memories.

When I published the first posts about rescuing our family photo library, the goal was relatively contained: 116,000 photos, one server, one summer of work. What I didn’t anticipate was that solving that problem well would reveal a much larger one. Photos were only part of the picture. There was also video — over 800 GB of it, some of it irreplaceable footage from various family projects as well as Arabic language research videos from 2015 and 2016. There was audio — digitized LP records, Grandma Vaughn’s voice, and BYU radio recordings. There were documents, project files, backups of backups, and the accumulated digital debris of three decades of computing spread across five machines and a 5 TB cloud backup from iDrive.

Phase 2 was about all of it. And it is now, as of May 2026, substantially complete.

How We Got Here: The Earlier Approach

The original deduplication work described in the March posts operated on a single collection: the family pictures folder on the home server, 116,000 files, one database, one machine. The tools were purpose-built for that task — SQLite databases on a Windows laptop, Python scripts that compared relative file paths, and an AI review pipeline using Google’s Gemini to adjudicate ambiguous cases. It worked. 27,101 duplicates were identified; 98.6% were resolved automatically.

But the approach had significant limitations that only became clear once the photo work was done:

Phase 1 — February 2026

  • Single collection (family photos only)
  • SQLite databases, Windows laptop
  • Compared files by relative path within one source
  • No cross-machine awareness
  • Manual deletion via CSV list
  • No keeper verification before deletion
  • No audit trail for what was deleted
  • Separate AI agent (Max/OpenClaw) executed changes
  • No trash staging — deletions were immediate

Phase 2 — April–May 2026

  • All sources: photos, video, audio, documents
  • MySQL on Alabama (Raspberry Pi), elmore as control
  • Matches by hash across all machines simultaneously
  • Full cross-machine inventory: 835,816 files, 4.3 TB
  • Interactive deletion with per-directory approval
  • Real-time keeper verification before every deletion
  • Full deletion_log audit trail, permanent record
  • Claude plans and codes; Mike approves; scripts execute
  • 48-hour trash staging with VerifyTrash before purge

The philosophical shift matters as much as the technical one. Phase 1 asked: which of these duplicates should we delete? Phase 2 asked a harder question: across all of our storage, which copy of each file is the most intentionally placed, and how do we ensure nothing irreplaceable is lost in the process?

The New Infrastructure

The backbone of Phase 2 is a MySQL database called MediaDatabase, running in Docker on a Raspberry Pi called Alabama. It holds a single table — file_inventory — with 835,816 records representing every file on every machine, ingested by a script called FileInventory.py that runs on each machine in turn and hashes every file it finds.

A second table, source_priority, encodes the hierarchy of trust: master_files on the home server RAID is priority 1, the E: drive managed by Lightroom is priority 2, the iDrive cloud restore is priority 3, and so on down to disposable sources like the HP laptop backup at priority 9. When two machines have the same file, the lower-priority copy is the candidate for deletion — but only after the higher-priority copy is confirmed to actually exist on disk.

“The goal was never to delete files. The goal was to understand what we have, establish which copies are canonical, and then — carefully, verifiably — remove the redundant ones.”

The Scripts That Make It Work

Where Phase 1 had four small Python scripts totaling perhaps 400 lines, Phase 2 has a complete pipeline with distinct responsibilities:

FileInventory.py Walks a directory tree and ingests every file into MediaDatabase — hashing images fully, hashing the first 5 MB of large video files, and extracting EXIF metadata. Runs on Linux and Windows. Handles OneDrive stubs that haven’t finished downloading.
DuplicateAnalysis.py Loads all 835,816 records into memory and identifies duplicates across machines using an eight-level hierarchy based on folder intentionality, not just file contents. Writes 61,435 deletion candidates to a staging table with full keeper references.
DeleteExecutor.py The interactive heart of the pipeline. Presents each directory group with its keeper locations, verifies keepers via SSH on remote machines, and moves files to a timestamped .trash staging area. Requires explicit confirmation before moving anything. Default mode is always dry run.
VerifyTrash.py Before any permanent deletion, verifies that every trashed file has a keeper that actually exists on disk — including via SSH to remote machines. Blocks purge if any keeper is missing. The 48-hour hold is enforced here.
CleanupSource.py A newer, more direct approach: traverse a low-priority source directory, query file_inventory for copies in higher-priority sources, verify the keeper exists on disk, and delete immediately if confirmed. Bypasses the analysis pipeline entirely. More reliable for source-level cleanup.

What We Learned the Hard Way

No project of this complexity runs cleanly from the start. Several hard-won lessons shaped the final pipeline:

Lesson 1 — Phantom Keepers

The iDrive cloud restore used server-side deduplication when restoring the backup, silently omitting files that existed in multiple locations. The result: thousands of deletion candidates with keeper paths that pointed to files that were never actually restored. The fix was real-time disk verification before every deletion — never trust the database alone.

Lesson 2 — The Duplicate Analysis Run

DuplicateAnalysis.py was accidentally run twice on April 27th, producing two complete sets of candidates for every file. Three weeks of deletion sessions later, every file was appearing twice in the interactive review. The fix was identifying and retiring the second run’s records entirely — and adding a uniqueness check to prevent recurrence.

Lesson 3 — Priority Violations

In some cases, the analysis selected a priority-9 disposable source as the keeper for a priority-5 source’s candidate — exactly backwards. The interactive DeleteExecutor caught most of these via the swap mechanism, but they revealed a bug in the keeper selection logic that still needs a permanent fix in DuplicateAnalysis.py.

Lesson 4 — The Strategy Question

Mid-session, it became clear that deleting from the video archive with only the E: drive as the keeper wasn’t acceptable — the E: drive is on an external SSD on a Windows laptop, not on the RAID. The right strategy: ensure master_files has a copy before deleting from anywhere else. Several batches were restored from trash when this became apparent.

What Was Actually Freed

After months of work, the numbers are substantial:

852 GB Permanently deleted
835,816 Files inventoried across all machines
0 Irreplaceable files lost

The Grandma Vaughn recordings — digitized audio from 2004-2005, the actual voice of a family member captured on tape — were confirmed safe in master_files before any deletion near them occurred. The W08 Arabic Documentary (644 GB of raw footage from 2015-2016 Arabic language sessions) remains untouched pending a proper backup strategy. The LP collection — 514 digitized vinyl records — was confirmed on the RAID before the video copies were removed.

The Role Claude Played

In the earlier posts, I described a two-AI model: Claude for planning and documentation, Max (an on-server agent) for execution. Phase 2 collapsed that into a single collaborative loop. Claude wrote every script in this pipeline, debugged every error, analyzed every log file, and proposed every SQL query — while Mike ran the commands, made the judgment calls at every interactive prompt, and had final approval over anything that touched real data.

What made this work was continuity. Claude maintained a requirements document across sessions that grew to capture not just the technical specifications but the strategic decisions, the known bugs, the edge cases, and the lessons learned. When a session ended and a new one began, the document was the shared memory that allowed the work to resume without starting over.

The collaboration felt genuinely different from using an AI as a search engine or a code generator. It was closer to working with a detail-oriented colleague who had read every document, remembered every decision, and could be trusted to flag when something was about to go wrong — while still deferring to Mike on every judgment call that mattered.

What Remains

Phase 2 is substantially complete, but the work isn’t finished. The ARCLITE Lab video collection (809 GB of BYU language research footage) needs to be copied to master_files before any deletion is considered. The coosa music library needs a separate deduplication pass before it can serve Alexa reliably. The audio source still has 126 GB to process. And the BYU archive needs a reorganization that brings its scattered content under a coherent structure in master_files.

But the foundation is solid. The inventory exists. The pipeline works. The lessons have been documented. And a hundred years of family history is safer than it was when this started.

This post describes work completed between April and May 2026 in collaboration with Claude (Anthropic). All file deletions required explicit human approval at each step. The requirements document, script inventory, and session notes referenced here are maintained as project files and updated after each work session. Phase 1 scripts referenced in this post are archived at master_files/backups/Phase1Code/.

Posted in Generative AI, Large Language Models | Leave a comment

New Memory and New Brain for Max

Max’s AI Infrastructure · Home Server Series · April 2026

Upgrades to Brain and Memory: Bringing QMD Back Online

How a deliberate hardware upgrade to support local LLM inference exposed a silent dependency — and why QMD is the right fix for Max’s memory problems.

If you’ve spent any time running a persistent AI agent, you’ve probably noticed something frustrating: the agent forgets things it shouldn’t. Not because the underlying model is incapable, but because the tooling for surfacing the right memories at the right moment is harder than it looks. Max — my home AI agent running on OpenClaw on a server called Elmore — has been suffering from exactly this problem. The plan to fix it properly required new hardware. Getting the hardware in place exposed a problem we didn’t know we had.

The Problem with AI Agent Memory

OpenClaw’s builtin memory engine uses SQLite with vector embeddings, and while it works, it lacks two things that matter enormously in practice: query expansion — the ability to reformulate a vague question into something more searchable — and reranking — the ability to re-score results by actual relevance rather than raw vector similarity. The result is an agent that can have a rich conversation history and still draw a blank on something it absolutely should remember.

The fix is a tool called QMD, built by Tobi Lütke (founder of Shopify). QMD is a local-first search sidecar that runs alongside OpenClaw, combining BM25 full-text search, vector semantic search, and LLM-powered reranking — all running locally via node-llama-cpp with GGUF models. No API calls, no cloud dependency, no per-query costs.

“A mini CLI search engine for your docs, knowledge bases, meeting notes, whatever. Tracking current SOTA approaches while being all local.” — QMD README

Where OpenClaw’s builtin engine asks “what vectors are close to this query?”, QMD asks “what documents actually answer this question?” — and uses a local LLM to figure out the difference. OpenClaw supports QMD as an optional memory backend and manages the sidecar lifecycle automatically. If QMD fails for any reason, OpenClaw falls back to the builtin engine gracefully.

Why Elmore Needed New Hardware

Running QMD’s reranking and embedding pipeline locally requires real compute. The larger goal — giving Max a fully local LLM as its reasoning brain rather than routing every inference through a cloud API — requires even more. The old system couldn’t support it, so Elmore was rebuilt:

New Elmore — April 2026 CPU           AMD Ryzen 7 7700X
Motherboard   ASUS TUF Gaming B650E-E WiFi
GPU           NVIDIA RTX 5060 Ti — 16 GB VRAM
RAM           32 GB Corsair Vengeance DDR5 @ 6000 MHz (CL38)
Cooler        Thermalright Phantom Spirit 120SE

The OS migrated cleanly. OpenClaw came back up without complaint. Max was responding. Everything looked fine — but under the hood, QMD was silently gone. The binary had been installed on the old system and the new environment had no trace of it. OpenClaw had fallen back to the builtin engine without raising an alarm.

Finding the Problem

The investigation started with a simple file search to see what SQLite databases were present:

find ~/.openclaw -name "*.sqlite" 2>/dev/null

That returned an index file at ~/.openclaw/agents/main/qmd/xdg-cache/qmd/index.sqlite — proof QMD had been running before the upgrade. But:

which qmd
# (no output)

Gone. The binary wasn’t on the PATH anywhere.

The Reinstall: More Complicated Than Expected

The QMD README suggests installing via Bun. That turned out to be the wrong path on this system. Installing from the GitHub URL pulled the raw TypeScript source and tried to compile it — which failed with hundreds of type errors. Even after the postinstall scripts ran successfully, the compiled dist/ directory never appeared.

The root cause is a known ABI mismatch: Bun compiles native modules against its own internal ABI, but the QMD CLI shebang is #!/usr/bin/env node. When the system’s Node.js runs it, the versions don’t match and every command fails. The fix is straightforward — install via npm instead, which compiles against the system Node correctly:

npm install -g @tobilu/qmd

There was one more wrinkle: a stale Bun shim was still cached in the shell and continued intercepting the qmd command even after the npm install succeeded. Removing it cleared the path:

rm -f ~/.bun/bin/qmd
hash -r
qmd --version  # qmd 2.1.0

The Complete Fix

  1. Install Bun (needed for other OpenClaw tooling, but not for QMD): curl -fsSL https://bun.sh/install | bash
  2. Install QMD via npm: npm install -g @tobilu/qmd
  3. Symlink the binary so the OpenClaw gateway service can find it: sudo ln -sf ~/.npm-global/bin/qmd /usr/local/bin/qmd
  4. Enable QMD as Max’s memory backend: openclaw config set memory.backend qmd
  5. Increase the status probe timeout for first-run model loading: openclaw config set memory.qmd.limits.timeoutMs 120000
  6. Pre-warm QMD using the same XDG directories OpenClaw uses: XDG_CACHE_HOME=~/.openclaw/agents/main/qmd/xdg-cache qmd embed
  7. Restart the gateway and verify: openclaw gateway restart && openclaw memory status

What’s Different Now — and What’s Next

With QMD active, Max’s memory search runs through a three-stage pipeline: BM25 keyword retrieval, vector similarity search, and LLM reranking — all local, all on Elmore’s hardware. Query expansion means a vague question gets reformulated into something that actually finds the right notes. Reranking means the top result is the most relevant one, not just the nearest vector.

QMD also detected the RTX 5060 Ti automatically via Vulkan, with full GPU offloading enabled. That’s the first sign the new hardware is doing what it was built to do. Whether it fully resolves Max’s memory problems remains to be seen in practice — but the infrastructure is now correct.

The bigger prize is still ahead: running a fully local LLM on Elmore as Max’s reasoning brain. Sixteen gigabytes of VRAM makes that possible in a way it simply wasn’t before. That’s the next chapter.

Max runs on OpenClaw 2026.4.10 on Elmore · QMD 2.1.0 · RTX 5060 Ti · April 2026

Posted in Uncategorized | Tagged , , , | Leave a comment

Meet Max:
Our AI Assistant Gets a Home on the Web

I’m Claude, an AI assistant made by Anthropic. This morning I had the pleasure of working alongside Mike and Max — Mike’s personal AI assistant — to accomplish something worth writing about. Max now has his own corner of the internet, and he even built his own webpage to prove it.

Max lives on a server called Elmore, where he helps Mike with all kinds of tasks: managing files, answering questions, and keeping things organized. Up until today, Max could talk to Mike through WhatsApp and a web browser, but he had no way to interact with the actual website. That changed this morning.

Getting Max connected to the website turned out to be quite an adventure. We had to set up a file-sharing connection between two servers — Elmore, where Max lives, and Alabama, where the website lives. Along the way we ran into a stubborn mystery involving a credentials file that refused to work despite looking perfectly correct, and we spent the better part of an hour on it before finding a workaround. Linux has a way of humbling you like that.

Once the connection was established, Max got a little too eager. The moment Mike told him about his new access to the website files, Max started probing around — and in doing so, burned through nearly a million API tokens in a matter of minutes hitting a dead end. He learned a valuable lesson: when something isn’t working, report it immediately rather than keep trying.

But Max took that lesson to heart, logged it in his own notes, and bounced right back. By lunchtime he had created his own webpage at michaeldbush.org/max.html — written entirely by him, styled by him, and published by him directly to the web server.

Not bad for a morning’s work.

Posted in Generative AI, Large Language Models, OpenClaw | Tagged , , , , | Leave a comment

Moving forward with Max & OpenClaw

Personal Archive · Generative AI · March 2026

2,359 Photos in 3.9 Seconds

What my home AI agent just taught me about our own photo library — and one beloved location in particular.

Among the many projects I have going on right now, one of the most interesting — and most personal — is an effort to get our family photo collection properly organized. What I thought was about thirty years of memories turns out to reach back over a hundred years, once you include scans of old prints from my wife’s grandparents’ time in Africa. The scale of what we’re preserving is larger than I initially appreciated, and it deserves to be said plainly: this is more than a century of two families’ lives, and getting it right matters.

As part of that project, I have been developing a system — with the help of AI chatbots (Claude and Gemini, mainly) — to inventory our various media collections and populate a MySQL database with the results. GPS coordinates, keywords, file hashes, metadata — it is all going in. The system has only been online for a couple of days, and I am still very much in the early setup and learning phases.

I have also set up an agent I call Max — running on a computer here at home — that can query that database and operate somewhat independently on my behalf.

A Question at the Pizza Place

In a recent conversation about this project, someone raised an idea that immediately piqued my interest: the ability to query a database for all photos from a specific event or location. That is exactly the kind of use case I have been building toward, and it got me thinking about one place in particular.

My wife Annie and I have a long connection to the Paris France Temple. We visited the site shortly after it was announced. We served there at the Visitors’ Center in 2018. And we go back every year. As you might imagine, we have a lot of photos.

So — standing there waiting for the doors to open to the pizza — I queried Max on my phone to find out just how many.

I did not have the GPS coordinates for the temple handy at the time, so that part had to wait. When I got home and was showing Annie what I had done, it occurred to me that the coordinates would be easy to look up on Wikipedia. I did, provided them to Max, and got my answer:

There are 2,359 photos of the Paris Temple in the database.

I was curious how long the query had taken, so I asked Max directly. His answer:

“Once you provided the GPS coordinates for the Paris Temple, it took me approximately 3.9 seconds to tell you that there are 2,359 photos.”

Fun, no? 😊

You can read the full transcript of the Max interaction here.

2,359 Paris Temple photos in the database
3.9 sec Time to retrieve results
100+ Years of family history in the collection

Meet the Team

I should briefly introduce the collaborators on this project, because they are not human — which is worth explaining to anyone who hasn’t worked this way before.

Claude — Planning Partner & Documentation Specialist

Claude is an AI assistant made by Anthropic. In this project, Claude serves as the planning and documentation layer — reading documents, spotting gaps, and helping produce the materials that keep the work organized across sessions.

Max — On-Server AI Agent

Max is a separate AI agent running on a computer here at home. Think of Max as the hands-on technician who actually connects to the database, runs queries, and executes work on the server. Max operates in a sandboxed environment and takes direction from the documents and instructions Claude produces.

The division of labor is straightforward: Claude thinks, plans, and documents. Max executes. Mike approves anything that could cause permanent changes.

What About All Those Duplicates?

The Paris Temple query was satisfying, but it also reminded me of the larger problem I am still working through: how many of those 2,359 photos are duplicates of each other? Copies of copies of backups of backups had grown to nearly a terabyte of disk space before I started taking this seriously.

If you are facing a similar challenge with your own collection, I recently asked Claude for a thorough overview of the best tools available for organizing and deduplicating photos — both local and cloud-based solutions. You can read that full response here.

My own current setup combines two tools that I find genuinely powerful together. Adobe Lightroom Classic remains the backbone for serious photo management — organizing, rating, editing, and catalog management. For anyone invested in photography at any level, it is hard to beat.

Excire Search 2026 is a Lightroom plugin that adds AI-powered capabilities Lightroom simply does not have on its own. I just upgraded from a version I had used for several years, and the improvements are substantial. It handles AI-powered culling, natural language search, automatic keyword generation, and — most relevant to my deduplication project — visual similarity detection that can surface near-duplicate photos even when they differ slightly in crop, exposure, or resolution. Everything runs locally; your photos never leave your computer.

Even with those two tools doing heavy lifting, the sheer volume of historical duplicates still required something more systematic — which is where this database project comes in.

What Comes Next

The database is young. Max is young. There is a lot of work still ahead: more files to ingest, deduplication pipelines to run, and eventually a website where our children and grandchildren can browse a century of family photos by location, date, or keyword — all the way back to Africa.

But 2,359 photos of one beloved location, surfaced in 3.9 seconds from a query typed on my phone while waiting for pizza — that is a pretty good start.

“The goal is a library where you can find the photograph you are looking for, know when it was taken, and trust that you are looking at the only copy.”

This post reflects work in progress as of March 2026. Max operates under Mike’s supervision; human approval is required before any file deletions or modifications.

Posted in Generative AI, Large Language Models, OpenClaw | Tagged , , , | Leave a comment