Life Archive

Personal information retrieval system โ€” 278K documents across 7 sources, queryable via RAG pipeline on Mac Studio.

Overview

Life Archive is a full RAG (Retrieval-Augmented Generation) pipeline that indexes ~278K personal records spanning decades โ€” Evernote notes, emails, magazine archives, Tana nodes, and Paperless-NGX documents โ€” into a searchable knowledge base. It runs entirely on the Mac Studio using local embeddings (gte-Qwen2-7B on Apple MPS) and LanceDB for vector storage.

What it answers: “What did I write about X?”, “When did I meet Y?”, “What happened during Z trip?” โ€” any question against a lifetime of personal documents.

Key paths:

Path Content
~/Sync/ED/life_archive/ Project root โ€” all code, configs, data
~/Sync/ED/life_archive/.venv/ Python virtual environment
~/Sync/ED/life_archive/lancedb_data/ LanceDB vector database (~50 GB)
~/Sync/ED/life_archive/knowledge_graph.db SQLite knowledge graph (~356 MB)
Architecture

Data flow:

  1. Source extraction โ€” Raw documents parsed from Evernote exports, email archives, magazine PDFs, Tana JSON, Paperless-NGX API
  2. Enrichment โ€” Text cleaning, section splitting, paragraph chunking, QA pair generation
  3. Embedding โ€” gte-Qwen2-7B encodes text into dense vectors (local, MPS-accelerated)
  4. Storage โ€” LanceDB tables for docs, sections, paragraphs, QA pairs; SQLite for knowledge graph
  5. Query โ€” Multi-strategy retrieval with fusion and reranking

Retrieval strategies (all run in parallel per query):

Strategy What it does
Dense vectors Semantic similarity search against paragraph embeddings
SPLADE keywords Sparse keyword matching for exact terms
QA pairs Matches against pre-generated question-answer pairs
Knowledge graph Entity and relationship lookup
HyDE Hypothetical Document Embedding โ€” generates a synthetic answer, then searches for similar real content

Results from all strategies are fused via Reciprocal Rank Fusion (RRF), then reranked with a cross-encoder model for final ordering.

Running Services

Four persistent services on Mac Studio, all managed via launchd:

Service Port launchd Label Purpose
Embed Server 1235 com.beedifferent.embed-server gte-Qwen2-7B on MPS โ€” generates embeddings
Life Archive API 8900 com.beedifferent.life-archive-api FastAPI HTTP wrapper for remote queries
MCP HTTP Server 8901 com.beedifferent.life-archive-mcp-http Streamable HTTP MCP server for remote Claude clients
Paperless-NGX 8100 (manual / runserver) Document ingestion and OCR

All launchd plists are in ~/Library/LaunchAgents/.

API endpoints (port 8900):

Method Path Description
POST /search Full RAG search with all retrieval strategies
POST /entity Knowledge graph entity lookup
POST /temporal Temporal anchor search (events, dates, periods)
GET /stats Database statistics
GET /health Service health check
GET /docs Interactive Swagger UI

MCP endpoint (port 8901): http://192.168.8.180:8901/mcp โ€” Streamable HTTP transport for Claude Desktop, Claude Code, or any MCP client.

Remote access:

Service Pangolin VPN Address
Life Archive API 100.96.128.19:8900
MCP HTTP Server 100.96.128.20:8901
Database Stats

LanceDB (as of 2026-03-12):

Table Rows
Documents 74,041
Paragraphs 2,689,330
Sections 714,451
QA pairs 289,356
Communities 0 (GraphRAG not run)
Total size ~63 GB

Knowledge Graph:

Table Count
Entities 276,348
Relationships 230,855
Doc-entity links 1,153,312
Assets 456,321
Temporal anchors 391,565
Entity aliases 167
Correspondents 18,385
DB size ~368 MB

Entity types: person (92,377) ยท org (85,519) ยท thing (52,346) ยท location (46,106)

Source breakdown:

Source Docs in LanceDB Notes
magazine_article 28,309 โœ“ loaded
paperless_doc 22,555 โœ“ loaded
tana_node 14,807 โœ“ loaded
evernote_pdf 5,069 โœ“ loaded
evernote_note 3,301 โœ“ loaded
epub_articles 0 vectors exist (17 GB), not yet loaded
emails 0 enriched but not embedded (157K records)
MCP Tools (Claude Integration)

The Life Archive is also available as MCP tools inside Claude Code and Cowork, enabling natural-language queries without the HTTP API.

Tool Purpose
life_archive_search Full RAG search โ€” main query interface
life_archive_entity_lookup Find people, orgs, locations in the knowledge graph
life_archive_temporal_search Search for events, dates, time periods
life_archive_stats Database health and statistics
life_archive_graph_explore Deep-dive any entity โ€” connections, source docs, aliases
life_archive_graph_traverse Multi-hop graph walk โ€” map the neighborhood of any entity
life_archive_graph_search Find entities by name, filter by type

Two transport modes:

Transport Server Use Case
stdio mcp_server.py Local โ€” spawned on demand by Claude Code/Cowork on the Mac Studio
Streamable HTTP mcp_server_http.py Remote โ€” any MCP client on the network or over Pangolin VPN

Remote MCP client config (Claude Desktop / Claude Code):

"mcpServers": {
    "life-archive": {
        "url": "http://100.96.128.20:8901/mcp"
    }
}
Key Scripts

All scripts live in ~/Sync/ED/life_archive/:

Script Purpose
query.py Core query engine โ€” LifeArchiveQuery class
http_api.py FastAPI HTTP wrapper
embed_server.py Embedding server (gte-Qwen2-7B on MPS)
load_lancedb.py Loads extracted data into LanceDB tables
load_knowledge_graph.py Builds SQLite knowledge graph from extracted entities
resolve_entities.py Fuzzy dedup of knowledge graph entities
retry_entity_resolution.py Retry failed entity resolution batches
eval_queries.py Evaluation framework for query quality
mcp_server.py MCP stdio server for Claude integration
mcp_server_http.py MCP streamable HTTP server for remote access (port 8901)
Manual Operations

Check service status:

launchctl list | grep beedifferent

Restart embed server:

launchctl kickstart -k gui/$(id -u)/com.beedifferent.embed-server

Restart Life Archive API:

launchctl kickstart -k gui/$(id -u)/com.beedifferent.life-archive-api

Test API health:

curl http://localhost:8900/health

Run a search via API:

curl -X POST http://localhost:8900/search \
  -H "Content-Type: application/json" \
  -d '{"query": "beekeeping notes from 2023"}'

View logs:

tail -f ~/Sync/ED/life_archive/http_api.stdout.log
tail -f ~/Sync/ED/life_archive/http_api.stderr.log

Load new data into LanceDB:

cd ~/Sync/ED/life_archive
.venv/bin/python load_lancedb.py --source <source_name>

Rebuild knowledge graph:

cd ~/Sync/ED/life_archive
.venv/bin/python load_knowledge_graph.py
Knowledge Graph API (Universal Access)

The knowledge graph is exposed as a live API that any client can query โ€” Claude, Obsidian, Tana, local LLMs, browsers, scripts. Three endpoints provide entity exploration, multi-hop traversal, and search, all with source document links back to the original archive content.

Live endpoints (port 8900):

Endpoint Method Purpose
/graph/explore POST Full entity deep-dive: info, connections, source docs, aliases
/graph/traverse POST Multi-hop subgraph: walk N hops from any starting entity
/graph/search POST Find entities by name, filter by type
/docs GET Interactive Swagger UI for all endpoints

Web explorer: http://192.168.8.180:1313/kg/ โ€” interactive D3.js force-directed graph backed by the live API.

Example: Explore an entity

curl -X POST http://192.168.8.180:8900/graph/explore \
  -H "Content-Type: application/json" \
  -d '{"entity": "thomas brown", "max_connections": 20, "max_sources": 5}'

Returns: entity info, all connections with relationship labels, source documents with titles and summaries, total document count.

Example: Traverse the graph (2 hops from Colorado)

curl -X POST http://192.168.8.180:8900/graph/traverse \
  -H "Content-Type: application/json" \
  -d '{"entity": "colorado", "depth": 2, "max_per_hop": 15}'

Returns: full subgraph of nodes and edges reachable within N hops. Each node tagged with hop distance from root.

Example: Search entities

curl -X POST http://192.168.8.180:8900/graph/search \
  -H "Content-Type: application/json" \
  -d '{"query": "brown", "entity_type": "person", "limit": 10}'

MCP tools (same functionality): life_archive_graph_explore, life_archive_graph_traverse, life_archive_graph_search โ€” available via stdio and HTTP MCP servers. Any Claude session or MCP-compatible LLM can call these.

Client compatibility:

Client How to connect
Claude (Code/Cowork) MCP tools โ€” already registered, just ask in natural language
Local LLM (LM Studio, etc.) Point MCP client at http://192.168.8.180:8901/mcp
Obsidian HTTP API via Templater/Dataview, or Obsidian notes export (export_kg_obsidian.py)
Tana API integration to /graph/explore endpoint
Browser Swagger UI at /docs or web explorer at /kg/
Scripts curl / Python requests / any HTTP client

Key files:

File Purpose
graph_api.py Shared graph traversal logic (KnowledgeGraphAPI class)
http_api.py FastAPI HTTP endpoints (port 8900)
mcp_server.py MCP stdio server with graph tools
mcp_server_http.py MCP HTTP server with graph tools (port 8901)
export_kg_obsidian.py Export KG to Obsidian vault as markdown notes with wikilinks
export_kg_d3.py Export KG to JSON for D3.js visualization
Knowledge Graph Visualization

The knowledge graph can be exported to GEXF format for interactive exploration in Gephi or Cosmograph.

Export script: ~/Sync/ED/life_archive/export_kg_gexf.py

Pre-built exports (in ~/Sync/ED/life_archive/exports/):

File Nodes Edges Size Use case
life_archive_kg_full.gexf 276K 231K 173 MB Full graph โ€” Gephi or Cosmograph
life_archive_kg_top5000.gexf 5K 38K 13 MB Curated โ€” best for first exploration

Color scheme:

Entity Type Color
Person Blue
Organization Red
Location Green
Thing Yellow
Concept Purple

Node sizes scale logarithmically by mention count.

Viewing in Gephi:

  1. Install: brew install --cask gephi
  2. File โ†’ Open โ†’ choose a .gexf export
  3. Layout โ†’ ForceAtlas 2 โ†’ Run (let settle 30โ€“60 sec) โ†’ Stop
  4. Appearance โ†’ Nodes โ†’ Color โ†’ Partition โ†’ entity_type
  5. Statistics โ†’ Modularity โ†’ Run โ†’ then color by modularity class to see communities
  6. Use Data Laboratory tab to search/filter entities by name

Viewing in Cosmograph:

  1. Go to cosmograph.app
  2. Drag and drop the .gexf file
  3. WebGL renders instantly โ€” supports the full 276K-node graph

Custom exports:

cd ~/Sync/ED/life_archive

# Only people and orgs
python3 export_kg_gexf.py --types person org

# Entities mentioned 5+ times
python3 export_kg_gexf.py --min-mentions 5

# Top 10,000 by mention count
python3 export_kg_gexf.py --top 10000
Current Status & Pending Work

Last updated: 2026-03-24

Item Status
LanceDB loaded โœ“ 74K docs, 2.69M paragraphs
Knowledge graph โœ“ 276K entities, 231K relationships
Services running โœ“ API :8900, MCP :8901, Embed :1235
Eval baseline โœ“ 1.91/3.0 avg quality (2026-03-15)
epub_articles in LanceDB โœ— Vectors exist, not loaded
Emails embedded โœ— 157K records deferred
Contextual re-embedding โš ๏ธ Pending โ€” RunPod run needed

Contextual re-embedding is the most important pending item. All existing embeddings were generated without document-level context prefixed to chunks. New runpod_embed.py adds this (35-50% retrieval improvement). Previous RunPod run (2026-03-17 to 2026-03-21) failed at source 3/7 with OOM. Scripts fixed 2026-03-21 โ€” ready for new pod.

See ~/Sync/ED/TASKS.md for step-by-step next actions.

Remaining Work
Task Status Notes
Entity resolution Done 37 groups merged (7 original + 30 via Claude Sonnet), 177 aliases
Graph API + traversal Done /graph/explore, /graph/traverse, /graph/search + MCP tools
Email body embedding Deferred 157K email bodies not yet embedded (headers indexed)
Evaluation set Framework ready eval_queries.py exists, needs execution
Rule-based query routing Planned Replace LLM router with deterministic rules
New Paperless doc extraction Planned Process recently ingested 1,115 Evernote imports