Life Archive | BeeDifferent

Overview

Life Archive is a full RAG (Retrieval-Augmented Generation) pipeline that indexes ~278K personal records spanning decades — Evernote notes, emails, magazine archives, Tana nodes, and Paperless-NGX documents — into a searchable knowledge base. It runs entirely on the Mac Studio using local embeddings (gte-Qwen2-7B on Apple MPS) and LanceDB for vector storage.

What it answers: “What did I write about X?”, “When did I meet Y?”, “What happened during Z trip?” — any question against a lifetime of personal documents.

Key paths:

Path	Content
`~/Sync/ED/life_archive/`	Project root — all code, configs, data
`~/Sync/ED/life_archive/.venv/`	Python virtual environment
`~/Sync/ED/life_archive/lancedb_data/`	LanceDB vector database (~50 GB)
`~/Sync/ED/life_archive/knowledge_graph.db`	SQLite knowledge graph (~356 MB)

Architecture

Data flow:

Source extraction — Raw documents parsed from Evernote exports, email archives, magazine PDFs, Tana JSON, Paperless-NGX API
Enrichment — Text cleaning, section splitting, paragraph chunking, QA pair generation
Embedding — gte-Qwen2-7B encodes text into dense vectors (local, MPS-accelerated)
Storage — LanceDB tables for docs, sections, paragraphs, QA pairs; SQLite for knowledge graph
Query — Multi-strategy retrieval with fusion and reranking

Retrieval strategies (all run in parallel per query):

Strategy	What it does
Dense vectors	Semantic similarity search against paragraph embeddings
SPLADE keywords	Sparse keyword matching for exact terms
QA pairs	Matches against pre-generated question-answer pairs
Knowledge graph	Entity and relationship lookup
HyDE	Hypothetical Document Embedding — generates a synthetic answer, then searches for similar real content

Results from all strategies are fused via Reciprocal Rank Fusion (RRF), then reranked with a cross-encoder model for final ordering.

Running Services

Four persistent services on Mac Studio, all managed via launchd:

Service	Port	launchd Label	Purpose
Embed Server	1235	`com.beedifferent.embed-server`	gte-Qwen2-7B on MPS — generates embeddings
Life Archive API	8900	`com.beedifferent.life-archive-api`	FastAPI HTTP wrapper for remote queries
MCP HTTP Server	8901	`com.beedifferent.life-archive-mcp-http`	Streamable HTTP MCP server for remote Claude clients
Paperless-NGX	8100	(manual / runserver)	Document ingestion and OCR

All launchd plists are in ~/Library/LaunchAgents/.

API endpoints (port 8900):

Method	Path	Description
POST	`/search`	Full RAG search with all retrieval strategies
POST	`/entity`	Knowledge graph entity lookup
POST	`/temporal`	Temporal anchor search (events, dates, periods)
GET	`/stats`	Database statistics
GET	`/health`	Service health check
GET	`/docs`	Interactive Swagger UI

MCP endpoint (port 8901): http://192.168.8.180:8901/mcp — Streamable HTTP transport for Claude Desktop, Claude Code, or any MCP client.

Remote access:

Service	Pangolin VPN Address
Life Archive API	`100.96.128.19:8900`
MCP HTTP Server	`100.96.128.20:8901`

Database Stats

LanceDB (as of 2026-03-12):

Table	Rows
Documents	74,041
Paragraphs	2,689,330
Sections	714,451
QA pairs	289,356
Communities	0 (GraphRAG not run)
Total size	~63 GB

Knowledge Graph:

Table	Count
Entities	276,348
Relationships	230,855
Doc-entity links	1,153,312
Assets	456,321
Temporal anchors	391,565
Entity aliases	167
Correspondents	18,385
DB size	~368 MB

Entity types: person (92,377) · org (85,519) · thing (52,346) · location (46,106)

Source breakdown:

Source	Docs in LanceDB	Notes
magazine_article	28,309	✓ loaded
paperless_doc	22,555	✓ loaded
tana_node	14,807	✓ loaded
evernote_pdf	5,069	✓ loaded
evernote_note	3,301	✓ loaded
epub_articles	0	vectors exist (17 GB), not yet loaded
emails	0	enriched but not embedded (157K records)

MCP Tools (Claude Integration)

The Life Archive is also available as MCP tools inside Claude Code and Cowork, enabling natural-language queries without the HTTP API.

Tool	Purpose
`life_archive_search`	Full RAG search — main query interface
`life_archive_entity_lookup`	Find people, orgs, locations in the knowledge graph
`life_archive_temporal_search`	Search for events, dates, time periods
`life_archive_stats`	Database health and statistics
`life_archive_graph_explore`	Deep-dive any entity — connections, source docs, aliases
`life_archive_graph_traverse`	Multi-hop graph walk — map the neighborhood of any entity
`life_archive_graph_search`	Find entities by name, filter by type

Two transport modes:

Transport	Server	Use Case
stdio	`mcp_server.py`	Local — spawned on demand by Claude Code/Cowork on the Mac Studio
Streamable HTTP	`mcp_server_http.py`	Remote — any MCP client on the network or over Pangolin VPN

Remote MCP client config (Claude Desktop / Claude Code):

"mcpServers": {
    "life-archive": {
        "url": "http://100.96.128.20:8901/mcp"
    }
}

Key Scripts

All scripts live in ~/Sync/ED/life_archive/:

Script	Purpose
`query.py`	Core query engine — `LifeArchiveQuery` class
`http_api.py`	FastAPI HTTP wrapper
`embed_server.py`	Embedding server (gte-Qwen2-7B on MPS)
`load_lancedb.py`	Loads extracted data into LanceDB tables
`load_knowledge_graph.py`	Builds SQLite knowledge graph from extracted entities
`resolve_entities.py`	Fuzzy dedup of knowledge graph entities
`retry_entity_resolution.py`	Retry failed entity resolution batches
`eval_queries.py`	Evaluation framework for query quality
`mcp_server.py`	MCP stdio server for Claude integration
`mcp_server_http.py`	MCP streamable HTTP server for remote access (port 8901)

Manual Operations

Check service status:

launchctl list | grep beedifferent

Restart embed server:

launchctl kickstart -k gui/$(id -u)/com.beedifferent.embed-server

Restart Life Archive API:

launchctl kickstart -k gui/$(id -u)/com.beedifferent.life-archive-api

Test API health:

curl http://localhost:8900/health

Run a search via API:

curl -X POST http://localhost:8900/search \
  -H "Content-Type: application/json" \
  -d '{"query": "beekeeping notes from 2023"}'

View logs:

tail -f ~/Sync/ED/life_archive/http_api.stdout.log
tail -f ~/Sync/ED/life_archive/http_api.stderr.log

Load new data into LanceDB:

cd ~/Sync/ED/life_archive
.venv/bin/python load_lancedb.py --source <source_name>

Rebuild knowledge graph:

cd ~/Sync/ED/life_archive
.venv/bin/python load_knowledge_graph.py

Knowledge Graph API (Universal Access)

The knowledge graph is exposed as a live API that any client can query — Claude, Obsidian, Tana, local LLMs, browsers, scripts. Three endpoints provide entity exploration, multi-hop traversal, and search, all with source document links back to the original archive content.

Live endpoints (port 8900):

Endpoint	Method	Purpose
`/graph/explore`	POST	Full entity deep-dive: info, connections, source docs, aliases
`/graph/traverse`	POST	Multi-hop subgraph: walk N hops from any starting entity
`/graph/search`	POST	Find entities by name, filter by type
`/docs`	GET	Interactive Swagger UI for all endpoints

Web explorer: http://192.168.8.180:1313/kg/ — interactive D3.js force-directed graph backed by the live API.

Example: Explore an entity

curl -X POST http://192.168.8.180:8900/graph/explore \
  -H "Content-Type: application/json" \
  -d '{"entity": "thomas brown", "max_connections": 20, "max_sources": 5}'

Returns: entity info, all connections with relationship labels, source documents with titles and summaries, total document count.

Example: Traverse the graph (2 hops from Colorado)

curl -X POST http://192.168.8.180:8900/graph/traverse \
  -H "Content-Type: application/json" \
  -d '{"entity": "colorado", "depth": 2, "max_per_hop": 15}'

Returns: full subgraph of nodes and edges reachable within N hops. Each node tagged with hop distance from root.

Example: Search entities

curl -X POST http://192.168.8.180:8900/graph/search \
  -H "Content-Type: application/json" \
  -d '{"query": "brown", "entity_type": "person", "limit": 10}'

MCP tools (same functionality): life_archive_graph_explore, life_archive_graph_traverse, life_archive_graph_search — available via stdio and HTTP MCP servers. Any Claude session or MCP-compatible LLM can call these.

Client compatibility:

Client	How to connect
Claude (Code/Cowork)	MCP tools — already registered, just ask in natural language
Local LLM (LM Studio, etc.)	Point MCP client at `http://192.168.8.180:8901/mcp`
Obsidian	HTTP API via Templater/Dataview, or Obsidian notes export (`export_kg_obsidian.py`)
Tana	API integration to `/graph/explore` endpoint
Browser	Swagger UI at `/docs` or web explorer at `/kg/`
Scripts	curl / Python requests / any HTTP client

Key files:

File	Purpose
`graph_api.py`	Shared graph traversal logic (KnowledgeGraphAPI class)
`http_api.py`	FastAPI HTTP endpoints (port 8900)
`mcp_server.py`	MCP stdio server with graph tools
`mcp_server_http.py`	MCP HTTP server with graph tools (port 8901)
`export_kg_obsidian.py`	Export KG to Obsidian vault as markdown notes with wikilinks
`export_kg_d3.py`	Export KG to JSON for D3.js visualization

Knowledge Graph Visualization

The knowledge graph can be exported to GEXF format for interactive exploration in Gephi or Cosmograph.

Export script: ~/Sync/ED/life_archive/export_kg_gexf.py

Pre-built exports (in ~/Sync/ED/life_archive/exports/):

File	Nodes	Edges	Size	Use case
`life_archive_kg_full.gexf`	276K	231K	173 MB	Full graph — Gephi or Cosmograph
`life_archive_kg_top5000.gexf`	5K	38K	13 MB	Curated — best for first exploration

Color scheme:

Entity Type	Color
Person	Blue
Organization	Red
Location	Green
Thing	Yellow
Concept	Purple

Node sizes scale logarithmically by mention count.

Viewing in Gephi:

Install: brew install --cask gephi
File → Open → choose a .gexf export
Layout → ForceAtlas 2 → Run (let settle 30–60 sec) → Stop
Appearance → Nodes → Color → Partition → entity_type
Statistics → Modularity → Run → then color by modularity class to see communities
Use Data Laboratory tab to search/filter entities by name

Viewing in Cosmograph:

Go to cosmograph.app
Drag and drop the .gexf file
WebGL renders instantly — supports the full 276K-node graph

Custom exports:

cd ~/Sync/ED/life_archive

# Only people and orgs
python3 export_kg_gexf.py --types person org

# Entities mentioned 5+ times
python3 export_kg_gexf.py --min-mentions 5

# Top 10,000 by mention count
python3 export_kg_gexf.py --top 10000

Current Status & Pending Work

Snapshot is from late March 2026; the contextual re-embedding work in particular has been quiet — check ~/Sync/ED/TASKS.md for any newer status before acting on the pending items.

Item	Status
LanceDB loaded	✓ 74K docs, 2.69M paragraphs
Knowledge graph	✓ 276K entities, 231K relationships
Services running	✓ API :8900, MCP :8901, Embed :1235
Eval baseline	✓ 1.91/3.0 avg quality (2026-03-15)
epub_articles in LanceDB	✗ Vectors exist, not loaded
Emails embedded	✗ 157K records deferred
Contextual re-embedding	⚠️ Pending — RunPod run needed

Contextual re-embedding is the most important pending item. All existing embeddings were generated without document-level context prefixed to chunks. New runpod_embed.py adds this (35-50% retrieval improvement). Previous RunPod run (2026-03-17 to 2026-03-21) failed at source 3/7 with OOM. Scripts fixed 2026-03-21 — ready for new pod.

See ~/Sync/ED/TASKS.md for step-by-step next actions.

Remaining Work

Task	Status	Notes
Entity resolution	Done	37 groups merged (7 original + 30 via Claude Sonnet), 177 aliases
Graph API + traversal	Done	`/graph/explore`, `/graph/traverse`, `/graph/search` + MCP tools
Email body embedding	Deferred	157K email bodies not yet embedded (headers indexed)
Evaluation set	Framework ready	`eval_queries.py` exists, needs execution
Rule-based query routing	Planned	Replace LLM router with deterministic rules
New Paperless doc extraction	Planned	Process recently ingested 1,115 Evernote imports