Personal information retrieval system โ 278K documents across 7 sources, queryable via RAG pipeline on Mac Studio.
Life Archive is a full RAG (Retrieval-Augmented Generation) pipeline that indexes ~278K personal records spanning decades โ Evernote notes, emails, magazine archives, Tana nodes, and Paperless-NGX documents โ into a searchable knowledge base. It runs entirely on the Mac Studio using local embeddings (gte-Qwen2-7B on Apple MPS) and LanceDB for vector storage.
What it answers: “What did I write about X?”, “When did I meet Y?”, “What happened during Z trip?” โ any question against a lifetime of personal documents.
Key paths:
| Path | Content |
|---|---|
~/Sync/ED/life_archive/ |
Project root โ all code, configs, data |
~/Sync/ED/life_archive/.venv/ |
Python virtual environment |
~/Sync/ED/life_archive/lancedb_data/ |
LanceDB vector database (~50 GB) |
~/Sync/ED/life_archive/knowledge_graph.db |
SQLite knowledge graph (~356 MB) |
Data flow:
- Source extraction โ Raw documents parsed from Evernote exports, email archives, magazine PDFs, Tana JSON, Paperless-NGX API
- Enrichment โ Text cleaning, section splitting, paragraph chunking, QA pair generation
- Embedding โ gte-Qwen2-7B encodes text into dense vectors (local, MPS-accelerated)
- Storage โ LanceDB tables for docs, sections, paragraphs, QA pairs; SQLite for knowledge graph
- Query โ Multi-strategy retrieval with fusion and reranking
Retrieval strategies (all run in parallel per query):
| Strategy | What it does |
|---|---|
| Dense vectors | Semantic similarity search against paragraph embeddings |
| SPLADE keywords | Sparse keyword matching for exact terms |
| QA pairs | Matches against pre-generated question-answer pairs |
| Knowledge graph | Entity and relationship lookup |
| HyDE | Hypothetical Document Embedding โ generates a synthetic answer, then searches for similar real content |
Results from all strategies are fused via Reciprocal Rank Fusion (RRF), then reranked with a cross-encoder model for final ordering.
Four persistent services on Mac Studio, all managed via launchd:
| Service | Port | launchd Label | Purpose |
|---|---|---|---|
| Embed Server | 1235 | com.beedifferent.embed-server |
gte-Qwen2-7B on MPS โ generates embeddings |
| Life Archive API | 8900 | com.beedifferent.life-archive-api |
FastAPI HTTP wrapper for remote queries |
| MCP HTTP Server | 8901 | com.beedifferent.life-archive-mcp-http |
Streamable HTTP MCP server for remote Claude clients |
| Paperless-NGX | 8100 | (manual / runserver) | Document ingestion and OCR |
All launchd plists are in ~/Library/LaunchAgents/.
API endpoints (port 8900):
| Method | Path | Description |
|---|---|---|
| POST | /search |
Full RAG search with all retrieval strategies |
| POST | /entity |
Knowledge graph entity lookup |
| POST | /temporal |
Temporal anchor search (events, dates, periods) |
| GET | /stats |
Database statistics |
| GET | /health |
Service health check |
| GET | /docs |
Interactive Swagger UI |
MCP endpoint (port 8901): http://192.168.8.180:8901/mcp โ Streamable HTTP transport for Claude Desktop, Claude Code, or any MCP client.
Remote access:
| Service | Pangolin VPN Address |
|---|---|
| Life Archive API | 100.96.128.19:8900 |
| MCP HTTP Server | 100.96.128.20:8901 |
LanceDB (as of 2026-03-12):
| Table | Rows |
|---|---|
| Documents | 74,041 |
| Paragraphs | 2,689,330 |
| Sections | 714,451 |
| QA pairs | 289,356 |
| Communities | 0 (GraphRAG not run) |
| Total size | ~63 GB |
Knowledge Graph:
| Table | Count |
|---|---|
| Entities | 276,348 |
| Relationships | 230,855 |
| Doc-entity links | 1,153,312 |
| Assets | 456,321 |
| Temporal anchors | 391,565 |
| Entity aliases | 167 |
| Correspondents | 18,385 |
| DB size | ~368 MB |
Entity types: person (92,377) ยท org (85,519) ยท thing (52,346) ยท location (46,106)
Source breakdown:
| Source | Docs in LanceDB | Notes |
|---|---|---|
| magazine_article | 28,309 | โ loaded |
| paperless_doc | 22,555 | โ loaded |
| tana_node | 14,807 | โ loaded |
| evernote_pdf | 5,069 | โ loaded |
| evernote_note | 3,301 | โ loaded |
| epub_articles | 0 | vectors exist (17 GB), not yet loaded |
| emails | 0 | enriched but not embedded (157K records) |
The Life Archive is also available as MCP tools inside Claude Code and Cowork, enabling natural-language queries without the HTTP API.
| Tool | Purpose |
|---|---|
life_archive_search |
Full RAG search โ main query interface |
life_archive_entity_lookup |
Find people, orgs, locations in the knowledge graph |
life_archive_temporal_search |
Search for events, dates, time periods |
life_archive_stats |
Database health and statistics |
life_archive_graph_explore |
Deep-dive any entity โ connections, source docs, aliases |
life_archive_graph_traverse |
Multi-hop graph walk โ map the neighborhood of any entity |
life_archive_graph_search |
Find entities by name, filter by type |
Two transport modes:
| Transport | Server | Use Case |
|---|---|---|
| stdio | mcp_server.py |
Local โ spawned on demand by Claude Code/Cowork on the Mac Studio |
| Streamable HTTP | mcp_server_http.py |
Remote โ any MCP client on the network or over Pangolin VPN |
Remote MCP client config (Claude Desktop / Claude Code):
"mcpServers": {
"life-archive": {
"url": "http://100.96.128.20:8901/mcp"
}
}
All scripts live in ~/Sync/ED/life_archive/:
| Script | Purpose |
|---|---|
query.py |
Core query engine โ LifeArchiveQuery class |
http_api.py |
FastAPI HTTP wrapper |
embed_server.py |
Embedding server (gte-Qwen2-7B on MPS) |
load_lancedb.py |
Loads extracted data into LanceDB tables |
load_knowledge_graph.py |
Builds SQLite knowledge graph from extracted entities |
resolve_entities.py |
Fuzzy dedup of knowledge graph entities |
retry_entity_resolution.py |
Retry failed entity resolution batches |
eval_queries.py |
Evaluation framework for query quality |
mcp_server.py |
MCP stdio server for Claude integration |
mcp_server_http.py |
MCP streamable HTTP server for remote access (port 8901) |
Check service status:
launchctl list | grep beedifferent
Restart embed server:
launchctl kickstart -k gui/$(id -u)/com.beedifferent.embed-server
Restart Life Archive API:
launchctl kickstart -k gui/$(id -u)/com.beedifferent.life-archive-api
Test API health:
curl http://localhost:8900/health
Run a search via API:
curl -X POST http://localhost:8900/search \
-H "Content-Type: application/json" \
-d '{"query": "beekeeping notes from 2023"}'
View logs:
tail -f ~/Sync/ED/life_archive/http_api.stdout.log
tail -f ~/Sync/ED/life_archive/http_api.stderr.log
Load new data into LanceDB:
cd ~/Sync/ED/life_archive
.venv/bin/python load_lancedb.py --source <source_name>
Rebuild knowledge graph:
cd ~/Sync/ED/life_archive
.venv/bin/python load_knowledge_graph.py
The knowledge graph is exposed as a live API that any client can query โ Claude, Obsidian, Tana, local LLMs, browsers, scripts. Three endpoints provide entity exploration, multi-hop traversal, and search, all with source document links back to the original archive content.
Live endpoints (port 8900):
| Endpoint | Method | Purpose |
|---|---|---|
/graph/explore |
POST | Full entity deep-dive: info, connections, source docs, aliases |
/graph/traverse |
POST | Multi-hop subgraph: walk N hops from any starting entity |
/graph/search |
POST | Find entities by name, filter by type |
/docs |
GET | Interactive Swagger UI for all endpoints |
Web explorer: http://192.168.8.180:1313/kg/ โ interactive D3.js force-directed graph backed by the live API.
Example: Explore an entity
curl -X POST http://192.168.8.180:8900/graph/explore \
-H "Content-Type: application/json" \
-d '{"entity": "thomas brown", "max_connections": 20, "max_sources": 5}'
Returns: entity info, all connections with relationship labels, source documents with titles and summaries, total document count.
Example: Traverse the graph (2 hops from Colorado)
curl -X POST http://192.168.8.180:8900/graph/traverse \
-H "Content-Type: application/json" \
-d '{"entity": "colorado", "depth": 2, "max_per_hop": 15}'
Returns: full subgraph of nodes and edges reachable within N hops. Each node tagged with hop distance from root.
Example: Search entities
curl -X POST http://192.168.8.180:8900/graph/search \
-H "Content-Type: application/json" \
-d '{"query": "brown", "entity_type": "person", "limit": 10}'
MCP tools (same functionality): life_archive_graph_explore, life_archive_graph_traverse, life_archive_graph_search โ available via stdio and HTTP MCP servers. Any Claude session or MCP-compatible LLM can call these.
Client compatibility:
| Client | How to connect |
|---|---|
| Claude (Code/Cowork) | MCP tools โ already registered, just ask in natural language |
| Local LLM (LM Studio, etc.) | Point MCP client at http://192.168.8.180:8901/mcp |
| Obsidian | HTTP API via Templater/Dataview, or Obsidian notes export (export_kg_obsidian.py) |
| Tana | API integration to /graph/explore endpoint |
| Browser | Swagger UI at /docs or web explorer at /kg/ |
| Scripts | curl / Python requests / any HTTP client |
Key files:
| File | Purpose |
|---|---|
graph_api.py |
Shared graph traversal logic (KnowledgeGraphAPI class) |
http_api.py |
FastAPI HTTP endpoints (port 8900) |
mcp_server.py |
MCP stdio server with graph tools |
mcp_server_http.py |
MCP HTTP server with graph tools (port 8901) |
export_kg_obsidian.py |
Export KG to Obsidian vault as markdown notes with wikilinks |
export_kg_d3.py |
Export KG to JSON for D3.js visualization |
The knowledge graph can be exported to GEXF format for interactive exploration in Gephi or Cosmograph.
Export script: ~/Sync/ED/life_archive/export_kg_gexf.py
Pre-built exports (in ~/Sync/ED/life_archive/exports/):
| File | Nodes | Edges | Size | Use case |
|---|---|---|---|---|
life_archive_kg_full.gexf |
276K | 231K | 173 MB | Full graph โ Gephi or Cosmograph |
life_archive_kg_top5000.gexf |
5K | 38K | 13 MB | Curated โ best for first exploration |
Color scheme:
| Entity Type | Color |
|---|---|
| Person | Blue |
| Organization | Red |
| Location | Green |
| Thing | Yellow |
| Concept | Purple |
Node sizes scale logarithmically by mention count.
Viewing in Gephi:
- Install:
brew install --cask gephi - File โ Open โ choose a
.gexfexport - Layout โ ForceAtlas 2 โ Run (let settle 30โ60 sec) โ Stop
- Appearance โ Nodes โ Color โ Partition โ
entity_type - Statistics โ Modularity โ Run โ then color by modularity class to see communities
- Use Data Laboratory tab to search/filter entities by name
Viewing in Cosmograph:
- Go to cosmograph.app
- Drag and drop the
.gexffile - WebGL renders instantly โ supports the full 276K-node graph
Custom exports:
cd ~/Sync/ED/life_archive
# Only people and orgs
python3 export_kg_gexf.py --types person org
# Entities mentioned 5+ times
python3 export_kg_gexf.py --min-mentions 5
# Top 10,000 by mention count
python3 export_kg_gexf.py --top 10000
Last updated: 2026-03-24
| Item | Status |
|---|---|
| LanceDB loaded | โ 74K docs, 2.69M paragraphs |
| Knowledge graph | โ 276K entities, 231K relationships |
| Services running | โ API :8900, MCP :8901, Embed :1235 |
| Eval baseline | โ 1.91/3.0 avg quality (2026-03-15) |
| epub_articles in LanceDB | โ Vectors exist, not loaded |
| Emails embedded | โ 157K records deferred |
| Contextual re-embedding | โ ๏ธ Pending โ RunPod run needed |
Contextual re-embedding is the most important pending item. All existing embeddings were generated without document-level context prefixed to chunks. New runpod_embed.py adds this (35-50% retrieval improvement). Previous RunPod run (2026-03-17 to 2026-03-21) failed at source 3/7 with OOM. Scripts fixed 2026-03-21 โ ready for new pod.
See ~/Sync/ED/TASKS.md for step-by-step next actions.
| Task | Status | Notes |
|---|---|---|
| Entity resolution | Done | 37 groups merged (7 original + 30 via Claude Sonnet), 177 aliases |
| Graph API + traversal | Done | /graph/explore, /graph/traverse, /graph/search + MCP tools |
| Email body embedding | Deferred | 157K email bodies not yet embedded (headers indexed) |
| Evaluation set | Framework ready | eval_queries.py exists, needs execution |
| Rule-based query routing | Planned | Replace LLM router with deterministic rules |
| New Paperless doc extraction | Planned | Process recently ingested 1,115 Evernote imports |