- Velocirag
Velocirag
🦖 VelociRAG
Lightning-fast RAG for AI agents.
Four-layer retrieval fusion powered by ONNX Runtime. No PyTorch. Sub-200ms warm search. Incremental graph updates. MCP-ready.
Most RAG solutions either drag in 2GB+ of PyTorch or limit you to single-layer vector search. VelociRAG gives you four retrieval methods — vector similarity, BM25 keyword matching, knowledge graph traversal, and metadata filtering — fused through reciprocal rank fusion with cross-encoder reranking. All running on ONNX Runtime, no GPU, no API keys. Comes with an MCP server for agent integration, a Unix socket daemon for warm queries, and a CLI that just works.
🚀 Quick Start
MCP Server (Claude, Cursor, Windsurf)
pip install "velocirag[mcp]"
velocirag index ./my-docs
velocirag mcp
Claude Code — add to .mcp.json in your project root:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp"],
"env": { "VELOCIRAG_DB": "/path/to/data" }
}
}
}
Then open /mcp in Claude Code and enable the velocirag server. If using a virtualenv, use the full path to the binary (e.g. .venv/bin/velocirag).
Claude Desktop — add to claude_desktop_config.json:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp", "--db", "/path/to/data"]
}
}
}
Cursor — add to .cursor/mcp.json:
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": ["mcp", "--db", "/path/to/data"]
}
}
}
Python API
from velocirag import Embedder, VectorStore, Searcher
embedder = Embedder()
store = VectorStore('./my-db', embedder)
store.add_directory('./my-docs')
searcher = Searcher(store, embedder)
results = searcher.search('query', limit=5)
CLI
pip install velocirag
velocirag index ./my-docs
velocirag search "your query here"
Search Daemon (warm engine for CLI users)
velocirag serve --db ./my-data # start daemon (background)
velocirag search "query" # auto-routes through daemon
velocirag status # check daemon health
velocirag stop # stop daemon
The daemon keeps the ONNX model + FAISS index warm over a Unix socket. First query loads the engine (~1s), subsequent queries return in ~180ms with full 4-layer fusion.
🎯 Why VelociRAG?
| VelociRAG | LangChain | LlamaIndex | Chroma | mcp-local-rag | |
|---|---|---|---|---|---|
| Search layers | 4 | 2 | 2 | 1 | 2 |
| Cross-encoder reranking | ✅ | ❌ | ✅ | ❌ | ❌ |
| Knowledge graph | ✅ | ❌ | ✅ | ❌ | ❌ |
| Incremental updates | ✅ | ❌ | ❌ | ❌ | ❌ |
| LLM required for search | ❌ | ⚠️ | ⚠️ | ❌ | ❌ |
| MCP server | ✅ | ❌ | ❌ | ❌ | ✅ |
| GPU required | ❌ | ❌ | ❌ | ❌ | ❌ |
| PyTorch required | ❌ | ✅ | ✅ | ❌ | ❌ |
| Install size | ~80MB | ~750MB+ | ~750MB+ | ~50MB | ~30MB |
| Warm search latency | ~3ms | — | — | ~50ms | ~200ms |
🏗️ How It Works
The 4-layer pipeline:
Query → expand (acronyms, variants)
→ [Vector] FAISS cosine similarity (384d, MiniLM-L6-v2 via ONNX)
→ [Keyword] BM25 via SQLite FTS5
→ [Graph] Knowledge graph traversal
→ [Metadata] Structured SQL filters (tags, status, project)
→ RRF Fusion → Cross-encoder rerank → Results
What each layer catches:
| Query type | Vector | Keyword | Graph | Metadata |
|---|---|---|---|---|
| Conceptual ("improve error handling") | ✅ | — | — | — |
| Exact match ("ERR_CONNECTION_REFUSED") | — | ✅ | — | — |
| Connected concepts | — | — | ✅ | — |
| Filtered ("#python status:active") | — | — | — | ✅ |
| Combined ("React state management") | ✅ | ✅ | ✅ | ✅ |
✨ Features
- ONNX Runtime — 184ms cold start, 3ms cached. No PyTorch, no GPU
- Four-layer fusion — FAISS vector similarity + SQLite FTS5 (BM25) + knowledge graph + metadata filtering, merged via reciprocal rank fusion
- Cross-encoder reranking — TinyBERT reranker via ONNX Runtime — included in base install, no PyTorch needed. Downloads ~17MB model on first use
- Incremental graph updates — file-centric provenance tracking detects what changed and only rebuilds affected nodes/edges. Cascading deletes maintain consistency across all stores (vector, graph, metadata). Multi-source support with isolated provenance per source
- MCP server — Five tools (search, index, add_document, health, list_sources) for Claude, Cursor, Windsurf
- Search daemon — Unix socket server keeps ONNX model + FAISS index warm between queries
- Knowledge graph — Analyzers build entity, temporal, topic, and explicit-link edges from markdown. Optional GLiNER NER. 418 files in 2.1s
- Smart chunking — Header-aware splitting preserves document structure and parent context
- Query expansion — Acronym registry, casing/spacing variants, underscore-aware tokenization
- Runs anywhere — CPU-only, 8GB RAM, no API keys, no external services
🤖 MCP Server
VelociRAG exposes a Model Context Protocol server for seamless agent integration:
Available tools:
search— 4-layer fusion search with rerankingindex— Add documents to the knowledge baseadd_document— Insert single documenthealth— System diagnosticslist_sources— Show indexed document sources
The MCP server process stays alive between queries, so models load once and every subsequent search is warm. Works with any MCP-compatible client.
🐍 Python API
Full 4-layer unified search:
from velocirag import (
Embedder, VectorStore, Searcher,
GraphStore, MetadataStore, UnifiedSearch,
GraphPipeline
)
# Build the full stack
embedder = Embedder()
store = VectorStore('./search-db', embedder)
graph_store = GraphStore('./search-db/graph.db')
metadata_store = MetadataStore('./search-db/metadata.db')
# Index with graph + metadata
store.add_directory('./docs')
pipeline = GraphPipeline(graph_store, embedder, metadata_store)
pipeline.build('./docs', source_name='my-docs')
# Unified search across all layers
searcher = Searcher(store, embedder)
unified = UnifiedSearch(searcher, graph_store, metadata_store)
results = unified.search(
'machine learning algorithms',
limit=5,
enrich_graph=True,
filters={'tags': ['python'], 'status': 'active'}
)
Quick semantic search:
from velocirag import Embedder, VectorStore, Searcher
embedder = Embedder()
store = VectorStore('./db', embedder)
store.add_directory('./docs')
searcher = Searcher(store, embedder)
results = searcher.search('neural networks', limit=10)
Incremental graph updates:
from velocirag import Embedder, GraphStore, GraphPipeline
# First run — full build, populates provenance
gs = GraphStore('./db/graph.db')
pipeline = GraphPipeline(gs, embedder=Embedder())
pipeline.build('./docs', source_name='my-docs') # full build
# Subsequent runs — only changed files get reprocessed
pipeline.build('./docs', source_name='my-docs') # incremental (automatic)
# Force full rebuild
pipeline.build('./docs', source_name='my-docs', force_rebuild=True)
# Multi-source graphs
pipeline.build('./project-a', source_name='project-a')
pipeline.build('./project-b', source_name='project-b') # isolated provenance
# Deleted files automatically cascade across all stores
# (vector, FTS5, graph, metadata) on next build
💻 CLI Reference
# Index documents (graph + metadata built by default)
velocirag index <path> [--no-graph] [--no-metadata] [--gliner] [--full-graph] [--force]
[--source NAME] [--db PATH]
# Search across all layers (auto-routes through daemon if running)
velocirag search <query> [--limit N] [--threshold F] [--format text|json]
# Search daemon
velocirag serve [--db PATH] [-f] # start daemon (-f for foreground)
velocirag stop # stop daemon
velocirag status # check daemon health
# Metadata queries
velocirag query [--tags TAG] [--status S] [--project P] [--recent N]
# System health and status
velocirag health [--format text|json]
# Start MCP server
velocirag mcp [--db PATH] [--transport stdio|sse]
Options:
--no-graph— Skip knowledge graph build--no-metadata— Skip metadata extraction--full-graph— Build graph WITH semantic similarity edges (~2GB extra RAM)--source NAME— Label for multi-source provenance isolation--force— Clear and rebuild from scratch--gliner— Use GLiNER for entity extraction (requirespip install "velocirag[ner]")
📊 Performance
Real benchmarks on ByteByteGo/system-design-101 (418 files, 1,001 chunks):
| Metric | Value |
|---|---|
| Index (418 files) | 13.6s |
| Search (warm, 5 results) | 35–90ms |
| Graph build (light) | 2.1s → 2,397 nodes, 8,717 edges |
| Incremental update (1 file) | 1.3s |
| Reranker | Cross-encoder TinyBERT via ONNX |
| Install size | ~80MB (no PyTorch) |
| RAM usage | <1GB with all models loaded |
Production deployment (6,300+ chunks, 3 sources, 950 files):
| Metric | Value |
|---|---|
| Full search (warm) | 16ms avg, 2ms min |
| Full search (first run) | 22ms avg, 4ms min |
| Search P50 / P95 | 17ms / 55ms |
| Hit rate (100-query benchmark) | 99/100 |
| Graph | 3,125 nodes, 132,320 edges |
| Reranker | Cross-encoder TinyBERT via ONNX |
| RAM | <1GB with all models loaded |
⚙️ Configuration
| Environment Variable | Default | Description |
|---|---|---|
VELOCIRAG_DB | ./.velocirag | Database directory |
VELOCIRAG_SOCKET | /tmp/velocirag-daemon.sock | Daemon socket path |
NO_COLOR | — | Disable colored output |
Dependencies (all included in base install):
onnxruntime— ONNX inference (embedder + reranker)tokenizers+huggingface-hub— model loadingfaiss-cpu— vector similarity searchnetworkx+scikit-learn— knowledge graph + topic clusteringnumpy,click,pyyaml,python-frontmatter
Optional extras:
pip install "velocirag[mcp]"— MCP server (addsfastmcp)pip install "velocirag[ner]"— GLiNER entity extraction (addsgliner, requires PyTorch)
📚 References
VelociRAG builds on these foundational works:
Core Fusion & Retrieval
Reciprocal Rank Fusion — Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods." SIGIR '09.
Core fusion algorithm for merging results across retrieval layers.
BM25 — Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). "Okapi at TREC-3." TREC-3.
Keyword search foundation via SQLite FTS5.
Embeddings & Neural IR
Sentence-BERT — Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. paper
Dense embedding architecture usingall-MiniLM-L6-v2.
MiniLM — Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers." NeurIPS 2020. paper
Efficient transformer distillation for production embedding models.
Reranking & Neural Models
Cross-Encoder Reranking — Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." arXiv:1901.04085. paper
Cross-attention reranking with TinyBERT on MS MARCO.
TinyBERT — Jiao, X., et al. (2020). "TinyBERT: Distilling BERT for Natural Language Understanding." Findings of EMNLP 2020. paper
Compressed BERT for fast reranking inference.
Vector Search & Systems
FAISS — Johnson, J., Douze, M., & Jégou, H. (2019). "Billion-scale similarity search with GPUs." IEEE Transactions on Big Data. paper
High-performance vector similarity search engine.
GLiNER — Zaratiana, U., Nzeyimana, A., & Holat, P. (2023). "GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer." arXiv:2311.08526. paper
Generalist NER for knowledge graph entity extraction (optional dependency).
📄 License
MIT — Use it anywhere, build anything.
Need agent integration help? Check AGENTS.md for machine-readable project context.
Built for agents who think fast and remember faster.
Server Config
{
"mcpServers": {
"velocirag": {
"command": "velocirag",
"args": [
"mcp"
],
"env": {
"VELOCIRAG_DB": "/path/to/your/docs"
}
}
}
}