Stop guessing which RAG
strategy actually works.
Upload your documents. Run 5 retrieval strategies in parallel. Get an AI-scored leaderboard in under 10 minutes. Open source and self-hostable.
Which strategy should I use?
Naive RAG? Hybrid search? Reranking? HyDE? Every team guesses and hopes. Nobody measures.
Why is my RAG failing?
Your system scores 68% but you have no idea if it's the chunking, the embeddings, or the model. Debugging takes weeks.
How do I prove it works?
Regulators and stakeholders want evidence. You have vibes. That gap costs contracts.
How it works
From document to decision in 4 steps
Upload your documents
PDF or any text document. Stored securely. Chunked and embedded automatically.
Auto-generate test questions
AI reads your documents and creates realistic Q&A pairs automatically. PII scrubbed before storage.
5 strategies run in parallel
LangGraph agents benchmark Naive RAG, Hybrid BM25, Cohere Rerank, HyDE, and Parent-Child simultaneously.
Know exactly what to deploy
AI judge scores each strategy. Failure attribution tells you why the others lost.
Self-hosting
Self-host in minutes
Clone the repo and run locally in under 5 minutes. No vendor lock-in. Your data stays yours.
Clone the repo
git clone https://github.com/ tanmaykaushik451/rag-eval-app cd rag-eval-framework
Add your API keys
cp .env.example .env # Add your keys: # OpenRouter, Cohere, AWS S3 # Neon PostgreSQL
Run it
pip install -r requirements.txt uvicorn backend.main:app cd frontend && npm run dev
5 strategies
5 strategies. One winner. No guessing.
Most teams pick one and hope. We test all five in 9 seconds.
Naive RAG
Pure vector similarity search. Fast, simple, often not enough.
Hybrid BM25 + Vector
Combines keyword and semantic search using Reciprocal Rank Fusion.
Cohere Rerank
Cross-encoder re-scores 20 candidates by true relevance.
HyDE
Generates a hypothetical answer first, then searches with that embedding.
Parent-Child
Matches small chunks but returns surrounding context for richer answers.
Use cases
Who this is built for
Any team shipping RAG in production faces the same problem. Here is how different industries use RAG Eval.
Financial Institution
OSFI-compliant RAG evaluation
A Canadian bank building an internal policy assistant for compliance documents needs to prove systematic evaluation to regulators before going live.
- โAuto-generates test questions from policy docs
- โPII scrubbed before any cloud processing
- โ5-strategy benchmark across full corpus
- โOne-click audit ZIP for OSFI review
- โCI/CD gate prevents quality regressions
AI-Native Startup
Shipping RAG without breaking production
A developer tools company building an AI assistant over their documentation needs to pick the right retrieval strategy before launch and protect quality after.
- โBenchmarks all 5 strategies in one run
- โFailure attribution shows exactly why Naive RAG misses technical terminology
- โBaseline locked after first evaluation
- โGitHub Action blocks regressions on every PR
Healthcare Platform
Patient-safe RAG with full audit trail
A healthcare platform building clinical decision support needs evaluation that never exposes patient data and produces evidence for clinical governance review.
- โSelf-hostable โ runs entirely on your servers
- โZero data leaves your network
- โPresidio PII scrubbing on all test questions
- โFull audit export for clinical governance
- โBatch evaluation across hundreds of questions
Features
Everything you need. Nothing you don't.
Built for production teams โ not for demos.
Failure Attribution
Pinpoints exactly why each strategy failed โ embedding weakness, retrieval logic, or generation layer.
CI/CD Regression Gates
GitHub Action blocks merges when RAG quality drops below your baseline. Never ship a silent regression.
Synthetic Test Generator
Auto-generates realistic Q&A pairs from your documents. No manual labeling required.
PII Scrubbing
Microsoft Presidio automatically detects and removes names, emails, and sensitive data before processing.
Audit Export
One-click ZIP with questions.csv, scrub_report.json, and a PDF summary formatted for regulatory review.
LangSmith Tracing
Every parallel agent run fully traced. See cost, latency, and token usage per strategy in real time.
Tech stack
Built on the stack you already trust
Built with async Python, LangGraph parallel agents, pgvector for vector storage, and Microsoft Presidio for PII detection. Every evaluation traced in LangSmith. Open architecture โ bring your own models, swap any component.
Open source. Self-hostable. Free.
Clone the repo, add your API keys, and run your first evaluation in under 5 minutes.