RAG Benchmarking

Article 15Flagship

Plug in any RAG system — LangChain, LlamaIndex, or custom — and benchmark it against classic and agentic-era metrics. Faithfulness, answer relevancy, retrieval precision, and four agentic metrics for multi-step agents. Measured faithfulness of 0.958 on the 50-sample golden dataset.

Quick Start

bashpip install rag-benchmarking
pythonfrom app.sdk.client import RagEval

client = RagEval(api_url="http://localhost:5001", api_key="your-key")

# Works with LangChain
result = my_chain.invoke({"query": "What is RAG?"})
sample = RagEval.from_langchain(result)

# Or any dict with question / contexts / answer
sample = {
    "question": "What is RAG?",
    "contexts": ["RAG stands for Retrieval-Augmented Generation."],
    "answer": "RAG combines retrieval with LLM generation.",
}

report = client.evaluate([sample], metrics=["faithfulness", "answer_relevancy"])
print(report["metrics"])
# {"faithfulness": 0.95, "answer_relevancy": 0.81}

Benchmark Results

Measured on the 50-sample golden dataset using gemini-2.5-flash as judge at temperature=0.0.

96%
faithfulness
Excellent
81%
answer_relevancy
Good

Features

  • Framework-agnostic — works with LangChain, LlamaIndex, or any custom RAG system
  • Classic metrics: faithfulness, answer relevancy, context precision/recall
  • Retrieval metrics: Precision@K, Recall@K, MRR, NDCG
  • Agentic metrics: agent faithfulness, tool call accuracy, source attribution, retrieval necessity
  • REST API + Python SDK with LangChain and LlamaIndex adapters
  • Run history with comparison across configurations

EU AI Act Context

Article 15Accuracy Requirements

Provides systematic accuracy testing and documentation for high-risk AI systems under Article 15.

Known Limitations

  • Benchmark datasets are English-only; no multilingual evaluation support.
  • Custom dataset integration requires manual formatting to the expected JSONL schema.
  • Accuracy metrics only — latency and throughput are not measured.
  • LLM-as-judge metrics depend on the configured judge model quality.
  • Rate limiting is in-memory and resets on server restart.

For the most current status, see GitHub issues.

Contributing

Contributions are welcome — Apache 2.0 licensed. See the contributing guide and open issues.

License

Licensed under the Apache License 2.0.

The Compound Moat

One tool is a start. The chain is the moat.

Each AiExponent tool produces structured evidence the next tool consumes. Browse the full toolchain — from Article 5 screening through Article 72 post-market monitoring.

See all tools →