Question 1

What does EU AI Act Article 15 require?

Accepted Answer

High-risk AI systems must achieve appropriate accuracy, robustness, and cybersecurity throughout their lifecycle. Accuracy metrics must be declared in the instructions for use (Article 15(3)), and the system must be resilient to errors, faults, and inconsistencies (Article 15(4)). Source: Regulation (EU) 2024/1689 Article 15(1), 15(3), 15(4).

Question 2

When does Article 15 become enforceable?

Accepted Answer

Article 15 obligations for high-risk AI systems apply from 2 December 2027. The Digital Omnibus deferred the original 2 August 2026 statutory date: Council–Parliament political agreement on 7 May 2026, Parliament adoption on 16 June 2026, Council adoption on 29 June 2026, with OJEU publication still pending. Source: Regulation (EU) 2024/1689 Article 113, as amended by the Digital Omnibus (adopted; OJEU publication pending).

Question 3

Does RAG Benchmarking cover the cybersecurity leg of Article 15?

Accepted Answer

No. RAG Benchmarking covers the accuracy and robustness legs: faithfulness, retrieval precision, agentic metrics, and adversarial-passage stress tests. The cybersecurity leg (prompt injection resistance, jailbreak defence, model integrity) needs a runtime AI security control. Pair the two to cover both legs of Article 15.

Question 4

What metrics does RAG Benchmarking measure?

Accepted Answer

Classic metrics (faithfulness, answer relevancy, context precision/recall), retrieval metrics (Precision@K, Recall@K, MRR, NDCG), and four agentic metrics (agent faithfulness, tool-call accuracy, source attribution, retrieval necessity).

Question 5

Is RAG Benchmarking framework-agnostic?

Accepted Answer

Yes. RAG Benchmarking works with LangChain, LlamaIndex, or any custom RAG system that returns a sample with `question`, `contexts`, and `answer` fields. SDK adapters for LangChain and LlamaIndex are included; custom integrations use the JSONL schema directly.

Question 6

What is the measured faithfulness on the golden dataset?

Accepted Answer

0.958 on the published 50-sample golden dataset (rated "Excellent"), with 0.810 answer relevancy ("Good"). These are the actual numbers from the v1.0.0 release benchmark, not aspirational targets.

Question 7

Can I bring my own evaluation dataset?

Accepted Answer

Yes. RAG Benchmarking accepts custom datasets in JSONL format with the expected schema. The bundled golden dataset is English-only; multilingual evaluation is not supported in v1.0.

Question 8

Is RAG Benchmarking free?

Accepted Answer

Yes. Apache 2.0 licensed. The harness itself runs locally; LLM-as-judge metrics depend on whichever judge model you configure (which may have its own usage cost).

Question 9

What is the penalty for Article 15 non-compliance?

Accepted Answer

Up to €15M or 3% of global annual turnover, whichever is higher, under Article 99(4). The provider-obligation chain via Article 16 routes Article 15 failures through this penalty band.

Question 10

How does drift monitoring work?

Accepted Answer

You declare an evaluation set version and a metric threshold. RAG Benchmarking replays the eval set against the live system on a schedule and alerts on metric regression, supporting the lifecycle-consistent-performance requirement of Article 15(1).

RAG Benchmarking

Quick Start

Benchmark Results

Features

Regulatory Foundation

What the regulation requires

What you face if you don't comply

How RAG Benchmarking addresses this

Frequently asked questions

Known Limitations

Contributing

License

One tool covers one article. The full set covers your audit.