Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe

Authors: JV Roig Published: December 2025

Abstract

Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially harder than single-document extraction; and grounding ability and hallucination resistance are distinct capabilities - models excelling at finding facts that exist may still fabricate facts that do not. Beyond the specific benchmark, we contribute a domain-agnostic methodology for constructing scalable and contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.

Why This Matters

If you're building or deploying RAG systems, knowledge graphs, or any AI system that needs to retrieve and reason over documents, traditional benchmarks may not reflect real-world performance. RIKER reveals critical insights:

Context length claims are misleading - Significant degradation occurs beyond 32K tokens, regardless of advertised context windows
Aggregation is fundamentally harder - Cross-document reasoning proves substantially more difficult than single-document extraction
Grounding and hallucination resistance are separate skills - Models that excel at finding facts that exist may still fabricate facts that do not

The RIKER methodology also provides a replicable framework for creating your own contamination-resistant evaluations in any domain where synthetic documents can be generated from structured ground truth.

Downloads & Resources

📄 Download PDF

🔗 Additional Links:

arXiv: arxiv.org/abs/2601.08847
Dataset availability: Research Datasets
PICARD Framework - Related contamination-resistant evaluation methodology

Citation

If you use this work in your research, please cite:

@techreport{roig2025riker,
  title={Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe},
  author={Roig, JV},
  institution={Kamiwaza AI},
  year={2025},
  url={https://docs.kamiwaza.ai/research/papers/riker}
}

KAMI v0.1 - Enterprise-relevant agentic AI benchmark
How Do LLMs Fail In Agentic Scenarios? - Qualitative analysis of LLM failure modes
PICARD - Contamination-resistant evaluation framework

Abstract​

Why This Matters​

Downloads & Resources​

Citation​

Related Work​

Abstract

Why This Matters

Downloads & Resources

Citation

Related Work