Research Datasets
Access datasets from Kamiwaza AIR research to support your agentic AI research and development.
Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems (aka "RIKER")
Generated ground truth, document corpora, test sets, and the various model raw results.
📦 Download riker_2025.zip (612.45 MB)
Citation:
@techreport{roig2025riker,
title={Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe},
author={Roig, JV},
institution={Kamiwaza AI},
year={2025},
url={https://docs.kamiwaza.ai/research/papers/riker}
}
Towards a Standard, Enterprise-Relevant Agentic AI Benchmark (aka "KAMI v0.1")
Test suite definitions and evaluation code from our enterprise-focused agentic AI benchmark study.
📦 Download KAMI_v0.1_2025-12-17.zip (999.62 MB)
Citation:
@techreport{roig2025kami,
title={Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations},
author={Roig, JV},
institution={Kamiwaza AI},
year={2025},
url={https://docs.kamiwaza.ai/research/papers/kami-v0-1}
}
How Do LLMs Fail In Agentic Scenarios?
Execution traces from our qualitative analysis of LLM failure modes in agentic simulations.
📦 Download HowDoLLMsFail_2025.zip (2.60 MB)
Citation:
@techreport{roig2025llmfailures,
title={How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations},
author={Roig, JV},
institution={Kamiwaza AI},
year={2025},
url={https://docs.kamiwaza.ai/research/papers/llm-agentic-failures}
}