Research Datasets

Access datasets from Kamiwaza AIR research to support your agentic AI research and development.

How Much Do LLMs Hallucinate in Document Q&A Scenarios?

Raw model outputs from the 172-billion-token hallucination study across 35 models, three context lengths, four temperatures, and three hardware platforms. Ground truth, document corpora, and test sets provided separately.

📦 Download RIKER2_March2026.zip (4.34 GB) — Raw model outputs

📦 Download RIKER2_corpora_groundtruth_testsets.zip (0.60 MB) — Ground truth, corpora, and test sets

🔗 Read the paper | arXiv

Citation:

@article{roig2026hallucinate,
  title={How Much Do LLMs Hallucinate in Document Q\&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms},
  author={Roig, JV},
  year={2026},
  eprint={2603.08274},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2603.08274}
}

Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems (aka "RIKER")

Generated ground truth, document corpora, test sets, and the various model raw results.

📦 Download riker_2025.zip (612.45 MB)

🔗 Read the paper

Citation:

@techreport{roig2025riker,
  title={Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe},
  author={Roig, JV},
  institution={Kamiwaza AI},
  year={2025},
  url={https://docs.kamiwaza.ai/research/papers/riker}
}

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark (aka "KAMI v0.1")

Test suite definitions and evaluation code from our enterprise-focused agentic AI benchmark study.

📦 Download KAMI_v0.1_2025-12-17.zip (999.62 MB)

🔗 Read the paper

Citation:

@techreport{roig2025kami,
  title={Towards a Standard, Enterprise-Relevant Agentic AI Benchmark: Lessons from 5.5 billion tokens' worth of agentic AI evaluations},
  author={Roig, JV},
  institution={Kamiwaza AI},
  year={2025},
  url={https://docs.kamiwaza.ai/research/papers/kami-v0-1}
}

How Do LLMs Fail In Agentic Scenarios?

Execution traces from our qualitative analysis of LLM failure modes in agentic simulations.

📦 Download HowDoLLMsFail_2025.zip (2.60 MB)

🔗 Read the paper

Citation:

@techreport{roig2025llmfailures,
  title={How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations},
  author={Roig, JV},
  institution={Kamiwaza AI},
  year={2025},
  url={https://docs.kamiwaza.ai/research/papers/llm-agentic-failures}
}

How Much Do LLMs Hallucinate in Document Q&A Scenarios?​

Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems (aka "RIKER")​

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark (aka "KAMI v0.1")​

How Do LLMs Fail In Agentic Scenarios?​

How Much Do LLMs Hallucinate in Document Q&A Scenarios?

Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems (aka "RIKER")

Towards a Standard, Enterprise-Relevant Agentic AI Benchmark (aka "KAMI v0.1")

How Do LLMs Fail In Agentic Scenarios?