Skip to main content

Featured Publications

How Do LLMs Fail In Agentic Scenarios?

JV Roig | December 2025

A qualitative analysis of 900 execution traces from three representative models (Granite 4 Small, Llama 4 Maverick, DeepSeek V3.1) revealing how LLMs fail when operating as autonomous agents. Rather than aggregate scores, this study surfaces the behavioral strategies that enable success and the recurring failure modes that undermine reliability.

Key Finding: Recovery capability—not initial correctness—best predicts overall success. Four failure archetypes emerge across all models: premature action without grounding, over-helpfulness under uncertainty, context pollution vulnerability, and fragile execution under load.

📄 Read the paper | Download PDF


KAMI v0.1: Enterprise-Relevant Agentic AI Benchmark

JV Roig | October 2025

Lessons from 5.5 billion tokens' worth of agentic AI evaluations showing traditional benchmarks fail to predict real-world performance. Through massive-scale testing of 35 model configurations using the PICARD framework, we demonstrate that models ranking high on traditional benchmarks often fail at practical enterprise tasks.

Key Finding: Traditional benchmark rankings fail to predict enterprise task performance, even tool-calling benchmarks like BFCLv3 or TAU2-Bench, or even aggregated benchmarks. Benchmarking is not enough - simulation is what is needed.

📄 Read the paper | Download PDF


PICARD: Testing What Models Can Do, Not What They've Seen

JV Roig | July 2025

A framework for contamination-resistant LLM evaluation through multi-layered randomization. PICARD creates over 10^80 unique test configurations—more than atoms in the observable universe—making memorization impossible while testing real-world agentic tasks like file manipulation, database operations, and multi-step workflows.

Key Innovation: Unlike static benchmarks that models can memorize, PICARD generates unique test instances every time while maintaining deterministic scoring and statistical validity. Extends beyond math to complex enterprise scenarios.

📄 Read the paper | Download PDF | GitHub