Performance Overview
Statistical Analysis
Provider Model Runs Mean Accuracy Std Dev RSE 95% t-CI Range Pooled Accuracy
Performance Heatmap

Basic Sanity Check

Filesystem Tasks

Finding Needles in Text Files

CSV Processing

Database Processing

DB Proc (Easy/Hinted)

Instruction Following /

Output Formatting

Performance Scale:
0-8
9-14
15-22
23-29
30 (Perfect)