Performance Overview
Statistical Analysis
| Provider | Model | Runs | Mean Accuracy | Std Dev | RSE | 95% t-CI | Range | Pooled Accuracy |
|---|
Performance Heatmap
Basic Sanity Check
Filesystem Tasks
Finding Needles in Text Files
CSV Processing
Database Processing
DB Proc (Easy/Hinted)
Instruction Following /
Output Formatting
Performance Scale:
0-8
9-14
15-22
23-29
30 (Perfect)