Latest model benchmark snapshot
Weekly refreshed public benchmark (static fallback mode if API unavailable).
| Provider | Model | Capability | Cost efficiency |
|---|---|---|---|
| OpenAI | gpt-4.1-mini | 86/100 | 72/100 |
| Anthropic | claude-3.5-haiku | 82/100 | 91/100 |
| gemini-2.0-flash | 79/100 | 88/100 | |
| Groq | llama-3.3-70b | 75/100 | 94/100 |
Methodology
Capability score (0–100)
Composite of MMLU, HumanEval, HellaSwag, and internal coding benchmarks. Scores normalised to 100 against the highest-performing model.
Cost efficiency score (0–100)
Capability divided by blended input+output cost per 1M tokens, then normalised. Higher = more capability per dollar.
Refresh cadence
Static snapshot updated weekly. ModelSpend routing uses live provider pricing data updated daily.
Disclaimer
Benchmark data is provided for informational purposes. Actual performance varies by use case. Run evals on your own dataset using the ModelSpend evaluation framework.