Skip to main content
Benchmarks

AI model cost & capability benchmarks

Compare leading models on capability and cost efficiency. ModelSpend uses these metrics — and real routing data from millions of prompts — to make optimal routing decisions for your team.

Latest model benchmark snapshot

Weekly refreshed public benchmark (static fallback mode if API unavailable).

ProviderModelCapabilityCost efficiency
OpenAIgpt-4.1-mini86/10072/100
Anthropicclaude-3.5-haiku82/10091/100
Googlegemini-2.0-flash79/10088/100
Groqllama-3.3-70b75/10094/100

Methodology

Capability score (0–100)

Composite of MMLU, HumanEval, HellaSwag, and internal coding benchmarks. Scores normalised to 100 against the highest-performing model.

Cost efficiency score (0–100)

Capability divided by blended input+output cost per 1M tokens, then normalised. Higher = more capability per dollar.

Refresh cadence

Static snapshot updated weekly. ModelSpend routing uses live provider pricing data updated daily.

Disclaimer

Benchmark data is provided for informational purposes. Actual performance varies by use case. Run evals on your own dataset using the ModelSpend evaluation framework.

Route to the best model automatically.

ModelSpend uses live pricing and benchmark data to route each prompt optimally. Setup in 4 minutes.

Start free → See how routing works
Founding Beta: Limited Access
Help shape the future of AI spend control.
ends 29 August 2026
Spots are limited.
Secure your early access.
Request Access →