Benchmarks

AI model cost & capability benchmarks

Compare leading models on capability and cost efficiency. ModelSpend uses these metrics — and real routing data from millions of prompts — to make optimal routing decisions for your team.

Latest model benchmark snapshot

Weekly refreshed public benchmark (static fallback mode if API unavailable).

Provider	Model	Capability	Cost efficiency
OpenAI	gpt-4.1-mini	86/100	72/100
Anthropic	claude-3.5-haiku	82/100	91/100
Google	gemini-2.0-flash	79/100	88/100
Groq	llama-3.3-70b	75/100	94/100

Methodology

Capability score (0–100)

Composite of MMLU, HumanEval, HellaSwag, and internal coding benchmarks. Scores normalised to 100 against the highest-performing model.

Cost efficiency score (0–100)

Capability divided by blended input+output cost per 1M tokens, then normalised. Higher = more capability per dollar.

Refresh cadence

Static snapshot updated weekly. ModelSpend routing uses live provider pricing data updated daily.

Disclaimer

Benchmark data is provided for informational purposes. Actual performance varies by use case. Run evals on your own dataset using the ModelSpend evaluation framework.

Route to the best model automatically.

ModelSpend uses live pricing and benchmark data to route each prompt optimally. Setup in 4 minutes.

Start free → See how routing works