OpenAI-compatible proxy
Point your existing OpenAI or Anthropic SDK at api.modelspend.best/proxy/v1. Every existing SDK call — streaming, tools, vision, function calling — works identically. The model parameter becomes a routing hint.
- Streaming SSE with OpenAI delta format
- Function calling and tool use pass-through
- Cost metadata in x-modelspend-* headers
- Model-to-tier hint table (50+ model patterns)
Evaluation framework
Before you switch routing configurations, prove quality is maintained. Upload a dataset of representative prompts, run them against multiple models simultaneously, and score outputs with LLM-as-judge scoring.
Scores are tracked over time so you can see quality trends as providers update their models.
- CSV or API dataset upload (up to 1,000 items)
- Run against up to 6 models in parallel
- LLM-as-judge scoring (0.0–1.0) with reasoning
- Exact match mode for deterministic tasks
- Per-model quality × cost × latency comparison
- Link eval runs to prompt versions
Prompt registry
System prompts are code. Treat them like it. Semantic versioning, diff views, a staging workflow, and rollback — the same controls you have on your application code.
- Semantic versioning (major.minor.patch)
- draft → staging → production promotion
- Line-level diff between any two versions
- One-click rollback to any previous version
- Link to eval runs for quality validation
- Token count tracking per version
OpenTelemetry traces
Every execute call emits a distributed trace with child spans for each stage of the pipeline. Export to your existing observability stack via OTLP. Debug exactly why a specific request was expensive, slow, or blocked.
- Root span per request + child spans per stage
- Attributes: cost, tokens, tier, provider, model
- OTLP HTTP export to any collector
- Native integrations: Jaeger, Tempo, Datadog, New Relic, Honeycomb
- 30-day rolling retention in ModelSpend
- In-dashboard trace viewer with waterfall