Methodology
API-first, exception-aware.
RunWhere bakes the boundary where the default API answer stops holding, then keeps the advanced open-weight matrix available for users who already know what they want to inspect.
1. The API-first claim
Hosted APIs are the right default for most teams because they turn model serving into a variable token bill, absorb idle capacity, and avoid model-serving operations. RunWhere only pushes users away from that default when spend, traffic shape, constraints, or substitution quality plausibly put them in the exception set.
2. Headline rate calculation
The current homepage headline is 9 out of 10 typical workloads should stay on the API. It comes from the boundary artifact, not marketing copy.
api_first_rate = confirmed_api_reference_weight / total_reference_weight
api_first_rate = 89 / 100 = 0.89 | Reference profile | Weight | Spend | Traffic | Hosted model |
|---|---|---|---|---|
| Low spend, bursty small-model app | 27 | < $200 | Bursty | gpt-5-mini |
| Low spend, steady small-model app | 17 | < $200 | Steady | gemini-3-5-flash |
| Mid spend, bursty small-model product | 18 | $200–$2K | Bursty | gpt-5-mini |
| Mid spend, steady small-model product | 3 | $200–$2K | Steady | gpt-5-mini |
| Mid spend, bursty frontier-family product | 15 | $200–$2K | Bursty | gpt-5 |
| High spend, bursty frontier-family product | 12 | $2K–$10K | Bursty | claude-sonnet-5 |
| High spend, steady medium-model product | 2 | $2K–$10K | Steady | claude-haiku-4 |
| Predictable small-model batch job | 2 | $200–$2K | Predictable batches | gemini-3-5-flash |
| Very high steady frontier spend | 2 | $10K+ | Steady | gpt-5 |
| Compliance-bound workload | 2 | $200–$2K | Bursty | gpt-5-mini |
3. Exception traits
- High sustained API spend, especially above the $10K/month band.
- Traffic steady enough to keep rented GPUs busy, or predictable enough for batch GPU jobs.
- A small or medium mapped open-weight model is acceptable for the workload.
- Hard requirements: Data residency, Air-gap, Latency <100ms p99.
- Latency, quota, or capacity constraints that make token price alone incomplete.
4. Serving modalities
| Option | Pricing model | Best fit | v1 treatment |
|---|---|---|---|
| Hosted API Hosted model API via provider marketplace or direct vendor. | Blended input/output token price. | Default for most workloads, especially bursty or uncertain demand. | modeled |
| Managed dedicated endpoint Dedicated inference endpoint on Vertex AI, SageMaker, Azure ML, Bedrock custom model class, or similar. | Provisioned accelerator instance-hours plus managed-service premium and reduced ops labor. | Teams needing isolation, private networking, or deployment controls without raw VM operations. | preview |
| Serverless GPU GPU-backed container or inference worker that can scale to zero. | Active GPU seconds plus request/cold-start overhead and lightweight ops labor. | Bursty small or medium open-weight workloads that want API-like operations. | modeled |
| Cheap always-on GPU VM Single smaller GPU VM such as L4, A10G, T4, or L40S. | Hourly GPU VM rental times 720 monthly hours plus operational labor. | Steady small-model production where idle capacity is low. | modeled |
| Serious always-on GPU VM A100/H100-class VM or node for frontier-size models or high throughput. | Hourly accelerator node rental times 720 monthly hours plus operational labor. | Steady frontier-model or high-throughput production workloads. | modeled |
| Scheduled/batch GPU Scheduled GPU job for offline inference, backfills, nightly summarization, evaluations, or other latency-relaxed work. | Active job GPU-hours plus orchestration overhead and batch ops labor. | Predictable offline inference where requests can queue and capacity can be scheduled. | preview |
| Owned/on-prem hardware On-prem, colo, or owned workstation/server GPUs. | Capex amortization plus power, cooling or colo, maintenance, and labor. | Privacy-bound, air-gapped, or very steady high-volume workloads. | advisory |
CPU/local inference is intentionally excluded from the v1 ranking. It is real for hobby and sub-7B prototype work, but outside the cost-anxious production persona.
5. Cost formulas
api_monthly = tokens_per_day × 30 × blended_input_output_price
vm_monthly = gpu_hourly × gpus_needed × 24 × 30 + operational_labor
managed_endpoint_monthly = accelerator_hourly × provisioned_instances × 24 × 30 + premium + managed_ops_labor
serverless_gpu_monthly = (active_gpu_seconds + cold_start_seconds) × gpu_second_price + overhead + serverless_ops_labor
batch_gpu_monthly = gpu_hourly × gpus_needed × active_job_hours + orchestration + batch_ops_labor
owned_hardware_monthly = capex_amortization + power × PUE + colo + maintenance + labor Utilization affects capacity sizing; it is not applied again to the rented GPU bill. API cost uses blended input/output tokens, while capacity checks use peak output tokens/sec.
6. Self-host throughput
throughput_fp16_tps = (memory_bandwidth_gb_per_sec / active_param_bytes_fp16) × utilization
throughput_int8_tps = throughput_fp16_tps × 1.7
throughput_int4_tps = throughput_fp16_tps × 2.5
peak_output_tps = tokens_per_day × output_share × peak_to_avg_ratio / 86,400 Cited throughput overrides can replace estimates when refresh data is available. Otherwise, the estimator labels throughput as calculated.
7. API capacity nuance
If API token cost wins but estimated peak output tokens/sec exceeds measured hosted aggregate p50 throughput, an advanced artifact downgrades the regime to close-call. The point is not that self-hosting is cheaper; it is that quota, latency, capacity, and provider concentration may dominate.
8. Bounded-opinion defaults
- GPU utilization: 60%
- Input/output token split: 70% input, 30% output
- Peak-to-average traffic ratio: 3:1
- INT8 throughput multiplier: 1.7×
- INT4 throughput multiplier: 2.5×
- Operational labor: 0.1 FTE at $150,000/year for raw VMs
- Serverless ops labor: 0.03 FTE; managed endpoint ops labor: 0.05 FTE
9. Source provenance
Artifact values are labeled calculated, cited, or measured. Pricing refreshes weekly; throughput citations refresh quarterly.
- Pricing data last refreshed: April 29, 2026
- Throughput data last refreshed: April 29, 2026
- Boundary artifact: runwhere-boundaries--v1.1.0-d33b3cda
- Ruleset version: 1.1.0
10. API-to-open-weight mapping policy
Mappings are editorial candidates, not quality-equivalence claims. They exist so the short check can reason about whether a small, medium, or frontier open-weight substitute is plausible.
| Hosted model | Size class | Candidates | Caveat |
|---|---|---|---|
| Claude Haiku 4 | medium | mistral-small-4, qwen-3-5, gemma-4-26b | These open-weight options can be evaluated for support, summarization, and coding-adjacent workflows, but they are not quality-equivalent to Claude. |
| Claude Sonnet 5 | frontier | llama-4-maverick, qwen-3-coder-480b, mistral-large-3 | Frontier hosted models differ materially in reasoning, tool use, safety behavior, and latency; use these mappings only as candidate self-host investigations. |
| Gemini 3.5 Flash | small | gemma-4-26b, phi-4, mistral-small-4 | Flash-class API models optimize latency and price; mapped open weights may need task-specific evaluation before replacement. |
| GPT-5 mini | small | gemma-4-26b, mistral-small-4, phi-4 | These models are plausible efficient substitutes for narrow or cost-sensitive workloads, not quality-identical replacements for GPT-5 mini. |
| GPT-5 | frontier | llama-4-maverick, deepseek-v3-2, qwen-3-coder-480b | The mapped open-weight models are not asserted to match GPT-5 quality; they are credible candidates when cost, compliance, or capacity forces a self-host investigation. |
| Other / not listed | medium | mistral-small-4, qwen-3-5, llama-4-maverick | Because this hosted model is not mapped precisely, RunWhere cannot make a confident quality or cost-boundary claim. |
11. Open-weight curation criteria
Models must be self-hostable open weights, available on Hugging Face, compatible with common inference runtimes, hosted by at least two providers or explicitly preview, and runnable on a single-cloud GPU node. The launch roster stays under the soft cap of ~15 models.
12. What's excluded from v1
- User accounts, saved private scenarios, and raw prompt ingestion
- Precise quality equivalence scoring
- CPU/local hobby recommendations
- Reserved capacity, committed-use pricing, and spot/preemptible pricing
- Full multi-cloud deployment architecture
13. Escalation policy
J1 is an opt-in live analysis path for likely exceptions, hard constraints, insufficient data, and advanced close calls. The UI discloses the approximate cost before the call and sends only normalized categorical inputs plus optional short free text.
14. Refresh cadences and versioning
Every boundary and advanced composition artifact includes an ID, generation timestamp, ruleset version, and source commit hashes. Stable URLs are the only persistence mechanism.