Methodology

API-first, exception-aware.

RunWhere bakes the boundary where the default API answer stops holding, then keeps the advanced open-weight matrix available for users who already know what they want to inspect.

1. The API-first claim

Hosted APIs are the right default for most teams because they turn model serving into a variable token bill, absorb idle capacity, and avoid model-serving operations. RunWhere only pushes users away from that default when spend, traffic shape, constraints, or substitution quality plausibly put them in the exception set.

2. Headline rate calculation

The current homepage headline is 9 out of 10 typical workloads should stay on the API. It comes from the boundary artifact, not marketing copy.

api_first_rate = confirmed_api_reference_weight / total_reference_weight
api_first_rate = 89 / 100 = 0.89

Reference profile	Weight	Spend	Traffic	Hosted model
Low spend, bursty small-model app	27	< $200	Bursty	gpt-5-mini
Low spend, steady small-model app	17	< $200	Steady	gemini-3-5-flash
Mid spend, bursty small-model product	18	$200–$2K	Bursty	gpt-5-mini
Mid spend, steady small-model product	3	$200–$2K	Steady	gpt-5-mini
Mid spend, bursty frontier-family product	15	$200–$2K	Bursty	gpt-5
High spend, bursty frontier-family product	12	$2K–$10K	Bursty	claude-sonnet-5
High spend, steady medium-model product	2	$2K–$10K	Steady	claude-haiku-4
Predictable small-model batch job	2	$200–$2K	Predictable batches	gemini-3-5-flash
Very high steady frontier spend	2	$10K+	Steady	gpt-5
Compliance-bound workload	2	$200–$2K	Bursty	gpt-5-mini

3. Exception traits

High sustained API spend, especially above the $10K/month band.
Traffic steady enough to keep rented GPUs busy, or predictable enough for batch GPU jobs.
A small or medium mapped open-weight model is acceptable for the workload.
Hard requirements: Data residency, Air-gap, Latency <100ms p99.
Latency, quota, or capacity constraints that make token price alone incomplete.

4. Serving modalities

Option	Pricing model	Best fit	v1 treatment
Hosted API Hosted model API via provider marketplace or direct vendor.	Blended input/output token price.	Default for most workloads, especially bursty or uncertain demand.	modeled
Managed dedicated endpoint Dedicated inference endpoint on Vertex AI, SageMaker, Azure ML, Bedrock custom model class, or similar.	Provisioned accelerator instance-hours plus managed-service premium and reduced ops labor.	Teams needing isolation, private networking, or deployment controls without raw VM operations.	preview
Serverless GPU GPU-backed container or inference worker that can scale to zero.	Active GPU seconds plus request/cold-start overhead and lightweight ops labor.	Bursty small or medium open-weight workloads that want API-like operations.	modeled
Cheap always-on GPU VM Single smaller GPU VM such as L4, A10G, T4, or L40S.	Hourly GPU VM rental times 720 monthly hours plus operational labor.	Steady small-model production where idle capacity is low.	modeled
Serious always-on GPU VM A100/H100-class VM or node for frontier-size models or high throughput.	Hourly accelerator node rental times 720 monthly hours plus operational labor.	Steady frontier-model or high-throughput production workloads.	modeled
Scheduled/batch GPU Scheduled GPU job for offline inference, backfills, nightly summarization, evaluations, or other latency-relaxed work.	Active job GPU-hours plus orchestration overhead and batch ops labor.	Predictable offline inference where requests can queue and capacity can be scheduled.	preview
Owned/on-prem hardware On-prem, colo, or owned workstation/server GPUs.	Capex amortization plus power, cooling or colo, maintenance, and labor.	Privacy-bound, air-gapped, or very steady high-volume workloads.	advisory

CPU/local inference is intentionally excluded from the v1 ranking. It is real for hobby and sub-7B prototype work, but outside the cost-anxious production persona.

5. Cost formulas

api_monthly = tokens_per_day × 30 × blended_input_output_price
vm_monthly = gpu_hourly × gpus_needed × 24 × 30 + operational_labor
managed_endpoint_monthly = accelerator_hourly × provisioned_instances × 24 × 30 + premium + managed_ops_labor
serverless_gpu_monthly = (active_gpu_seconds + cold_start_seconds) × gpu_second_price + overhead + serverless_ops_labor
batch_gpu_monthly = gpu_hourly × gpus_needed × active_job_hours + orchestration + batch_ops_labor
owned_hardware_monthly = capex_amortization + power × PUE + colo + maintenance + labor

Utilization affects capacity sizing; it is not applied again to the rented GPU bill. API cost uses blended input/output tokens, while capacity checks use peak output tokens/sec.

6. Self-host throughput

throughput_fp16_tps = (memory_bandwidth_gb_per_sec / active_param_bytes_fp16) × utilization
throughput_int8_tps = throughput_fp16_tps × 1.7
throughput_int4_tps = throughput_fp16_tps × 2.5
peak_output_tps = tokens_per_day × output_share × peak_to_avg_ratio / 86,400

Cited throughput overrides can replace estimates when refresh data is available. Otherwise, the estimator labels throughput as calculated.

7. API capacity nuance

If API token cost wins but estimated peak output tokens/sec exceeds measured hosted aggregate p50 throughput, an advanced artifact downgrades the regime to close-call. The point is not that self-hosting is cheaper; it is that quota, latency, capacity, and provider concentration may dominate.

8. Bounded-opinion defaults

GPU utilization: 60%
Input/output token split: 70% input, 30% output
Peak-to-average traffic ratio: 3:1
INT8 throughput multiplier: 1.7×
INT4 throughput multiplier: 2.5×
Operational labor: 0.1 FTE at $150,000/year for raw VMs
Serverless ops labor: 0.03 FTE; managed endpoint ops labor: 0.05 FTE

9. Source provenance

Artifact values are labeled calculated, cited, or measured. Pricing refreshes weekly; throughput citations refresh quarterly.

Pricing data last refreshed: April 29, 2026
Throughput data last refreshed: April 29, 2026
Boundary artifact: runwhere-boundaries--v1.1.0-d33b3cda
Ruleset version: 1.1.0

10. API-to-open-weight mapping policy

Mappings are editorial candidates, not quality-equivalence claims. They exist so the short check can reason about whether a small, medium, or frontier open-weight substitute is plausible.

Hosted model	Size class	Candidates	Caveat
Claude Haiku 4	medium	mistral-small-4, qwen-3-5, gemma-4-26b	These open-weight options can be evaluated for support, summarization, and coding-adjacent workflows, but they are not quality-equivalent to Claude.
Claude Sonnet 5	frontier	llama-4-maverick, qwen-3-coder-480b, mistral-large-3	Frontier hosted models differ materially in reasoning, tool use, safety behavior, and latency; use these mappings only as candidate self-host investigations.
Gemini 3.5 Flash	small	gemma-4-26b, phi-4, mistral-small-4	Flash-class API models optimize latency and price; mapped open weights may need task-specific evaluation before replacement.
GPT-5 mini	small	gemma-4-26b, mistral-small-4, phi-4	These models are plausible efficient substitutes for narrow or cost-sensitive workloads, not quality-identical replacements for GPT-5 mini.
GPT-5	frontier	llama-4-maverick, deepseek-v3-2, qwen-3-coder-480b	The mapped open-weight models are not asserted to match GPT-5 quality; they are credible candidates when cost, compliance, or capacity forces a self-host investigation.
Other / not listed	medium	mistral-small-4, qwen-3-5, llama-4-maverick	Because this hosted model is not mapped precisely, RunWhere cannot make a confident quality or cost-boundary claim.

11. Open-weight curation criteria

Models must be self-hostable open weights, available on Hugging Face, compatible with common inference runtimes, hosted by at least two providers or explicitly preview, and runnable on a single-cloud GPU node. The launch roster stays under the soft cap of ~15 models.

12. What's excluded from v1

User accounts, saved private scenarios, and raw prompt ingestion
Precise quality equivalence scoring
CPU/local hobby recommendations
Reserved capacity, committed-use pricing, and spot/preemptible pricing
Full multi-cloud deployment architecture

13. Escalation policy

J1 is an opt-in live analysis path for likely exceptions, hard constraints, insufficient data, and advanced close calls. The UI discloses the approximate cost before the call and sends only normalized categorical inputs plus optional short free text.

14. Refresh cadences and versioning

Every boundary and advanced composition artifact includes an ID, generation timestamp, ruleset version, and source commit hashes. Stable URLs are the only persistence mechanism.