Methodology

API-first, exception-aware.

RunWhere bakes the boundary where the default API answer stops holding, then keeps the advanced open-weight matrix available for users who already know what they want to inspect.

1. The API-first claim

Hosted APIs are the right default for most teams because they turn model serving into a variable token bill, absorb idle capacity, and avoid model-serving operations. RunWhere only pushes users away from that default when spend, traffic shape, constraints, or substitution quality plausibly put them in the exception set.

2. Headline rate calculation

The current homepage headline is 9 out of 10 typical workloads should stay on the API. It comes from the boundary artifact, not marketing copy.

api_first_rate = confirmed_api_reference_weight / total_reference_weight
api_first_rate = 89 / 100 = 0.89
Reference profile Weight Spend Traffic Hosted model
Low spend, bursty small-model app 27 < $200 Bursty gpt-5-mini
Low spend, steady small-model app 17 < $200 Steady gemini-3-5-flash
Mid spend, bursty small-model product 18 $200–$2K Bursty gpt-5-mini
Mid spend, steady small-model product 3 $200–$2K Steady gpt-5-mini
Mid spend, bursty frontier-family product 15 $200–$2K Bursty gpt-5
High spend, bursty frontier-family product 12 $2K–$10K Bursty claude-sonnet-5
High spend, steady medium-model product 2 $2K–$10K Steady claude-haiku-4
Predictable small-model batch job 2 $200–$2K Predictable batches gemini-3-5-flash
Very high steady frontier spend 2 $10K+ Steady gpt-5
Compliance-bound workload 2 $200–$2K Bursty gpt-5-mini

3. Exception traits

  • High sustained API spend, especially above the $10K/month band.
  • Traffic steady enough to keep rented GPUs busy, or predictable enough for batch GPU jobs.
  • A small or medium mapped open-weight model is acceptable for the workload.
  • Hard requirements: Data residency, Air-gap, Latency <100ms p99.
  • Latency, quota, or capacity constraints that make token price alone incomplete.

4. Serving modalities

Option Pricing model Best fit v1 treatment
Hosted API
Hosted model API via provider marketplace or direct vendor.
Blended input/output token price. Default for most workloads, especially bursty or uncertain demand. modeled
Managed dedicated endpoint
Dedicated inference endpoint on Vertex AI, SageMaker, Azure ML, Bedrock custom model class, or similar.
Provisioned accelerator instance-hours plus managed-service premium and reduced ops labor. Teams needing isolation, private networking, or deployment controls without raw VM operations. preview
Serverless GPU
GPU-backed container or inference worker that can scale to zero.
Active GPU seconds plus request/cold-start overhead and lightweight ops labor. Bursty small or medium open-weight workloads that want API-like operations. modeled
Cheap always-on GPU VM
Single smaller GPU VM such as L4, A10G, T4, or L40S.
Hourly GPU VM rental times 720 monthly hours plus operational labor. Steady small-model production where idle capacity is low. modeled
Serious always-on GPU VM
A100/H100-class VM or node for frontier-size models or high throughput.
Hourly accelerator node rental times 720 monthly hours plus operational labor. Steady frontier-model or high-throughput production workloads. modeled
Scheduled/batch GPU
Scheduled GPU job for offline inference, backfills, nightly summarization, evaluations, or other latency-relaxed work.
Active job GPU-hours plus orchestration overhead and batch ops labor. Predictable offline inference where requests can queue and capacity can be scheduled. preview
Owned/on-prem hardware
On-prem, colo, or owned workstation/server GPUs.
Capex amortization plus power, cooling or colo, maintenance, and labor. Privacy-bound, air-gapped, or very steady high-volume workloads. advisory

CPU/local inference is intentionally excluded from the v1 ranking. It is real for hobby and sub-7B prototype work, but outside the cost-anxious production persona.

5. Cost formulas

api_monthly = tokens_per_day × 30 × blended_input_output_price
vm_monthly = gpu_hourly × gpus_needed × 24 × 30 + operational_labor
managed_endpoint_monthly = accelerator_hourly × provisioned_instances × 24 × 30 + premium + managed_ops_labor
serverless_gpu_monthly = (active_gpu_seconds + cold_start_seconds) × gpu_second_price + overhead + serverless_ops_labor
batch_gpu_monthly = gpu_hourly × gpus_needed × active_job_hours + orchestration + batch_ops_labor
owned_hardware_monthly = capex_amortization + power × PUE + colo + maintenance + labor

Utilization affects capacity sizing; it is not applied again to the rented GPU bill. API cost uses blended input/output tokens, while capacity checks use peak output tokens/sec.

6. Self-host throughput

throughput_fp16_tps = (memory_bandwidth_gb_per_sec / active_param_bytes_fp16) × utilization
throughput_int8_tps = throughput_fp16_tps × 1.7
throughput_int4_tps = throughput_fp16_tps × 2.5
peak_output_tps = tokens_per_day × output_share × peak_to_avg_ratio / 86,400

Cited throughput overrides can replace estimates when refresh data is available. Otherwise, the estimator labels throughput as calculated.

7. API capacity nuance

If API token cost wins but estimated peak output tokens/sec exceeds measured hosted aggregate p50 throughput, an advanced artifact downgrades the regime to close-call. The point is not that self-hosting is cheaper; it is that quota, latency, capacity, and provider concentration may dominate.

8. Bounded-opinion defaults

  • GPU utilization: 60%
  • Input/output token split: 70% input, 30% output
  • Peak-to-average traffic ratio: 3:1
  • INT8 throughput multiplier: 1.7×
  • INT4 throughput multiplier: 2.5×
  • Operational labor: 0.1 FTE at $150,000/year for raw VMs
  • Serverless ops labor: 0.03 FTE; managed endpoint ops labor: 0.05 FTE

9. Source provenance

Artifact values are labeled calculated, cited, or measured. Pricing refreshes weekly; throughput citations refresh quarterly.

  • Pricing data last refreshed: April 29, 2026
  • Throughput data last refreshed: April 29, 2026
  • Boundary artifact: runwhere-boundaries--v1.1.0-d33b3cda
  • Ruleset version: 1.1.0

10. API-to-open-weight mapping policy

Mappings are editorial candidates, not quality-equivalence claims. They exist so the short check can reason about whether a small, medium, or frontier open-weight substitute is plausible.

Hosted model Size class Candidates Caveat
Claude Haiku 4 medium mistral-small-4, qwen-3-5, gemma-4-26b These open-weight options can be evaluated for support, summarization, and coding-adjacent workflows, but they are not quality-equivalent to Claude.
Claude Sonnet 5 frontier llama-4-maverick, qwen-3-coder-480b, mistral-large-3 Frontier hosted models differ materially in reasoning, tool use, safety behavior, and latency; use these mappings only as candidate self-host investigations.
Gemini 3.5 Flash small gemma-4-26b, phi-4, mistral-small-4 Flash-class API models optimize latency and price; mapped open weights may need task-specific evaluation before replacement.
GPT-5 mini small gemma-4-26b, mistral-small-4, phi-4 These models are plausible efficient substitutes for narrow or cost-sensitive workloads, not quality-identical replacements for GPT-5 mini.
GPT-5 frontier llama-4-maverick, deepseek-v3-2, qwen-3-coder-480b The mapped open-weight models are not asserted to match GPT-5 quality; they are credible candidates when cost, compliance, or capacity forces a self-host investigation.
Other / not listed medium mistral-small-4, qwen-3-5, llama-4-maverick Because this hosted model is not mapped precisely, RunWhere cannot make a confident quality or cost-boundary claim.

11. Open-weight curation criteria

Models must be self-hostable open weights, available on Hugging Face, compatible with common inference runtimes, hosted by at least two providers or explicitly preview, and runnable on a single-cloud GPU node. The launch roster stays under the soft cap of ~15 models.

12. What's excluded from v1

  • User accounts, saved private scenarios, and raw prompt ingestion
  • Precise quality equivalence scoring
  • CPU/local hobby recommendations
  • Reserved capacity, committed-use pricing, and spot/preemptible pricing
  • Full multi-cloud deployment architecture

13. Escalation policy

J1 is an opt-in live analysis path for likely exceptions, hard constraints, insufficient data, and advanced close calls. The UI discloses the approximate cost before the call and sends only normalized categorical inputs plus optional short free text.

14. Refresh cadences and versioning

Every boundary and advanced composition artifact includes an ID, generation timestamp, ruleset version, and source commit hashes. Stable URLs are the only persistence mechanism.