Daniel Reyes, YuSMP Group
Daniel Reyes Principal Engineer (AI/ML), YuSMP Group · LLM systems, RAG and fine-tuning for production

TL;DR — the 2026 cost envelope

The compute side of fine-tuning has fallen sharply for two years in a row. The new bottleneck is dataset quality and evaluation, not GPU cost. A production-grade LoRA programme on a 7B–13B open-weights model now lands between USD 30,000 and USD 180,000 end-to-end. Full fine-tunes on 70B+ models still routinely exceed USD 250,000 when you include the dataset, eval harness, MLOps, and the first six months of maintenance.

ProgrammeCompute onlyEnd-to-end (with data + eval + ops)
LoRA 7B-13B, narrow taskUSD 200–1,500USD 30–80k
LoRA 70B, instruction adaptUSD 1,500–6,000USD 60–180k
Full FT 7B-13BUSD 1,500–15,000USD 60–200k
Full FT 70BUSD 25–90kUSD 180–450k
Continued pre-train, 70B, 50B tokensUSD 180–420kUSD 400k–1.2M

GPU-hour pricing across H100, H200, B200, A100

GPU pricing in 2026 is unrecognisable compared to the 2023 panic-buy era. Three forces collapsed prices: H100 supply finally catching demand in H2 2025, B200/GB200 entering general availability in Q1 2026, and the rise of neoclouds (CoreWeave, Lambda, RunPod, Crusoe, FluidStack, Vast.ai) running at materially lower margins than hyperscalers.

GPUHyperscaler on-demandNeocloud on-demandNeocloud spot
A100 80GBUSD 2.20–3.20USD 1.20–1.80USD 0.80–1.40
H100 80GB SXMUSD 2.80–4.20USD 1.80–2.60USD 1.20–1.80
H200 141GBUSD 3.50–5.00USD 2.40–3.40USD 1.80–2.40
B200 / GB200 (early access)USD 5.50–8.00USD 4.00–6.00limited
MI300XUSD 2.90–4.00USD 1.90–2.80USD 1.30–1.90

Two pricing dynamics deserve calling out. First, B200 looks expensive on paper but delivers roughly 2.0–2.5x throughput over H100 on FP8 training and 3–4x on FP4 inference. The per-token cost on a 70B fine-tune is now usually lower on B200 than on H100 despite the higher hourly. Second, MI300X with ROCm 6.2+ has reached real production parity for LLaMA, Mistral, Qwen and Gemma fine-tuning; if your team can swallow the slightly thinner ecosystem, you save 10–25%.

LoRA, QLoRA, DPO, full fine-tune — cost per method

Five methods cover 95% of 2026 fine-tuning work. Pick by the shape of the problem, not by what your team most recently read about.

  • Supervised fine-tuning (SFT) with LoRA / QLoRA. Train low-rank adapters (rank 8–64) on top of frozen base weights. 0.1–3% of parameters updated. QLoRA adds 4-bit base-model quantisation, slashing VRAM by ~4x. Cost: 1–5% of full SFT. Default choice.
  • Full SFT. Update all parameters. Required when you change tokenizer, vocabulary, or do continued pre-training. 20–50x more VRAM than LoRA — you need ZeRO-3 / FSDP across multiple nodes for anything above 13B.
  • Direct Preference Optimisation (DPO) and variants (IPO, KTO, ORPO). Aligns the model against preference pairs without a separate reward model. Cost: 1.5–3x SFT on the same dataset. Required when style, safety, or refusal behaviour matters.
  • Continued pre-training. Tens to hundreds of billions of new tokens of domain corpus. Cost dominated by data acquisition (USD 50–500k for a clean specialist corpus) and compute (USD 100–500k for 50B tokens on a 70B model).
  • Reinforcement learning from verifiable rewards (RLVR), GRPO, RLHF. 2026's hot direction for reasoning models. Cost 3–8x SFT for comparable wall-clock; the eval and reward-model infrastructure dominates total spend.

Dataset curation: the largest line item nobody budgets

In every audit we run on a stalled fine-tuning programme, the dataset is the gating issue. The internal estimate at the start is invariably 5–10x too low. A realistic 2026 cost stack for a 30,000-pair high-quality instruction dataset in a regulated domain:

ActivityCost rangeNotes
Sourcing and rights clearanceUSD 2–15kCounsel review, licensing of third-party corpora, CDSM Article 4(3) opt-out checks for EU.
PII / PHI redaction pipelineUSD 3–8kPresidio + custom regex + LLM-assisted review; mandatory for HIPAA, GDPR Article 5 data minimisation.
Annotation labour (SME)USD 6–25kUSD 20–120/hour depending on domain; legal, medical, finance at the top.
Synthetic data generationUSD 1–6kClaude Opus or GPT-4o calls + verification; cost compresses fast on Sonnet/Haiku for verification.
Inter-annotator agreement and adjudicationUSD 1–4k10–20% double-labeled, third-party adjudication on disagreements.
Dataset eval and decontaminationUSD 1–3kn-gram overlap against held-out eval, MinHash near-duplicates, contamination against MMLU/HumanEval/etc.

Total for a serious 30k-pair dataset: USD 14–61k. For 100k+ pairs in a regulated domain, expect USD 40–180k. This is why we tell clients during fine-tuning engagements that the dataset budget should be 3–6x the compute budget, not the other way around.

MLOps cost dashboard showing GPU spend by experiment
Treat every fine-tuning experiment as a budgeted line item. Untracked experimentation is where 30–50% of programme spend leaks.

Evaluation infrastructure: don't ship blind

The fastest way to lose money on fine-tuning is to ship a model whose quality you cannot measure. Eval infrastructure for a serious programme:

  • Frozen test set — 500–2,000 examples, never seen in training, versioned, hashed in CI.
  • Production traffic replay set — 1,000–5,000 anonymised real prompts, refreshed monthly.
  • Bias slices — per-group performance to satisfy EU AI Act Article 10(2)(f) and GDPR Article 22 explanations.
  • LLM-as-judge harness — Claude or GPT-4-class judge with hand-validated rubrics; correlation against human judges measured quarterly.
  • Public benchmarks where relevant — MMLU-Pro, MATH, HumanEval+, IFEval, MT-Bench v2, plus a domain-specific benchmark you build once and reuse.

Setup cost: USD 3–15k. Per-eval cost on a serious harness: USD 200–1,000 in LLM-judge calls. Budget USD 800–3,000/month for continuous eval against production traffic.

Worked examples: 7B, 13B, 70B end-to-end budgets

Three real programmes we ran in 2025–2026, with numbers cleaned of client specifics:

Example A — LoRA on Qwen2.5-7B for legal document extraction

  • Dataset: 14,000 hand-labeled extraction pairs from contract corpus. Annotation by paralegals at USD 45/hour blended. Dataset cost: USD 38,000.
  • Compute: 8xH100 spot for 6 hours per training run, 14 runs across hyperparameter sweep + DPO pass. USD 1,150.
  • Eval harness: USD 6,200 setup, USD 1,800/month ongoing.
  • MLOps and engineering: 6 weeks senior engineer at USD 180/hr blended. USD 43,200.
  • Total programme: USD 88,550. Replaced a USD 22k/month GPT-4o pipeline; break-even in month 5.

Example B — QLoRA on Llama-3.3-70B for customer-support voice

  • Dataset: 22,000 historical support tickets with curated agent responses; synthetic augmentation 3x. Cost: USD 26,000.
  • Compute: 4xH200 on neocloud for 9 hours per run, 8 runs. USD 1,400.
  • Eval + ops: USD 9,800 setup, USD 2,200/month ongoing.
  • Engineering: 8 weeks. USD 57,600.
  • Total: USD 94,800. Reduced average handle time by 31%; payback in 4 months on labour savings alone.

Example C — Full FT on Mistral-Small-22B for clinical scribe

  • Dataset: 48,000 de-identified clinical dictation pairs; HIPAA-controlled pipeline. Cost: USD 142,000.
  • Compute: 32xH100 FSDP, 18 hours per run, 5 runs. USD 13,500.
  • Eval (medical SME-graded) and compliance: USD 31,000.
  • Engineering, MLOps, HIPAA review: USD 118,000.
  • Total: USD 304,500. Frontier API was not an option (BAA-blocked in this configuration); the fine-tune is the product.

Inference economics and the break-even against frontier APIs

The fine-tune costs of training are dwarfed over a model's life by inference cost. Run the math early.

A 13B fine-tune served on a 2xH100 vLLM instance at 80% utilisation delivers roughly 12–20 million output tokens/day at a cost of USD 95–150/day. That is USD 0.005–0.012 per 1k output tokens, against USD 0.60–15.00 per 1k for frontier APIs — a 50–1500x advantage at scale. A 70B fine-tune on 4xH100 lands at USD 0.02–0.06 per 1k tokens.

Break-even rule of thumb: a USD 80–120k fine-tune programme pays back inside 3–6 months once you exceed USD 25,000/month in frontier-API inference. Below USD 5,000/month, prompting a frontier model wins on TCO; do not fine-tune.

Ongoing maintenance and drift

A fine-tuned model is not a finished product. Plan USD 8–25k per quarter:

  • Re-evaluation against frozen and refreshed test sets — USD 1–3k.
  • Drift monitoring on production traffic (embedding-distance, semantic similarity, refusal-rate, hallucination-rate) — USD 1–3k.
  • Incremental dataset growth and re-labeling on hard cases — USD 3–10k.
  • One re-train cycle per quarter — USD 2–30k depending on method.
  • Base model migration when better open weights drop (2–3x per year in 2025–2026) — one-time USD 8–40k.

Compliance overhead: GDPR, EU AI Act Article 53, SOC 2

Fine-tuning interacts with three compliance frameworks more than people expect:

  • GDPR. Article 5 data-minimisation, Article 25 privacy by design, Article 28 processor agreements with annotation vendors, Article 32 security of processing, Article 35 DPIA for high-risk processing. PII in training data is a strict no — redact or synthesise.
  • EU AI Act Article 53. If you fine-tune an open-weights model and redistribute, you are a GPAI provider. You owe Annex XI technical documentation, Annex XII downstream-provider information, a copyright policy honouring CDSM Article 4(3) opt-out, and a public training-data summary on the AI Office template. We covered the detail in our EU AI Act SaaS checklist.
  • SOC 2 / ISO 27001:2022. Annex A.5.34 (privacy and protection of PII), A.8.10 (information deletion), A.8.11 (data masking), A.8.28 (secure coding) all apply to your training pipeline; auditors are catching up fast.

For HIPAA-bound work, the BAA chain (you → cloud → GPU provider) must hold all the way down. AWS, GCP and Azure offer BAA on H100/H200 SKUs; most neoclouds do not. That cost premium is real and unavoidable for PHI fine-tunes.

Top 10 cost mistakes we see in client audits

  1. Defaulting to full fine-tune when LoRA would do — 10–30x compute waste.
  2. Hyperparameter sweeps with no early-stopping — 3–6x sweep cost.
  3. Running on-demand hyperscaler when spot or neocloud was fine — 2–4x compute cost.
  4. No eval harness — ship and pray, then re-train from scratch when it underperforms.
  5. Annotation labour booked to "engineering" budget, never tracked as data cost.
  6. No contamination check against public benchmarks — inflated eval scores, real-world failure.
  7. Training set leaks PII / PHI; counsel forces re-do.
  8. No frozen test set; eval scores drift as test set drifts.
  9. Choosing a base model going EOL in 6 weeks — re-train forced.
  10. No inference cost model before training starts — "we fine-tuned a 70B and now serving costs 4x the API we replaced".
Engineering team reviewing training curves and cost burndown
Fine-tuning programmes succeed on operational discipline: every run budgeted, every metric tracked, every dollar attributed.

If you are weighing a fine-tuning programme against frontier APIs or RAG, our LLM fine-tuning & MLOps team runs a fixed-price two-week feasibility — dataset audit, method recommendation, GPU-hour estimate, ROI model, EU AI Act delta. For broader AI architecture decisions across SaaS development and custom software contexts, a fractional CTO with shipped MLOps experience usually pays for itself in the first month.

FAQ

How much does it cost to fine-tune an open-weights LLM in 2026?

LoRA on a 7B-13B model: USD 200–1,500 in compute; USD 30–80k end-to-end. LoRA on 70B: USD 1,500–6,000 compute; USD 60–180k end-to-end. Full fine-tunes 5–15x more.

LoRA vs full fine-tuning?

Default to LoRA / QLoRA. Matches full-FT quality in 85–95% of cases at 1–5% of compute and storage. Full FT only when changing tokenizer/vocabulary or doing continued pre-training.

What is the going GPU-hour price in 2026?

H100 80GB on neocloud spot USD 1.20–1.80; on-demand USD 1.80–2.60. H200 USD 2.40–3.40 on-demand. B200 USD 4.00–6.00 on neocloud but 2–2.5x throughput. A100 spot USD 0.80–1.40 still cost-optimal for small LoRA.

How big a dataset do I actually need?

LoRA instruction-tuning: 1k–10k high-quality pairs beats 100k noisy. Domain Q&A: 5k–30k real conversations. Classification/extraction: 2k–10k per class with strong inter-annotator agreement.

When does ROI justify a fine-tune?

Under USD 5k/month API spend — don't fine-tune. USD 5k–25k — only if narrow. Above USD 25k/month, or where latency or data residency forces it — almost always yes.

What does ongoing maintenance cost?

USD 8–25k per quarter: re-eval, drift monitoring, incremental data, one re-train. Teams that skip maintenance lose 4–9 percentage points of quality per quarter.

Build the dataset like it's the product. The model is the artefact.

The single highest-leverage change we make in fine-tuning audits is reallocating budget from compute to data. Spend 60–70% of programme dollars on dataset curation, eval, and labelling; spend 5–15% on compute; spend the rest on MLOps. Teams that flip this ratio ship models that miss; teams that respect it ship models that compound.

Last updated 26 May 2026. Prices reflect publicly observable on-demand and spot pricing across major hyperscalers and neoclouds as of May 2026 and may move sharply. Nothing in this article constitutes legal or investment advice.