Use-case discovery & ROI scoring
We interview product, ops, and support, then score 8 to 15 candidate use cases on revenue impact, build cost, feasibility, and risk. You get a ranked shortlist, a written ROI model per top-three, and a clear "do not build this" list with reasoning.
LLM provider selection
Eval-driven selection across OpenAI, Anthropic, Bedrock, and Vertex. We measure quality, p50/p95 latency, and cost on a 150 to 400-item task-specific eval set. Output is an ADR with the chosen model, a fallback model, and the trigger to re-evaluate.
RAG & data pipelines
Corpus ingestion, chunking strategy calibrated to your document distribution, embedding model selection, vector store sizing, hybrid retrieval. We size for your actual corpus growth rate, not a default 100k-vector demo.
Prompt engineering & evals
Versioned prompts in git, regression eval sets that run in CI, Ragas and DeepEval rubrics, LLM-as-judge with human spot-check. We refuse to merge prompt changes that regress a tier-1 metric — even our own changes.
Security & PII handling
PII stripping at ingress (Presidio or fine-tuned classifier), zero-retention provider contracts, EU endpoints for EU data, egress scans for hallucinated PII. DPAs and sub-processor lists aligned to your customer contracts.
MLOps for LLMs
Observability through LangSmith, Langfuse, or Helicone. Per-request logging of prompt version, model, tokens, latency, cost. Cost alerts, latency SLOs, automated A/B prompts, and a written runbook for model upgrades and provider outages.