Intent design & conversation flows
Workshop with your support, sales, or ops team to map real user intents from ticket and chat data. Flow diagrams, slot-filling logic, escalation rules, and a written conversation design doc before any code ships.
Services
We design and ship LLM-powered chatbots that pass an eval bar, not a demo. GPT-4o, Claude 3.7, and Gemini 2.0 picked per workload, RAG grounding on Pinecone or pgvector, Slack/Teams/WhatsApp channels, human handoff into Intercom/Zendesk/Salesforce, and full Langfuse observability. Every project ships with a versioned golden set and Ragas regression tests so hallucination is a tracked SLO, not a worry. Discovery + flow design from 9,000 EUR, MVP from 32,000 EUR, production support from 8,500 EUR/month.
Most chatbots fail in the same three ways: they hallucinate confidently on questions outside their knowledge base, they trap users in dead-end loops instead of handing off to a human, and they ship without an eval suite so nobody can prove month two is better than month one. We build chatbots around those three failure modes. Every conversation flow has an escape hatch to a human agent with full context. Every factual answer is grounded in a retrieval citation. Every release runs against a versioned golden set with Ragas faithfulness and answer-relevance scoring. The bot ships when the numbers say it should, not when the calendar says it should.
Workshop with your support, sales, or ops team to map real user intents from ticket and chat data. Flow diagrams, slot-filling logic, escalation rules, and a written conversation design doc before any code ships.
GPT-4o, Claude 3.7, or Gemini 2.0 picked per workload on the basis of a side-by-side eval against your real data. Function calling for tool use, structured outputs for ticket creation, and routing logic that fails safe.
Ingestion pipeline for docs, help center articles, Confluence, Notion, SharePoint, and Zendesk macros. Pinecone or pgvector index with hybrid search, citation rendering, and confidence-based refusal when retrieval is weak.
Web widget, Slack, Microsoft Teams, WhatsApp Business via Twilio or Meta Cloud API, SMS, Telegram, and voice via Twilio or LiveKit. Channel-agnostic conversation engine: same flows, same RAG, same eval suite.
First-class integration with Intercom, Zendesk, Salesforce Service Cloud, Front, HubSpot. Handoff carries transcript, detected intent, citations, and confidence score. Triggers tuned against your CSAT and AHT targets.
Langfuse tracing on every conversation, Helicone cost dashboards, Posthog session replay, GA4 funnels, weekly eval regression reports, and a monthly improvement loop where low-confidence answers feed back into the golden set.
Weeks 1–3: mine your ticket and chat data, run intent workshops with support/ops, write the conversation design doc, pick the LLM via side-by-side eval, build the golden set v0. Go/no-go before MVP build.
Weeks 4–7: ingestion pipeline, vector index, hybrid retrieval, top intents wired with tool calls, structured outputs, citation rendering. Ragas eval running on every PR. Confidence thresholds tuned against the golden set.
Weeks 8–9: launch channel (web, Slack, Teams, or WhatsApp), human handoff into your support tool with full context, escalation triggers, analytics dashboards, runbooks for incidents.
Week 10 onward: canary rollout to 10 percent, then 50, then 100. Weekly eval regression review, monthly intent expansion, quarterly model upgrade ablation. Production support runs as a retainer if you want it.
Three weeks fixed. Ticket and chat data audit, intent workshops, conversation design doc, LLM provider eval, golden set v0, and a written MVP plan with cost and timeline. Credit applied to MVP if you proceed. 9,000 EUR fixed.
8–10 weeks. Production chatbot on one channel with RAG grounding, human handoff into your support tool, analytics dashboards, monitoring, and 30 days post-launch support. Eval bar agreed before kickoff. 32,000 EUR fixed.
Continuous flow tuning, eval expansion, new intents, additional channels, model upgrades, vendor cost optimization, on-call for incidents. One senior engineer plus eval support, six-month minimum. From 8,500 EUR/month.
Pricing excludes LLM API consumption — we set up the providers on your accounts so you keep the cost lever and zero-retention contractual terms.
Production social platform — App Store + Google Play, live across the US and EU — with geo Radar, encrypted messaging and a virtual economy.
Native iOS and Android e-signature clients with a Symfony + React CRM for a cross-border law firm — KYC onboarding and a defensible evidence trail for US & EU matters.
Consumer WireGuard VPN app for iOS and Android with zero-log architecture, launched across the US and EU.
GDPR-aligned · ISO 27001 ready · SOC 2 Type II in progress · HIPAA-capable · CCPA-acknowledged
Faithfulness, answer relevance, and context precision are tracked in Langfuse and reviewed weekly. If a release regresses the golden set above the agreed threshold, the merge is blocked — not shipped behind a feature flag.
We use Voiceflow and Botpress when they fit, but the conversation engine is code in your repo. No vendor lock-in, no surprise per-message fees, no “the platform is down” phone calls on a Tuesday afternoon.
LLM APIs run on your provider accounts, Helicone shows real-time spend per intent, and we ship cost-optimization recommendations monthly: cheaper models for high-volume intents, prompt compression, prefix caching.
For regulated workloads we sign HIPAA BAAs, route to HIPAA-eligible LLM endpoints, and integrate with your existing data governance and DLP — not parallel to it.
It depends on the workload, not on brand loyalty. GPT-4o leads on tool-calling reliability and structured-output adherence at low latency; we default to it for transactional support bots that hit APIs. Claude 3.7 leads on long-context grounding and refusal calibration; we default to it for legal, compliance, and policy-heavy assistants. Gemini 2.0 leads on cost per token at frontier quality for high-volume read-heavy workloads. Every engagement starts with a side-by-side eval against your real ticket data, presented as a written comparison with cost, p95 latency, and refusal-rate numbers before we pick.
Three layers. First, RAG grounding: every factual answer cites a passage from your knowledge base via Pinecone or pgvector, and the LLM is prompted to refuse when retrieval confidence is below a tuned threshold. Second, the eval harness: a golden set of 300 to 800 real questions with labelled correct answers, scored every release with Ragas (faithfulness, answer relevance, context precision/recall) plus rubric-based LLM-as-judge. Third, monitoring in production: Langfuse traces every conversation, flags low-confidence answers for human review, and feeds them back into the golden set. Hallucination rate is a tracked SLO, not a vibe.
Yes, and the handoff is a first-class part of the design, not an afterthought. We integrate with Intercom, Zendesk, Salesforce Service Cloud, Front, and HubSpot Service Hub via their native APIs. The handoff includes the full conversation transcript, the user intent the bot detected, retrieval citations, and a confidence score so the human agent has context. Handoff triggers are configurable: explicit user request, low confidence, sensitive intent (billing dispute, legal, complaint), or after N failed clarifications. We tune the threshold against your CSAT and AHT targets in the first month.
Web chat widget (vanilla JS or React drop-in), Slack, Microsoft Teams, WhatsApp Business via Twilio or Meta Cloud API, SMS, Telegram, Intercom Messenger, Facebook Messenger, and voice via Twilio Voice or LiveKit. The conversation engine is channel-agnostic: same flows, same RAG index, same eval suite. Channel-specific work is mostly authentication and rich-message rendering. A typical second channel adds two to three weeks; a third channel adds one. WhatsApp Business takes longer because of Meta template approval, which is paperwork, not engineering.
Engagement starts with a GDPR-aligned DPA and a data flow diagram showing every place a user message lands. EU clients run on EU regions only (AWS eu-west-1, eu-central-1, GCP europe-west). PII redaction (Presidio plus custom rules) runs before any prompt hits the LLM provider. Conversation logs are retained per your policy with right-to-erasure tooling built in. For Anthropic, OpenAI, and Google we use zero-retention API endpoints where available. We are GDPR-aligned, ISO 27001 ready, SOC 2 Type II in progress, HIPAA-capable for healthtech, and CCPA-acknowledged for US consumer products.
Discovery and flow design is a fixed 9,000 EUR over three weeks: intents, conversation flows, knowledge audit, eval golden set v0, and a written delivery plan. A production MVP on one channel with RAG, handoff, and analytics is fixed 32,000 EUR over 8 to 10 weeks. Production support and continuous improvement (eval expansion, flow tuning, model upgrades, vendor cost optimization, on-call) runs from 8,500 EUR/month with a six-month minimum. Pricing excludes LLM API consumption, which is billed on your accounts directly so you keep the cost lever.