Which LLM do you use — GPT, Claude, or open-source?

Depends on use case. Claude (Anthropic) for most support chatbots — strongest at following instructions and admitting uncertainty. GPT-4o (OpenAI) for multilingual and cost-sensitive cases. Open-source (Llama via Groq, Mistral) when data residency or cost at high volume matters. I benchmark against your actual questions in week 1 before committing.

What's the ongoing LLM cost?

Depends on conversation volume and model choice. Typical: $0.01–0.05 per conversation with Claude 3.7 Sonnet. A bot handling 10k conversations/month costs $100–500 in API fees. Higher volume or premium models can run $1k–5k/month. All billed directly to you from the LLM provider.

Will the bot replace my support team?

No. It handles L1 and well-documented L2 questions, freeing your team for complex problems. Typical deflection rate: 30–50% of incoming tickets resolved by bot, with the rest escalated. Your team still handles the hard cases; they're less crushed by the easy ones.

How do I measure if it's working?

Weekly metrics dashboard: deflection rate (percent resolved without escalation), CSAT on bot interactions, hallucination count (flagged manually or by guardrails), escalation accuracy (did the bot escalate appropriately), cost per conversation. Plus qualitative review of a sample of conversations each week.

What about privacy and data handling?

LLM providers vary. Claude (Anthropic) and enterprise OpenAI both commit to not training on your data. I configure with enterprise-grade privacy settings, route PII-sensitive conversations with additional redaction layers, and provide data processing agreements when your compliance team needs them.

Support chatbots

A support bot that reads your docs — and admits when it doesn't know.

RAG architecture grounded in your knowledge base, with human handoff, in 3–4 weeks. Monthly retainer delivery.

Available for new projects

See AI Automation

Starting at $3,000/mo · monthly retainer

Who this is for

CX or support lead with ticket volume growing faster than headcount — L1 tickets flooding the queue, generic chatbots hallucinating, and support SLAs slipping.

The pain today

L1 tickets (password reset, billing basics) consuming agent time
Generic chatbots (Intercom Fin, Zendesk AI) hallucinating your policies
Help center articles fresh but buried — users email instead of searching
No confidence signal — bot answers look confident even when wrong
Escalation to human painful — user repeats everything already typed

The outcome you get

RAG bot grounded in your docs, Notion, Zendesk, and knowledge base
Source citations on every answer — user verifies, trust builds
Confidence threshold — bot admits uncertainty, escalates to human
Seamless handoff to Intercom or Zendesk with full conversation context
Hallucination testing suite run before every knowledge-base update

RAG architecture that actually works

RAG (Retrieval-Augmented Generation) is how you keep an LLM grounded — instead of trusting the model's training data, you retrieve relevant documents from your knowledge base and give them to the model as context. Done right, this eliminates most hallucinations. Done wrong (chunk-the-docs, stuff-everything-in-context), it produces confident-looking garbage. The right pattern: semantic chunking (split docs on meaningful boundaries, not fixed character counts), hybrid retrieval (semantic + keyword, re-ranked), context window discipline (give the model 3–5 relevant chunks, not 20 marginal ones), and citation enforcement (model must cite sources or refuse). I've built this pattern across internal tools and Instill (my AI product); the mechanics are the same regardless of domain.

Knowledge-source ingestion

Your knowledge lives in many places. Help center articles (Intercom, Zendesk, Confluence, HubSpot). Internal docs (Notion, Google Docs, Slack canvases). Product documentation (GitBook, Docusaurus). Call recordings (Gong, Chorus transcripts). Invoice and account data (your app's DB). The bot needs access to relevant sources per conversation — RBAC-respecting so a non-customer doesn't see internal docs. I wire ingestion pipelines that refresh on a schedule (daily for fast-changing docs, weekly for stable), with explicit quality controls (skip low-quality drafts, prioritize published). Every source is versioned so you can diff what changed when the bot's behavior shifts.

Guardrails and hallucination testing

Three guardrails every production RAG needs. One: source citations required on answers — if the model can't point to a specific doc, it must say 'I don't know, let me escalate.' Two: confidence threshold — retrieved context quality scored, low-quality retrievals trigger escalation rather than guesses. Three: hallucination test suite — 50–100 adversarial questions tested on every knowledge-base update, catching regressions before they hit production. I build this test suite with your support leads, questions that have historically been misanswered or that probe policy boundaries. The suite runs in CI; bot changes that lower the score don't ship.

Human handoff

Handoff is where chatbots feel respectful or feel like walls. Standard: when bot escalates, full conversation context passes to the human agent — they don't ask the user to start over. Routing: by intent (billing vs technical vs cancellation) or by SLA (enterprise customers skip the bot entirely). Return path: human can return the ticket to bot queue after resolution ('the next time you see a question like this, here's the answer'). I integrate with Intercom, Zendesk, Front, or HelpScout — all four have standard handoff patterns. The handoff experience is what makes bots feel like a first line of defense vs a roadblock.

Case study: Instill AI product

Instill is my self-initiated AI product launched Q1 2026 — a prompt library that works with every AI tool, built on Next.js 16, React 19, TypeScript, PostgreSQL, Vercel, and MCP Protocol. 30+ active users, 1,000+ skills saved, 45+ projects powered. The production experience of running an AI product (prompt engineering, evaluation pipelines, hallucination handling, user expectation management) feeds directly into every chatbot engagement I scope. I know where AI products fail because I've operated one through iteration. That experience saves months of learning-in-public for my chatbot clients.

Pricing

Customer support chatbots fit the AI Automation retainer at $3,000/mo. First-version timeline: 3–4 weeks from kickoff to production deployment. Retainer continues through knowledge-base tuning, new source ingestion, and guardrail refinement — most clients stay 6–12 months through initial buildout and ongoing optimization. 14-day money-back guarantee, cancel anytime, Work Made for Hire. LLM API costs (OpenAI, Anthropic) are separate and billed at cost to you — typically $100–500/mo depending on conversation volume.

Recent proof

A comparable engagement, delivered and documented.

AI Product · MCP · Beta

An AI knowledge base your whole team uses via MCP

A personal library for Skills, Agents, and Rules — built once, used across Claude, Cursor, and any MCP-compatible AI tool. Save your best workflows once. Run them anywhere.

AI Product30+ active usersMCP-nativeSelf-funded

Read the case study

Frequently asked questions

The questions prospects ask before they book.

Which LLM do you use — GPT, Claude, or open-source?: Depends on use case. Claude (Anthropic) for most support chatbots — strongest at following instructions and admitting uncertainty. GPT-4o (OpenAI) for multilingual and cost-sensitive cases. Open-source (Llama via Groq, Mistral) when data residency or cost at high volume matters. I benchmark against your actual questions in week 1 before committing.
What's the ongoing LLM cost?: Depends on conversation volume and model choice. Typical: $0.01–0.05 per conversation with Claude 3.7 Sonnet. A bot handling 10k conversations/month costs $100–500 in API fees. Higher volume or premium models can run $1k–5k/month. All billed directly to you from the LLM provider.
Will the bot replace my support team?: No. It handles L1 and well-documented L2 questions, freeing your team for complex problems. Typical deflection rate: 30–50% of incoming tickets resolved by bot, with the rest escalated. Your team still handles the hard cases; they're less crushed by the easy ones.
How do I measure if it's working?: Weekly metrics dashboard: deflection rate (percent resolved without escalation), CSAT on bot interactions, hallucination count (flagged manually or by guardrails), escalation accuracy (did the bot escalate appropriately), cost per conversation. Plus qualitative review of a sample of conversations each week.
What about privacy and data handling?: LLM providers vary. Claude (Anthropic) and enterprise OpenAI both commit to not training on your data. I configure with enterprise-grade privacy settings, route PII-sensitive conversations with additional redaction layers, and provide data processing agreements when your compliance team needs them.

Get started in 60 seconds

Ready to start?

Tell me what you need in 60 seconds. Tailored proposal in your inbox within 6 hours.

Available for new projects