How do you handle evals for product AI?

Golden-set evals (50 to 200 hand-crafted test cases with expected outputs) for each AI feature. Run evals on every prompt change. Production sampling with human review catches drift that evals miss. For high-stakes outputs (clinician-facing content), 100 percent human review before release to users. Eval infrastructure (LangSmith, Braintrust, or custom) depending on scale. Evals are mandatory — AI features without evals drift silently and create safety risks.

What about latency for clinical workflows?

Clinicians do not tolerate slow AI. Target: under 3 seconds for inline AI help, under 10 seconds for summarisation. For cases requiring longer processing, streaming output so clinicians see partial results immediately. Model routing (smaller faster model for simple tasks, larger for complex) keeps latency under control. Caching for repeated queries. At Cuez I achieved 10x performance improvements under load — same discipline applies to AI features in healthtech products.

How do you handle PHI boundaries?

PHI-touching AI runs on HIPAA-compliant infrastructure with BAAs. Non-PHI AI (public knowledge questions, product help) can run on general-purpose AI. Clear boundaries in code — PHI flows only through HIPAA-eligible providers. For hybrid features (PHI + public knowledge), architecture isolates PHI from public calls. Design decisions on boundaries made in the first month of the engagement.

What about data retention?

For HIPAA-compliant LLM tiers, prompts and outputs are retained under the provider's BAA terms (varies by provider but typically 30 to 90 days for abuse monitoring, with opt-out options for longer-retention scenarios). For workflows where retention creates risk, we use no-retention tiers or self-hosted models. Audit logs kept for 6 years (HIPAA requirement). Patient right-to-deletion handled through the application — deleting patient data cascades through AI-processed data.

Can AI integrate with our EHR?

Yes. EHRs with FHIR APIs (Athena, Epic via App Orchard, Cerner, DrChrono) integrate for reading clinical context. AI reads EHR data, produces output, output writes back to EHR as structured or unstructured notes with clinician approval. Full EHR write integration requires careful testing — EHRs do not forgive sloppy writes. Typical integration: 4 to 8 weeks on top of standard AI feature work.

Healthtech AI automation

AI features in your healthtech product with evals and audit trails

Product AI (summarisation, triage support, documentation help) plus internal ops automation. HIPAA-aware, eval-ready. $3,000/mo retainer.

Available for new projects

See AI Automation

Starting at $3,000/mo · monthly retainer

Who this is for

Healthtech founder or product lead whose market expects AI features, where compliance and evals are non-negotiable and there is no ML hire budget.

The pain today

Market expects AI features in the product
Compliance and evals cannot be skipped
No ML hire budget in current runway
Previous AI attempts did not ship because of HIPAA concerns
Need AI that clinical advisors and investors trust

The outcome you get

AI features inside healthtech product on $3,000/mo retainer
HIPAA-aware architecture with evals from day one
Human-in-the-loop where clinical judgement matters
Model choice aligned with data sensitivity and budget
Audit trails satisfying regulatory and investor review

AI inside healthtech products

Three patterns deliver ROI without compromising safety. Summarisation — long clinical text, research papers, or patient history condensed for fast review. Triage support — suggesting priority level or routing based on structured inputs, with clinician approval. Documentation help — drafting notes from structured data for clinician review and edit. All three keep humans in the decision loop. What AI should not do in healthtech products: diagnose autonomously, recommend treatment without clinician oversight, or calculate clinical risk scores that directly drive care decisions. The line between assistance and decision matters.

Guardrails, evals, and audit trails

Three non-negotiables for healthtech AI. Guardrails — explicit system prompts defining what AI cannot do, output filters that block inappropriate content, human review steps where regulation demands. Evals — golden-set test cases with expected outputs, regression testing on every prompt change, sampling in production with human review. Audit trails — every AI interaction logged with inputs, outputs, user, patient ID, and downstream action. For regulatory review, these three are the difference between a defensible AI feature and a compliance fire.

Model choice (hosted vs on-prem)

Hosted: Azure OpenAI (HIPAA BAA), AWS Bedrock with Claude (HIPAA BAA), Anthropic enterprise. Easier to operate, faster iteration. Good for most healthtech AI. On-prem or self-hosted open-source: Llama 3/4 via vLLM, Mistral, dedicated models. Required when data sensitivity or cost dictates it — for example, clinical trial data that cannot leave your infrastructure. Self-hosted adds operational overhead but pays back at scale. I help decide in week one based on data flow and scale.

Pricing and engagement model

$3,000/mo retainer. Covers AI integration, HIPAA-aware architecture, eval infrastructure, monitoring, iteration. 14-day money-back guarantee. Cancel anytime. 100 percent code ownership under Work Made for Hire. NDA and BAA standard. LLM and infrastructure costs pass through — $500 to $3,000/month depending on scale and model choice. For healthtech raising, this retainer often pairs with an Applications subscription for the product build itself — bundled engagement keeps cadence aligned.

Case: Instill and Cuez

Instill: self-initiated AI skills platform with 30+ active users, 1,000+ skills saved, 45+ projects powered (Next.js 16, React 19, TypeScript, PostgreSQL, Vercel, MCP Protocol). Structured-prompt library pattern applies directly to healthtech AI — tasks captured as structured prompts with clear output formats and evals. Cuez: broadcast-SaaS API from 3s to 300ms, 10x faster. Performance discipline transfers to AI — healthtech AI features need to be fast, reliable, cost-controlled. Same patterns I ship for healthtech clients.

When a regulated clinical-AI partner is required

For product features that make clinical decisions — diagnostic aid, clinical risk scoring, treatment recommendation with regulatory implications — use FDA-cleared vendors or go through the FDA clearance process yourself. I build pre-clearance features (drafts, summaries, triage support) that keep humans in decision loop. For healthtech founders targeting FDA clearance as a product outcome, the pathway involves clinical trials, RA consultants, and partner strategies I do not run. I work alongside that effort, not in place of it.

Recent proof

A comparable engagement, delivered and documented.

AI Product · Beta

A prompt library that works with every AI tool

A home for your best AI prompts. Save them once, then use them in Claude, Cursor, or any AI tool you work with. No more copy-paste.

AI Product30+ active usersCross-tool workflowsSelf-funded

Read the case study

Frequently asked questions

The questions prospects ask before they book.

How do you handle evals for product AI?: Golden-set evals (50 to 200 hand-crafted test cases with expected outputs) for each AI feature. Run evals on every prompt change. Production sampling with human review catches drift that evals miss. For high-stakes outputs (clinician-facing content), 100 percent human review before release to users. Eval infrastructure (LangSmith, Braintrust, or custom) depending on scale. Evals are mandatory — AI features without evals drift silently and create safety risks.
What about latency for clinical workflows?: Clinicians do not tolerate slow AI. Target: under 3 seconds for inline AI help, under 10 seconds for summarisation. For cases requiring longer processing, streaming output so clinicians see partial results immediately. Model routing (smaller faster model for simple tasks, larger for complex) keeps latency under control. Caching for repeated queries. At Cuez I achieved 10x performance improvements under load — same discipline applies to AI features in healthtech products.
How do you handle PHI boundaries?: PHI-touching AI runs on HIPAA-compliant infrastructure with BAAs. Non-PHI AI (public knowledge questions, product help) can run on general-purpose AI. Clear boundaries in code — PHI flows only through HIPAA-eligible providers. For hybrid features (PHI + public knowledge), architecture isolates PHI from public calls. Design decisions on boundaries made in the first month of the engagement.
What about data retention?: For HIPAA-compliant LLM tiers, prompts and outputs are retained under the provider's BAA terms (varies by provider but typically 30 to 90 days for abuse monitoring, with opt-out options for longer-retention scenarios). For workflows where retention creates risk, we use no-retention tiers or self-hosted models. Audit logs kept for 6 years (HIPAA requirement). Patient right-to-deletion handled through the application — deleting patient data cascades through AI-processed data.
Can AI integrate with our EHR?: Yes. EHRs with FHIR APIs (Athena, Epic via App Orchard, Cerner, DrChrono) integrate for reading clinical context. AI reads EHR data, produces output, output writes back to EHR as structured or unstructured notes with clinician approval. Full EHR write integration requires careful testing — EHRs do not forgive sloppy writes. Typical integration: 4 to 8 weeks on top of standard AI feature work.

Get started in 60 seconds

Ready to start?

Tell me what you need in 60 seconds. Tailored proposal in your inbox within 6 hours.

Available for new projects