10k → 100k users without the 3 am incidents.
Scaling roadmap (caching, DB, queue, observability) with execution support. Proven at Cuez 10x capacity unlock and bolttech 99.9% uptime.
Who this is for
Founder hitting 10k–100k users where the stack is showing cracks — incidents increasing, 'works on my machine' breaking in production, observability thin.
The pain today
- Production incidents frequency climbing month over month
- Monitoring exists but doesn't catch problems before users do
- Database CPU spiking during peak hours
- Features that worked at 1k users slow dramatically at 10k
- Team time consumed by incident response instead of feature work
The outcome you get
- Scaling roadmap with specific waypoints (1k → 10k → 100k → 1M)
- Observability-first — know about problems before users do
- Database scaling strategy (caching, read replicas, connection pooling, partitioning)
- Queue architecture for decoupling and backpressure
- Incident response runbooks so on-call doesn't feel like crisis
Scaling waypoints
Different user counts mean different bottlenecks. 1k users: single server can handle everything, optimization is about speed not capacity. 10k users: database becomes the hot spot — connection pool limits, slow queries compound, read replicas start mattering. 100k users: caching layer essential (Redis or CDN), queue architecture for anything that can be async, database partitioning conversations begin. 1M users: horizontal scale everywhere, sharding or multi-region, dedicated observability team. Each waypoint has architectural implications; trying to solve 1M problems at 10k scale wastes engineering. I identify your current waypoint and plan to the next one — incremental scaling is the pattern, not big-bang re-architecture.
What breaks first (usually DB + connection pool)
Across most startups, the database is the first bottleneck. Specific patterns: N+1 queries that weren't bad at 1k users but crush at 10k. Missing indexes on fields that became query filters. Connection pool exhaustion when request volume spikes. ORM-generated queries that look fine but fan out under load. Writes competing with reads on the same primary. Fixes in rough order: add indexes, fix N+1s with eager loading or batch queries, add read replicas for read-heavy workloads, add caching layer (Redis) for expensive reads, connection pooling tuning (PgBouncer for Postgres). Most 10k-scale DB problems can be fixed without major architectural change; careful profiling beats guessing every time.
Observability-first approach
Scaling without observability is flying blind. Minimum observability stack: structured logs with request ID correlation, APM (Datadog, New Relic, Honeycomb) for request traces, metrics (Prometheus + Grafana or cloud-native equivalent) for time-series, alerting on SLO violations (not just 'server down'). Error tracking (Sentry). Synthetic monitoring for critical paths. The team should know about incidents before users report them — that's the observability bar. Implementation is 2–4 weeks of focused work; without it, every incident becomes a multi-hour forensic exercise. Observability is engineering discipline that pays back every month for the life of the product.
Case studies: Cuez and bolttech
Cuez: API at 3 seconds, serving broadcast customers who couldn't tolerate slow. 10x capacity unlock (3s → 300ms) meant the same infrastructure served 10x users. Infrastructure cost cut ~40% as a result. bolttech: $1B+ unicorn, 15+ markets, 40+ payment providers, 99.9% uptime. Zero post-launch critical bugs on the Payment Service I led. Scale discipline — observability, incident response, architectural patterns that survive 10x growth — isn't a different skill per stage; it's the same discipline applied consistently as scale grows. Apply unicorn-scale discipline starting at 10k users and you skip the usual 100k-scale crisis.
Retainer pricing
Scaling work fits the Fractional CTO service. Advisory ($4,500/mo) for strategic roadmap + weekly check-ins with team leads executing the work. Fractional ($8,500/mo) for deeper operational involvement — hands-on pairing, major architectural changes led by me with team support. Typical engagement: 3–6 months through initial roadmap and first waypoint push, then tapering as team internalizes the practices. 14-day money-back, cancel anytime after. Infrastructure costs (Datadog, New Relic, whatever APM) are your spend, not mine. Most successful scaling engagements see observability + DB fixes deliver 10x capacity without adding significantly to cloud spend.
When to hire a dedicated SRE or platform team
Around 100k users with growing complexity, a dedicated SRE or platform team becomes right. Their job: reliability, observability, developer experience, infrastructure as code, cost optimization. My job during scaling engagement: help the company figure out when that hire is warranted and support the search. Premature SRE hires (at 10k users with a single application) waste hiring budget. Late SRE hires (at 500k users still-with-no-SRE) waste operational budget on repeated incidents. Right timing varies by product, team structure, and growth rate; I help with the call.
Recent proof
A comparable engagement, delivered and documented.
Rescued a slow API that was blocking user growth
Cuez is a live broadcast production tool used by TV teams on air across Europe. I inherited a backend API averaging 3 seconds per response and cut it to 300ms, while reducing infrastructure costs by 40% and leaving the system stable under real production load.
Frequently asked questions
The questions prospects ask before they book.
- How long until scale issues get fixed?
- First 2–3 weeks: observability and quick wins (indexes, N+1 queries, obvious bottlenecks) — typically delivers 2–3x capacity headroom. Weeks 4–8: structural work (caching, replicas, queue) — delivers 5–10x capacity. Full scaling engagement to next waypoint: 3–4 months typically.
- Does this require rewriting the app?
- Usually no. Scaling work is mostly infrastructure, database, and caching layer — existing application code largely unchanged. When code changes are needed (fixing N+1 queries, adding cache-through patterns), they're targeted. Full rewrites are the exception, scoped separately if evidence warrants.
- What about database sharding?
- Rarely needed before 1M users or very-high-write workloads. Most 10k–100k-user startups scale fine with read replicas + caching + connection pooling, no sharding. Sharding adds operational complexity that most teams underestimate. I recommend only when evidence clearly supports it, usually at 500k+ scale.
- Can you set up observability if we have none?
- Yes — that's usually the first 2 weeks of engagement if starting from zero observability. Datadog, New Relic, or Honeycomb for APM. Sentry for error tracking. Prometheus + Grafana or cloud-native for metrics. Choice depends on budget and team familiarity. Setup + dashboards + alerting in 2–3 weeks.
- What about multi-region or active-active?
- Multi-region usually adds cost and complexity without proportional scaling benefit for most startups. It becomes right for specific reasons: data residency, latency (global user base), disaster recovery. Rarely needed at 10k–100k users. I recommend against multi-region prematurely; single-region with strong availability usually wins at this scale.
Ready to start?
Tell me what you need in 60 seconds. Tailored proposal in your inbox within 6 hours.