Orb21
For FoundersFor Builders
Invest in Orb21The PlatformInvest in StartupsDeal Flow
ProcessTeamBlogCommunity
Back to Logs
DatabaseScalingPostgreSQL

When Your Database Chokes: Lessons from Black Friday

Jerry Odhiambo

Jerry Odhiambo

Backend Engineer

Nov 28, 2025 6 min read
When Your Database Chokes: Lessons from Black Friday

The Black Friday Meldtown

It was 9:00 AM on Black Friday. We had prepared for this for months. We scaled our Kubernetes cluster to 50 nodes. We upgraded our load balancers.

But at 9:05 AM, our monitoring dashboard lit up like a Christmas tree. - CPU usage: 100% on the Primary DB. - Active Connections: Maxed out (5000/5000). - Latency: Spiked from 50ms to 5,000ms.

We were down. The app was returning 504 Gateway Timeouts. We were losing $1,000 every minute.

The Culprit: The Silent N+1 Killer

We panicked. changing instance sizes didn't help. Adding read replicas didn't help.

We looked at the slow query log. It wasn't a complex join. It wasn't a DDoS attack.

It was a simple `SELECT *` inside a loop in our order processing code.

```javascript // The code looked innocent: const users = await db.getAllUsers(); users.forEach(async (user) => { const orders = await db.getOrdersForUser(user.id); // <--- THE KILLER // ... process orders }); ```

In local development with 10 test users, this runs instantly. You make 1 query for users, and 10 queries for orders. Total 11 queries.

In production with 100,000 active users, this triggered 100,001 SQL queries immediately. The database was being DDoSed by our own code.

The Fix (War Room Tactics)

We had to fix it live while customers were screaming on Twitter.

Step 1: Stop the Bleeding (Read Replicas) We immediately diverted all *read* traffic (dashboards, analytics) to a Read Replica. This unblocked the Main Writer instance so it could process transactions (Writes). *Impact: CPU dropped to 80%. Site still slow, but not dead.*

Step 2: The Redis Shield We realized 90% of the load was people refreshing the "Live Sales Dashboard". We implemented a Redis cache layer for that query with a Time-To-Live (TTL) of 60 seconds. Instead of hitting Postgres 10,000 times a second, we hit Redis. Postgres only saw 1 query per minute for that dashboard. *Impact: CPU dropped to 40%. Site usable.*

Step 3: Rewrite the Code (Batch Fetching) That night, we rewrote the N+1 loop to use a single `IN` query.

```sql SELECT * FROM orders WHERE user_id IN (1, 2, 3, ... 1000); ```

We fetched data in batches of 1000. *Impact: Total queries dropped from 100,000 to 100.*

Lesson Learned: ORMs are Liars

Your ORM (Hibernate, Prisma, TypeORM) is lying to you. It creates an abstraction layer that hides the cost of data access. It makes `user.orders` look like a free property access, when it is actually a network call.

To scale, you must unlearn the ORM. - Don't just trust the abstraction. - Run `EXPLAIN ANALYZE` on every query. - Understand the physical execution plan. - Monitor query counts per request.

If you don't understand SQL, you are not a Senior Backend Engineer. You are just a framework user.

Share Transmission

Stop reading. Start building.

The content is free. The execution requires a team. Find your co-founder in the Forge today.

Enter The Forge
Share Transmission