Lesson 6

Load Balancing & Basic Scaling

Understand how load balancers distribute traffic across multiple servers to handle more users and improve reliability.

20 min read · Beginner

When one server is not enough

Your application is growing. Response times are creeping up. The server CPU is pinned at 100%. You have two choices: buy a bigger machine (vertical scaling) or add more machines (horizontal scaling). For most web applications, horizontal scaling is the better long-term strategy — but it introduces a new question: how do you distribute incoming traffic across multiple servers?

That is the job of a load balancer. It sits in front of your servers, receives all incoming requests, and forwards each one to an available server.

Vertical vs horizontal scaling

Vertical scaling has a ceiling (biggest machine available). Horizontal scaling adds capacity by adding machines.

Approach	How	Limit	Cost curve
Vertical	Upgrade CPU, RAM, disk on one machine	Hardware maximum	Expensive at top end
Horizontal	Add more identical machines	Needs load balancer	Linear, more predictable

Vertical scaling is simpler (no code changes) but hits a ceiling. Horizontal scaling is how Netflix, Google, and Amazon handle billions of requests — but requires stateless servers and a load balancer.

Multi-layer load balancing

Load balancing happens at multiple layers in a production system:

Multi-layer load balancing

Traffic passes through multiple balancing layers before reaching your application code.

Layer	What it balances	Example
DNS	Traffic across regions/data centers	Route 53, Cloudflare
Application	HTTP requests across app servers	NGINX, AWS ALB, HAProxy
Database	Read queries across replicas	PostgreSQL read replicas

How load balancers work

A load balancer is itself a server (or managed service) with a public IP address. Clients connect to the load balancer, not to individual application servers.

Client → Load Balancer → Server A
                       → Server B
                       → Server C

This setup gives you two major benefits:

Capacity — three servers handle roughly three times the traffic of one
Reliability — if Server B crashes, traffic flows to A and C with minimal disruption

Load balancing algorithms

Algorithm	How it works	Best when
Round-robin	Each server gets the next request in turn	Servers are identical, requests are similar
Least connections	Sends to server with fewest active connections	Requests have varying processing times
IP hash	Same client IP always goes to same server	Sticky sessions needed
Weighted	More traffic to more powerful servers	Mixed hardware capacities
Random	Picks a server at random	Simple, surprisingly effective

For most starting applications, round-robin is perfectly adequate.

Round-Robin Load Balancer

Watch how incoming requests are distributed evenly across servers using round-robin.

Load Balancer →

Server A0

Server B0

Server C0

Sticky sessions

When your app stores session data in server memory (stateful), a user must always hit the same server. Sticky sessions (session affinity) route the same client to the same backend.

Approach	How	Downside
Cookie-based stickiness	LB sets a cookie mapping to a server	Uneven load if users have different activity levels
IP hash	Hash client IP to pick server	Breaks when IPs change (mobile networks)
Shared session store	Store sessions in Redis	Best approach — any server can serve any user

Recommendation: Use stateless APIs with tokens, or store sessions in Redis. Avoid sticky sessions when possible — they complicate deployments and create uneven load.

Auto-scaling intro

Auto-scaling automatically adds or removes servers based on demand:

CPU > 70% for 5 min  →  add 2 servers
CPU < 30% for 10 min →  remove 1 server

Metric	Scale up when	Scale down when
CPU utilization	Sustained above 70%	Sustained below 30%
Request count	Queue depth growing	Queue empty
Response time	p95 latency above target	p95 latency normal

Cloud providers (AWS Auto Scaling Groups, Kubernetes HPA) handle this automatically. Start with manual scaling until you understand your traffic patterns, then automate.

Health checks

Load balancers continuously check if backend servers are healthy — usually by sending a periodic request to a /health endpoint. Unhealthy servers are removed from the rotation until they recover.

A good health endpoint returns:

{ "status": "ok", "database": "connected", "version": "1.2.3" }

Check dependencies (database, cache) — a server that cannot reach its database should not receive traffic.

In practice

Before adding a load balancer, confirm you actually need one. A single well-provisioned server handles thousands of concurrent users for most apps. Add load balancing when monitoring shows sustained high CPU or response time degradation — not because architecture diagrams look better with more boxes.

Key takeaways

Load balancers distribute traffic across multiple servers for capacity and reliability
Horizontal scaling (more servers) beats vertical scaling long-term
Round-robin is a simple, effective algorithm for even distribution
Avoid sticky sessions — use shared session storage instead
Auto-scaling adds/removes servers based on metrics
Health checks enable automatic failover

Common mistakes

Scaling before you need to — premature optimization adds operational complexity
Ignoring stateful sessions — sticky sessions or shared storage required for stateful apps
Forgetting about the database — adding app servers does not help if the database is the bottleneck
No health checks — without them, traffic goes to crashed servers

Go deeper

NGINX Load Balancing Guide — practical configuration reference
AWS Well-Architected Framework — reliability and performance best practices
High Scalability — real-world architecture case studies