Arch Tutor

Lesson 6

Load Balancing & Basic Scaling

Understand how load balancers distribute traffic across multiple servers to handle more users and improve reliability.

20 min read · Beginner

When one server is not enough

Your application is growing. Response times are creeping up. The server CPU is pinned at 100%. You have two choices: buy a bigger machine (vertical scaling) or add more machines (horizontal scaling). For most web applications, horizontal scaling is the better long-term strategy — but it introduces a new question: how do you distribute incoming traffic across multiple servers?

That is the job of a load balancer. It sits in front of your servers, receives all incoming requests, and forwards each one to an available server.

Vertical vs horizontal scaling

Vertical vs horizontal scaling

evolves to

Bigger Server

More Servers

Server 2

Server 3

Vertical scaling has a ceiling (biggest machine available). Horizontal scaling adds capacity by adding machines.

ApproachHowLimitCost curve
VerticalUpgrade CPU, RAM, disk on one machineHardware maximumExpensive at top end
HorizontalAdd more identical machinesNeeds load balancerLinear, more predictable

Vertical scaling is simpler (no code changes) but hits a ceiling. Horizontal scaling is how Netflix, Google, and Amazon handle billions of requests — but requires stateless servers and a load balancer.

Multi-layer load balancing

Load balancing happens at multiple layers in a production system:

Multi-layer load balancing

resolve

route

forward

forward

query

query

Users

DNS LB

App LB

App Server 1

App Server 2

Database

Traffic passes through multiple balancing layers before reaching your application code.

LayerWhat it balancesExample
DNSTraffic across regions/data centersRoute 53, Cloudflare
ApplicationHTTP requests across app serversNGINX, AWS ALB, HAProxy
DatabaseRead queries across replicasPostgreSQL read replicas

How load balancers work

A load balancer is itself a server (or managed service) with a public IP address. Clients connect to the load balancer, not to individual application servers.

Client → Load Balancer → Server A
                       → Server B
                       → Server C

This setup gives you two major benefits:

  1. Capacity — three servers handle roughly three times the traffic of one
  2. Reliability — if Server B crashes, traffic flows to A and C with minimal disruption

Load balancing algorithms

AlgorithmHow it worksBest when
Round-robinEach server gets the next request in turnServers are identical, requests are similar
Least connectionsSends to server with fewest active connectionsRequests have varying processing times
IP hashSame client IP always goes to same serverSticky sessions needed
WeightedMore traffic to more powerful serversMixed hardware capacities
RandomPicks a server at randomSimple, surprisingly effective

For most starting applications, round-robin is perfectly adequate.

Round-Robin Load Balancer

Watch how incoming requests are distributed evenly across servers using round-robin.

Load Balancer →
Server A0
Server B0
Server C0

Sticky sessions

When your app stores session data in server memory (stateful), a user must always hit the same server. Sticky sessions (session affinity) route the same client to the same backend.

ApproachHowDownside
Cookie-based stickinessLB sets a cookie mapping to a serverUneven load if users have different activity levels
IP hashHash client IP to pick serverBreaks when IPs change (mobile networks)
Shared session storeStore sessions in RedisBest approach — any server can serve any user

Recommendation: Use stateless APIs with tokens, or store sessions in Redis. Avoid sticky sessions when possible — they complicate deployments and create uneven load.

Auto-scaling intro

Auto-scaling automatically adds or removes servers based on demand:

CPU > 70% for 5 min  →  add 2 servers
CPU < 30% for 10 min →  remove 1 server
MetricScale up whenScale down when
CPU utilizationSustained above 70%Sustained below 30%
Request countQueue depth growingQueue empty
Response timep95 latency above targetp95 latency normal

Cloud providers (AWS Auto Scaling Groups, Kubernetes HPA) handle this automatically. Start with manual scaling until you understand your traffic patterns, then automate.

Health checks

Load balancers continuously check if backend servers are healthy — usually by sending a periodic request to a /health endpoint. Unhealthy servers are removed from the rotation until they recover.

A good health endpoint returns:

{ "status": "ok", "database": "connected", "version": "1.2.3" }

Check dependencies (database, cache) — a server that cannot reach its database should not receive traffic.

In practice

Before adding a load balancer, confirm you actually need one. A single well-provisioned server handles thousands of concurrent users for most apps. Add load balancing when monitoring shows sustained high CPU or response time degradation — not because architecture diagrams look better with more boxes.

Key takeaways

  • Load balancers distribute traffic across multiple servers for capacity and reliability
  • Horizontal scaling (more servers) beats vertical scaling long-term
  • Round-robin is a simple, effective algorithm for even distribution
  • Avoid sticky sessions — use shared session storage instead
  • Auto-scaling adds/removes servers based on metrics
  • Health checks enable automatic failover

Common mistakes

  • Scaling before you need to — premature optimization adds operational complexity
  • Ignoring stateful sessions — sticky sessions or shared storage required for stateful apps
  • Forgetting about the database — adding app servers does not help if the database is the bottleneck
  • No health checks — without them, traffic goes to crashed servers

Go deeper