Arch Tutor

Lesson 4

Databases & Where Data Lives

Explore how databases store persistent data, the difference between SQL and NoSQL, and how to choose the right one.

28 min read · Beginner

Data outlives your server

Here is a fact that surprises many new developers: when your server restarts, everything in its memory disappears. Variables, in-memory caches, session data — gone. If you want information to survive restarts, deployments, and crashes, you need persistent storage. That is what databases are for.

Databases are specialized systems designed to store, retrieve, and manage data reliably. They handle the hard problems — concurrent access, crash recovery, efficient lookups — so your application code can focus on business logic.

SQL vs NoSQL comparison

FeatureSQL (PostgreSQL, MySQL)NoSQL (MongoDB, DynamoDB)
Data modelTables with rows and columnsDocuments, key-value, graphs
SchemaFixed schema (with migrations)Flexible, schema-on-read
RelationshipsForeign keys, JOINsEmbedded docs or manual references
QueriesPowerful SQL with JOINsSimpler query languages
ScalingVertical + read replicasHorizontal sharding built-in
Best forStructured, relational dataFlexible schemas, high write volume
ExamplesUser accounts, orders, inventoryActivity feeds, IoT sensor data

Neither is universally better. PostgreSQL handles most startup use cases excellently. Choose based on your data shape and query patterns, not hype.

E-commerce schema: the full picture

A typical online store needs several related tables. Here is how they connect:

E-commerce table relationships

places

contains

listed in

users

int

id

PK

string

name

string

email

orders

int

id

PK

int

user_id

FK

decimal

total

order_items

int

id

PK

int

order_id

FK

int

product_id

FK

int

quantity

products

int

id

PK

string

name

decimal

price

Foreign keys link orders to users and order_items to products. Each table owns one kind of data.

In ASCII, the same relationships look like this:

  users                orders              order_items           products
+---------+         +-----------+         +-------------+       +----------+
| id (PK) |<--------| user_id   |         | order_id    |------>| id (PK)  |
| name    |    1:N  | id (PK)   |<--------| product_id  |--1:N--| name     |
| email   |         | total     |    1:N  | quantity    |       | price    |
+---------+         | created_at|         | unit_price  |       +----------+
                    +-----------+         +-------------+
  • users — one row per customer
  • orders — one row per purchase, linked to a user via user_id
  • order_items — line items (product + quantity) for each order
  • products — catalog of things you sell

This is normalized design: each fact lives in exactly one place.

Read path: how data flows out

Database read path

GET /users/1

check cache

cache miss

rows

JSON

Client

API Server

Cache

Database

The server translates API requests into database queries and shapes the response for the client.

On a read request, the server often checks a cache first. If the data is cached (a hit), it returns immediately without touching the database. On a cache miss, it queries the database, stores the result in cache, and returns it.

Write path: how data flows in

Database write path

POST /users

INSERT

invalidate

201 Created

Client

API Server

Database

Cache

Writes go to the database first, then invalidate stale cache entries.

Writes are more careful. The server validates input, writes to the database, then invalidates any cached copies of that data so the next read gets fresh information.

Interactive: database components

Database architecture components

App Server

Your application code sends queries through a connection pool — a set of reusable database connections. Without pooling, opening a new connection per request is slow and can exhaust the database's connection limit.

ACID: the bank transfer example

ACID properties guarantee reliable transactions:

PropertyMeaningBank transfer example
AtomicityAll or nothingDebit AND credit both happen, or neither does
ConsistencyRules always holdAccount balance never goes negative
IsolationConcurrent txs don’t interfereTwo transfers don’t corrupt each other’s state
DurabilityCommitted data survives crashesAfter “transfer complete,” a power outage won’t undo it
BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'alice';
  UPDATE accounts SET balance = balance + 100 WHERE id = 'bob';
COMMIT;

If the server crashes between the two UPDATEs, the transaction rolls back — Alice is not charged without Bob receiving the money.

Normalization: why split tables?

Normalization organizes data to reduce redundancy. The goal is simple: each fact should be stored exactly once.

Before normalization: the problems

Here is a denormalized “everything in one table” design:

Denormalized orders tablesql
-- ONE giant table (bad)
orders_denormalized:
| order_id | customer_name | customer_email      | product    | price |
|----------|---------------|---------------------|------------|-------|
| 1        | Alice Chen    | alice@example.com   | Keyboard   | 79.99 |
| 2        | Alice Chen    | alice@example.com   | Mouse      | 29.99 |
| 3        | Bob Smith     | bob@example.com     | Monitor    | 349.00 |
Customer info is duplicated on every order row. This causes three classic anomalies.

Three problems emerge:

Update anomaly — Alice changes her email to alice.chen@work.com. You must find and update every row where she appears. Miss one row and your data is inconsistent.

Insert anomaly — You want to add a new customer who has not ordered yet. There is nowhere to put them without creating a fake order row.

Delete anomaly — You delete Bob’s only order. You accidentally lose his name and email too, even though he is still a customer.

After normalization: splitting tables

Split the data so each entity has its own table, linked by foreign keys:

Normalized schemasql
CREATE TABLE customers (
id    SERIAL PRIMARY KEY,
name  TEXT NOT NULL,
email TEXT UNIQUE NOT NULL
);

CREATE TABLE orders (
id          SERIAL PRIMARY KEY,
customer_id INT NOT NULL REFERENCES customers(id),
total       DECIMAL(10,2) NOT NULL,
created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE products (
id    SERIAL PRIMARY KEY,
name  TEXT NOT NULL,
price DECIMAL(10,2) NOT NULL
);

CREATE TABLE order_items (
id         SERIAL PRIMARY KEY,
order_id   INT NOT NULL REFERENCES orders(id),
product_id INT NOT NULL REFERENCES products(id),
quantity   INT NOT NULL DEFAULT 1,
unit_price DECIMAL(10,2) NOT NULL
);
Foreign keys (REFERENCES) enforce that every order belongs to a real user.

Now Alice’s email lives in one row. Change it once, every future query sees the update.

Querying normalized data with JOINs

Normalized data lives in separate tables — you combine it at query time with JOINs:

JOIN query for order detailssql
SELECT
o.id AS order_id,
c.name AS customer_name,
p.name AS product_name,
oi.quantity,
oi.unit_price
FROM orders o
JOIN customers c ON c.id = o.customer_id
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
WHERE o.id = 42;
JOINs are the payoff of normalization: flexible queries without duplicated storage.

This single query assembles a complete order view from four tables. Yes, JOINs add query complexity — but they eliminate the update/insert/delete anomalies that plague denormalized designs.

Normalization journey

Normalization journey

atomic columns

no partial deps

no transitive deps

Raw Data

1NF

2NF

3NF

Each normal form removes a specific kind of redundancy. Most apps aim for 3NF.

Normal formRuleExample fix
1NFNo repeating groupsOne product per order_items row, not a comma-separated list
2NFNo partial dependenciesorder_date depends on order_id, not on product_id
3NFNo transitive dependenciesStore city on addresses, not derived from zip on customers

When denormalization makes sense

Normalization is the default, but sometimes you intentionally duplicate data for performance:

  • Read-heavy analytics — a dashboard that counts daily orders should not JOIN four tables on every page load
  • Caching aggregated counts — store total_orders on the customer row, updated by a background job
  • Materialized views — pre-computed query results refreshed periodically
Denormalized summary tablesql
CREATE TABLE customer_stats (
customer_id   INT PRIMARY KEY REFERENCES customers(id),
total_orders  INT NOT NULL DEFAULT 0,
total_spent   DECIMAL(12,2) NOT NULL DEFAULT 0,
last_order_at TIMESTAMPTZ
);

-- Refreshed by a nightly job or trigger after each order
Trade storage for read speed. Update this table when orders change, not on every dashboard view.
NormalizedDenormalized
WritesSimple, no duplicatesMust update multiple places
ReadsJOINs requiredFast single-table lookups
ConsistencyEasy to maintainRisk of stale copies
Best forOLTP (transactions)Analytics, dashboards

For most apps, aim for 3NF in your core schema, then denormalize specific hot paths only after you measure a performance problem.

Core database concepts

Indexes speed up lookups. Without an index, finding a user by email requires scanning every row. With an index, the database jumps directly to the right row.

Replication copies data to multiple machines for reliability and read performance.

Sharding splits data across machines when a single server cannot hold it all.

In practice

Start with PostgreSQL for most web apps. It handles relational data, JSON documents, full-text search, and scales further than most startups need. Add read replicas when query latency becomes a problem, not before.

Test your understanding

Database Basics Quiz

Question 1 of 6

Why should clients never connect to the database directly?

Key takeaways

  • Databases provide persistent storage that survives server restarts and crashes
  • Only servers should talk to databases — never expose database access to clients
  • SQL for structured relationships; NoSQL for flexible schemas
  • ACID transactions protect data integrity for critical operations
  • Normalization reduces redundancy — aim for 3NF unless you have a reason not to
  • Indexes, transactions, and replication are essential tools for performance and reliability

Common mistakes

  • Picking a database because it is trendy — PostgreSQL handles most use cases well
  • Skipping indexes — queries fine with 100 rows become painfully slow with 10 million
  • Storing everything in one giant table — related data belongs in related tables
  • Denormalizing too early — optimize for correctness first, performance when measured

Go deeper