PostgreSQL Inside

The longer we use PostgreSQL, the more we run into a few concepts at least once: MVCC, indexes, EXPLAIN, isolation levels, VACUUM, partitioning.

1. MVCC

Multi-Version Concurrency Control is a concurrency control technique that keeps multiple versions of the same row at once so that reads and writes do not block each other. PostgreSQL is built on this model.

A transaction sees data through a snapshot taken at its start.
An update creates a new row version and marks the old version with a transaction ID.
Old versions that no transaction looks at anymore become candidates for VACUUM.

The advantage is that it offers a consistent snapshot without read locks. The limit is bloat from accumulated dead tuples.

2. Index types

PostgreSQL ships with several index access methods.

Type	Where it fits
B-tree	Equality and range comparisons (default). Almost any equality or inequality search.
Hash	Equality only. WAL logging stabilized from 9.6+.
GIN	Multi-value columns (`tsvector`, `jsonb`, arrays). Full-text and key existence checks.
GiST	Ranges and spatial (PostGIS), part of similarity search.
SP-GiST	Space partitioning (Quad-tree, KD-tree style).
BRIN	Naturally sorted columns of very large tables (timestamps, etc.). Small index size.
BLOOM	Multi-column equality combination check (extension).

Index choice depends on data distribution and query shape. The wrong index drives up disk and write cost before it helps performance.

3. EXPLAIN and EXPLAIN ANALYZE

EXPLAIN shows the execution plan the planner produced, while EXPLAIN ANALYZE measures actual execution time and row counts.

EXPLAIN ANALYZE
SELECT id, email FROM users WHERE email = 'a@b.com';

Common nodes we see when reading.

Seq Scan: full table scan. Normal on small tables or low selectivity.
Index Scan / Index Only Scan: access through an index. Index Only is meaningful only when the visibility map is clean.
Bitmap Heap Scan: combines several index results as a bitmap and visits the heap once.
Nested Loop / Hash Join / Merge Join: join algorithms. The planner picks based on data size and distribution.
actual time vs cost: cost is estimated, actual time is measured. A large gap suggests stale statistics.

Turning on the BUFFERS option shows cache hits and disk reads.

EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

4. Transaction isolation levels

The SQL standard defines four levels. Here is how PostgreSQL behaves.

Level	PostgreSQL's actual behavior
Read Uncommitted	Treated identically to Read Committed (no dirty reads occur).
Read Committed	Default. A new snapshot per statement. Non-repeatable reads and phantoms possible.
Repeatable Read	Snapshot fixed at transaction start. SSI reinforcement blocks some phantoms.
Serializable	SSI (Serializable Snapshot Isolation, introduced in 9.1). Guarantees a serializable result. On conflict, one transaction fails.

Raising the level boosts safety but increases conflict and retry cost. Most OLTP runs fine on Read Committed. Places like payments or inventory are worth reviewing for Serializable or explicit locks.

5. VACUUM and autovacuum

VACUUM reclaims dead tuples and refreshes the visibility map and statistics. The autovacuum daemon runs it automatically based on thresholds.

VACUUM (default): returns space to the free space map. Disk usage does not shrink.
VACUUM FULL: rewrites the table and shrinks disk too. It holds an ACCESS EXCLUSIVE lock during the operation, so it is risky in production.
ANALYZE: collects statistics. Affects planner accuracy.

Bloat happens when dead tuples accumulate faster than reclamation. Results are increased disk usage, index bloat, and lower cache hit rates. It is common on tables with heavy update or delete patterns. The pgstattuple extension is often mentioned as a diagnostic tool.

6. Partitioning

Declarative partitioning landed in PostgreSQL 10. A single table is split into child tables by a key.

CREATE TABLE events (
  id BIGSERIAL,
  occurred_at TIMESTAMPTZ NOT NULL,
  payload JSONB
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2026_04 PARTITION OF events
  FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

The advantage is that index and VACUUM cost on a large table can be split per partition. The limit is that each partition keeps its own statistics, and there is (so far) no global index, so PK and unique constraints must include the partition key.

7. Distributed SQL and time-series specialists

When sharding automation and distributed strong consistency take priority, candidates like Citus (PostgreSQL extension), CockroachDB, and YugabyteDB come up. We trade single-node PostgreSQL's simplicity for global scale.

timescaledb adds time-series hypertables, compression, and continuous aggregate views on top of PostgreSQL. Standalone engines like InfluxDB occupy their own place, but timescaledb is often picked when the requirement is "we want to keep using existing SQL tools."

8. Operational tools

CREATE INDEX CONCURRENTLY — reduces locking when building an index on a table with production traffic (only one such build per table at a time).
pg_stat_statements — shows accumulated execution time, count, and plan per query. The doorway to tuning.
pg_stat_activity, pg_locks — inspect current sessions and locks. Track deadlocks.
Long-running transactions — long transactions hold the vacuum horizon and accelerate bloat. Periodic checks are needed.

9. Common pitfalls

JSONB index choice — pick GIN(jsonb_path_ops) or expression indexes to match the query shape. GIN is not always the right answer.

SELECT * plus large rows — TOASTed columns get pulled along and create surprising costs.

Bad statistics — if ANALYZE has not run recently or samples are small, the planner picks a poor plan. default_statistics_target can also be tuned per column.

SERIALIZABLE's retry obligation — when a conflict happens, the client must rerun the transaction. Libraries and code must assume this.

Index bloat — frequent updates to the same row break HOT updates and inflate index entries. Periodic REINDEX CONCURRENTLY is worth considering.

Closing thoughts

The internals of PostgreSQL do not need to be known all at once. Reading EXPLAIN is a start, and isolation levels and VACUUM only become meaningful after an incident. This article sits in front of those incidents before they hit production.

three-layer-cache
redis-roles

References: Postgres MVCC docs, Postgres index types, Using EXPLAIN, Transaction isolation, Routine Vacuuming.

PostgreSQL Inside

The longer we use PostgreSQL, the more we run into a few concepts at least once: MVCC, indexes, EXPLAIN, isolation levels, VACUUM, partitioning.

1. MVCC

A transaction sees data through a snapshot taken at its start.
An update creates a new row version and marks the old version with a transaction ID.
Old versions that no transaction looks at anymore become candidates for VACUUM.

The advantage is that it offers a consistent snapshot without read locks. The limit is bloat from accumulated dead tuples.

2. Index types

PostgreSQL ships with several index access methods.

Type	Where it fits
B-tree	Equality and range comparisons (default). Almost any equality or inequality search.
Hash	Equality only. WAL logging stabilized from 9.6+.
GIN	Multi-value columns (`tsvector`, `jsonb`, arrays). Full-text and key existence checks.
GiST	Ranges and spatial (PostGIS), part of similarity search.
SP-GiST	Space partitioning (Quad-tree, KD-tree style).
BRIN	Naturally sorted columns of very large tables (timestamps, etc.). Small index size.
BLOOM	Multi-column equality combination check (extension).

Index choice depends on data distribution and query shape. The wrong index drives up disk and write cost before it helps performance.

3. EXPLAIN and EXPLAIN ANALYZE

EXPLAIN shows the execution plan the planner produced, while EXPLAIN ANALYZE measures actual execution time and row counts.

EXPLAIN ANALYZE
SELECT id, email FROM users WHERE email = 'a@b.com';

Common nodes we see when reading.

Seq Scan: full table scan. Normal on small tables or low selectivity.
Index Scan / Index Only Scan: access through an index. Index Only is meaningful only when the visibility map is clean.
Bitmap Heap Scan: combines several index results as a bitmap and visits the heap once.
Nested Loop / Hash Join / Merge Join: join algorithms. The planner picks based on data size and distribution.
actual time vs cost: cost is estimated, actual time is measured. A large gap suggests stale statistics.

Turning on the BUFFERS option shows cache hits and disk reads.

EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

4. Transaction isolation levels

The SQL standard defines four levels. Here is how PostgreSQL behaves.

Level	PostgreSQL's actual behavior
Read Uncommitted	Treated identically to Read Committed (no dirty reads occur).
Read Committed	Default. A new snapshot per statement. Non-repeatable reads and phantoms possible.
Repeatable Read	Snapshot fixed at transaction start. SSI reinforcement blocks some phantoms.
Serializable	SSI (Serializable Snapshot Isolation, introduced in 9.1). Guarantees a serializable result. On conflict, one transaction fails.

Raising the level boosts safety but increases conflict and retry cost. Most OLTP runs fine on Read Committed. Places like payments or inventory are worth reviewing for Serializable or explicit locks.

5. VACUUM and autovacuum

VACUUM reclaims dead tuples and refreshes the visibility map and statistics. The autovacuum daemon runs it automatically based on thresholds.

VACUUM (default): returns space to the free space map. Disk usage does not shrink.
VACUUM FULL: rewrites the table and shrinks disk too. It holds an ACCESS EXCLUSIVE lock during the operation, so it is risky in production.
ANALYZE: collects statistics. Affects planner accuracy.

6. Partitioning

Declarative partitioning landed in PostgreSQL 10. A single table is split into child tables by a key.

CREATE TABLE events (
  id BIGSERIAL,
  occurred_at TIMESTAMPTZ NOT NULL,
  payload JSONB
) PARTITION BY RANGE (occurred_at);

CREATE TABLE events_2026_04 PARTITION OF events
  FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

7. Distributed SQL and time-series specialists

8. Operational tools

CREATE INDEX CONCURRENTLY — reduces locking when building an index on a table with production traffic (only one such build per table at a time).
pg_stat_statements — shows accumulated execution time, count, and plan per query. The doorway to tuning.
pg_stat_activity, pg_locks — inspect current sessions and locks. Track deadlocks.
Long-running transactions — long transactions hold the vacuum horizon and accelerate bloat. Periodic checks are needed.

9. Common pitfalls

JSONB index choice — pick GIN(jsonb_path_ops) or expression indexes to match the query shape. GIN is not always the right answer.

SELECT * plus large rows — TOASTed columns get pulled along and create surprising costs.

Bad statistics — if ANALYZE has not run recently or samples are small, the planner picks a poor plan. default_statistics_target can also be tuned per column.

SERIALIZABLE's retry obligation — when a conflict happens, the client must rerun the transaction. Libraries and code must assume this.

Index bloat — frequent updates to the same row break HOT updates and inflate index entries. Periodic REINDEX CONCURRENTLY is worth considering.

Closing thoughts

three-layer-cache
redis-roles

References: Postgres MVCC docs, Postgres index types, Using EXPLAIN, Transaction isolation, Routine Vacuuming.

PostgreSQL Inside

PostgreSQL Inside

1. MVCC

2. Index types

3. EXPLAIN and EXPLAIN ANALYZE

4. Transaction isolation levels

5. VACUUM and autovacuum

6. Partitioning

7. Distributed SQL and time-series specialists

8. Operational tools

9. Common pitfalls

Closing thoughts

Next

More in data

PostgreSQL Inside

PostgreSQL Inside

1. MVCC

2. Index types

3. EXPLAIN and EXPLAIN ANALYZE

4. Transaction isolation levels

5. VACUUM and autovacuum

6. Partitioning

7. Distributed SQL and time-series specialists

8. Operational tools

9. Common pitfalls

Closing thoughts

Next

More in data