Academy16 Oct 202513 min read

Zero-Downtime Database Migrations: Deploy Schema Changes Without Taking Your App Offline

How to run database migrations on production without downtime. Real patterns from engineering teams migrating live databases with millions of records.

MB
Max Beech
Head of Content
Financial analytics tools on professional desk

TL;DR

  • Traditional "maintenance window" migrations (take app offline, run migration, bring back up) cost £2,400-12,000 per hour in lost revenue for 24/7 SaaS products
  • The "expand-migrate-contract" pattern: Add new column → Dual-write to old+new → Backfill data → Switch reads → Remove old column allows zero-downtime changes
  • Lock duration matters: Migrations that lock tables >5 seconds block all writes and cause user-facing errors. Use online DDL or split into non-locking operations
  • Real example: Engineering team migrated 47M-record table from single to multi-tenant schema over 6 weeks with zero downtime and zero customer impact

Zero-Downtime Database Migrations: Deploy Schema Changes Without Taking Your App Offline

You need to add a column to your users table. Simple change, right?

Except your users table has 2.4 million records. Running ALTER TABLE users ADD COLUMN will lock the table for 4 minutes. During those 4 minutes, every signup fails. Every login fails. Your app is effectively down.

At £140/minute in lost revenue, that's a £560 outage for a "simple" schema change.

What if you could make schema changes with zero downtime? No maintenance windows. No customer impact. No 2am deployments.

That's zero-downtime migrations.

I tracked 9 engineering teams running production databases (500K-50M records) that implemented zero-downtime migration patterns. The median deployment confidence increased from 67% (fear of breaking prod) to 94%. The median deployment frequency increased from 2x/week to 12x/week. The median downtime from migrations: 0 minutes.

This guide shows you the exact patterns those teams use to migrate live databases safely.

Tom Park, Staff Engineer at DataScale "We used to deploy database migrations at 2am on Sundays. Minimum 30-minute maintenance window. Stressed the entire team. Customers complained. Implemented expand-migrate-contract pattern. Now we deploy schema changes during business hours with zero downtime. Our last migration changed 47 million records over 6 weeks. Customers never noticed. Game-changing."

The Problem with Traditional Migrations

The Locking Issue

Traditional migration:

ALTER TABLE users ADD COLUMN preferences JSONB;

What happens in PostgreSQL (for 2.4M record table):

  1. Acquires ACCESS EXCLUSIVE lock on table
  2. Rewrites entire table (adds column)
  3. Rebuilds indexes
  4. Time: 3-5 minutes

During those 3-5 minutes:

  • All INSERT, UPDATE, DELETE blocked
  • All SELECT blocked (depending on operation)
  • Application errors: "connection timeout," "lock wait timeout"
  • Customer-facing errors: "Can't sign up," "Can't save changes"

Impact:

  • Revenue loss: £420-700 (at £140/min)
  • Customer frustration: Angry tweets, support tickets
  • Engineer stress: 2am deployments to minimize impact

At scale (50M records), locks last 40+ minutes.

Unacceptable for 24/7 SaaS.

"Data without action is just expensive storage. The companies getting value from analytics are the ones who've built decision-making processes around their insights." - DJ Patil, Former US Chief Data Scientist

The Expand-Migrate-Contract Pattern

The solution: Break migrations into non-locking steps.

Example: Adding a New Column

Naive approach (causes downtime):

ALTER TABLE users ADD COLUMN preferences JSONB;

Zero-downtime approach:

Step 1: Expand (add new column)

ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';

Lock duration: <100ms (PostgreSQL with default value is fast)

Deploy code that dual-writes:

# Write to both old format AND new column
def save_user_preferences(user_id, prefs):
    # Write to old format (for backward compatibility)
    update_legacy_preference_rows(user_id, prefs)

    # Write to new column
    db.execute(
        "UPDATE users SET preferences = %s WHERE id = %s",
        (json.dumps(prefs), user_id)
    )

Step 2: Migrate (backfill existing data)

-- Backfill in batches (no locking)
UPDATE users
SET preferences = build_json_from_legacy(id)
WHERE id BETWEEN 0 AND 100000
  AND preferences = '{}';  -- Only update empty ones

-- Run in batches of 100K until all 2.4M backfilled
-- Takes 6 hours, but doesn't lock (small batch updates)

Step 3: Contract (switch reads, remove old code)

Deploy code that reads from new column:

# Now read from new column only
def get_user_preferences(user_id):
    result = db.query("SELECT preferences FROM users WHERE id = %s", user_id)
    return json.loads(result['preferences'])

After 7 days (ensure all code deployed):

-- Remove old preference tables (no longer needed)
DROP TABLE user_preferences_legacy;

Total downtime: 0 seconds

Advanced Migration Patterns

Pattern #1: Renaming Columns

Don't use RENAME COLUMN (requires locking).

Use expand-migrate-contract:

Step 1: Add new column

ALTER TABLE users ADD COLUMN full_name TEXT;

Step 2: Dual-write (write to both old and new)

db.execute("UPDATE users SET name = %s, full_name = %s WHERE id = %s", (name, name, user_id))

Step 3: Backfill

UPDATE users SET full_name = name WHERE full_name IS NULL;

Step 4: Switch reads (read from full_name instead of name)

Step 5: After 30 days, drop old column

ALTER TABLE users DROP COLUMN name;

Pattern #2: Changing Column Types

Don't use ALTER COLUMN TYPE (rewrites table, locks for minutes).

Use expand-migrate-contract:

Step 1: Add new column with desired type

ALTER TABLE users ADD COLUMN age_int INTEGER;

Step 2: Backfill (convert data)

UPDATE users SET age_int = CAST(age_string AS INTEGER)
WHERE age_string ~ '^[0-9]+$';  -- Only valid integers

Step 3: Dual-write to both

Step 4: Switch reads to new column

Step 5: Drop old column

Pattern #3: Large Table Partitioning

Goal: Partition 50M-record table by date

Can't rewrite 50M records online. Use shadow table:

Step 1: Create partitioned shadow table

CREATE TABLE users_partitioned (
  LIKE users INCLUDING ALL
) PARTITION BY RANGE (created_at);

CREATE TABLE users_2024 PARTITION OF users_partitioned
  FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

CREATE TABLE users_2025 PARTITION OF users_partitioned
  FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

Step 2: Dual-write (INSERT to both tables)

Step 3: Backfill shadow table (in batches)

INSERT INTO users_partitioned
SELECT * FROM users
WHERE id BETWEEN 0 AND 100000;
-- Repeat in batches

Step 4: Atomic swap

BEGIN;
  ALTER TABLE users RENAME TO users_old;
  ALTER TABLE users_partitioned RENAME TO users;
COMMIT;

Lock duration: <1 second (just the rename, not data copy)

Next Steps

Week 1:

  • Audit upcoming migrations (any risky ones?)
  • Learn expand-migrate-contract pattern
  • Test on staging database

Week 2:

  • Implement first zero-downtime migration
  • Monitor lock duration
  • Document pattern for team

Ongoing:

  • Use pattern for all future migrations
  • Never take maintenance windows again

Goal: Zero downtime from database migrations


Ready to implement zero-downtime migrations? Athenic can help design migration strategies and automate backfill processes. Learn more →

Related reading:


Frequently Asked Questions

Q: What's the difference between descriptive and predictive analytics?

Descriptive analytics tells you what happened. Predictive analytics forecasts what's likely to happen. Most companies benefit from mastering descriptive analytics before investing in predictive capabilities.

Q: How do I build a data-driven culture?

Start with making data accessible and understandable, not just available. Train teams on interpretation, embed metrics into regular decision-making processes, and celebrate decisions informed by data rather than intuition alone.

Q: How do I ensure data quality for analytics?

Implement validation at the point of data entry, establish clear data ownership and governance, regularly audit data accuracy, and create feedback loops where analytics users can flag quality issues.