Parallel Run Pattern

Validate your rewrite with real production traffic before switching over

The Validation Problem

You just rewrote your payment processing system.

Months of development
Thousands of test cases passed
Code reviews complete

But how do you KNOW it works correctly with real production traffic?

The "How Do We Know?" Problem

Real production data has edge cases you never imagined:

Floating point rounding differences
Timezone handling quirks
Unicode edge cases
Race conditions under load
Null handling inconsistencies

One wrong calculation in prod = lost revenue + angry customers.

What is Parallel Run?

A migration pattern where you run both old and new implementations with real traffic, compare results, but only return one answer.

Also called:

Dark Launching
Shadow Mode
Parallel Change
Shadow Testing

Core idea: Build confidence with real data before cutting over.

How It Works: The Flow

async function processOrder(order: Order) {
  // 1. Call old system (source of truth)
  const oldResult = await oldOrderSystem.process(order)

  // 2. Call new system in parallel (shadow)
  const newResult = await newOrderSystem.process(order)
    .catch(err => logError('new-system-error', err))

  // 3. Compare results (don't block response)
  compareResults(oldResult, newResult)

  // 4. Return ONLY old system result
  return oldResult
}

New system runs, but doesn't affect users yet.

Architecture: Traffic Routing

Option 1: Reverse Proxy

Route at infrastructure level (nginx, Envoy, API Gateway).

Option 2: Application-Level Routing

class OrderRouter {
  async route(request: Request) {
    const legacy = this.legacyService.handle(request)

    // Fire and forget shadow request
    this.newService.handle(request)
      .then(result => this.compare(legacy, result))
      .catch(err => this.logShadowError(err))

    return legacy
  }
}

Legacy = source of truth. New = shadow validation.

Comparing Results: The Hard Part

Not all differences matter. Define what "correct" means:

interface ResultComparator {
  compare(old: Result, new: Result): Comparison
}

class PaymentComparator implements ResultComparator {
  compare(old: Payment, new: Payment) {
    return {
      exactMatch: old.amount === new.amount,
      statusMatch: old.status === new.status,
      // Timestamps might differ slightly
      timeDelta: Math.abs(old.timestamp - new.timestamp),
      acceptable: this.isAcceptable(old, new)
    }
  }
}

Handling Discrepancies

When results differ, you need to investigate:

async function compareResults(old, new) {
  const match = deepEqual(old, new)

  if (!match) {
    // Log to monitoring system
    logger.warn('shadow-mismatch', {
      oldResult: old,
      newResult: new,
      diff: calculateDiff(old, new)
    })

    // Increment metrics
    metrics.increment('shadow.mismatch')
  } else {
    metrics.increment('shadow.match')
  }
}

Monitor match rate. Target: 99.9%+ before switching.

Read vs Write Operations

Read operations: Safe to run in parallel

// Safe: Just comparing query results
const oldData = await oldDB.query('SELECT * FROM users')
const newData = await newDB.query('SELECT * FROM users')

Write operations: Dangerous without safeguards

// Risky! Could create duplicate charges
await oldPayment.charge(card, amount)  // Production
await newPayment.charge(card, amount)  // Uh oh...

Solution: Shadow writes to test database, or use idempotency keys.

Safe Shadow Writes

Option 1: Dry Run Mode

// New system doesn't actually mutate
await newPayment.charge(card, amount, { dryRun: true })

Option 2: Separate Test Database

// Shadow writes to non-production DB
await newDB.insert(record) // Test database only

Option 3: Idempotent Operations

// Use same idempotency key
const key = generateIdempotencyKey(order)
await oldSystem.process(order, key)  // Creates record
await newSystem.process(order, key)  // No duplicate

Real Example: Netflix Queue Migration

Challenge: Migrate Netflix Queue from SimpleDB to Cassandra

Approach:

Forklift existing data to Cassandra
Shadow reads: Query both databases for every user request
Compare results and log discrepancies
Fix data consistency issues incrementally
Gradually shift read traffic once confidence high
Eventually migrate writes

Result: Zero-downtime migration of critical user data

Source: Netflix Tech Blog, 2013

Real Example: GitHub MySQL Migration

Challenge: Migrate critical MySQL clusters to new infrastructure

Approach:

Used Percona XtraBackup for initial data copy
Enabled binlog-based replication for incremental updates
Parallel reads from both clusters
Compared query results in real-time
Monitored replication lag and data consistency
Cut over with confidence after weeks of validation

Result: Smooth migration with minimal risk

Performance and Infrastructure Costs

You're running TWO systems in production:

Compute costs:

2x server capacity
Increased latency (running both systems)
Network overhead for comparison

Mitigation strategies:

Sample traffic (run shadow on 10% of requests)
Run async (don't block primary response)
Auto-scale shadow infrastructure
Time-box the parallel run (weeks, not months)

It's expensive, but cheaper than a broken production system.

Best Practices

Do:

Start with read-only operations
Run new system async (don't block old)
Log ALL discrepancies for analysis
Use sampling for high-traffic systems
Set clear success criteria (99%+ match rate)
Time-box the parallel run period

Don't:

Block production traffic on shadow system
Run parallel indefinitely (kills momentum)
Ignore small discrepancies (they compound)
Shadow write to production databases

Common Pitfalls

Shadow system crashes take down production

Isolate failures with timeouts and circuit breakers

Discrepancy alert fatigue

Filter noise, focus on meaningful differences

Performance degradation

Shadow calls must be non-blocking and async

Comparison logic is flaky

Test your comparison code thoroughly

Running parallel run too long

Set deadline: 2-4 weeks for most migrations

When to Use Parallel Run

Perfect for:

Rewriting critical business logic
Database migrations (read queries)
Algorithm changes (pricing, recommendations)
High-risk refactors where correctness is paramount
When you can't afford production bugs

Not ideal for:

Greenfield features (nothing to compare against)
Simple dependency swaps (use Branch by Abstraction)
When 2x infrastructure cost is prohibitive
Systems with side effects you can't isolate

Parallel Run vs Other Patterns

| Pattern | Parallel Run | Strangler Fig | Branch by Abstraction | |---------|--------------|---------------|---------------------| | Scope | Validation before switch | Gradual replacement | Internal dependency swap | | Traffic | Both systems get same traffic | Routing splits traffic | Single implementation at a time | | Duration | Short (weeks) | Long (months to years) | Medium (weeks to months) | | Cost | High (2x infrastructure) | Medium | Low | | Confidence | Highest | Medium | Medium |

Combine them: Use Strangler Fig to route traffic, Branch by Abstraction for internal refactor, Parallel Run to validate before final cutover.

Key Takeaways

1. Validation with Real Data

Running both systems with production traffic reveals edge cases tests can't catch.

2. Old System Stays in Control

Shadow system runs in parallel but doesn't affect users until you're confident.

3. Short-Lived and Focused

This is expensive. Validate quickly (2-4 weeks) then cut over or abort.

4. Comparison is Critical

Define what "correct" means. Log and analyze every discrepancy.

Parallel Run trades infrastructure cost for confidence in correctness.

Your First Parallel Run

Week 1: Choose a critical system to rewrite, build new implementation

Week 2: Deploy shadow routing (new system runs async, returns nothing)

Week 3: Implement result comparison and monitoring

Week 4: Run parallel with full traffic, analyze discrepancies

Week 5: Fix issues until 99%+ match rate achieved

Week 6: Gradually shift traffic to new system (10% → 50% → 100%)

Week 7: Deprecate old system once new is proven

Resources: martinfowler.com/bliki/DarkLaunching.html

The Complete Patterns Toolkit

You've now learned:

Legacy Modernization Intro - Why and when to modernize
Strangler Fig - Gradual external system replacement
Branch by Abstraction - Safe internal dependency swaps
Parallel Run - Validate with production traffic

Coming next:

Feature flags and gradual rollouts
Database migration strategies
Real-world case studies and war stories

You now have the core patterns for safe, incremental modernization.

1 / 0

Parallel Run Pattern