Parallel Run Pattern

Validate your rewrite with real production traffic before switching over

The Validation Problem

You just rewrote your payment processing system.

  • Months of development
  • Thousands of test cases passed
  • Code reviews complete

But how do you KNOW it works correctly with real production traffic?

The "How Do We Know?" Problem

Real production data has edge cases you never imagined:

  • Floating point rounding differences
  • Timezone handling quirks
  • Unicode edge cases
  • Race conditions under load
  • Null handling inconsistencies

One wrong calculation in prod = lost revenue + angry customers.

What is Parallel Run?

A migration pattern where you run both old and new implementations with real traffic, compare results, but only return one answer.

Also called:

  • Dark Launching
  • Shadow Mode
  • Parallel Change
  • Shadow Testing

Core idea: Build confidence with real data before cutting over.

How It Works: The Flow

async function processOrder(order: Order) {
  // 1. Call old system (source of truth)
  const oldResult = await oldOrderSystem.process(order)

  // 2. Call new system in parallel (shadow)
  const newResult = await newOrderSystem.process(order)
    .catch(err => logError('new-system-error', err))

  // 3. Compare results (don't block response)
  compareResults(oldResult, newResult)

  // 4. Return ONLY old system result
  return oldResult
}

New system runs, but doesn't affect users yet.

Architecture: Traffic Routing

Option 1: Reverse Proxy

Route at infrastructure level (nginx, Envoy, API Gateway).

Option 2: Application-Level Routing

class OrderRouter {
  async route(request: Request) {
    const legacy = this.legacyService.handle(request)

    // Fire and forget shadow request
    this.newService.handle(request)
      .then(result => this.compare(legacy, result))
      .catch(err => this.logShadowError(err))

    return legacy
  }
}

Legacy = source of truth. New = shadow validation.

Comparing Results: The Hard Part

Not all differences matter. Define what "correct" means:

interface ResultComparator {
  compare(old: Result, new: Result): Comparison
}

class PaymentComparator implements ResultComparator {
  compare(old: Payment, new: Payment) {
    return {
      exactMatch: old.amount === new.amount,
      statusMatch: old.status === new.status,
      // Timestamps might differ slightly
      timeDelta: Math.abs(old.timestamp - new.timestamp),
      acceptable: this.isAcceptable(old, new)
    }
  }
}

Handling Discrepancies

When results differ, you need to investigate:

async function compareResults(old, new) {
  const match = deepEqual(old, new)

  if (!match) {
    // Log to monitoring system
    logger.warn('shadow-mismatch', {
      oldResult: old,
      newResult: new,
      diff: calculateDiff(old, new)
    })

    // Increment metrics
    metrics.increment('shadow.mismatch')
  } else {
    metrics.increment('shadow.match')
  }
}

Monitor match rate. Target: 99.9%+ before switching.

Read vs Write Operations

Read operations: Safe to run in parallel

// Safe: Just comparing query results
const oldData = await oldDB.query('SELECT * FROM users')
const newData = await newDB.query('SELECT * FROM users')

Write operations: Dangerous without safeguards

// Risky! Could create duplicate charges
await oldPayment.charge(card, amount)  // Production
await newPayment.charge(card, amount)  // Uh oh...

Solution: Shadow writes to test database, or use idempotency keys.

Safe Shadow Writes

Option 1: Dry Run Mode

// New system doesn't actually mutate
await newPayment.charge(card, amount, { dryRun: true })

Option 2: Separate Test Database

// Shadow writes to non-production DB
await newDB.insert(record) // Test database only

Option 3: Idempotent Operations

// Use same idempotency key
const key = generateIdempotencyKey(order)
await oldSystem.process(order, key)  // Creates record
await newSystem.process(order, key)  // No duplicate

Real Example: Netflix Queue Migration

Challenge: Migrate Netflix Queue from SimpleDB to Cassandra

Approach:

  1. Forklift existing data to Cassandra
  2. Shadow reads: Query both databases for every user request
  3. Compare results and log discrepancies
  4. Fix data consistency issues incrementally
  5. Gradually shift read traffic once confidence high
  6. Eventually migrate writes

Result: Zero-downtime migration of critical user data

Source: Netflix Tech Blog, 2013

Real Example: GitHub MySQL Migration

Challenge: Migrate critical MySQL clusters to new infrastructure

Approach:

  • Used Percona XtraBackup for initial data copy
  • Enabled binlog-based replication for incremental updates
  • Parallel reads from both clusters
  • Compared query results in real-time
  • Monitored replication lag and data consistency
  • Cut over with confidence after weeks of validation

Result: Smooth migration with minimal risk

Performance and Infrastructure Costs

You're running TWO systems in production:

Compute costs:

  • 2x server capacity
  • Increased latency (running both systems)
  • Network overhead for comparison

Mitigation strategies:

  • Sample traffic (run shadow on 10% of requests)
  • Run async (don't block primary response)
  • Auto-scale shadow infrastructure
  • Time-box the parallel run (weeks, not months)

It's expensive, but cheaper than a broken production system.

Best Practices

Do:

  • Start with read-only operations
  • Run new system async (don't block old)
  • Log ALL discrepancies for analysis
  • Use sampling for high-traffic systems
  • Set clear success criteria (99%+ match rate)
  • Time-box the parallel run period

Don't:

  • Block production traffic on shadow system
  • Run parallel indefinitely (kills momentum)
  • Ignore small discrepancies (they compound)
  • Shadow write to production databases

Common Pitfalls

Shadow system crashes take down production

  • Isolate failures with timeouts and circuit breakers

Discrepancy alert fatigue

  • Filter noise, focus on meaningful differences

Performance degradation

  • Shadow calls must be non-blocking and async

Comparison logic is flaky

  • Test your comparison code thoroughly

Running parallel run too long

  • Set deadline: 2-4 weeks for most migrations

When to Use Parallel Run

Perfect for:

  • Rewriting critical business logic
  • Database migrations (read queries)
  • Algorithm changes (pricing, recommendations)
  • High-risk refactors where correctness is paramount
  • When you can't afford production bugs

Not ideal for:

  • Greenfield features (nothing to compare against)
  • Simple dependency swaps (use Branch by Abstraction)
  • When 2x infrastructure cost is prohibitive
  • Systems with side effects you can't isolate

Parallel Run vs Other Patterns

| Pattern | Parallel Run | Strangler Fig | Branch by Abstraction | |---------|--------------|---------------|---------------------| | Scope | Validation before switch | Gradual replacement | Internal dependency swap | | Traffic | Both systems get same traffic | Routing splits traffic | Single implementation at a time | | Duration | Short (weeks) | Long (months to years) | Medium (weeks to months) | | Cost | High (2x infrastructure) | Medium | Low | | Confidence | Highest | Medium | Medium |

Combine them: Use Strangler Fig to route traffic, Branch by Abstraction for internal refactor, Parallel Run to validate before final cutover.

Key Takeaways

1. Validation with Real Data

Running both systems with production traffic reveals edge cases tests can't catch.

2. Old System Stays in Control

Shadow system runs in parallel but doesn't affect users until you're confident.

3. Short-Lived and Focused

This is expensive. Validate quickly (2-4 weeks) then cut over or abort.

4. Comparison is Critical

Define what "correct" means. Log and analyze every discrepancy.

Parallel Run trades infrastructure cost for confidence in correctness.

Your First Parallel Run

Week 1: Choose a critical system to rewrite, build new implementation

Week 2: Deploy shadow routing (new system runs async, returns nothing)

Week 3: Implement result comparison and monitoring

Week 4: Run parallel with full traffic, analyze discrepancies

Week 5: Fix issues until 99%+ match rate achieved

Week 6: Gradually shift traffic to new system (10% → 50% → 100%)

Week 7: Deprecate old system once new is proven

Resources: martinfowler.com/bliki/DarkLaunching.html

The Complete Patterns Toolkit

You've now learned:

  1. Legacy Modernization Intro - Why and when to modernize
  2. Strangler Fig - Gradual external system replacement
  3. Branch by Abstraction - Safe internal dependency swaps
  4. Parallel Run - Validate with production traffic

Coming next:

  • Feature flags and gradual rollouts
  • Database migration strategies
  • Real-world case studies and war stories

You now have the core patterns for safe, incremental modernization.

1 / 0
Parallel Run Pattern