Validate your rewrite with real production traffic before switching over
You just rewrote your payment processing system.
But how do you KNOW it works correctly with real production traffic?
Real production data has edge cases you never imagined:
One wrong calculation in prod = lost revenue + angry customers.
A migration pattern where you run both old and new implementations with real traffic, compare results, but only return one answer.
Also called:
Core idea: Build confidence with real data before cutting over.
async function processOrder(order: Order) {
// 1. Call old system (source of truth)
const oldResult = await oldOrderSystem.process(order)
// 2. Call new system in parallel (shadow)
const newResult = await newOrderSystem.process(order)
.catch(err => logError('new-system-error', err))
// 3. Compare results (don't block response)
compareResults(oldResult, newResult)
// 4. Return ONLY old system result
return oldResult
}
New system runs, but doesn't affect users yet.
Option 1: Reverse Proxy
Route at infrastructure level (nginx, Envoy, API Gateway).
Option 2: Application-Level Routing
class OrderRouter {
async route(request: Request) {
const legacy = this.legacyService.handle(request)
// Fire and forget shadow request
this.newService.handle(request)
.then(result => this.compare(legacy, result))
.catch(err => this.logShadowError(err))
return legacy
}
}
Legacy = source of truth. New = shadow validation.
Not all differences matter. Define what "correct" means:
interface ResultComparator {
compare(old: Result, new: Result): Comparison
}
class PaymentComparator implements ResultComparator {
compare(old: Payment, new: Payment) {
return {
exactMatch: old.amount === new.amount,
statusMatch: old.status === new.status,
// Timestamps might differ slightly
timeDelta: Math.abs(old.timestamp - new.timestamp),
acceptable: this.isAcceptable(old, new)
}
}
}
When results differ, you need to investigate:
async function compareResults(old, new) {
const match = deepEqual(old, new)
if (!match) {
// Log to monitoring system
logger.warn('shadow-mismatch', {
oldResult: old,
newResult: new,
diff: calculateDiff(old, new)
})
// Increment metrics
metrics.increment('shadow.mismatch')
} else {
metrics.increment('shadow.match')
}
}
Monitor match rate. Target: 99.9%+ before switching.
Read operations: Safe to run in parallel
// Safe: Just comparing query results
const oldData = await oldDB.query('SELECT * FROM users')
const newData = await newDB.query('SELECT * FROM users')
Write operations: Dangerous without safeguards
// Risky! Could create duplicate charges
await oldPayment.charge(card, amount) // Production
await newPayment.charge(card, amount) // Uh oh...
Solution: Shadow writes to test database, or use idempotency keys.
Option 1: Dry Run Mode
// New system doesn't actually mutate
await newPayment.charge(card, amount, { dryRun: true })
Option 2: Separate Test Database
// Shadow writes to non-production DB
await newDB.insert(record) // Test database only
Option 3: Idempotent Operations
// Use same idempotency key
const key = generateIdempotencyKey(order)
await oldSystem.process(order, key) // Creates record
await newSystem.process(order, key) // No duplicate
Challenge: Migrate Netflix Queue from SimpleDB to Cassandra
Approach:
Result: Zero-downtime migration of critical user data
Source: Netflix Tech Blog, 2013
Challenge: Migrate critical MySQL clusters to new infrastructure
Approach:
Result: Smooth migration with minimal risk
You're running TWO systems in production:
Compute costs:
Mitigation strategies:
It's expensive, but cheaper than a broken production system.
Do:
Don't:
Shadow system crashes take down production
Discrepancy alert fatigue
Performance degradation
Comparison logic is flaky
Running parallel run too long
Perfect for:
Not ideal for:
| Pattern | Parallel Run | Strangler Fig | Branch by Abstraction | |---------|--------------|---------------|---------------------| | Scope | Validation before switch | Gradual replacement | Internal dependency swap | | Traffic | Both systems get same traffic | Routing splits traffic | Single implementation at a time | | Duration | Short (weeks) | Long (months to years) | Medium (weeks to months) | | Cost | High (2x infrastructure) | Medium | Low | | Confidence | Highest | Medium | Medium |
Combine them: Use Strangler Fig to route traffic, Branch by Abstraction for internal refactor, Parallel Run to validate before final cutover.
1. Validation with Real Data
Running both systems with production traffic reveals edge cases tests can't catch.
2. Old System Stays in Control
Shadow system runs in parallel but doesn't affect users until you're confident.
3. Short-Lived and Focused
This is expensive. Validate quickly (2-4 weeks) then cut over or abort.
4. Comparison is Critical
Define what "correct" means. Log and analyze every discrepancy.
Parallel Run trades infrastructure cost for confidence in correctness.
Week 1: Choose a critical system to rewrite, build new implementation
Week 2: Deploy shadow routing (new system runs async, returns nothing)
Week 3: Implement result comparison and monitoring
Week 4: Run parallel with full traffic, analyze discrepancies
Week 5: Fix issues until 99%+ match rate achieved
Week 6: Gradually shift traffic to new system (10% → 50% → 100%)
Week 7: Deprecate old system once new is proven
Resources: martinfowler.com/bliki/DarkLaunching.html
You've now learned:
Coming next:
You now have the core patterns for safe, incremental modernization.