Data Migration Verification
Data migration verification is the process of ensuring data integrity and correctness when moving data between systems. While dual writes and expand-contract patterns help with the mechanics of migration, verification ensures you haven't lost, corrupted, or misinterpreted any data during the process.
Failed data migrations can result in lost revenue, compliance violations, and corrupted business records. Systematic verification is essential for high-stakes migrations.
Why Verification Matters
The Stakes
Data is permanent: Unlike code bugs that can be fixed with a new deploy, data corruption may be irreversible.
Examples of migration failures:
- E-commerce company loses order history during database migration
- Financial services firm miscalculates balances due to rounding errors
- Healthcare provider loses patient records during system upgrade
- SaaS company corrupts user data, violating privacy regulations
What Can Go Wrong
Data loss:
- Records missing in target system
- Fields not migrated
- Relationships broken
Data corruption:
- Type conversion errors
- Character encoding issues
- Precision loss in numeric fields
- Date/time timezone problems
Data inconsistency:
- Related records out of sync
- Constraints violated
- Business rules broken
Verification Strategies
Strategy 1: Count-Based Verification
Simplest check—ensure same number of records:
async function verifyRecordCounts() {
const counts = await Promise.all([
legacyDb.query('SELECT COUNT(*) as count FROM users'),
modernDb.query('SELECT COUNT(*) as count FROM users'),
]);
const legacyCount = counts[0][0].count;
const modernCount = counts[1][0].count;
if (legacyCount !== modernCount) {
throw new Error(
`Record count mismatch: legacy=${legacyCount}, modern=${modernCount}`
);
}
logger.info('Record count verification passed', { count: legacyCount });
}
Limitations: Doesn't catch data corruption, only data loss.
Strategy 2: Checksum Verification
Compare aggregated checksums for fast validation:
async function verifyChecksums() {
const [legacyChecksum, modernChecksum] = await Promise.all([
calculateTableChecksum(legacyDb, 'users'),
calculateTableChecksum(modernDb, 'users'),
]);
if (legacyChecksum !== modernChecksum) {
throw new Error('Checksum mismatch - data differs between systems');
}
logger.info('Checksum verification passed', { checksum: legacyChecksum });
}
async function calculateTableChecksum(
db: Database,
table: string
): Promise<string> {
// Get sorted, consistent representation
const rows = await db.query(
`SELECT * FROM ${table} ORDER BY id`
);
// Calculate checksum
const data = JSON.stringify(rows);
return crypto.createHash('sha256').update(data).digest('hex');
}
Benefits: Fast, detects any data differences Limitations: Doesn't tell you what's different, only that something is
Strategy 3: Row-by-Row Comparison
Compare individual records for detailed validation:
async function verifyRowByRow() {
const batchSize = 1000;
let offset = 0;
let mismatches = [];
while (true) {
const legacyRows = await legacyDb.query(
'SELECT * FROM users ORDER BY id LIMIT ? OFFSET ?',
[batchSize, offset]
);
if (legacyRows.length === 0) break;
for (const legacyRow of legacyRows) {
const modernRow = await modernDb.query(
'SELECT * FROM users WHERE id = ?',
[legacyRow.id]
);
if (!modernRow) {
mismatches.push({
id: legacyRow.id,
issue: 'missing_in_modern',
legacyData: legacyRow,
});
} else if (!areRowsEqual(legacyRow, modernRow[0])) {
mismatches.push({
id: legacyRow.id,
issue: 'data_mismatch',
legacyData: legacyRow,
modernData: modernRow[0],
differences: findDifferences(legacyRow, modernRow[0]),
});
}
}
offset += batchSize;
}
if (mismatches.length > 0) {
logger.error('Row verification found mismatches', {
count: mismatches.length,
samples: mismatches.slice(0, 10),
});
throw new Error(`Found ${mismatches.length} mismatched rows`);
}
logger.info('Row-by-row verification passed', { rowsChecked: offset });
}
function areRowsEqual(a: any, b: any): boolean {
// Normalize before comparison
const normalize = (obj: any) => {
const normalized = { ...obj };
// Handle date comparison
Object.keys(normalized).forEach((key) => {
if (normalized[key] instanceof Date) {
normalized[key] = normalized[key].toISOString();
}
});
return normalized;
};
return JSON.stringify(normalize(a)) === JSON.stringify(normalize(b));
}
function findDifferences(a: any, b: any): string[] {
const diffs = [];
for (const key in a) {
if (a[key] !== b[key]) {
diffs.push(`${key}: ${a[key]} !== ${b[key]}`);
}
}
return diffs;
}
Strategy 4: Sampling Verification
Verify a statistical sample for large datasets:
async function verifySample(sampleSize: number = 10000) {
// Get total count
const totalCount = await legacyDb.query('SELECT COUNT(*) FROM users');
// Random sample
const sampleIds = await legacyDb.query(
'SELECT id FROM users ORDER BY RAND() LIMIT ?',
[sampleSize]
);
let mismatches = 0;
for (const { id } of sampleIds) {
const [legacyRow, modernRow] = await Promise.all([
legacyDb.query('SELECT * FROM users WHERE id = ?', [id]),
modernDb.query('SELECT * FROM users WHERE id = ?', [id]),
]);
if (!areRowsEqual(legacyRow[0], modernRow[0])) {
mismatches++;
logger.warn('Sample verification mismatch', {
id,
legacy: legacyRow[0],
modern: modernRow[0],
});
}
}
const errorRate = mismatches / sampleSize;
logger.info('Sample verification complete', {
sampleSize,
mismatches,
errorRate,
estimatedTotalErrors: Math.round(errorRate * totalCount[0].count),
});
if (errorRate > 0.001) {
// More than 0.1% error rate
throw new Error(`Error rate too high: ${(errorRate * 100).toFixed(2)}%`);
}
}
Strategy 5: Business Rule Validation
Verify business logic constraints:
async function verifyBusinessRules() {
const issues = [];
// Rule 1: All orders must have a customer
const orphanOrders = await modernDb.query(`
SELECT o.id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id
WHERE c.id IS NULL
`);
if (orphanOrders.length > 0) {
issues.push({
rule: 'orders_must_have_customer',
violations: orphanOrders.length,
samples: orphanOrders.slice(0, 10),
});
}
// Rule 2: Order total must equal sum of line items
const invalidTotals = await modernDb.query(`
SELECT
o.id,
o.total,
SUM(li.quantity * li.unit_price) as calculated_total
FROM orders o
JOIN line_items li ON o.id = li.order_id
GROUP BY o.id, o.total
HAVING ABS(o.total - calculated_total) > 0.01
`);
if (invalidTotals.length > 0) {
issues.push({
rule: 'order_total_equals_line_items',
violations: invalidTotals.length,
samples: invalidTotals.slice(0, 10),
});
}
// Rule 3: Account balances must be non-negative
const negativeBalances = await modernDb.query(
'SELECT id, balance FROM accounts WHERE balance < 0'
);
if (negativeBalances.length > 0) {
issues.push({
rule: 'account_balance_non_negative',
violations: negativeBalances.length,
samples: negativeBalances.slice(0, 10),
});
}
if (issues.length > 0) {
logger.error('Business rule violations found', { issues });
throw new Error(`Found ${issues.length} business rule violations`);
}
logger.info('Business rule validation passed');
}
Comprehensive Verification Pipeline
class MigrationVerifier {
async runFullVerification(): Promise<VerificationReport> {
const report: VerificationReport = {
startTime: new Date(),
checks: [],
};
try {
// Phase 1: Quick smoke tests
await this.runCheck(report, 'record_counts', () =>
this.verifyRecordCounts()
);
await this.runCheck(report, 'basic_checksums', () =>
this.verifyChecksums()
);
// Phase 2: Detailed verification
await this.runCheck(report, 'sample_comparison', () =>
this.verifySample(10000)
);
// Phase 3: Business logic validation
await this.runCheck(report, 'business_rules', () =>
this.verifyBusinessRules()
);
// Phase 4: Relationship integrity
await this.runCheck(report, 'foreign_keys', () =>
this.verifyForeignKeys()
);
// Phase 5: Data quality
await this.runCheck(report, 'data_quality', () =>
this.verifyDataQuality()
);
report.endTime = new Date();
report.status = 'passed';
return report;
} catch (error) {
report.endTime = new Date();
report.status = 'failed';
report.error = error.message;
return report;
}
}
private async runCheck(
report: VerificationReport,
name: string,
checkFn: () => Promise<void>
) {
const start = Date.now();
try {
await checkFn();
report.checks.push({
name,
status: 'passed',
duration: Date.now() - start,
});
logger.info(`Verification check passed: ${name}`);
} catch (error) {
report.checks.push({
name,
status: 'failed',
duration: Date.now() - start,
error: error.message,
});
logger.error(`Verification check failed: ${name}`, { error });
throw error;
}
}
private async verifyForeignKeys() {
// Check all foreign key relationships are valid
const tables = await modernDb.getTables();
for (const table of tables) {
const foreignKeys = await modernDb.getForeignKeys(table);
for (const fk of foreignKeys) {
const orphans = await modernDb.query(`
SELECT child.id
FROM ${fk.childTable} child
LEFT JOIN ${fk.parentTable} parent ON child.${fk.childColumn} = parent.${fk.parentColumn}
WHERE parent.${fk.parentColumn} IS NULL
AND child.${fk.childColumn} IS NOT NULL
`);
if (orphans.length > 0) {
throw new Error(
`Foreign key violation: ${fk.childTable}.${fk.childColumn} -> ${fk.parentTable}.${fk.parentColumn} (${orphans.length} orphans)`
);
}
}
}
}
private async verifyDataQuality() {
const issues = [];
// Check for NULL values where they shouldn't be
const nullEmails = await modernDb.query(
'SELECT COUNT(*) FROM users WHERE email IS NULL'
);
if (nullEmails[0].count > 0) {
issues.push({ field: 'email', issue: 'unexpected_nulls', count: nullEmails[0].count });
}
// Check for data in expected ranges
const futureDates = await modernDb.query(
'SELECT COUNT(*) FROM orders WHERE created_at > NOW()'
);
if (futureDates[0].count > 0) {
issues.push({ field: 'created_at', issue: 'future_dates', count: futureDates[0].count });
}
if (issues.length > 0) {
throw new Error(`Data quality issues: ${JSON.stringify(issues)}`);
}
}
}
interface VerificationReport {
startTime: Date;
endTime?: Date;
status?: 'passed' | 'failed';
error?: string;
checks: Array<{
name: string;
status: 'passed' | 'failed';
duration: number;
error?: string;
}>;
}
Continuous Verification
Don't just verify once—verify continuously during migration:
class ContinuousVerifier {
private interval: NodeJS.Timeout;
start() {
// Run verification every hour
this.interval = setInterval(async () => {
try {
const report = await new MigrationVerifier().runFullVerification();
await this.storeReport(report);
metrics.gauge('migration.verification.status', report.status === 'passed' ? 1 : 0);
if (report.status === 'failed') {
await this.alertTeam(report);
}
} catch (error) {
logger.error('Continuous verification failed', { error });
}
}, 60 * 60 * 1000); // 1 hour
}
stop() {
clearInterval(this.interval);
}
private async storeReport(report: VerificationReport) {
await db.insert('migration_verification_reports', {
timestamp: report.startTime,
status: report.status,
duration: report.endTime.getTime() - report.startTime.getTime(),
checks: JSON.stringify(report.checks),
});
}
private async alertTeam(report: VerificationReport) {
await slack.send({
channel: '#data-migration',
text: `⚠️ Migration verification failed!\n\nFailed checks:\n${report.checks
.filter((c) => c.status === 'failed')
.map((c) => `- ${c.name}: ${c.error}`)
.join('\n')}`,
});
}
}
Key Takeaways
- Verification is essential—data corruption can be irreversible
- Use multiple verification strategies: counts, checksums, row comparison, sampling
- Validate business rules and data quality, not just data presence
- Run verification continuously during migration, not just once
- Document verification results for audit trail
- Automate verification to run frequently and catch issues early
- Have rollback plans if verification fails
- Test verification logic itself before relying on it
Further Reading
- Database Reliability Engineering - Chapter on data validation
- Data Quality Fundamentals - Comprehensive data quality guide
- Testing Data Migrations - Martin Fowler on evolutionary database design
- Refactoring Databases - Includes verification strategies