CachingAdvancedInteractive12 min exploration

Cache Inconsistency Timeline

Understanding when and why cache inconsistencies occur, and how long they persist in distributed systems

consistencydistributed-systemsrace-conditionspartial-failurestimeline

How this simulation works

Use the interactive controls below to adjust system parameters and observe how they affect performance metrics in real-time. The charts update instantly to show the impact of your changes, helping you understand system trade-offs and optimal configurations.

Simulation Controls

Inconsistency Scenario

Type of cache inconsistency to simulate

Number of Services3 services

How many services are accessing the same data

Update Frequency20 updates/min

How often each service updates data

Cache Operation Timeout

Maximum time to wait for cache operations

Cache Failure Rate5 %

Percentage of cache operations that fail or timeout

Propagation Delay2000 ms

Time for cache updates to propagate across system

Read/Write Ratio

Ratio of read operations to write operations

Coordination Mechanism

How services coordinate cache updates

Current Metrics

Inconsistency Window

Average time data remains inconsistent

Stale Read Percentage

Percentage of reads returning outdated data

Data Loss Events

Number of partial failure scenarios per hour

events/hr

System Throughput

Total operations processed successfully

ops/s

Coordination Overhead

Performance cost of consistency mechanisms

Inconsistency Recovery Time

Time to restore consistency after failure

Performance Metrics

Real-time performance metrics based on your configuration

Inconsistency Window

Average time data remains inconsistent

Stale Read Percentage

Percentage of reads returning outdated data

Data Loss Events

Number of partial failure scenarios per hour

System Throughput

Total operations processed successfully

Coordination Overhead

Performance cost of consistency mechanisms

Inconsistency Recovery Time

Time to restore consistency after failure

Configuration Summary

Current Settings

Inconsistency Scenario:multiple-writers

Number of Services:3 services

Update Frequency:20 updates/min

Cache Operation Timeout:100

Key Insights

Inconsistency Window:

Stale Read Percentage:

Data Loss Events:

System Throughput:

Optimization Tips

Experiment with different parameter combinations to understand the trade-offs. Notice how changing one parameter affects multiple metrics simultaneously.

Cache Inconsistency Timeline

This simulation explores the temporal dimension of cache inconsistency - when inconsistencies occur, how long they persist, and what factors influence recovery time. Based on real scenarios from our Caches Lie: Consistency Isn't Free post.

The Four Horsemen of Cache Inconsistency

🏁 Multiple Writers (Race Conditions)

What happens: Service A updates database, Service B reads from stale cache before invalidation

T=0:    Service A writes "premium=true" to database
T=50:   Service B reads "premium=false" from cache (STALE!)
T=100:  Cache invalidation finally propagates
T=150:  System becomes consistent again

Real impact: User sees old subscription status, duplicate charges possible

💥 Partial Failures

What happens: Database write succeeds, cache update times out

T=0:    Database UPDATE succeeds (user.name = "Jane")
T=50:   Cache SET fails (timeout/network error)
T=100:  Future reads return old value (user.name = "John")
T=300:  Manual intervention or TTL expiry fixes state

Real impact: Inconsistent user experience, customer support burden

🌊 Propagation Delays

What happens: Updates spread slowly through distributed cache cluster

T=0:    Update written to cache node A
T=500:  Update propagates to cache node B
T=1000: Update propagates to cache node C
T=2000: All nodes finally consistent

Real impact: Different users see different versions of data

⚡ Thundering Herd

What happens: Cache expires, all services simultaneously hit database

T=0:    Cache key expires
T=1:    100 services detect cache miss
T=2:    All 100 services query database simultaneously
T=50:   Database becomes overloaded, timeouts begin
T=200:  One query succeeds, repopulates cache
T=300:  System recovers, but damage done

Real impact: Database overload, cascade failures, user-facing errors

Key Metrics Explained

Inconsistency Window

The time between when data becomes inconsistent and when it's fixed

Factors that increase window:

More services = more coordination complexity
Higher update frequency = more conflicts
Slower propagation = longer delays
Poor coordination = manual fixes needed

Stale Read Percentage

How many reads return outdated information

Formula: (Failed Updates / Total Updates) × Read Ratio × 100

Critical for:

User experience quality
Business logic correctness
Compliance requirements

Data Loss Events

Scenarios where updates are completely lost

Common causes:

Cache accepts write, crashes before database flush
Database rollback after cache update
Network partitions during two-phase commits

Coordination Mechanisms Compared

No Coordination (Chaos Mode)

# Everyone for themselves
def update_user(user_id, data):
    database.update(user_id, data)  # Service A
    cache.delete(f"user:{user_id}")  # Service B (maybe)

✅ Zero overhead ❌ Maximum inconsistency

Sequence Numbers

def update_user(user_id, data):
    version = get_next_version()
    database.update(user_id, data, version=version)
    cache.set(f"user:{user_id}", data, version=version)

def get_user(user_id):
    cached = cache.get(f"user:{user_id}")
    if cached and cached.version < database.get_version(user_id):
        cache.delete(f"user:{user_id}")  # Stale!
        return database.get(user_id)
    return cached

✅ Detects conflicts ❌ Additional metadata overhead

Distributed Locks

def update_user(user_id, data):
    with distributed_lock(f"user:{user_id}"):
        database.update(user_id, data)
        cache.set(f"user:{user_id}", data)

✅ Prevents race conditions ❌ High latency, deadlock risk

Event-Driven Invalidation

# Service A
def update_user(user_id, data):
    database.update(user_id, data)
    event_bus.publish('user.updated', {
        'user_id': user_id,
        'timestamp': now()
    })

# All services
@event_handler('user.updated')
def on_user_updated(event):
    cache.delete(f"user:{event.user_id}")

✅ Decoupled, scalable ❌ Event delivery complexity

Real-World Scenarios

E-commerce Inventory

Scenario: Stock level updates
Services: [checkout, inventory, analytics]
Problem: "In stock" shown after last item sold
Impact: Overselling, customer disappointment
Solution: Event-driven with sequence numbers

Social Media Likes

Scenario: Like count updates
Services: [mobile-app, web-app, analytics]
Problem: Different users see different like counts
Impact: User confusion, engagement metrics skew
Solution: Write-behind with eventual consistency

Financial Balances

Scenario: Account balance updates
Services: [payments, statements, fraud-detection]
Problem: Payment authorized on stale balance
Impact: Overdrafts, compliance violations
Solution: Distributed locks with strict consistency

Interactive Experiments

Experiment 1: Race Condition Chaos

Set 10 services with no coordination
High update frequency (80 updates/min)
Observe stale read percentage and inconsistency windows
Add sequence numbers - watch improvement

Experiment 2: Partial Failure Recovery

Choose partial failures scenario
Set high cache failure rate (20%)
Try different coordination mechanisms
Compare recovery times and data loss events

Experiment 3: Propagation Impact

Select propagation delay scenario
Increase propagation delay to 5+ seconds
Watch how stale read percentage grows
Reduce services - see if it helps

Experiment 4: Thundering Herd Mitigation

Choose thundering herd scenario
Start with no coordination
Switch to distributed locks
Observe dramatic improvement in data loss events

Production Insights

When to Accept Inconsistency

Analytics data: Slight delays acceptable
Content recommendations: Staleness won't hurt
A/B test assignments: Eventually consistent is fine

When to Fight for Consistency

Financial transactions: Money must be accurate
Security permissions: Stale access = vulnerabilities
Inventory levels: Overselling damages reputation

Monitoring Strategy

# Key metrics to track
class ConsistencyMonitor:
    def track_inconsistency_window(self, operation):
        start_time = time.now()
        # ... operation happens
        end_time = time.now()
        self.histogram('inconsistency.window', end_time - start_time)

    def detect_stale_reads(self, cache_value, db_value):
        if cache_value.version < db_value.version:
            self.counter('stale.reads').increment()
            self.gauge('staleness.lag', db_value.version - cache_value.version)

Alerting Thresholds

Inconsistency window > 5 seconds
Stale reads > 1% of total reads
Data loss events > 0 per hour (for critical data)
Recovery time > 30 seconds

The Timeline Perspective

Understanding cache inconsistency through time reveals:

Most inconsistencies are temporary - systems naturally converge
Coordination trades performance for consistency - choose wisely
Recovery time often matters more than initial inconsistency - plan for it
Different scenarios need different solutions - one size doesn't fit all

Key insight: The goal isn't eliminating inconsistency - it's controlling the blast radius and recovery time to match your business requirements.

This temporal view helps you make informed decisions about where to invest in consistency mechanisms and where to accept eventual consistency with proper monitoring.

Published on October 5, 2025 by Anirudh Sharma

← Back to Simulations

Comments

Explore More Interactive Content

Ready to dive deeper? Check out our system blueprints for implementation guides or explore more simulations.

View System Blueprints More Simulations