CachingAdvancedInteractive12 min exploration

Cache Inconsistency Timeline

Understanding when and why cache inconsistencies occur, and how long they persist in distributed systems

consistencydistributed-systemsrace-conditionspartial-failurestimeline

How this simulation works

Use the interactive controls below to adjust system parameters and observe how they affect performance metrics in real-time. The charts update instantly to show the impact of your changes, helping you understand system trade-offs and optimal configurations.

Simulation Controls

Type of cache inconsistency to simulate

3 services

How many services are accessing the same data

20 updates/min

How often each service updates data

Maximum time to wait for cache operations

5 %

Percentage of cache operations that fail or timeout

2000 ms

Time for cache updates to propagate across system

Ratio of read operations to write operations

How services coordinate cache updates

Current Metrics

Inconsistency Window
Average time data remains inconsistent
ms
Stale Read Percentage
Percentage of reads returning outdated data
%
Data Loss Events
Number of partial failure scenarios per hour
events/hr
System Throughput
Total operations processed successfully
ops/s
Coordination Overhead
Performance cost of consistency mechanisms
%
Inconsistency Recovery Time
Time to restore consistency after failure
s

Performance Metrics

Real-time performance metrics based on your configuration

Inconsistency Window

Average time data remains inconsistent

Stale Read Percentage

Percentage of reads returning outdated data

Data Loss Events

Number of partial failure scenarios per hour

System Throughput

Total operations processed successfully

Coordination Overhead

Performance cost of consistency mechanisms

Inconsistency Recovery Time

Time to restore consistency after failure

Configuration Summary

Current Settings

Inconsistency Scenario:multiple-writers
Number of Services:3 services
Update Frequency:20 updates/min
Cache Operation Timeout:100

Key Insights

Inconsistency Window:
Stale Read Percentage:
Data Loss Events:
System Throughput:

Optimization Tips

Experiment with different parameter combinations to understand the trade-offs. Notice how changing one parameter affects multiple metrics simultaneously.

Cache Inconsistency Timeline

This simulation explores the temporal dimension of cache inconsistency - when inconsistencies occur, how long they persist, and what factors influence recovery time. Based on real scenarios from our Caches Lie: Consistency Isn't Free post.

The Four Horsemen of Cache Inconsistency

🏁 Multiple Writers (Race Conditions)

What happens: Service A updates database, Service B reads from stale cache before invalidation

T=0: Service A writes "premium=true" to database T=50: Service B reads "premium=false" from cache (STALE!) T=100: Cache invalidation finally propagates T=150: System becomes consistent again

Real impact: User sees old subscription status, duplicate charges possible

💥 Partial Failures

What happens: Database write succeeds, cache update times out

T=0: Database UPDATE succeeds (user.name = "Jane") T=50: Cache SET fails (timeout/network error) T=100: Future reads return old value (user.name = "John") T=300: Manual intervention or TTL expiry fixes state

Real impact: Inconsistent user experience, customer support burden

🌊 Propagation Delays

What happens: Updates spread slowly through distributed cache cluster

T=0: Update written to cache node A T=500: Update propagates to cache node B T=1000: Update propagates to cache node C T=2000: All nodes finally consistent

Real impact: Different users see different versions of data

⚡ Thundering Herd

What happens: Cache expires, all services simultaneously hit database

T=0: Cache key expires T=1: 100 services detect cache miss T=2: All 100 services query database simultaneously T=50: Database becomes overloaded, timeouts begin T=200: One query succeeds, repopulates cache T=300: System recovers, but damage done

Real impact: Database overload, cascade failures, user-facing errors

Key Metrics Explained

Inconsistency Window

The time between when data becomes inconsistent and when it's fixed

Factors that increase window:

  • More services = more coordination complexity
  • Higher update frequency = more conflicts
  • Slower propagation = longer delays
  • Poor coordination = manual fixes needed

Stale Read Percentage

How many reads return outdated information

Formula: (Failed Updates / Total Updates) × Read Ratio × 100

Critical for:

  • User experience quality
  • Business logic correctness
  • Compliance requirements

Data Loss Events

Scenarios where updates are completely lost

Common causes:

  • Cache accepts write, crashes before database flush
  • Database rollback after cache update
  • Network partitions during two-phase commits

Coordination Mechanisms Compared

No Coordination (Chaos Mode)

# Everyone for themselves def update_user(user_id, data): database.update(user_id, data) # Service A cache.delete(f"user:{user_id}") # Service B (maybe)

✅ Zero overhead ❌ Maximum inconsistency

Sequence Numbers

def update_user(user_id, data): version = get_next_version() database.update(user_id, data, version=version) cache.set(f"user:{user_id}", data, version=version) def get_user(user_id): cached = cache.get(f"user:{user_id}") if cached and cached.version < database.get_version(user_id): cache.delete(f"user:{user_id}") # Stale! return database.get(user_id) return cached

✅ Detects conflicts ❌ Additional metadata overhead

Distributed Locks

def update_user(user_id, data): with distributed_lock(f"user:{user_id}"): database.update(user_id, data) cache.set(f"user:{user_id}", data)

✅ Prevents race conditions ❌ High latency, deadlock risk

Event-Driven Invalidation

# Service A def update_user(user_id, data): database.update(user_id, data) event_bus.publish('user.updated', { 'user_id': user_id, 'timestamp': now() }) # All services @event_handler('user.updated') def on_user_updated(event): cache.delete(f"user:{event.user_id}")

✅ Decoupled, scalable ❌ Event delivery complexity

Real-World Scenarios

E-commerce Inventory

Scenario: Stock level updates Services: [checkout, inventory, analytics] Problem: "In stock" shown after last item sold Impact: Overselling, customer disappointment Solution: Event-driven with sequence numbers

Social Media Likes

Scenario: Like count updates Services: [mobile-app, web-app, analytics] Problem: Different users see different like counts Impact: User confusion, engagement metrics skew Solution: Write-behind with eventual consistency

Financial Balances

Scenario: Account balance updates Services: [payments, statements, fraud-detection] Problem: Payment authorized on stale balance Impact: Overdrafts, compliance violations Solution: Distributed locks with strict consistency

Interactive Experiments

Experiment 1: Race Condition Chaos

  1. Set 10 services with no coordination
  2. High update frequency (80 updates/min)
  3. Observe stale read percentage and inconsistency windows
  4. Add sequence numbers - watch improvement

Experiment 2: Partial Failure Recovery

  1. Choose partial failures scenario
  2. Set high cache failure rate (20%)
  3. Try different coordination mechanisms
  4. Compare recovery times and data loss events

Experiment 3: Propagation Impact

  1. Select propagation delay scenario
  2. Increase propagation delay to 5+ seconds
  3. Watch how stale read percentage grows
  4. Reduce services - see if it helps

Experiment 4: Thundering Herd Mitigation

  1. Choose thundering herd scenario
  2. Start with no coordination
  3. Switch to distributed locks
  4. Observe dramatic improvement in data loss events

Production Insights

When to Accept Inconsistency

  • Analytics data: Slight delays acceptable
  • Content recommendations: Staleness won't hurt
  • A/B test assignments: Eventually consistent is fine

When to Fight for Consistency

  • Financial transactions: Money must be accurate
  • Security permissions: Stale access = vulnerabilities
  • Inventory levels: Overselling damages reputation

Monitoring Strategy

# Key metrics to track class ConsistencyMonitor: def track_inconsistency_window(self, operation): start_time = time.now() # ... operation happens end_time = time.now() self.histogram('inconsistency.window', end_time - start_time) def detect_stale_reads(self, cache_value, db_value): if cache_value.version < db_value.version: self.counter('stale.reads').increment() self.gauge('staleness.lag', db_value.version - cache_value.version)

Alerting Thresholds

  • Inconsistency window > 5 seconds
  • Stale reads > 1% of total reads
  • Data loss events > 0 per hour (for critical data)
  • Recovery time > 30 seconds

The Timeline Perspective

Understanding cache inconsistency through time reveals:

  1. Most inconsistencies are temporary - systems naturally converge
  2. Coordination trades performance for consistency - choose wisely
  3. Recovery time often matters more than initial inconsistency - plan for it
  4. Different scenarios need different solutions - one size doesn't fit all

Key insight: The goal isn't eliminating inconsistency - it's controlling the blast radius and recovery time to match your business requirements.

This temporal view helps you make informed decisions about where to invest in consistency mechanisms and where to accept eventual consistency with proper monitoring.

Published on by Anirudh Sharma

Comments

Explore More Interactive Content

Ready to dive deeper? Check out our system blueprints for implementation guides or explore more simulations.