Cache Inconsistency Timeline
Understanding when and why cache inconsistencies occur, and how long they persist in distributed systems
How this simulation works
Use the interactive controls below to adjust system parameters and observe how they affect performance metrics in real-time. The charts update instantly to show the impact of your changes, helping you understand system trade-offs and optimal configurations.
Simulation Controls
Type of cache inconsistency to simulate
How many services are accessing the same data
How often each service updates data
Maximum time to wait for cache operations
Percentage of cache operations that fail or timeout
Time for cache updates to propagate across system
Ratio of read operations to write operations
How services coordinate cache updates
Current Metrics
Performance Metrics
Real-time performance metrics based on your configuration
Inconsistency Window
Average time data remains inconsistent
Stale Read Percentage
Percentage of reads returning outdated data
Data Loss Events
Number of partial failure scenarios per hour
System Throughput
Total operations processed successfully
Coordination Overhead
Performance cost of consistency mechanisms
Inconsistency Recovery Time
Time to restore consistency after failure
Configuration Summary
Current Settings
Key Insights
Optimization Tips
Experiment with different parameter combinations to understand the trade-offs. Notice how changing one parameter affects multiple metrics simultaneously.
Cache Inconsistency Timeline
This simulation explores the temporal dimension of cache inconsistency - when inconsistencies occur, how long they persist, and what factors influence recovery time. Based on real scenarios from our Caches Lie: Consistency Isn't Free post.
The Four Horsemen of Cache Inconsistency
🏁 Multiple Writers (Race Conditions)
What happens: Service A updates database, Service B reads from stale cache before invalidation
T=0: Service A writes "premium=true" to database T=50: Service B reads "premium=false" from cache (STALE!) T=100: Cache invalidation finally propagates T=150: System becomes consistent again
Real impact: User sees old subscription status, duplicate charges possible
💥 Partial Failures
What happens: Database write succeeds, cache update times out
T=0: Database UPDATE succeeds (user.name = "Jane") T=50: Cache SET fails (timeout/network error) T=100: Future reads return old value (user.name = "John") T=300: Manual intervention or TTL expiry fixes state
Real impact: Inconsistent user experience, customer support burden
🌊 Propagation Delays
What happens: Updates spread slowly through distributed cache cluster
T=0: Update written to cache node A T=500: Update propagates to cache node B T=1000: Update propagates to cache node C T=2000: All nodes finally consistent
Real impact: Different users see different versions of data
⚡ Thundering Herd
What happens: Cache expires, all services simultaneously hit database
T=0: Cache key expires T=1: 100 services detect cache miss T=2: All 100 services query database simultaneously T=50: Database becomes overloaded, timeouts begin T=200: One query succeeds, repopulates cache T=300: System recovers, but damage done
Real impact: Database overload, cascade failures, user-facing errors
Key Metrics Explained
Inconsistency Window
The time between when data becomes inconsistent and when it's fixed
Factors that increase window:
- More services = more coordination complexity
- Higher update frequency = more conflicts
- Slower propagation = longer delays
- Poor coordination = manual fixes needed
Stale Read Percentage
How many reads return outdated information
Formula: (Failed Updates / Total Updates) × Read Ratio × 100
Critical for:
- User experience quality
- Business logic correctness
- Compliance requirements
Data Loss Events
Scenarios where updates are completely lost
Common causes:
- Cache accepts write, crashes before database flush
- Database rollback after cache update
- Network partitions during two-phase commits
Coordination Mechanisms Compared
No Coordination (Chaos Mode)
# Everyone for themselves def update_user(user_id, data): database.update(user_id, data) # Service A cache.delete(f"user:{user_id}") # Service B (maybe)
✅ Zero overhead ❌ Maximum inconsistency
Sequence Numbers
def update_user(user_id, data): version = get_next_version() database.update(user_id, data, version=version) cache.set(f"user:{user_id}", data, version=version) def get_user(user_id): cached = cache.get(f"user:{user_id}") if cached and cached.version < database.get_version(user_id): cache.delete(f"user:{user_id}") # Stale! return database.get(user_id) return cached
✅ Detects conflicts ❌ Additional metadata overhead
Distributed Locks
def update_user(user_id, data): with distributed_lock(f"user:{user_id}"): database.update(user_id, data) cache.set(f"user:{user_id}", data)
✅ Prevents race conditions ❌ High latency, deadlock risk
Event-Driven Invalidation
# Service A def update_user(user_id, data): database.update(user_id, data) event_bus.publish('user.updated', { 'user_id': user_id, 'timestamp': now() }) # All services @event_handler('user.updated') def on_user_updated(event): cache.delete(f"user:{event.user_id}")
✅ Decoupled, scalable ❌ Event delivery complexity
Real-World Scenarios
E-commerce Inventory
Scenario: Stock level updates Services: [checkout, inventory, analytics] Problem: "In stock" shown after last item sold Impact: Overselling, customer disappointment Solution: Event-driven with sequence numbers
Social Media Likes
Scenario: Like count updates Services: [mobile-app, web-app, analytics] Problem: Different users see different like counts Impact: User confusion, engagement metrics skew Solution: Write-behind with eventual consistency
Financial Balances
Scenario: Account balance updates Services: [payments, statements, fraud-detection] Problem: Payment authorized on stale balance Impact: Overdrafts, compliance violations Solution: Distributed locks with strict consistency
Interactive Experiments
Experiment 1: Race Condition Chaos
- Set 10 services with no coordination
- High update frequency (80 updates/min)
- Observe stale read percentage and inconsistency windows
- Add sequence numbers - watch improvement
Experiment 2: Partial Failure Recovery
- Choose partial failures scenario
- Set high cache failure rate (20%)
- Try different coordination mechanisms
- Compare recovery times and data loss events
Experiment 3: Propagation Impact
- Select propagation delay scenario
- Increase propagation delay to 5+ seconds
- Watch how stale read percentage grows
- Reduce services - see if it helps
Experiment 4: Thundering Herd Mitigation
- Choose thundering herd scenario
- Start with no coordination
- Switch to distributed locks
- Observe dramatic improvement in data loss events
Production Insights
When to Accept Inconsistency
- Analytics data: Slight delays acceptable
- Content recommendations: Staleness won't hurt
- A/B test assignments: Eventually consistent is fine
When to Fight for Consistency
- Financial transactions: Money must be accurate
- Security permissions: Stale access = vulnerabilities
- Inventory levels: Overselling damages reputation
Monitoring Strategy
# Key metrics to track class ConsistencyMonitor: def track_inconsistency_window(self, operation): start_time = time.now() # ... operation happens end_time = time.now() self.histogram('inconsistency.window', end_time - start_time) def detect_stale_reads(self, cache_value, db_value): if cache_value.version < db_value.version: self.counter('stale.reads').increment() self.gauge('staleness.lag', db_value.version - cache_value.version)
Alerting Thresholds
- Inconsistency window > 5 seconds
- Stale reads > 1% of total reads
- Data loss events > 0 per hour (for critical data)
- Recovery time > 30 seconds
The Timeline Perspective
Understanding cache inconsistency through time reveals:
- Most inconsistencies are temporary - systems naturally converge
- Coordination trades performance for consistency - choose wisely
- Recovery time often matters more than initial inconsistency - plan for it
- Different scenarios need different solutions - one size doesn't fit all
Key insight: The goal isn't eliminating inconsistency - it's controlling the blast radius and recovery time to match your business requirements.
This temporal view helps you make informed decisions about where to invest in consistency mechanisms and where to accept eventual consistency with proper monitoring.
Published on by Anirudh Sharma
Comments
Explore More Interactive Content
Ready to dive deeper? Check out our system blueprints for implementation guides or explore more simulations.