Thundering Herd Simulator
Understanding how cache expiration can overwhelm databases and how to prevent cascade failures
How this simulation works
Use the interactive controls below to adjust system parameters and observe how they affect performance metrics in real-time. The charts update instantly to show the impact of your changes, helping you understand system trade-offs and optimal configurations.
Simulation Controls
Number of simultaneous users making requests
Time-to-live for cached data
Requests per minute from each user
Maximum concurrent requests database can handle
Percentage of requests served from cache when healthy
How to prevent thundering herd problems
Time for database to respond to each query
Time for cache to respond to requests
Current Metrics
Performance Metrics
Real-time performance metrics based on your configuration
Database Load
Percentage of database capacity currently used
Average Response Latency
Average time to serve user requests
Error Rate
Percentage of requests failing due to overload
System Throughput
Successfully processed requests per second
Mitigation Effectiveness
How much the strategy reduces database load
User Experience Score
Combined metric of latency and reliability (1-100)
Configuration Summary
Current Settings
Key Insights
Optimization Tips
Experiment with different parameter combinations to understand the trade-offs. Notice how changing one parameter affects multiple metrics simultaneously.
Thundering Herd Simulator
This simulation demonstrates one of the most dangerous cache failure modes: the thundering herd. When a popular cache entry expires, hundreds of requests can simultaneously hit your database, causing cascade failures. Learn how to prevent this disaster from our Caches Lie: Consistency Isn't Free post.
The Anatomy of a Thundering Herd
What Triggers the Stampede?
T=0: Cache entry for popular data expires T=1: Request #1 detects cache miss → hits database T=2: Request #2 detects cache miss → hits database T=3: Request #3 detects cache miss → hits database ... T=50: Request #200 detects cache miss → hits database T=100: Database becomes overwhelmed, starts timing out T=150: Timeouts cause more retries → even more load T=200: Cascade failure: database goes down completely
The Perfect Storm Conditions
- Popular data (high request rate)
- Synchronized expiration (TTL-based caching)
- Slow database queries (complex computation)
- No coordination between requests
Mitigation Strategies Deep Dive
🤝 Request Coalescing
Principle: Only allow one request per cache key to hit the database
import asyncio from typing import Dict, Any class CoalescingCache: def __init__(self): self._cache: Dict[str, Any] = {} self._pending: Dict[str, asyncio.Future] = {} async def get(self, key: str) -> Any: # Check cache first if key in self._cache: return self._cache[key] # Check if already fetching this key if key in self._pending: # Wait for the ongoing request instead of creating new one return await self._pending[key] # Start new fetch and let others wait future = asyncio.create_task(self._fetch_from_database(key)) self._pending[key] = future try: value = await future self._cache[key] = value return value finally: # Clean up pending request del self._pending[key]
✅ Pros: Eliminates duplicate database calls ❌ Cons: Requests wait for single slow request
🎲 Probabilistic Early Expiration
Principle: Randomly refresh cache before TTL expires
import random import time class ProbabilisticCache: def get(self, key: str) -> Any: entry = self._cache.get(key) if not entry: return self._fetch_and_cache(key) # Check if we should refresh early age = time.time() - entry.created_at ttl = entry.ttl # Beta distribution: higher probability as expiry approaches refresh_probability = (age / ttl) ** 2 if random.random() < refresh_probability: # Refresh in background, return current value asyncio.create_task(self._refresh_key(key)) return entry.value
✅ Pros: Spreads refresh load over time ❌ Cons: Some unnecessary database calls
🔄 Background Refresh
Principle: Proactively refresh popular cache entries
class BackgroundRefreshCache: def __init__(self): self._scheduler = BackgroundScheduler() self._scheduler.start() def set(self, key: str, value: Any, ttl: int): self._cache[key] = CacheEntry(value, ttl) # Schedule refresh at 80% of TTL refresh_time = ttl * 0.8 self._scheduler.add_job( self._refresh_key, 'date', run_date=datetime.now() + timedelta(seconds=refresh_time), args=[key] ) def _refresh_key(self, key: str): if key in self._cache: new_value = self._fetch_from_database(key) self.set(key, new_value, self._default_ttl)
✅ Pros: Zero user-facing impact, prevents all stampedes ❌ Cons: Continuous background load, complexity
⚡ Circuit Breaker Pattern
Principle: Fail fast when database is overloaded
class CircuitBreakerCache: def __init__(self, failure_threshold=5, timeout=60): self.failure_threshold = failure_threshold self.timeout = timeout self.failure_count = 0 self.last_failure_time = None self.state = 'CLOSED' # CLOSED, OPEN, HALF_OPEN def get(self, key: str) -> Any: if self.state == 'OPEN': if time.time() - self.last_failure_time < self.timeout: # Serve stale data or raise error return self._get_stale_or_error(key) else: self.state = 'HALF_OPEN' try: if key not in self._cache: value = self._fetch_from_database(key) self._cache[key] = value # Reset failure count on success if self.state == 'HALF_OPEN': self.state = 'CLOSED' self.failure_count = 0 return self._cache[key] except DatabaseTimeoutException: self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = 'OPEN' return self._get_stale_or_error(key)
✅ Pros: Protects database during outages ❌ Cons: May serve stale/error responses
Performance Impact Analysis
Database Load Patterns
No Mitigation:
Normal Load: ████ 20%
Herd Event: ████████████████████ 400% (OVERLOADED!)
Request Coalescing:
Normal Load: ████ 20%
Herd Event: ████ 20% (same as normal)
Background Refresh:
Normal Load: █████ 25% (background jobs)
Herd Event: █████ 25% (no spikes!)
Latency Characteristics
| Strategy | Normal | During Herd | Recovery Time | |----------|---------|-------------|---------------| | None | 5ms | 2000ms+ | 5+ minutes | | Coalescing | 5ms | 150ms | 30 seconds | | Early Expiry | 6ms | 50ms | Immediate | | Background | 5ms | 5ms | N/A | | Circuit Breaker | 5ms | 10ms* | 60 seconds |
*May serve stale data
Real-World Scenarios
Social Media Hot Posts
Problem: Viral post cache expires, millions hit database TTL: 5 minutes Users: 10,000 concurrent Strategy: Request coalescing + probabilistic early expiry Result: 99.9% database load reduction
E-commerce Flash Sales
Problem: Product details cache expires during sale TTL: 1 minute (fast-changing inventory) Users: 5,000 concurrent Strategy: Background refresh + circuit breaker Result: Zero customer-facing errors
Financial Market Data
Problem: Stock price cache expires during market open TTL: 10 seconds (real-time requirements) Users: 1,000 concurrent Strategy: Probabilistic early expiry only Result: 50% load reduction, acceptable staleness
Interactive Experiments
Experiment 1: Witness the Stampede
- Set 200 concurrent users with no mitigation
- Use 5-minute TTL and 100ms database latency
- Watch database load spike to 200%+ and error rates climb
- Observe how user experience score plummets
Experiment 2: Request Coalescing Magic
- Keep the same settings as Experiment 1
- Switch to request coalescing
- Watch database load drop to normal levels
- Notice slightly higher latency (requests wait for each other)
Experiment 3: Background Refresh Perfection
- Use background refresh strategy
- Increase concurrent users to 500
- Observe how database load stays steady
- See perfect user experience scores
Experiment 4: Circuit Breaker Protection
- Set 1000 concurrent users (extreme load)
- Use circuit breaker pattern
- Watch how it limits database load even when overwhelmed
- Note the trade-off: lower error rates but potential stale data
Experiment 5: TTL Impact
- Choose your favorite mitigation strategy
- Compare 1-minute vs 1-hour TTL
- Observe how longer TTLs reduce herd frequency
- Consider the staleness trade-off
Production Implementation Guide
Monitoring Dashboards
# Key metrics to track class ThunderingHerdMetrics: def __init__(self): self.cache_miss_rate = Histogram('cache_miss_rate') self.database_connection_pool = Gauge('db_pool_usage') self.request_coalescing_hits = Counter('coalescing_hits') self.early_refresh_rate = Histogram('early_refresh_rate') def track_cache_miss_burst(self, key: str, miss_count: int): if miss_count > 10: # Potential thundering herd self.alert('thundering_herd_detected', { 'key': key, 'miss_count': miss_count, 'timestamp': time.time() })
Alerting Strategy
- Database connection pool > 80% → Warning
- Cache miss rate > 20% → Investigation needed
- Average response time > 500ms → Alert
- Error rate > 1% → Critical alert
Configuration Guidelines
# Production cache configuration cache: default_ttl: 300 # 5 minutes max_ttl: 3600 # 1 hour # Thundering herd protection coalescing: enabled: true timeout: 5000 # 5 seconds max wait probabilistic_refresh: enabled: true beta: 1.0 # Refresh probability curve background_refresh: enabled: true refresh_ahead_ratio: 0.8 # Refresh at 80% TTL circuit_breaker: failure_threshold: 5 timeout: 60 half_open_max_calls: 3
The Bottom Line
Thundering herds are preventable disasters. The simulation shows that:
- Without mitigation: System becomes unusable during peak loads
- With basic coalescing: 90%+ improvement in stability
- With background refresh: Near-perfect performance
- Circuit breakers: Essential for graceful degradation
Key insight: The best strategy often combines multiple techniques. For example:
- Request coalescing for immediate protection
- Probabilistic early expiry for load spreading
- Circuit breakers for worst-case scenarios
- Background refresh for critical, high-traffic data
Choose your strategy based on your consistency requirements, load patterns, and failure tolerance.
Published on by Anirudh Sharma
Comments
Explore More Interactive Content
Ready to dive deeper? Check out our system blueprints for implementation guides or explore more simulations.