Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

570
HEALTH_CHECKS.md Normal file
View File

@@ -0,0 +1,570 @@
# Production Health Check Strategy
## Verify Actual Scraping Works
---
## 🎯 Problem with Basic Health Checks
### What Basic Health Checks Test:
```python
@app.get("/health")
async def health():
db_ok = await ping_database() # ✅ DB responds
redis_ok = await ping_redis() # ✅ Redis responds
disk_ok = check_disk_space() < 90 # ✅ Disk not full
return {"status": "healthy"}
```
### What They DON'T Test:
- ❌ Can we actually scrape Google Maps?
- ❌ Is Chrome working?
- ❌ Are CSS selectors still valid?
- ❌ Is GDPR handling working?
- ❌ Did Google change their page structure?
- ❌ Is our proxy/network working?
### Real-World Failure Example:
```
✅ Database: healthy
✅ Redis: healthy
✅ Disk: 45% used
❌ Actual scraping: BROKEN (Google changed selectors)
→ Health check says "healthy" but all jobs fail!
```
---
## ✅ Solution: Synthetic Monitoring
### Concept: Canary Tests
Run an **actual scraping job** periodically on a known test URL:
```python
TEST_URL = "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/..."
# A stable business that always has reviews
Every 4-6 hours:
1. Run actual scrape on test URL
2. Verify we get reviews
3. Verify data structure is correct
4. Verify scrape time is reasonable
5. Alert if anything fails
```
---
## 🏗️ Implementation
### 1. Canary Scraping Endpoint
```python
from datetime import datetime, timedelta
# Store last canary result
canary_state = {
"last_run": None,
"last_success": None,
"last_result": None,
"consecutive_failures": 0
}
@app.get("/health/canary")
async def canary_health_check():
"""
Run a real scraping test to verify the scraper works.
This is the MOST IMPORTANT health check - it verifies:
- Chrome can start
- Google Maps is accessible
- Selectors still work
- GDPR handling works
- We can extract reviews
"""
# Don't run too frequently (rate limit to avoid Google detection)
if canary_state["last_run"]:
elapsed = datetime.now() - canary_state["last_run"]
if elapsed < timedelta(hours=1):
# Return cached result
return {
"status": "cached",
"last_run": canary_state["last_run"].isoformat(),
"last_result": canary_state["last_result"],
"cached_for": f"{elapsed.total_seconds():.0f}s"
}
# Run canary test
canary_state["last_run"] = datetime.now()
try:
# Use a known stable business
TEST_URL = "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"
# Run actual scrape with timeout
result = await asyncio.wait_for(
fast_scrape_reviews(
url=TEST_URL,
headless=True,
max_scrolls=10 # Limited for canary
),
timeout=60 # Fail if takes > 60s
)
# Validate result
checks = {
"scrape_succeeded": result['success'],
"got_reviews": result['count'] > 0,
"reasonable_count": 10 <= result['count'] <= 500,
"reasonable_time": result['time'] < 30,
"data_structure_valid": validate_review_structure(result['reviews']),
}
all_passed = all(checks.values())
if all_passed:
canary_state["consecutive_failures"] = 0
canary_state["last_success"] = datetime.now()
canary_state["last_result"] = {
"status": "pass",
"reviews_count": result['count'],
"scrape_time": result['time'],
"checks": checks
}
status_code = 200
else:
canary_state["consecutive_failures"] += 1
canary_state["last_result"] = {
"status": "fail",
"reviews_count": result['count'],
"scrape_time": result['time'],
"checks": checks,
"consecutive_failures": canary_state["consecutive_failures"]
}
status_code = 503 # Service Unavailable
return JSONResponse(
status_code=status_code,
content={
"status": "pass" if all_passed else "fail",
"last_run": canary_state["last_run"].isoformat(),
"last_success": canary_state["last_success"].isoformat() if canary_state["last_success"] else None,
"result": canary_state["last_result"],
"details": {
"test_url": TEST_URL,
"reviews_found": result['count'],
"scrape_time_seconds": result['time'],
"checks": checks
}
}
)
except asyncio.TimeoutError:
canary_state["consecutive_failures"] += 1
canary_state["last_result"] = {
"status": "timeout",
"error": "Scrape took longer than 60 seconds"
}
return JSONResponse(
status_code=503,
content={
"status": "timeout",
"error": "Canary scrape timeout (>60s)",
"consecutive_failures": canary_state["consecutive_failures"]
}
)
except Exception as e:
canary_state["consecutive_failures"] += 1
canary_state["last_result"] = {
"status": "error",
"error": str(e)
}
return JSONResponse(
status_code=503,
content={
"status": "error",
"error": str(e),
"consecutive_failures": canary_state["consecutive_failures"]
}
)
def validate_review_structure(reviews):
"""Validate that reviews have expected structure"""
if not reviews or len(reviews) == 0:
return False
# Check first review has required fields
first_review = reviews[0]
required_fields = ['author', 'rating', 'date_text']
return all(field in first_review for field in required_fields)
```
---
### 2. Background Canary Runner
Instead of running on health check endpoint (which gets called frequently), run in background:
```python
import asyncio
from datetime import datetime, timedelta
class CanaryMonitor:
"""Background task that runs canary tests periodically"""
def __init__(self, interval_hours=4):
self.interval = timedelta(hours=interval_hours)
self.last_run = None
self.last_success = None
self.consecutive_failures = 0
self.running = False
async def start(self):
"""Start the background canary monitoring"""
self.running = True
while self.running:
try:
await self.run_canary()
except Exception as e:
log.error(f"Canary test failed: {e}")
self.consecutive_failures += 1
# Alert if multiple consecutive failures
if self.consecutive_failures >= 3:
await self.send_alert(
f"🚨 CRITICAL: Scraper canary failed {self.consecutive_failures} times in a row!"
)
# Sleep until next run
await asyncio.sleep(self.interval.total_seconds())
async def run_canary(self):
"""Run a single canary test"""
log.info("Running canary scrape test...")
self.last_run = datetime.now()
TEST_URL = "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"
result = await asyncio.wait_for(
fast_scrape_reviews(url=TEST_URL, headless=True, max_scrolls=10),
timeout=60
)
# Validate result
if result['success'] and result['count'] > 10 and result['time'] < 30:
log.info(f"✅ Canary test passed: {result['count']} reviews in {result['time']:.1f}s")
self.consecutive_failures = 0
self.last_success = datetime.now()
# Store result in database for tracking
await db.execute("""
INSERT INTO canary_results (timestamp, success, reviews_count, scrape_time)
VALUES (NOW(), true, %s, %s)
""", result['count'], result['time'])
else:
log.error(f"❌ Canary test failed: {result}")
self.consecutive_failures += 1
await db.execute("""
INSERT INTO canary_results (timestamp, success, error_message)
VALUES (NOW(), false, %s)
""", result.get('error', 'Unknown error'))
raise Exception(f"Canary validation failed: {result}")
async def send_alert(self, message):
"""Send alert via Slack/email/PagerDuty when canary fails"""
# Slack webhook
await httpx.post(
SLACK_WEBHOOK_URL,
json={"text": message}
)
# Or email
await send_email(
to="oncall@example.com",
subject="Scraper Canary Failure",
body=message
)
def stop(self):
"""Stop the background monitoring"""
self.running = False
# In api_server.py startup
canary_monitor = CanaryMonitor(interval_hours=4)
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup
asyncio.create_task(canary_monitor.start())
yield
# Shutdown
canary_monitor.stop()
```
---
### 3. Canary Health Check Endpoint (Fast)
```python
@app.get("/health/canary")
async def get_canary_status():
"""
Return the LATEST canary test result (doesn't run a new test).
Use this for health checks from load balancers / monitoring systems.
"""
if not canary_monitor.last_success:
return JSONResponse(
status_code=503,
content={
"status": "unknown",
"message": "No canary tests run yet"
}
)
# Check if last success was recent enough
age = datetime.now() - canary_monitor.last_success
max_age = timedelta(hours=6)
if age > max_age:
return JSONResponse(
status_code=503,
content={
"status": "stale",
"last_success": canary_monitor.last_success.isoformat(),
"age_hours": age.total_seconds() / 3600,
"message": f"Last successful canary was {age.total_seconds()/3600:.1f} hours ago"
}
)
# Recent success - all good!
return {
"status": "healthy",
"last_success": canary_monitor.last_success.isoformat(),
"age_minutes": age.total_seconds() / 60,
"consecutive_failures": canary_monitor.consecutive_failures
}
```
---
## 📊 Complete Health Check Hierarchy
### 1. **Liveness** (Is the app alive?)
```python
@app.get("/health/live")
async def liveness():
# Simple: can the server respond?
return {"status": "alive"}
```
**Use**: Kubernetes liveness probe (restart if fails)
---
### 2. **Readiness** (Can the app handle traffic?)
```python
@app.get("/health/ready")
async def readiness():
# Check dependencies
db_ok = await ping_database()
redis_ok = await ping_redis()
if db_ok and redis_ok:
return {"status": "ready"}
else:
raise HTTPException(status_code=503, detail="Not ready")
```
**Use**: Kubernetes readiness probe (remove from load balancer if fails)
---
### 3. **Canary** (Does scraping actually work?)
```python
@app.get("/health/canary")
async def canary():
# Return last canary test result
if canary_monitor.last_success and age < 6_hours:
return {"status": "healthy"}
else:
return JSONResponse(status_code=503, content={"status": "unhealthy"})
```
**Use**: External monitoring (PagerDuty, DataDog) - alerts if fails
---
### 4. **Detailed** (Full system status)
```python
@app.get("/health/detailed")
async def detailed_health():
return {
"status": "healthy",
"components": {
"api": {"status": "healthy", "latency_ms": 1},
"database": {"status": "healthy", "latency_ms": 5},
"redis": {"status": "healthy", "latency_ms": 2},
"workers": {"status": "healthy", "active": 4},
"canary": {
"status": "healthy",
"last_success": "2026-01-18T10:30:00Z",
"age_minutes": 45,
"consecutive_failures": 0
}
},
"timestamp": datetime.utcnow().isoformat()
}
```
**Use**: Monitoring dashboards, debugging
---
## 📈 Monitoring Strategy
### Canary Test Schedule
```
Every 4 hours:
- Run full canary test
- Store result in database
- Alert if fails
Benefits:
✅ Detects Google Maps changes within 4 hours
✅ Detects selector breakage quickly
✅ Low overhead (6 tests/day)
✅ Won't trigger Google rate limits
```
### Alert Rules
```python
# Alert on consecutive failures
if consecutive_failures >= 3:
send_pagerduty_alert("CRITICAL: Scraper broken")
# Alert on slow canary
if scrape_time > 60:
send_slack_alert("WARNING: Scraper slow")
# Alert on low review count
if reviews_count < 10:
send_slack_alert("WARNING: Low review count in canary")
```
---
## 🎯 Canary Database Tracking
```sql
CREATE TABLE canary_results (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL DEFAULT NOW(),
success BOOLEAN NOT NULL,
reviews_count INTEGER,
scrape_time REAL,
error_message TEXT,
metadata JSONB
);
CREATE INDEX idx_canary_timestamp ON canary_results(timestamp DESC);
-- Query to see canary health over time
SELECT
DATE_TRUNC('day', timestamp) as day,
COUNT(*) as total_tests,
SUM(CASE WHEN success THEN 1 ELSE 0 END) as successful,
AVG(scrape_time) as avg_scrape_time,
AVG(reviews_count) as avg_reviews
FROM canary_results
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY day
ORDER BY day DESC;
```
---
## ✅ Complete Health Check Implementation
```python
# health_checks.py
from datetime import datetime, timedelta
import asyncio
from typing import Dict, Any
class HealthCheckSystem:
"""Complete health check system for production"""
def __init__(self):
self.canary = CanaryMonitor(interval_hours=4)
async def start(self):
"""Start background health monitoring"""
asyncio.create_task(self.canary.start())
@property
def is_healthy(self) -> bool:
"""Overall system health"""
return (
self.canary.consecutive_failures < 3 and
self.canary.last_success and
(datetime.now() - self.canary.last_success) < timedelta(hours=6)
)
async def get_status(self) -> Dict[str, Any]:
"""Get complete health status"""
db_latency = await self.check_database()
redis_latency = await self.check_redis()
return {
"status": "healthy" if self.is_healthy else "degraded",
"components": {
"database": {
"healthy": db_latency is not None,
"latency_ms": db_latency
},
"redis": {
"healthy": redis_latency is not None,
"latency_ms": redis_latency
},
"canary_scraper": {
"healthy": self.canary.consecutive_failures == 0,
"last_success": self.canary.last_success.isoformat() if self.canary.last_success else None,
"consecutive_failures": self.canary.consecutive_failures
}
},
"timestamp": datetime.utcnow().isoformat()
}
```
---
## 🚀 Production Recommendations
1.**Run canary every 4-6 hours** (balanced between freshness and overhead)
2.**Alert after 3 consecutive failures** (avoid false positives)
3.**Store canary results in database** (historical tracking)
4.**Use different health checks for different purposes**:
- `/health/live` → Kubernetes liveness (restart if fails)
- `/health/ready` → Kubernetes readiness (route traffic)
- `/health/canary` → External monitoring (PagerDuty alerts)
5.**Monitor canary metrics**: scrape time, review count, success rate
---
**The canary test is your MOST IMPORTANT health check** - it's the only one that verifies your core business logic actually works!