Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
21 KiB
Production Microservice Architecture
Google Reviews Scraper API
🎯 Recommended Communication Patterns
1. Webhooks (Primary - RECOMMENDED) ✅
Best for: Production async job processing
Client → POST /scrape (with webhook_url)
↓
Server → Starts job, returns job_id
↓
[Scraping in progress...]
↓
Server → POST to client's webhook_url when complete
{
"job_id": "...",
"status": "completed",
"reviews_count": 244,
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews"
}
Advantages:
- ✅ No polling needed (reduces server load)
- ✅ Instant notifications when job completes
- ✅ Industry standard (Stripe, GitHub, Twilio use this)
- ✅ Client can go offline and come back
- ✅ Scales to millions of jobs
Use cases:
- Batch processing systems
- Integration with other services
- When client has a public endpoint
2. Server-Sent Events (SSE) (Real-time Updates) ⚡
Best for: Real-time progress monitoring
Client → GET /jobs/{job_id}/stream (keeps connection open)
↓
Server → Sends progress updates in real-time:
data: {"stage": "scrolling", "reviews_loaded": 50}
data: {"stage": "scrolling", "reviews_loaded": 100}
data: {"stage": "extracting", "reviews_loaded": 244}
data: {"stage": "completed", "total": 244}
Advantages:
- ✅ Real-time progress updates
- ✅ HTTP-based (works through firewalls)
- ✅ Lightweight (one-way communication)
- ✅ Auto-reconnection support
- ✅ Great for dashboards/UIs
Use cases:
- Web dashboards
- Real-time monitoring
- Progress bars in UI
3. Polling (Fallback) 🔄
Best for: Simple clients, no webhook capability
Client → POST /scrape
↓
Server → Returns job_id
↓
Client → Polls GET /jobs/{job_id} every 2-5 seconds
↓
Server → Returns current status
Advantages:
- ✅ Simple to implement
- ✅ Works everywhere (no public endpoint needed)
- ✅ Firewall-friendly
Disadvantages:
- ❌ Inefficient (many wasted requests)
- ❌ Delayed notifications (polling interval)
- ❌ Higher server load
Use cases:
- Internal tools
- Clients behind firewalls
- Simple integrations
🏛️ Complete Production Architecture
┌─────────────────────────────────────────────────────────────┐
│ LOAD BALANCER │
│ (nginx/AWS ALB) │
└──────────┬──────────────────────────────────┬────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ API Server 1 │ │ API Server 2 │
│ (FastAPI) │ │ (FastAPI) │
│ - REST endpoints │ │ - REST endpoints │
│ - Health checks │ │ - Health checks │
│ - Job management │ │ - Job management │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
└────────────┬───────────────────┘
▼
┌────────────────────────┐
│ REDIS / RabbitMQ │
│ (Job Queue) │
│ │
│ - Pending jobs │
│ - Job distribution │
│ - Pub/Sub for events │
└────────┬───────────────┘
│
▼
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Worker 1 │ │ Worker 2 │
│ │ │ │
│ - Scraping │ │ - Scraping │
│ - Headless │ │ - Headless │
│ - Chrome │ │ - Chrome │
└─────┬───────┘ └─────┬───────┘
│ │
└────────────┬───────────────┘
▼
┌──────────────────────────────┐
│ PERSISTENT STORAGE │
│ │
│ ┌────────────────────────┐ │
│ │ PostgreSQL / MongoDB │ │
│ │ - Job metadata │ │
│ │ - Status tracking │ │
│ │ - Webhook configs │ │
│ └────────────────────────┘ │
│ │
│ ┌────────────────────────┐ │
│ │ File Storage / S3 │ │
│ │ - Review JSON files │ │
│ │ - Large payloads │ │
│ └────────────────────────┘ │
└───────────────────────────────┘
│
▼
┌─────────────────────┐
│ Webhook Dispatcher │
│ - Retry logic │
│ - Dead letter queue│
└─────────────────────┘
│
▼
[Client's webhook URL]
📦 Component Breakdown
1. API Server (FastAPI)
Responsibilities:
- Handle HTTP requests
- Validate input
- Enqueue jobs
- Serve results
- Health checks
Endpoints:
POST /scrape # Submit job
GET /jobs/{id} # Get job status
GET /jobs/{id}/reviews # Get results
GET /jobs/{id}/stream # SSE progress stream
DELETE /jobs/{id} # Cancel job
GET /health # Health check
GET /metrics # Prometheus metrics
2. Job Queue (Redis or RabbitMQ)
Why needed:
- Decouple API from scraping workers
- Distribute load across workers
- Retry failed jobs
- Handle backpressure
Options:
Option A: Redis (Recommended for simpler setups)
# Fast, simple, good for most use cases
- In-memory queue
- Pub/Sub for events
- Job state storage
- Session storage
Option B: RabbitMQ (For complex workflows)
# More features, better for complex scenarios
- Guaranteed delivery
- Advanced routing
- Dead letter queues
- Priority queues
Recommendation: Start with Redis, upgrade to RabbitMQ if needed.
3. Worker Processes (Celery or Custom)
Responsibilities:
- Pull jobs from queue
- Run scraping (headless Chrome)
- Save results
- Send webhooks
- Update job status
Scaling:
# Run 4 workers on same machine
celery -A worker worker --concurrency=4
# Or 4 separate processes
python worker.py &
python worker.py &
python worker.py &
python worker.py &
# Or Kubernetes deployment
kubectl scale deployment scraper-worker --replicas=10
4. Database (PostgreSQL or MongoDB)
Job Metadata Schema:
PostgreSQL (Recommended):
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL,
url TEXT NOT NULL,
webhook_url TEXT,
created_at TIMESTAMP NOT NULL,
started_at TIMESTAMP,
completed_at TIMESTAMP,
reviews_count INTEGER,
reviews_file_path TEXT,
error_message TEXT,
metadata JSONB
);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
Why PostgreSQL:
- ✅ ACID transactions
- ✅ Good for structured data
- ✅ SQL queries
- ✅ Mature ecosystem
Alternative - MongoDB:
{
_id: ObjectId("..."),
job_id: "550e8400-...",
status: "completed",
url: "https://...",
webhook_url: "https://...",
created_at: ISODate("2026-01-18T..."),
reviews_count: 244,
reviews_file: "/data/reviews/550e8400.json",
metadata: { ... }
}
Why MongoDB:
- ✅ Flexible schema
- ✅ Good for document storage
- ✅ Built-in sharding
Recommendation: PostgreSQL for most cases (better for job queues and transactions)
5. File Storage
Options:
Option A: Local Filesystem (Development/Small scale)
/data/reviews/
├── 550e8400-e29b-41d4-a716-446655440000.json
├── 6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
└── ...
Option B: S3 / Object Storage (Production - RECOMMENDED)
s3://scraper-reviews-bucket/
├── 2026/01/18/550e8400-e29b-41d4-a716-446655440000.json
├── 2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
└── ...
Why S3:
- ✅ Unlimited storage
- ✅ No disk management
- ✅ High availability
- ✅ Versioning support
- ✅ Pre-signed URLs for direct access
- ✅ Lifecycle policies (auto-delete old files)
Recommendation: S3 (or compatible) for production
6. Webhook Dispatcher
Features:
- ✅ Retry logic (exponential backoff)
- ✅ Dead letter queue for failed webhooks
- ✅ Webhook signatures (HMAC for security)
- ✅ Timeout handling
- ✅ Async delivery
Implementation:
async def send_webhook(webhook_url, payload, max_retries=3):
for attempt in range(max_retries):
try:
# Add signature
signature = hmac.new(
WEBHOOK_SECRET,
json.dumps(payload).encode(),
hashlib.sha256
).hexdigest()
# Send with timeout
async with httpx.AsyncClient() as client:
response = await client.post(
webhook_url,
json=payload,
headers={"X-Webhook-Signature": signature},
timeout=10.0
)
if response.status_code == 200:
return True
except Exception as e:
if attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt) # Exponential backoff
else:
# Move to dead letter queue
await save_to_dead_letter_queue(webhook_url, payload)
return False
🔥 Complete Workflow Examples
Workflow 1: Webhooks (Production)
# 1. Client submits job with webhook
POST /scrape
{
"url": "https://maps.google.com/...",
"webhook_url": "https://client.com/webhook",
"webhook_secret": "secret123" # For signature verification
}
Response:
{
"job_id": "550e8400-...",
"status": "queued",
"estimated_time": "20s"
}
# 2. Server enqueues job
redis.lpush("scraper:queue", job_id)
# 3. Worker picks up job
worker = get_from_queue()
result = fast_scrape_reviews(url)
# 4. Save to S3
s3.upload(f"reviews/{job_id}.json", reviews)
# 5. Update database
db.jobs.update(job_id, {
status: "completed",
reviews_count: 244,
reviews_url: f"https://api.example.com/jobs/{job_id}/reviews"
})
# 6. Send webhook to client
POST https://client.com/webhook
Headers:
X-Webhook-Signature: hmac_sha256(payload, secret)
Body:
{
"event": "job.completed",
"job_id": "550e8400-...",
"status": "completed",
"reviews_count": 244,
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews",
"completed_at": "2026-01-18T10:30:20Z"
}
# 7. Client downloads reviews
GET https://api.example.com/jobs/{job_id}/reviews
# Or direct S3 pre-signed URL
GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=...
Workflow 2: SSE Streaming (Real-time Dashboard)
# 1. Client opens SSE connection
EventSource("/jobs/{job_id}/stream")
# 2. Server streams progress updates
def stream_progress(job_id):
while True:
job = get_job(job_id)
yield f"data: {json.dumps({
'stage': job.stage,
'reviews_loaded': job.reviews_loaded,
'progress_percent': job.progress_percent
})}\n\n"
if job.status in ['completed', 'failed']:
break
await asyncio.sleep(1) # Update every second
# 3. Client receives updates
onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20}
onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40}
onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60}
onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100}
onmessage: {"stage": "completed", "total": 244}
Workflow 3: Polling (Simple Clients)
# 1. Submit job (no webhook)
POST /scrape
{
"url": "https://maps.google.com/..."
}
Response:
{
"job_id": "550e8400-...",
"status": "queued"
}
# 2. Poll every 3 seconds
while True:
response = GET /jobs/{job_id}
if response.status == "completed":
reviews = GET /jobs/{job_id}/reviews
break
elif response.status == "failed":
handle_error(response.error_message)
break
sleep(3)
🏥 Health Checks
1. Basic Health Check
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"version": "1.0.0"
}
2. Detailed Health Check (Recommended)
@app.get("/health/detailed")
async def detailed_health():
checks = {
"api": await check_api(), # Always healthy if responding
"database": await check_database(), # Query DB
"redis": await check_redis(), # Ping Redis
"s3": await check_s3(), # List buckets
"workers": await check_workers(), # Check if workers alive
"disk": await check_disk_space(), # Check disk usage
}
overall_healthy = all(c["healthy"] for c in checks.values())
return {
"status": "healthy" if overall_healthy else "degraded",
"checks": checks,
"timestamp": datetime.utcnow().isoformat()
}
# Example response:
{
"status": "healthy",
"checks": {
"api": {"healthy": true, "latency_ms": 1},
"database": {"healthy": true, "latency_ms": 5},
"redis": {"healthy": true, "latency_ms": 2},
"s3": {"healthy": true, "latency_ms": 50},
"workers": {"healthy": true, "active_workers": 4},
"disk": {"healthy": true, "usage_percent": 45}
},
"timestamp": "2026-01-18T10:30:00Z"
}
3. Readiness vs Liveness (Kubernetes)
# Liveness: Is the app alive? (restart if false)
@app.get("/health/live")
async def liveness():
# Simple check - is the server running?
return {"status": "alive"}
# Readiness: Can the app handle traffic? (remove from load balancer if false)
@app.get("/health/ready")
async def readiness():
# Check dependencies
db_ok = await ping_database()
redis_ok = await ping_redis()
if db_ok and redis_ok:
return {"status": "ready"}
else:
raise HTTPException(status_code=503, detail="Not ready")
📊 Monitoring & Metrics
Prometheus Metrics
from prometheus_client import Counter, Histogram, Gauge
# Counters
jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status'])
webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success'])
# Histograms
scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration')
reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job')
# Gauges
active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs')
queue_size = Gauge('scraper_queue_size', 'Jobs in queue')
@app.get("/metrics")
async def metrics():
# Prometheus scrapes this endpoint
return Response(generate_latest(), media_type="text/plain")
🔐 Security
1. API Keys
@app.post("/scrape")
async def scrape(
request: ScrapeRequest,
api_key: str = Header(..., alias="X-API-Key")
):
if not validate_api_key(api_key):
raise HTTPException(status_code=401, detail="Invalid API key")
# Process request...
2. Rate Limiting
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@app.post("/scrape")
@limiter.limit("10/minute") # Max 10 jobs per minute
async def scrape(request: Request, ...):
# Process request...
3. Webhook Signatures
import hmac
def verify_webhook_signature(payload, signature, secret):
expected = hmac.new(
secret.encode(),
payload.encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(signature, expected)
🚀 Deployment Options
Option 1: Docker Compose (Development)
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://db:5432/scraper
depends_on:
- redis
- db
worker:
build: .
command: python worker.py
environment:
- REDIS_URL=redis://redis:6379
depends_on:
- redis
deploy:
replicas: 4
redis:
image: redis:7-alpine
db:
image: postgres:15-alpine
environment:
- POSTGRES_DB=scraper
Option 2: Kubernetes (Production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper-api
spec:
replicas: 3
selector:
matchLabels:
app: scraper-api
template:
spec:
containers:
- name: api
image: scraper-api:latest
ports:
- containerPort: 8000
env:
- name: REDIS_URL
value: redis://redis:6379
livenessProbe:
httpGet:
path: /health/live
port: 8000
readinessProbe:
httpGet:
path: /health/ready
port: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper-worker
spec:
replicas: 10
selector:
matchLabels:
app: scraper-worker
template:
spec:
containers:
- name: worker
image: scraper-worker:latest
📈 Scaling Considerations
Horizontal Scaling
1 Worker = 3 jobs/minute (20s per job)
10 Workers = 30 jobs/minute
100 Workers = 300 jobs/minute = 432,000 jobs/day
Resource Requirements (per worker)
CPU: 1-2 cores (Chrome is CPU-intensive)
RAM: 2-4 GB (headless Chrome + data)
Disk: Minimal (results go to S3)
Auto-scaling (Kubernetes HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: scraper-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: scraper-worker
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: redis_queue_size
target:
type: Value
value: "10" # Scale up if queue > 10
✅ Recommended Stack
For Small-Medium (< 1000 jobs/day):
✅ FastAPI (API Server)
✅ Redis (Queue + Cache)
✅ PostgreSQL (Job metadata)
✅ Local files or S3 (Reviews storage)
✅ Webhooks (Primary)
✅ Polling (Fallback)
✅ Docker Compose (Deployment)
For Large Scale (> 10,000 jobs/day):
✅ FastAPI (API Server)
✅ RabbitMQ (Queue)
✅ PostgreSQL (Job metadata)
✅ S3 (Reviews storage)
✅ Webhooks (Primary)
✅ SSE (Real-time updates)
✅ Kubernetes (Orchestration)
✅ Prometheus + Grafana (Monitoring)
✅ ELK Stack (Logging)
🎯 Next Steps
Would you like me to implement:
- ✅ Webhooks - Full webhook support with retries
- ✅ Redis Queue - Job queue with Celery/RQ
- ✅ PostgreSQL - Job metadata storage
- ✅ S3 Storage - Reviews file storage
- ✅ Health Checks - Detailed health endpoints
- ✅ SSE Streaming - Real-time progress updates (optional)
- ✅ Docker Setup - Complete docker-compose.yml
My recommendation: Start with #1-5 (core production features), add #6-7 later if needed.
Let me know which to implement first!