Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

View File

@@ -0,0 +1,833 @@
# Production Microservice Architecture
## Google Reviews Scraper API
---
## 🎯 Recommended Communication Patterns
### 1. **Webhooks** (Primary - RECOMMENDED) ✅
**Best for**: Production async job processing
```
Client → POST /scrape (with webhook_url)
Server → Starts job, returns job_id
[Scraping in progress...]
Server → POST to client's webhook_url when complete
{
"job_id": "...",
"status": "completed",
"reviews_count": 244,
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews"
}
```
**Advantages**:
- ✅ No polling needed (reduces server load)
- ✅ Instant notifications when job completes
- ✅ Industry standard (Stripe, GitHub, Twilio use this)
- ✅ Client can go offline and come back
- ✅ Scales to millions of jobs
**Use cases**:
- Batch processing systems
- Integration with other services
- When client has a public endpoint
---
### 2. **Server-Sent Events (SSE)** (Real-time Updates) ⚡
**Best for**: Real-time progress monitoring
```
Client → GET /jobs/{job_id}/stream (keeps connection open)
Server → Sends progress updates in real-time:
data: {"stage": "scrolling", "reviews_loaded": 50}
data: {"stage": "scrolling", "reviews_loaded": 100}
data: {"stage": "extracting", "reviews_loaded": 244}
data: {"stage": "completed", "total": 244}
```
**Advantages**:
- ✅ Real-time progress updates
- ✅ HTTP-based (works through firewalls)
- ✅ Lightweight (one-way communication)
- ✅ Auto-reconnection support
- ✅ Great for dashboards/UIs
**Use cases**:
- Web dashboards
- Real-time monitoring
- Progress bars in UI
---
### 3. **Polling** (Fallback) 🔄
**Best for**: Simple clients, no webhook capability
```
Client → POST /scrape
Server → Returns job_id
Client → Polls GET /jobs/{job_id} every 2-5 seconds
Server → Returns current status
```
**Advantages**:
- ✅ Simple to implement
- ✅ Works everywhere (no public endpoint needed)
- ✅ Firewall-friendly
**Disadvantages**:
- ❌ Inefficient (many wasted requests)
- ❌ Delayed notifications (polling interval)
- ❌ Higher server load
**Use cases**:
- Internal tools
- Clients behind firewalls
- Simple integrations
---
## 🏛️ Complete Production Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ LOAD BALANCER │
│ (nginx/AWS ALB) │
└──────────┬──────────────────────────────────┬────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ API Server 1 │ │ API Server 2 │
│ (FastAPI) │ │ (FastAPI) │
│ - REST endpoints │ │ - REST endpoints │
│ - Health checks │ │ - Health checks │
│ - Job management │ │ - Job management │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
└────────────┬───────────────────┘
┌────────────────────────┐
│ REDIS / RabbitMQ │
│ (Job Queue) │
│ │
│ - Pending jobs │
│ - Job distribution │
│ - Pub/Sub for events │
└────────┬───────────────┘
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Worker 1 │ │ Worker 2 │
│ │ │ │
│ - Scraping │ │ - Scraping │
│ - Headless │ │ - Headless │
│ - Chrome │ │ - Chrome │
└─────┬───────┘ └─────┬───────┘
│ │
└────────────┬───────────────┘
┌──────────────────────────────┐
│ PERSISTENT STORAGE │
│ │
│ ┌────────────────────────┐ │
│ │ PostgreSQL / MongoDB │ │
│ │ - Job metadata │ │
│ │ - Status tracking │ │
│ │ - Webhook configs │ │
│ └────────────────────────┘ │
│ │
│ ┌────────────────────────┐ │
│ │ File Storage / S3 │ │
│ │ - Review JSON files │ │
│ │ - Large payloads │ │
│ └────────────────────────┘ │
└───────────────────────────────┘
┌─────────────────────┐
│ Webhook Dispatcher │
│ - Retry logic │
│ - Dead letter queue│
└─────────────────────┘
[Client's webhook URL]
```
---
## 📦 Component Breakdown
### 1. **API Server** (FastAPI)
**Responsibilities**:
- Handle HTTP requests
- Validate input
- Enqueue jobs
- Serve results
- Health checks
**Endpoints**:
```python
POST /scrape # Submit job
GET /jobs/{id} # Get job status
GET /jobs/{id}/reviews # Get results
GET /jobs/{id}/stream # SSE progress stream
DELETE /jobs/{id} # Cancel job
GET /health # Health check
GET /metrics # Prometheus metrics
```
---
### 2. **Job Queue** (Redis or RabbitMQ)
**Why needed**:
- Decouple API from scraping workers
- Distribute load across workers
- Retry failed jobs
- Handle backpressure
**Options**:
**Option A: Redis** (Recommended for simpler setups)
```python
# Fast, simple, good for most use cases
- In-memory queue
- Pub/Sub for events
- Job state storage
- Session storage
```
**Option B: RabbitMQ** (For complex workflows)
```python
# More features, better for complex scenarios
- Guaranteed delivery
- Advanced routing
- Dead letter queues
- Priority queues
```
**Recommendation**: Start with **Redis**, upgrade to RabbitMQ if needed.
---
### 3. **Worker Processes** (Celery or Custom)
**Responsibilities**:
- Pull jobs from queue
- Run scraping (headless Chrome)
- Save results
- Send webhooks
- Update job status
**Scaling**:
```bash
# Run 4 workers on same machine
celery -A worker worker --concurrency=4
# Or 4 separate processes
python worker.py &
python worker.py &
python worker.py &
python worker.py &
# Or Kubernetes deployment
kubectl scale deployment scraper-worker --replicas=10
```
---
### 4. **Database** (PostgreSQL or MongoDB)
**Job Metadata Schema**:
**PostgreSQL** (Recommended):
```sql
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL,
url TEXT NOT NULL,
webhook_url TEXT,
created_at TIMESTAMP NOT NULL,
started_at TIMESTAMP,
completed_at TIMESTAMP,
reviews_count INTEGER,
reviews_file_path TEXT,
error_message TEXT,
metadata JSONB
);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
```
**Why PostgreSQL**:
- ✅ ACID transactions
- ✅ Good for structured data
- ✅ SQL queries
- ✅ Mature ecosystem
**Alternative - MongoDB**:
```javascript
{
_id: ObjectId("..."),
job_id: "550e8400-...",
status: "completed",
url: "https://...",
webhook_url: "https://...",
created_at: ISODate("2026-01-18T..."),
reviews_count: 244,
reviews_file: "/data/reviews/550e8400.json",
metadata: { ... }
}
```
**Why MongoDB**:
- ✅ Flexible schema
- ✅ Good for document storage
- ✅ Built-in sharding
**Recommendation**: **PostgreSQL** for most cases (better for job queues and transactions)
---
### 5. **File Storage**
**Options**:
**Option A: Local Filesystem** (Development/Small scale)
```python
/data/reviews/
550e8400-e29b-41d4-a716-446655440000.json
6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
...
```
**Option B: S3 / Object Storage** (Production - RECOMMENDED)
```python
s3://scraper-reviews-bucket/
2026/01/18/550e8400-e29b-41d4-a716-446655440000.json
2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
...
```
**Why S3**:
- ✅ Unlimited storage
- ✅ No disk management
- ✅ High availability
- ✅ Versioning support
- ✅ Pre-signed URLs for direct access
- ✅ Lifecycle policies (auto-delete old files)
**Recommendation**: **S3 (or compatible)** for production
---
### 6. **Webhook Dispatcher**
**Features**:
- ✅ Retry logic (exponential backoff)
- ✅ Dead letter queue for failed webhooks
- ✅ Webhook signatures (HMAC for security)
- ✅ Timeout handling
- ✅ Async delivery
**Implementation**:
```python
async def send_webhook(webhook_url, payload, max_retries=3):
for attempt in range(max_retries):
try:
# Add signature
signature = hmac.new(
WEBHOOK_SECRET,
json.dumps(payload).encode(),
hashlib.sha256
).hexdigest()
# Send with timeout
async with httpx.AsyncClient() as client:
response = await client.post(
webhook_url,
json=payload,
headers={"X-Webhook-Signature": signature},
timeout=10.0
)
if response.status_code == 200:
return True
except Exception as e:
if attempt < max_retries - 1:
await asyncio.sleep(2 ** attempt) # Exponential backoff
else:
# Move to dead letter queue
await save_to_dead_letter_queue(webhook_url, payload)
return False
```
---
## 🔥 Complete Workflow Examples
### Workflow 1: **Webhooks** (Production)
```python
# 1. Client submits job with webhook
POST /scrape
{
"url": "https://maps.google.com/...",
"webhook_url": "https://client.com/webhook",
"webhook_secret": "secret123" # For signature verification
}
Response:
{
"job_id": "550e8400-...",
"status": "queued",
"estimated_time": "20s"
}
# 2. Server enqueues job
redis.lpush("scraper:queue", job_id)
# 3. Worker picks up job
worker = get_from_queue()
result = fast_scrape_reviews(url)
# 4. Save to S3
s3.upload(f"reviews/{job_id}.json", reviews)
# 5. Update database
db.jobs.update(job_id, {
status: "completed",
reviews_count: 244,
reviews_url: f"https://api.example.com/jobs/{job_id}/reviews"
})
# 6. Send webhook to client
POST https://client.com/webhook
Headers:
X-Webhook-Signature: hmac_sha256(payload, secret)
Body:
{
"event": "job.completed",
"job_id": "550e8400-...",
"status": "completed",
"reviews_count": 244,
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews",
"completed_at": "2026-01-18T10:30:20Z"
}
# 7. Client downloads reviews
GET https://api.example.com/jobs/{job_id}/reviews
# Or direct S3 pre-signed URL
GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=...
```
---
### Workflow 2: **SSE Streaming** (Real-time Dashboard)
```python
# 1. Client opens SSE connection
EventSource("/jobs/{job_id}/stream")
# 2. Server streams progress updates
def stream_progress(job_id):
while True:
job = get_job(job_id)
yield f"data: {json.dumps({
'stage': job.stage,
'reviews_loaded': job.reviews_loaded,
'progress_percent': job.progress_percent
})}\n\n"
if job.status in ['completed', 'failed']:
break
await asyncio.sleep(1) # Update every second
# 3. Client receives updates
onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20}
onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40}
onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60}
onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100}
onmessage: {"stage": "completed", "total": 244}
```
---
### Workflow 3: **Polling** (Simple Clients)
```python
# 1. Submit job (no webhook)
POST /scrape
{
"url": "https://maps.google.com/..."
}
Response:
{
"job_id": "550e8400-...",
"status": "queued"
}
# 2. Poll every 3 seconds
while True:
response = GET /jobs/{job_id}
if response.status == "completed":
reviews = GET /jobs/{job_id}/reviews
break
elif response.status == "failed":
handle_error(response.error_message)
break
sleep(3)
```
---
## 🏥 Health Checks
### 1. **Basic Health Check**
```python
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"timestamp": datetime.utcnow().isoformat(),
"version": "1.0.0"
}
```
### 2. **Detailed Health Check** (Recommended)
```python
@app.get("/health/detailed")
async def detailed_health():
checks = {
"api": await check_api(), # Always healthy if responding
"database": await check_database(), # Query DB
"redis": await check_redis(), # Ping Redis
"s3": await check_s3(), # List buckets
"workers": await check_workers(), # Check if workers alive
"disk": await check_disk_space(), # Check disk usage
}
overall_healthy = all(c["healthy"] for c in checks.values())
return {
"status": "healthy" if overall_healthy else "degraded",
"checks": checks,
"timestamp": datetime.utcnow().isoformat()
}
# Example response:
{
"status": "healthy",
"checks": {
"api": {"healthy": true, "latency_ms": 1},
"database": {"healthy": true, "latency_ms": 5},
"redis": {"healthy": true, "latency_ms": 2},
"s3": {"healthy": true, "latency_ms": 50},
"workers": {"healthy": true, "active_workers": 4},
"disk": {"healthy": true, "usage_percent": 45}
},
"timestamp": "2026-01-18T10:30:00Z"
}
```
### 3. **Readiness vs Liveness** (Kubernetes)
```python
# Liveness: Is the app alive? (restart if false)
@app.get("/health/live")
async def liveness():
# Simple check - is the server running?
return {"status": "alive"}
# Readiness: Can the app handle traffic? (remove from load balancer if false)
@app.get("/health/ready")
async def readiness():
# Check dependencies
db_ok = await ping_database()
redis_ok = await ping_redis()
if db_ok and redis_ok:
return {"status": "ready"}
else:
raise HTTPException(status_code=503, detail="Not ready")
```
---
## 📊 Monitoring & Metrics
### Prometheus Metrics
```python
from prometheus_client import Counter, Histogram, Gauge
# Counters
jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status'])
webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success'])
# Histograms
scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration')
reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job')
# Gauges
active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs')
queue_size = Gauge('scraper_queue_size', 'Jobs in queue')
@app.get("/metrics")
async def metrics():
# Prometheus scrapes this endpoint
return Response(generate_latest(), media_type="text/plain")
```
---
## 🔐 Security
### 1. **API Keys**
```python
@app.post("/scrape")
async def scrape(
request: ScrapeRequest,
api_key: str = Header(..., alias="X-API-Key")
):
if not validate_api_key(api_key):
raise HTTPException(status_code=401, detail="Invalid API key")
# Process request...
```
### 2. **Rate Limiting**
```python
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address)
@app.post("/scrape")
@limiter.limit("10/minute") # Max 10 jobs per minute
async def scrape(request: Request, ...):
# Process request...
```
### 3. **Webhook Signatures**
```python
import hmac
def verify_webhook_signature(payload, signature, secret):
expected = hmac.new(
secret.encode(),
payload.encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(signature, expected)
```
---
## 🚀 Deployment Options
### Option 1: **Docker Compose** (Development)
```yaml
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
environment:
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://db:5432/scraper
depends_on:
- redis
- db
worker:
build: .
command: python worker.py
environment:
- REDIS_URL=redis://redis:6379
depends_on:
- redis
deploy:
replicas: 4
redis:
image: redis:7-alpine
db:
image: postgres:15-alpine
environment:
- POSTGRES_DB=scraper
```
### Option 2: **Kubernetes** (Production)
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper-api
spec:
replicas: 3
selector:
matchLabels:
app: scraper-api
template:
spec:
containers:
- name: api
image: scraper-api:latest
ports:
- containerPort: 8000
env:
- name: REDIS_URL
value: redis://redis:6379
livenessProbe:
httpGet:
path: /health/live
port: 8000
readinessProbe:
httpGet:
path: /health/ready
port: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: scraper-worker
spec:
replicas: 10
selector:
matchLabels:
app: scraper-worker
template:
spec:
containers:
- name: worker
image: scraper-worker:latest
```
---
## 📈 Scaling Considerations
### Horizontal Scaling
```
1 Worker = 3 jobs/minute (20s per job)
10 Workers = 30 jobs/minute
100 Workers = 300 jobs/minute = 432,000 jobs/day
```
### Resource Requirements (per worker)
```
CPU: 1-2 cores (Chrome is CPU-intensive)
RAM: 2-4 GB (headless Chrome + data)
Disk: Minimal (results go to S3)
```
### Auto-scaling (Kubernetes HPA)
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: scraper-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: scraper-worker
minReplicas: 2
maxReplicas: 50
metrics:
- type: External
external:
metric:
name: redis_queue_size
target:
type: Value
value: "10" # Scale up if queue > 10
```
---
## ✅ Recommended Stack
### For Small-Medium (< 1000 jobs/day):
```
✅ FastAPI (API Server)
✅ Redis (Queue + Cache)
✅ PostgreSQL (Job metadata)
✅ Local files or S3 (Reviews storage)
✅ Webhooks (Primary)
✅ Polling (Fallback)
✅ Docker Compose (Deployment)
```
### For Large Scale (> 10,000 jobs/day):
```
✅ FastAPI (API Server)
✅ RabbitMQ (Queue)
✅ PostgreSQL (Job metadata)
✅ S3 (Reviews storage)
✅ Webhooks (Primary)
✅ SSE (Real-time updates)
✅ Kubernetes (Orchestration)
✅ Prometheus + Grafana (Monitoring)
✅ ELK Stack (Logging)
```
---
## 🎯 Next Steps
Would you like me to implement:
1.**Webhooks** - Full webhook support with retries
2.**Redis Queue** - Job queue with Celery/RQ
3.**PostgreSQL** - Job metadata storage
4.**S3 Storage** - Reviews file storage
5.**Health Checks** - Detailed health endpoints
6.**SSE Streaming** - Real-time progress updates (optional)
7.**Docker Setup** - Complete docker-compose.yml
**My recommendation**: Start with **#1-5** (core production features), add #6-7 later if needed.
Let me know which to implement first!