Files

Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-18 19:49:24 +00:00

21 KiB

Raw Blame History

Production Microservice Architecture

Google Reviews Scraper API

🎯 Recommended Communication Patterns

1. Webhooks (Primary - RECOMMENDED) ✅

Best for: Production async job processing

Client → POST /scrape (with webhook_url)
         ↓
Server → Starts job, returns job_id
         ↓
         [Scraping in progress...]
         ↓
Server → POST to client's webhook_url when complete
         {
           "job_id": "...",
           "status": "completed",
           "reviews_count": 244,
           "reviews_url": "https://api.example.com/jobs/{job_id}/reviews"
         }

Advantages:

✅ No polling needed (reduces server load)
✅ Instant notifications when job completes
✅ Industry standard (Stripe, GitHub, Twilio use this)
✅ Client can go offline and come back
✅ Scales to millions of jobs

Use cases:

Batch processing systems
Integration with other services
When client has a public endpoint

2. Server-Sent Events (SSE) (Real-time Updates) ⚡

Best for: Real-time progress monitoring

Client → GET /jobs/{job_id}/stream (keeps connection open)
         ↓
Server → Sends progress updates in real-time:

         data: {"stage": "scrolling", "reviews_loaded": 50}

         data: {"stage": "scrolling", "reviews_loaded": 100}

         data: {"stage": "extracting", "reviews_loaded": 244}

         data: {"stage": "completed", "total": 244}

Advantages:

✅ Real-time progress updates
✅ HTTP-based (works through firewalls)
✅ Lightweight (one-way communication)
✅ Auto-reconnection support
✅ Great for dashboards/UIs

Use cases:

Web dashboards
Real-time monitoring
Progress bars in UI

3. Polling (Fallback) 🔄

Best for: Simple clients, no webhook capability

Client → POST /scrape
         ↓
Server → Returns job_id
         ↓
Client → Polls GET /jobs/{job_id} every 2-5 seconds
         ↓
Server → Returns current status

Advantages:

✅ Simple to implement
✅ Works everywhere (no public endpoint needed)
✅ Firewall-friendly

Disadvantages:

❌ Inefficient (many wasted requests)
❌ Delayed notifications (polling interval)
❌ Higher server load

Use cases:

Internal tools
Clients behind firewalls
Simple integrations

🏛️ Complete Production Architecture

┌─────────────────────────────────────────────────────────────┐
│                        LOAD BALANCER                         │
│                     (nginx/AWS ALB)                          │
└──────────┬──────────────────────────────────┬────────────────┘
           │                                  │
           ▼                                  ▼
┌──────────────────────┐         ┌──────────────────────┐
│   API Server 1       │         │   API Server 2       │
│   (FastAPI)          │         │   (FastAPI)          │
│   - REST endpoints   │         │   - REST endpoints   │
│   - Health checks    │         │   - Health checks    │
│   - Job management   │         │   - Job management   │
└──────────┬───────────┘         └──────────┬───────────┘
           │                                │
           └────────────┬───────────────────┘
                        ▼
           ┌────────────────────────┐
           │    REDIS / RabbitMQ    │
           │    (Job Queue)         │
           │                        │
           │  - Pending jobs        │
           │  - Job distribution    │
           │  - Pub/Sub for events  │
           └────────┬───────────────┘
                    │
                    ▼
     ┌──────────────┴──────────────┐
     │                             │
     ▼                             ▼
┌─────────────┐              ┌─────────────┐
│  Worker 1   │              │  Worker 2   │
│             │              │             │
│ - Scraping  │              │ - Scraping  │
│ - Headless  │              │ - Headless  │
│ - Chrome    │              │ - Chrome    │
└─────┬───────┘              └─────┬───────┘
      │                            │
      └────────────┬───────────────┘
                   ▼
    ┌──────────────────────────────┐
    │   PERSISTENT STORAGE          │
    │                               │
    │  ┌────────────────────────┐   │
    │  │  PostgreSQL / MongoDB  │   │
    │  │  - Job metadata        │   │
    │  │  - Status tracking     │   │
    │  │  - Webhook configs     │   │
    │  └────────────────────────┘   │
    │                               │
    │  ┌────────────────────────┐   │
    │  │  File Storage / S3     │   │
    │  │  - Review JSON files   │   │
    │  │  - Large payloads      │   │
    │  └────────────────────────┘   │
    └───────────────────────────────┘
                   │
                   ▼
         ┌─────────────────────┐
         │  Webhook Dispatcher │
         │  - Retry logic      │
         │  - Dead letter queue│
         └─────────────────────┘
                   │
                   ▼
         [Client's webhook URL]

📦 Component Breakdown

1. API Server (FastAPI)

Responsibilities:

Handle HTTP requests
Validate input
Enqueue jobs
Serve results
Health checks

Endpoints:

POST   /scrape              # Submit job
GET    /jobs/{id}           # Get job status
GET    /jobs/{id}/reviews   # Get results
GET    /jobs/{id}/stream    # SSE progress stream
DELETE /jobs/{id}           # Cancel job
GET    /health              # Health check
GET    /metrics             # Prometheus metrics

2. Job Queue (Redis or RabbitMQ)

Why needed:

Decouple API from scraping workers
Distribute load across workers
Retry failed jobs
Handle backpressure

Options:

Option A: Redis (Recommended for simpler setups)

# Fast, simple, good for most use cases
- In-memory queue
- Pub/Sub for events
- Job state storage
- Session storage

Option B: RabbitMQ (For complex workflows)

# More features, better for complex scenarios
- Guaranteed delivery
- Advanced routing
- Dead letter queues
- Priority queues

Recommendation: Start with Redis, upgrade to RabbitMQ if needed.

3. Worker Processes (Celery or Custom)

Responsibilities:

Pull jobs from queue
Run scraping (headless Chrome)
Save results
Send webhooks
Update job status

Scaling:

# Run 4 workers on same machine
celery -A worker worker --concurrency=4

# Or 4 separate processes
python worker.py &
python worker.py &
python worker.py &
python worker.py &

# Or Kubernetes deployment
kubectl scale deployment scraper-worker --replicas=10

4. Database (PostgreSQL or MongoDB)

Job Metadata Schema:

PostgreSQL (Recommended):

CREATE TABLE jobs (
    job_id UUID PRIMARY KEY,
    status VARCHAR(20) NOT NULL,
    url TEXT NOT NULL,
    webhook_url TEXT,
    created_at TIMESTAMP NOT NULL,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    reviews_count INTEGER,
    reviews_file_path TEXT,
    error_message TEXT,
    metadata JSONB
);

CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);

Why PostgreSQL:

✅ ACID transactions
✅ Good for structured data
✅ SQL queries
✅ Mature ecosystem

Alternative - MongoDB:

{
  _id: ObjectId("..."),
  job_id: "550e8400-...",
  status: "completed",
  url: "https://...",
  webhook_url: "https://...",
  created_at: ISODate("2026-01-18T..."),
  reviews_count: 244,
  reviews_file: "/data/reviews/550e8400.json",
  metadata: { ... }
}

Why MongoDB:

✅ Flexible schema
✅ Good for document storage
✅ Built-in sharding

Recommendation: PostgreSQL for most cases (better for job queues and transactions)

5. File Storage

Options:

Option A: Local Filesystem (Development/Small scale)

/data/reviews/
  ├── 550e8400-e29b-41d4-a716-446655440000.json
  ├── 6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
  └── ...

Option B: S3 / Object Storage (Production - RECOMMENDED)

s3://scraper-reviews-bucket/
  ├── 2026/01/18/550e8400-e29b-41d4-a716-446655440000.json
  ├── 2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
  └── ...

Why S3:

✅ Unlimited storage
✅ No disk management
✅ High availability
✅ Versioning support
✅ Pre-signed URLs for direct access
✅ Lifecycle policies (auto-delete old files)

Recommendation: S3 (or compatible) for production

6. Webhook Dispatcher

Features:

✅ Retry logic (exponential backoff)
✅ Dead letter queue for failed webhooks
✅ Webhook signatures (HMAC for security)
✅ Timeout handling
✅ Async delivery

Implementation:

async def send_webhook(webhook_url, payload, max_retries=3):
    for attempt in range(max_retries):
        try:
            # Add signature
            signature = hmac.new(
                WEBHOOK_SECRET,
                json.dumps(payload).encode(),
                hashlib.sha256
            ).hexdigest()

            # Send with timeout
            async with httpx.AsyncClient() as client:
                response = await client.post(
                    webhook_url,
                    json=payload,
                    headers={"X-Webhook-Signature": signature},
                    timeout=10.0
                )

                if response.status_code == 200:
                    return True

        except Exception as e:
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
            else:
                # Move to dead letter queue
                await save_to_dead_letter_queue(webhook_url, payload)

    return False

🔥 Complete Workflow Examples

Workflow 1: Webhooks (Production)

# 1. Client submits job with webhook
POST /scrape
{
  "url": "https://maps.google.com/...",
  "webhook_url": "https://client.com/webhook",
  "webhook_secret": "secret123"  # For signature verification
}

Response:
{
  "job_id": "550e8400-...",
  "status": "queued",
  "estimated_time": "20s"
}

# 2. Server enqueues job
redis.lpush("scraper:queue", job_id)

# 3. Worker picks up job
worker = get_from_queue()
result = fast_scrape_reviews(url)

# 4. Save to S3
s3.upload(f"reviews/{job_id}.json", reviews)

# 5. Update database
db.jobs.update(job_id, {
  status: "completed",
  reviews_count: 244,
  reviews_url: f"https://api.example.com/jobs/{job_id}/reviews"
})

# 6. Send webhook to client
POST https://client.com/webhook
Headers:
  X-Webhook-Signature: hmac_sha256(payload, secret)
Body:
{
  "event": "job.completed",
  "job_id": "550e8400-...",
  "status": "completed",
  "reviews_count": 244,
  "reviews_url": "https://api.example.com/jobs/{job_id}/reviews",
  "completed_at": "2026-01-18T10:30:20Z"
}

# 7. Client downloads reviews
GET https://api.example.com/jobs/{job_id}/reviews
# Or direct S3 pre-signed URL
GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=...

Workflow 2: SSE Streaming (Real-time Dashboard)

# 1. Client opens SSE connection
EventSource("/jobs/{job_id}/stream")

# 2. Server streams progress updates
def stream_progress(job_id):
    while True:
        job = get_job(job_id)

        yield f"data: {json.dumps({
            'stage': job.stage,
            'reviews_loaded': job.reviews_loaded,
            'progress_percent': job.progress_percent
        })}\n\n"

        if job.status in ['completed', 'failed']:
            break

        await asyncio.sleep(1)  # Update every second

# 3. Client receives updates
onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20}
onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40}
onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60}
onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100}
onmessage: {"stage": "completed", "total": 244}

Workflow 3: Polling (Simple Clients)

# 1. Submit job (no webhook)
POST /scrape
{
  "url": "https://maps.google.com/..."
}

Response:
{
  "job_id": "550e8400-...",
  "status": "queued"
}

# 2. Poll every 3 seconds
while True:
    response = GET /jobs/{job_id}

    if response.status == "completed":
        reviews = GET /jobs/{job_id}/reviews
        break
    elif response.status == "failed":
        handle_error(response.error_message)
        break

    sleep(3)

🏥 Health Checks

1. Basic Health Check

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "version": "1.0.0"
    }

2. Detailed Health Check (Recommended)

@app.get("/health/detailed")
async def detailed_health():
    checks = {
        "api": await check_api(),           # Always healthy if responding
        "database": await check_database(), # Query DB
        "redis": await check_redis(),       # Ping Redis
        "s3": await check_s3(),            # List buckets
        "workers": await check_workers(),   # Check if workers alive
        "disk": await check_disk_space(),  # Check disk usage
    }

    overall_healthy = all(c["healthy"] for c in checks.values())

    return {
        "status": "healthy" if overall_healthy else "degraded",
        "checks": checks,
        "timestamp": datetime.utcnow().isoformat()
    }

# Example response:
{
  "status": "healthy",
  "checks": {
    "api": {"healthy": true, "latency_ms": 1},
    "database": {"healthy": true, "latency_ms": 5},
    "redis": {"healthy": true, "latency_ms": 2},
    "s3": {"healthy": true, "latency_ms": 50},
    "workers": {"healthy": true, "active_workers": 4},
    "disk": {"healthy": true, "usage_percent": 45}
  },
  "timestamp": "2026-01-18T10:30:00Z"
}

3. Readiness vs Liveness (Kubernetes)

# Liveness: Is the app alive? (restart if false)
@app.get("/health/live")
async def liveness():
    # Simple check - is the server running?
    return {"status": "alive"}

# Readiness: Can the app handle traffic? (remove from load balancer if false)
@app.get("/health/ready")
async def readiness():
    # Check dependencies
    db_ok = await ping_database()
    redis_ok = await ping_redis()

    if db_ok and redis_ok:
        return {"status": "ready"}
    else:
        raise HTTPException(status_code=503, detail="Not ready")

📊 Monitoring & Metrics

Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge

# Counters
jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status'])
webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success'])

# Histograms
scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration')
reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job')

# Gauges
active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs')
queue_size = Gauge('scraper_queue_size', 'Jobs in queue')

@app.get("/metrics")
async def metrics():
    # Prometheus scrapes this endpoint
    return Response(generate_latest(), media_type="text/plain")

🔐 Security

1. API Keys

@app.post("/scrape")
async def scrape(
    request: ScrapeRequest,
    api_key: str = Header(..., alias="X-API-Key")
):
    if not validate_api_key(api_key):
        raise HTTPException(status_code=401, detail="Invalid API key")

    # Process request...

2. Rate Limiting

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/scrape")
@limiter.limit("10/minute")  # Max 10 jobs per minute
async def scrape(request: Request, ...):
    # Process request...

3. Webhook Signatures

import hmac

def verify_webhook_signature(payload, signature, secret):
    expected = hmac.new(
        secret.encode(),
        payload.encode(),
        hashlib.sha256
    ).hexdigest()

    return hmac.compare_digest(signature, expected)

🚀 Deployment Options

Option 1: Docker Compose (Development)

version: '3.8'
services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://db:5432/scraper
    depends_on:
      - redis
      - db

  worker:
    build: .
    command: python worker.py
    environment:
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
    deploy:
      replicas: 4

  redis:
    image: redis:7-alpine

  db:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=scraper

Option 2: Kubernetes (Production)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scraper-api
  template:
    spec:
      containers:
      - name: api
        image: scraper-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: REDIS_URL
          value: redis://redis:6379
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8000
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper-worker
spec:
  replicas: 10
  selector:
    matchLabels:
      app: scraper-worker
  template:
    spec:
      containers:
      - name: worker
        image: scraper-worker:latest

📈 Scaling Considerations

Horizontal Scaling

1 Worker  = 3 jobs/minute (20s per job)
10 Workers = 30 jobs/minute
100 Workers = 300 jobs/minute = 432,000 jobs/day

Resource Requirements (per worker)

CPU: 1-2 cores (Chrome is CPU-intensive)
RAM: 2-4 GB (headless Chrome + data)
Disk: Minimal (results go to S3)

Auto-scaling (Kubernetes HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scraper-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scraper-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: redis_queue_size
      target:
        type: Value
        value: "10"  # Scale up if queue > 10

✅ Recommended Stack

For Small-Medium (< 1000 jobs/day):

✅ FastAPI (API Server)
✅ Redis (Queue + Cache)
✅ PostgreSQL (Job metadata)
✅ Local files or S3 (Reviews storage)
✅ Webhooks (Primary)
✅ Polling (Fallback)
✅ Docker Compose (Deployment)

For Large Scale (> 10,000 jobs/day):

✅ FastAPI (API Server)
✅ RabbitMQ (Queue)
✅ PostgreSQL (Job metadata)
✅ S3 (Reviews storage)
✅ Webhooks (Primary)
✅ SSE (Real-time updates)
✅ Kubernetes (Orchestration)
✅ Prometheus + Grafana (Monitoring)
✅ ELK Stack (Logging)

🎯 Next Steps

Would you like me to implement:

✅ Webhooks - Full webhook support with retries
✅ Redis Queue - Job queue with Celery/RQ
✅ PostgreSQL - Job metadata storage
✅ S3 Storage - Reviews file storage
✅ Health Checks - Detailed health endpoints
✅ SSE Streaming - Real-time progress updates (optional)
✅ Docker Setup - Complete docker-compose.yml

My recommendation: Start with #1-5 (core production features), add #6-7 later if needed.

Let me know which to implement first!

21 KiB Raw Blame History

Production Microservice Architecture

Google Reviews Scraper API

🎯 Recommended Communication Patterns

1. Webhooks (Primary - RECOMMENDED) ✅

2. Server-Sent Events (SSE) (Real-time Updates) ⚡

3. Polling (Fallback) 🔄

🏛️ Complete Production Architecture

📦 Component Breakdown

1. API Server (FastAPI)

2. Job Queue (Redis or RabbitMQ)

3. Worker Processes (Celery or Custom)

4. Database (PostgreSQL or MongoDB)

5. File Storage

6. Webhook Dispatcher

🔥 Complete Workflow Examples

Workflow 1: Webhooks (Production)

Workflow 2: SSE Streaming (Real-time Dashboard)

Workflow 3: Polling (Simple Clients)

🏥 Health Checks

1. Basic Health Check

2. Detailed Health Check (Recommended)

3. Readiness vs Liveness (Kubernetes)

📊 Monitoring & Metrics

Prometheus Metrics

🔐 Security

1. API Keys

2. Rate Limiting

3. Webhook Signatures

🚀 Deployment Options

Option 1: Docker Compose (Development)

Option 2: Kubernetes (Production)

📈 Scaling Considerations

Horizontal Scaling

Resource Requirements (per worker)

Auto-scaling (Kubernetes HPA)

✅ Recommended Stack

For Small-Medium (< 1000 jobs/day):

For Large Scale (> 10,000 jobs/day):

🎯 Next Steps

21 KiB

Raw Blame History