Files

Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-18 19:49:24 +00:00

11 KiB

Raw Blame History

Production Deployment Guide

Phase 1: PostgreSQL + Webhooks + Health Checks

<EFBFBD><EFBFBD>️ What's Included

Phase 1 Features:

✅ PostgreSQL Storage - Job metadata + reviews as JSONB
✅ Webhooks - Async notifications with retry logic and HMAC signatures
✅ Smart Health Checks - Canary testing every 4 hours to verify scraping works
✅ Fast Scraper - 18.9s average scraping time (8.2x faster)
✅ Docker Deployment - Easy deployment with Docker Compose

🚀 Quick Start (Docker)

1. Clone and Configure

# Copy environment file
cp .env.example .env

# Edit .env with your settings
nano .env

2. Start Services

# Build and start all services
docker-compose -f docker-compose.production.yml up -d

# Check logs
docker-compose -f docker-compose.production.yml logs -f api

3. Verify Health

# Check if API is running
curl http://localhost:8000/

# Check detailed health
curl http://localhost:8000/health/detailed | jq

Done! API is running on http://localhost:8000

🔧 Manual Installation

1. Install Dependencies

# Install Python dependencies
pip install -r requirements-production.txt

# Install PostgreSQL
# On macOS:
brew install postgresql@15
brew services start postgresql@15

# On Ubuntu:
sudo apt-get install postgresql-15

2. Setup Database

# Create database and user
psql postgres
CREATE DATABASE scraper;
CREATE USER scraper WITH PASSWORD 'scraper123';
GRANT ALL PRIVILEGES ON DATABASE scraper TO scraper;
\q

3. Configure Environment

# Set environment variables
export DATABASE_URL="postgresql://scraper:scraper123@localhost:5432/scraper"
export API_BASE_URL="http://localhost:8000"

4. Run Server

python api_server_production.py

Server runs on http://localhost:8000

📡 API Usage

1. Submit Job with Webhook

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
    "webhook_url": "https://your-server.com/webhook",
    "webhook_secret": "your-secret-key"
  }'

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started"
}

2. Check Status

curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" | jq

3. Receive Webhook (When Complete)

Your webhook endpoint will receive:

POST https://your-server.com/webhook
Headers:
  X-Webhook-Signature: sha256=abc123...
  X-Webhook-Timestamp: 1705582800

Body:
{
  "event": "job.completed",
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "reviews_count": 244,
  "scrape_time": 18.9,
  "reviews_url": "http://localhost:8000/jobs/550e8400-.../reviews",
  "timestamp": "2026-01-18T10:30:00Z"
}

4. Verify Webhook Signature

import hmac
import hashlib

def verify_webhook(payload: str, signature: str, secret: str) -> bool:
    """Verify webhook signature"""
    expected = signature.split("sha256=", 1)[1]
    computed = hmac.new(
        secret.encode(),
        payload.encode(),
        hashlib.sha256
    ).hexdigest()

    return hmac.compare_digest(expected, computed)

# In your webhook handler:
@app.post("/webhook")
async def handle_webhook(request: Request):
    payload = await request.body()
    signature = request.headers.get("X-Webhook-Signature")

    if not verify_webhook(payload.decode(), signature, WEBHOOK_SECRET):
        raise HTTPException(status_code=401, detail="Invalid signature")

    # Process webhook...
    data = await request.json()
    job_id = data['job_id']

    # Download reviews
    reviews = requests.get(data['reviews_url']).json()
    print(f"Got {len(reviews['reviews'])} reviews for job {job_id}")

5. Get Reviews

curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" | jq

🏥 Health Checks

Liveness (Is server alive?)

curl http://localhost:8000/health/live

Use: Kubernetes liveness probe (restart if fails)

Readiness (Can handle traffic?)

curl http://localhost:8000/health/ready

Use: Kubernetes readiness probe (remove from load balancer if fails)

Canary (Does scraping work?)

curl http://localhost:8000/health/canary

Use: External monitoring (PagerDuty alerts)

How it works:

Runs real scrape test every 4 hours on test URL
Verifies Chrome, selectors, GDPR handling all work
Alerts if 3 consecutive failures

Detailed Health

curl http://localhost:8000/health/detailed | jq

Example response:

{
  "status": "healthy",
  "components": {
    "liveness": {
      "status": "alive"
    },
    "readiness": {
      "status": "ready",
      "checks": {
        "database": {"healthy": true}
      }
    },
    "canary": {
      "status": "healthy",
      "last_success": "2026-01-18T10:00:00Z",
      "age_minutes": 30,
      "consecutive_failures": 0
    }
  }
}

📊 Monitoring

View Canary History

# Connect to database
docker-compose -f docker-compose.production.yml exec db psql -U scraper

# Query canary results
SELECT
    timestamp,
    success,
    reviews_count,
    scrape_time,
    error_message
FROM canary_results
ORDER BY timestamp DESC
LIMIT 10;

View Job Statistics

curl http://localhost:8000/stats | jq

Response:

{
  "total_jobs": 150,
  "pending": 2,
  "running": 3,
  "completed": 140,
  "failed": 5,
  "cancelled": 0,
  "avg_scrape_time": 19.2,
  "total_reviews": 34560
}

View Webhook Delivery Stats

-- Connect to database
SELECT
    j.job_id,
    j.webhook_url,
    COUNT(w.id) as attempts,
    SUM(CASE WHEN w.success THEN 1 ELSE 0 END) as successful,
    MAX(w.timestamp) as last_attempt
FROM jobs j
LEFT JOIN webhook_attempts w ON j.job_id = w.job_id
WHERE j.webhook_url IS NOT NULL
GROUP BY j.job_id, j.webhook_url
ORDER BY last_attempt DESC
LIMIT 10;

🐳 Docker Commands

Start Services

docker-compose -f docker-compose.production.yml up -d

Stop Services

docker-compose -f docker-compose.production.yml down

View Logs

# All services
docker-compose -f docker-compose.production.yml logs -f

# Just API
docker-compose -f docker-compose.production.yml logs -f api

# Just database
docker-compose -f docker-compose.production.yml logs -f db

Restart Services

docker-compose -f docker-compose.production.yml restart api

Access Database

docker-compose -f docker-compose.production.yml exec db psql -U scraper

Backup Database

docker-compose -f docker-compose.production.yml exec db pg_dump -U scraper scraper > backup.sql

Restore Database

docker-compose -f docker-compose.production.yml exec -T db psql -U scraper scraper < backup.sql

🔐 Security

Webhook Signatures

All webhooks include HMAC-SHA256 signatures:

X-Webhook-Signature: sha256=abc123def456...
X-Webhook-Timestamp: 1705582800

Always verify signatures in your webhook handler!

Environment Variables

Store secrets in .env file (never commit to git):

# .env
DB_PASSWORD=strong_random_password_here
WEBHOOK_SECRET=another_strong_secret_here

HTTPS in Production

Always use HTTPS URLs for:

API_BASE_URL
webhook_url parameters

📈 Scaling

Vertical Scaling (Single Server)

# docker-compose.production.yml
services:
  api:
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 4G

Horizontal Scaling (Multiple Workers)

Phase 2 will add Redis queue for distributing jobs across multiple workers:

Load Balancer
     ↓
API Servers (3 replicas)
     ↓
Redis Queue
     ↓
Workers (10 replicas)
     ↓
PostgreSQL

🚨 Alerting

Slack Alerts

Set environment variable:

export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

Canary failures will automatically post to Slack:

🚨 CRITICAL: Scraper canary failed 3 times in a row!
Last error: Timeout after 60 seconds

Email Alerts (TODO)

Future enhancement - integrate with SMTP or SendGrid.

PagerDuty (TODO)

Future enhancement - integrate with PagerDuty API.

🧪 Testing

Test Webhook Locally

Use webhook.site or ngrok:

# Start ngrok
ngrok http 8000

# Use ngrok URL as webhook
curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://maps.google.com/...",
    "webhook_url": "https://your-id.ngrok.io/webhook"
  }'

Test Health Checks

# Should return 200
curl -f http://localhost:8000/health/live || echo "FAILED"

# Should return 200
curl -f http://localhost:8000/health/ready || echo "FAILED"

# May return 503 if no canary run yet
curl http://localhost:8000/health/canary

📝 Database Schema

Jobs Table

CREATE TABLE jobs (
    job_id UUID PRIMARY KEY,
    status VARCHAR(20) NOT NULL,
    url TEXT NOT NULL,
    webhook_url TEXT,
    webhook_secret TEXT,
    created_at TIMESTAMP NOT NULL,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    reviews_count INTEGER,
    reviews_data JSONB,        -- All reviews stored here!
    scrape_time REAL,
    error_message TEXT,
    metadata JSONB
);

Canary Results Table

CREATE TABLE canary_results (
    id SERIAL PRIMARY KEY,
    timestamp TIMESTAMP NOT NULL,
    success BOOLEAN NOT NULL,
    reviews_count INTEGER,
    scrape_time REAL,
    error_message TEXT,
    metadata JSONB
);

Webhook Attempts Table

CREATE TABLE webhook_attempts (
    id SERIAL PRIMARY KEY,
    job_id UUID NOT NULL,
    attempt_number INTEGER NOT NULL,
    timestamp TIMESTAMP NOT NULL,
    success BOOLEAN NOT NULL,
    status_code INTEGER,
    error_message TEXT,
    response_time_ms REAL
);

🎯 Next Steps (Phase 2)

Phase 2 will add:

✅ Redis Queue - Distribute jobs across multiple workers
✅ Worker Processes - Separate API from scraping
✅ Auto-scaling - Kubernetes HPA based on queue length
✅ SSE Streaming - Real-time progress updates (optional)

🐛 Troubleshooting

Database Connection Errors

# Check database is running
docker-compose -f docker-compose.production.yml ps db

# Check connection
psql postgresql://scraper:scraper123@localhost:5432/scraper -c "SELECT 1"

Canary Always Failing

Check canary test URL is accessible:

curl -I "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"

Try a different test URL in .env:

CANARY_TEST_URL=https://www.google.com/maps/place/YOUR_STABLE_BUSINESS

Webhooks Not Delivered

Check webhook attempts table:

SELECT * FROM webhook_attempts
WHERE job_id = '550e8400-e29b-41d4-a716-446655440000'
ORDER BY timestamp DESC;

Check webhook dispatcher is running:

docker-compose -f docker-compose.production.yml logs -f api | grep "webhook"

Your production microservice is ready! 🚀

For questions or issues, check:

Server logs: docker-compose logs -f api
Database: docker-compose exec db psql -U scraper
Health checks: curl http://localhost:8000/health/detailed

11 KiB Raw Blame History Unescape Escape