Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
11 KiB
Production Deployment Guide
Phase 1: PostgreSQL + Webhooks + Health Checks
<EFBFBD><EFBFBD>️ What's Included
Phase 1 Features:
- ✅ PostgreSQL Storage - Job metadata + reviews as JSONB
- ✅ Webhooks - Async notifications with retry logic and HMAC signatures
- ✅ Smart Health Checks - Canary testing every 4 hours to verify scraping works
- ✅ Fast Scraper - 18.9s average scraping time (8.2x faster)
- ✅ Docker Deployment - Easy deployment with Docker Compose
🚀 Quick Start (Docker)
1. Clone and Configure
# Copy environment file
cp .env.example .env
# Edit .env with your settings
nano .env
2. Start Services
# Build and start all services
docker-compose -f docker-compose.production.yml up -d
# Check logs
docker-compose -f docker-compose.production.yml logs -f api
3. Verify Health
# Check if API is running
curl http://localhost:8000/
# Check detailed health
curl http://localhost:8000/health/detailed | jq
Done! API is running on http://localhost:8000
🔧 Manual Installation
1. Install Dependencies
# Install Python dependencies
pip install -r requirements-production.txt
# Install PostgreSQL
# On macOS:
brew install postgresql@15
brew services start postgresql@15
# On Ubuntu:
sudo apt-get install postgresql-15
2. Setup Database
# Create database and user
psql postgres
CREATE DATABASE scraper;
CREATE USER scraper WITH PASSWORD 'scraper123';
GRANT ALL PRIVILEGES ON DATABASE scraper TO scraper;
\q
3. Configure Environment
# Set environment variables
export DATABASE_URL="postgresql://scraper:scraper123@localhost:5432/scraper"
export API_BASE_URL="http://localhost:8000"
4. Run Server
python api_server_production.py
Server runs on http://localhost:8000
📡 API Usage
1. Submit Job with Webhook
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
"webhook_url": "https://your-server.com/webhook",
"webhook_secret": "your-secret-key"
}'
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started"
}
2. Check Status
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" | jq
3. Receive Webhook (When Complete)
Your webhook endpoint will receive:
POST https://your-server.com/webhook
Headers:
X-Webhook-Signature: sha256=abc123...
X-Webhook-Timestamp: 1705582800
Body:
{
"event": "job.completed",
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"reviews_count": 244,
"scrape_time": 18.9,
"reviews_url": "http://localhost:8000/jobs/550e8400-.../reviews",
"timestamp": "2026-01-18T10:30:00Z"
}
4. Verify Webhook Signature
import hmac
import hashlib
def verify_webhook(payload: str, signature: str, secret: str) -> bool:
"""Verify webhook signature"""
expected = signature.split("sha256=", 1)[1]
computed = hmac.new(
secret.encode(),
payload.encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, computed)
# In your webhook handler:
@app.post("/webhook")
async def handle_webhook(request: Request):
payload = await request.body()
signature = request.headers.get("X-Webhook-Signature")
if not verify_webhook(payload.decode(), signature, WEBHOOK_SECRET):
raise HTTPException(status_code=401, detail="Invalid signature")
# Process webhook...
data = await request.json()
job_id = data['job_id']
# Download reviews
reviews = requests.get(data['reviews_url']).json()
print(f"Got {len(reviews['reviews'])} reviews for job {job_id}")
5. Get Reviews
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" | jq
🏥 Health Checks
Liveness (Is server alive?)
curl http://localhost:8000/health/live
Use: Kubernetes liveness probe (restart if fails)
Readiness (Can handle traffic?)
curl http://localhost:8000/health/ready
Use: Kubernetes readiness probe (remove from load balancer if fails)
Canary (Does scraping work?)
curl http://localhost:8000/health/canary
Use: External monitoring (PagerDuty alerts)
How it works:
- Runs real scrape test every 4 hours on test URL
- Verifies Chrome, selectors, GDPR handling all work
- Alerts if 3 consecutive failures
Detailed Health
curl http://localhost:8000/health/detailed | jq
Example response:
{
"status": "healthy",
"components": {
"liveness": {
"status": "alive"
},
"readiness": {
"status": "ready",
"checks": {
"database": {"healthy": true}
}
},
"canary": {
"status": "healthy",
"last_success": "2026-01-18T10:00:00Z",
"age_minutes": 30,
"consecutive_failures": 0
}
}
}
📊 Monitoring
View Canary History
# Connect to database
docker-compose -f docker-compose.production.yml exec db psql -U scraper
# Query canary results
SELECT
timestamp,
success,
reviews_count,
scrape_time,
error_message
FROM canary_results
ORDER BY timestamp DESC
LIMIT 10;
View Job Statistics
curl http://localhost:8000/stats | jq
Response:
{
"total_jobs": 150,
"pending": 2,
"running": 3,
"completed": 140,
"failed": 5,
"cancelled": 0,
"avg_scrape_time": 19.2,
"total_reviews": 34560
}
View Webhook Delivery Stats
-- Connect to database
SELECT
j.job_id,
j.webhook_url,
COUNT(w.id) as attempts,
SUM(CASE WHEN w.success THEN 1 ELSE 0 END) as successful,
MAX(w.timestamp) as last_attempt
FROM jobs j
LEFT JOIN webhook_attempts w ON j.job_id = w.job_id
WHERE j.webhook_url IS NOT NULL
GROUP BY j.job_id, j.webhook_url
ORDER BY last_attempt DESC
LIMIT 10;
🐳 Docker Commands
Start Services
docker-compose -f docker-compose.production.yml up -d
Stop Services
docker-compose -f docker-compose.production.yml down
View Logs
# All services
docker-compose -f docker-compose.production.yml logs -f
# Just API
docker-compose -f docker-compose.production.yml logs -f api
# Just database
docker-compose -f docker-compose.production.yml logs -f db
Restart Services
docker-compose -f docker-compose.production.yml restart api
Access Database
docker-compose -f docker-compose.production.yml exec db psql -U scraper
Backup Database
docker-compose -f docker-compose.production.yml exec db pg_dump -U scraper scraper > backup.sql
Restore Database
docker-compose -f docker-compose.production.yml exec -T db psql -U scraper scraper < backup.sql
🔐 Security
Webhook Signatures
All webhooks include HMAC-SHA256 signatures:
X-Webhook-Signature: sha256=abc123def456...
X-Webhook-Timestamp: 1705582800
Always verify signatures in your webhook handler!
Environment Variables
Store secrets in .env file (never commit to git):
# .env
DB_PASSWORD=strong_random_password_here
WEBHOOK_SECRET=another_strong_secret_here
HTTPS in Production
Always use HTTPS URLs for:
- API_BASE_URL
- webhook_url parameters
📈 Scaling
Vertical Scaling (Single Server)
# docker-compose.production.yml
services:
api:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
Horizontal Scaling (Multiple Workers)
Phase 2 will add Redis queue for distributing jobs across multiple workers:
Load Balancer
↓
API Servers (3 replicas)
↓
Redis Queue
↓
Workers (10 replicas)
↓
PostgreSQL
🚨 Alerting
Slack Alerts
Set environment variable:
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
Canary failures will automatically post to Slack:
🚨 CRITICAL: Scraper canary failed 3 times in a row!
Last error: Timeout after 60 seconds
Email Alerts (TODO)
Future enhancement - integrate with SMTP or SendGrid.
PagerDuty (TODO)
Future enhancement - integrate with PagerDuty API.
🧪 Testing
Test Webhook Locally
Use webhook.site or ngrok:
# Start ngrok
ngrok http 8000
# Use ngrok URL as webhook
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://maps.google.com/...",
"webhook_url": "https://your-id.ngrok.io/webhook"
}'
Test Health Checks
# Should return 200
curl -f http://localhost:8000/health/live || echo "FAILED"
# Should return 200
curl -f http://localhost:8000/health/ready || echo "FAILED"
# May return 503 if no canary run yet
curl http://localhost:8000/health/canary
📝 Database Schema
Jobs Table
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL,
url TEXT NOT NULL,
webhook_url TEXT,
webhook_secret TEXT,
created_at TIMESTAMP NOT NULL,
started_at TIMESTAMP,
completed_at TIMESTAMP,
reviews_count INTEGER,
reviews_data JSONB, -- All reviews stored here!
scrape_time REAL,
error_message TEXT,
metadata JSONB
);
Canary Results Table
CREATE TABLE canary_results (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL,
success BOOLEAN NOT NULL,
reviews_count INTEGER,
scrape_time REAL,
error_message TEXT,
metadata JSONB
);
Webhook Attempts Table
CREATE TABLE webhook_attempts (
id SERIAL PRIMARY KEY,
job_id UUID NOT NULL,
attempt_number INTEGER NOT NULL,
timestamp TIMESTAMP NOT NULL,
success BOOLEAN NOT NULL,
status_code INTEGER,
error_message TEXT,
response_time_ms REAL
);
🎯 Next Steps (Phase 2)
Phase 2 will add:
- ✅ Redis Queue - Distribute jobs across multiple workers
- ✅ Worker Processes - Separate API from scraping
- ✅ Auto-scaling - Kubernetes HPA based on queue length
- ✅ SSE Streaming - Real-time progress updates (optional)
🐛 Troubleshooting
Database Connection Errors
# Check database is running
docker-compose -f docker-compose.production.yml ps db
# Check connection
psql postgresql://scraper:scraper123@localhost:5432/scraper -c "SELECT 1"
Canary Always Failing
Check canary test URL is accessible:
curl -I "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"
Try a different test URL in .env:
CANARY_TEST_URL=https://www.google.com/maps/place/YOUR_STABLE_BUSINESS
Webhooks Not Delivered
Check webhook attempts table:
SELECT * FROM webhook_attempts
WHERE job_id = '550e8400-e29b-41d4-a716-446655440000'
ORDER BY timestamp DESC;
Check webhook dispatcher is running:
docker-compose -f docker-compose.production.yml logs -f api | grep "webhook"
Your production microservice is ready! 🚀
For questions or issues, check:
- Server logs:
docker-compose logs -f api - Database:
docker-compose exec db psql -U scraper - Health checks:
curl http://localhost:8000/health/detailed