Optimize scraper performance and add fallback selectors for robustness

Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions
--- a/DEPLOYMENT_GUIDE.md
+++ b/DEPLOYMENT_GUIDE.md
@@ -0,0 +1,604 @@
+# Production Deployment Guide
+## Phase 1: PostgreSQL + Webhooks + Health Checks
+
+---
+
+## <20><>️ What's Included
+
+### Phase 1 Features:
+- ✅ **PostgreSQL Storage** - Job metadata + reviews as JSONB
+- ✅ **Webhooks** - Async notifications with retry logic and HMAC signatures
+- ✅ **Smart Health Checks** - Canary testing every 4 hours to verify scraping works
+- ✅ **Fast Scraper** - 18.9s average scraping time (8.2x faster)
+- ✅ **Docker Deployment** - Easy deployment with Docker Compose
+
+---
+
+## 🚀 Quick Start (Docker)
+
+### 1. Clone and Configure
+
+```bash
+# Copy environment file
+cp .env.example .env
+
+# Edit .env with your settings
+nano .env
+```
+
+### 2. Start Services
+
+```bash
+# Build and start all services
+docker-compose -f docker-compose.production.yml up -d
+
+# Check logs
+docker-compose -f docker-compose.production.yml logs -f api
+```
+
+### 3. Verify Health
+
+```bash
+# Check if API is running
+curl http://localhost:8000/
+
+# Check detailed health
+curl http://localhost:8000/health/detailed | jq
+```
+
+**Done!** API is running on `http://localhost:8000`
+
+---
+
+## 🔧 Manual Installation
+
+### 1. Install Dependencies
+
+```bash
+# Install Python dependencies
+pip install -r requirements-production.txt
+
+# Install PostgreSQL
+# On macOS:
+brew install postgresql@15
+brew services start postgresql@15
+
+# On Ubuntu:
+sudo apt-get install postgresql-15
+```
+
+### 2. Setup Database
+
+```bash
+# Create database and user
+psql postgres
+CREATE DATABASE scraper;
+CREATE USER scraper WITH PASSWORD 'scraper123';
+GRANT ALL PRIVILEGES ON DATABASE scraper TO scraper;
+\q
+```
+
+### 3. Configure Environment
+
+```bash
+# Set environment variables
+export DATABASE_URL="postgresql://scraper:scraper123@localhost:5432/scraper"
+export API_BASE_URL="http://localhost:8000"
+```
+
+### 4. Run Server
+
+```bash
+python api_server_production.py
+```
+
+Server runs on `http://localhost:8000`
+
+---
+
+## 📡 API Usage
+
+### 1. Submit Job with Webhook
+
+```bash
+curl -X POST "http://localhost:8000/scrape" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
+    "webhook_url": "https://your-server.com/webhook",
+    "webhook_secret": "your-secret-key"
+  }'
+```
+
+**Response:**
+```json
+{
+  "job_id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "started"
+}
+```
+
+### 2. Check Status
+
+```bash
+curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" | jq
+```
+
+### 3. Receive Webhook (When Complete)
+
+Your webhook endpoint will receive:
+
+```json
+POST https://your-server.com/webhook
+Headers:
+  X-Webhook-Signature: sha256=abc123...
+  X-Webhook-Timestamp: 1705582800
+
+Body:
+{
+  "event": "job.completed",
+  "job_id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "completed",
+  "reviews_count": 244,
+  "scrape_time": 18.9,
+  "reviews_url": "http://localhost:8000/jobs/550e8400-.../reviews",
+  "timestamp": "2026-01-18T10:30:00Z"
+}
+```
+
+### 4. Verify Webhook Signature
+
+```python
+import hmac
+import hashlib
+
+def verify_webhook(payload: str, signature: str, secret: str) -> bool:
+    """Verify webhook signature"""
+    expected = signature.split("sha256=", 1)[1]
+    computed = hmac.new(
+        secret.encode(),
+        payload.encode(),
+        hashlib.sha256
+    ).hexdigest()
+
+    return hmac.compare_digest(expected, computed)
+
+# In your webhook handler:
+@app.post("/webhook")
+async def handle_webhook(request: Request):
+    payload = await request.body()
+    signature = request.headers.get("X-Webhook-Signature")
+
+    if not verify_webhook(payload.decode(), signature, WEBHOOK_SECRET):
+        raise HTTPException(status_code=401, detail="Invalid signature")
+
+    # Process webhook...
+    data = await request.json()
+    job_id = data['job_id']
+
+    # Download reviews
+    reviews = requests.get(data['reviews_url']).json()
+    print(f"Got {len(reviews['reviews'])} reviews for job {job_id}")
+```
+
+### 5. Get Reviews
+
+```bash
+curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" | jq
+```
+
+---
+
+## 🏥 Health Checks
+
+### Liveness (Is server alive?)
+
+```bash
+curl http://localhost:8000/health/live
+```
+
+**Use**: Kubernetes liveness probe (restart if fails)
+
+### Readiness (Can handle traffic?)
+
+```bash
+curl http://localhost:8000/health/ready
+```
+
+**Use**: Kubernetes readiness probe (remove from load balancer if fails)
+
+### Canary (Does scraping work?)
+
+```bash
+curl http://localhost:8000/health/canary
+```
+
+**Use**: External monitoring (PagerDuty alerts)
+
+**How it works**:
+- Runs real scrape test every 4 hours on test URL
+- Verifies Chrome, selectors, GDPR handling all work
+- Alerts if 3 consecutive failures
+
+### Detailed Health
+
+```bash
+curl http://localhost:8000/health/detailed | jq
+```
+
+**Example response:**
+```json
+{
+  "status": "healthy",
+  "components": {
+    "liveness": {
+      "status": "alive"
+    },
+    "readiness": {
+      "status": "ready",
+      "checks": {
+        "database": {"healthy": true}
+      }
+    },
+    "canary": {
+      "status": "healthy",
+      "last_success": "2026-01-18T10:00:00Z",
+      "age_minutes": 30,
+      "consecutive_failures": 0
+    }
+  }
+}
+```
+
+---
+
+## 📊 Monitoring
+
+### View Canary History
+
+```bash
+# Connect to database
+docker-compose -f docker-compose.production.yml exec db psql -U scraper
+
+# Query canary results
+SELECT
+    timestamp,
+    success,
+    reviews_count,
+    scrape_time,
+    error_message
+FROM canary_results
+ORDER BY timestamp DESC
+LIMIT 10;
+```
+
+### View Job Statistics
+
+```bash
+curl http://localhost:8000/stats | jq
+```
+
+**Response:**
+```json
+{
+  "total_jobs": 150,
+  "pending": 2,
+  "running": 3,
+  "completed": 140,
+  "failed": 5,
+  "cancelled": 0,
+  "avg_scrape_time": 19.2,
+  "total_reviews": 34560
+}
+```
+
+### View Webhook Delivery Stats
+
+```sql
+-- Connect to database
+SELECT
+    j.job_id,
+    j.webhook_url,
+    COUNT(w.id) as attempts,
+    SUM(CASE WHEN w.success THEN 1 ELSE 0 END) as successful,
+    MAX(w.timestamp) as last_attempt
+FROM jobs j
+LEFT JOIN webhook_attempts w ON j.job_id = w.job_id
+WHERE j.webhook_url IS NOT NULL
+GROUP BY j.job_id, j.webhook_url
+ORDER BY last_attempt DESC
+LIMIT 10;
+```
+
+---
+
+## 🐳 Docker Commands
+
+### Start Services
+
+```bash
+docker-compose -f docker-compose.production.yml up -d
+```
+
+### Stop Services
+
+```bash
+docker-compose -f docker-compose.production.yml down
+```
+
+### View Logs
+
+```bash
+# All services
+docker-compose -f docker-compose.production.yml logs -f
+
+# Just API
+docker-compose -f docker-compose.production.yml logs -f api
+
+# Just database
+docker-compose -f docker-compose.production.yml logs -f db
+```
+
+### Restart Services
+
+```bash
+docker-compose -f docker-compose.production.yml restart api
+```
+
+### Access Database
+
+```bash
+docker-compose -f docker-compose.production.yml exec db psql -U scraper
+```
+
+### Backup Database
+
+```bash
+docker-compose -f docker-compose.production.yml exec db pg_dump -U scraper scraper > backup.sql
+```
+
+### Restore Database
+
+```bash
+docker-compose -f docker-compose.production.yml exec -T db psql -U scraper scraper < backup.sql
+```
+
+---
+
+## 🔐 Security
+
+### Webhook Signatures
+
+All webhooks include HMAC-SHA256 signatures:
+
+```
+X-Webhook-Signature: sha256=abc123def456...
+X-Webhook-Timestamp: 1705582800
+```
+
+**Always verify signatures** in your webhook handler!
+
+### Environment Variables
+
+Store secrets in `.env` file (never commit to git):
+
+```bash
+# .env
+DB_PASSWORD=strong_random_password_here
+WEBHOOK_SECRET=another_strong_secret_here
+```
+
+### HTTPS in Production
+
+Always use HTTPS URLs for:
+- API_BASE_URL
+- webhook_url parameters
+
+---
+
+## 📈 Scaling
+
+### Vertical Scaling (Single Server)
+
+```yaml
+# docker-compose.production.yml
+services:
+  api:
+    deploy:
+      resources:
+        limits:
+          cpus: '2'
+          memory: 4G
+```
+
+### Horizontal Scaling (Multiple Workers)
+
+Phase 2 will add Redis queue for distributing jobs across multiple workers:
+
+```
+Load Balancer
+     ↓
+API Servers (3 replicas)
+     ↓
+Redis Queue
+     ↓
+Workers (10 replicas)
+     ↓
+PostgreSQL
+```
+
+---
+
+## 🚨 Alerting
+
+### Slack Alerts
+
+Set environment variable:
+
+```bash
+export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
+```
+
+Canary failures will automatically post to Slack:
+
+```
+🚨 CRITICAL: Scraper canary failed 3 times in a row!
+Last error: Timeout after 60 seconds
+```
+
+### Email Alerts (TODO)
+
+Future enhancement - integrate with SMTP or SendGrid.
+
+### PagerDuty (TODO)
+
+Future enhancement - integrate with PagerDuty API.
+
+---
+
+## 🧪 Testing
+
+### Test Webhook Locally
+
+Use webhook.site or ngrok:
+
+```bash
+# Start ngrok
+ngrok http 8000
+
+# Use ngrok URL as webhook
+curl -X POST "http://localhost:8000/scrape" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://maps.google.com/...",
+    "webhook_url": "https://your-id.ngrok.io/webhook"
+  }'
+```
+
+### Test Health Checks
+
+```bash
+# Should return 200
+curl -f http://localhost:8000/health/live || echo "FAILED"
+
+# Should return 200
+curl -f http://localhost:8000/health/ready || echo "FAILED"
+
+# May return 503 if no canary run yet
+curl http://localhost:8000/health/canary
+```
+
+---
+
+## 📝 Database Schema
+
+### Jobs Table
+
+```sql
+CREATE TABLE jobs (
+    job_id UUID PRIMARY KEY,
+    status VARCHAR(20) NOT NULL,
+    url TEXT NOT NULL,
+    webhook_url TEXT,
+    webhook_secret TEXT,
+    created_at TIMESTAMP NOT NULL,
+    started_at TIMESTAMP,
+    completed_at TIMESTAMP,
+    reviews_count INTEGER,
+    reviews_data JSONB,        -- All reviews stored here!
+    scrape_time REAL,
+    error_message TEXT,
+    metadata JSONB
+);
+```
+
+### Canary Results Table
+
+```sql
+CREATE TABLE canary_results (
+    id SERIAL PRIMARY KEY,
+    timestamp TIMESTAMP NOT NULL,
+    success BOOLEAN NOT NULL,
+    reviews_count INTEGER,
+    scrape_time REAL,
+    error_message TEXT,
+    metadata JSONB
+);
+```
+
+### Webhook Attempts Table
+
+```sql
+CREATE TABLE webhook_attempts (
+    id SERIAL PRIMARY KEY,
+    job_id UUID NOT NULL,
+    attempt_number INTEGER NOT NULL,
+    timestamp TIMESTAMP NOT NULL,
+    success BOOLEAN NOT NULL,
+    status_code INTEGER,
+    error_message TEXT,
+    response_time_ms REAL
+);
+```
+
+---
+
+## 🎯 Next Steps (Phase 2)
+
+Phase 2 will add:
+- ✅ **Redis Queue** - Distribute jobs across multiple workers
+- ✅ **Worker Processes** - Separate API from scraping
+- ✅ **Auto-scaling** - Kubernetes HPA based on queue length
+- ✅ **SSE Streaming** - Real-time progress updates (optional)
+
+---
+
+## 🐛 Troubleshooting
+
+### Database Connection Errors
+
+```bash
+# Check database is running
+docker-compose -f docker-compose.production.yml ps db
+
+# Check connection
+psql postgresql://scraper:scraper123@localhost:5432/scraper -c "SELECT 1"
+```
+
+### Canary Always Failing
+
+Check canary test URL is accessible:
+
+```bash
+curl -I "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"
+```
+
+Try a different test URL in .env:
+```
+CANARY_TEST_URL=https://www.google.com/maps/place/YOUR_STABLE_BUSINESS
+```
+
+### Webhooks Not Delivered
+
+Check webhook attempts table:
+
+```sql
+SELECT * FROM webhook_attempts
+WHERE job_id = '550e8400-e29b-41d4-a716-446655440000'
+ORDER BY timestamp DESC;
+```
+
+Check webhook dispatcher is running:
+
+```bash
+docker-compose -f docker-compose.production.yml logs -f api | grep "webhook"
+```
+
+---
+
+**Your production microservice is ready!** 🚀
+
+For questions or issues, check:
+- Server logs: `docker-compose logs -f api`
+- Database: `docker-compose exec db psql -U scraper`
+- Health checks: `curl http://localhost:8000/health/detailed`