Files
whyrating-engine-legacy/DEPLOYMENT_GUIDE.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

605 lines
11 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Production Deployment Guide
## Phase 1: PostgreSQL + Webhooks + Health Checks
---
## <20><> What's Included
### Phase 1 Features:
-**PostgreSQL Storage** - Job metadata + reviews as JSONB
-**Webhooks** - Async notifications with retry logic and HMAC signatures
-**Smart Health Checks** - Canary testing every 4 hours to verify scraping works
-**Fast Scraper** - 18.9s average scraping time (8.2x faster)
-**Docker Deployment** - Easy deployment with Docker Compose
---
## 🚀 Quick Start (Docker)
### 1. Clone and Configure
```bash
# Copy environment file
cp .env.example .env
# Edit .env with your settings
nano .env
```
### 2. Start Services
```bash
# Build and start all services
docker-compose -f docker-compose.production.yml up -d
# Check logs
docker-compose -f docker-compose.production.yml logs -f api
```
### 3. Verify Health
```bash
# Check if API is running
curl http://localhost:8000/
# Check detailed health
curl http://localhost:8000/health/detailed | jq
```
**Done!** API is running on `http://localhost:8000`
---
## 🔧 Manual Installation
### 1. Install Dependencies
```bash
# Install Python dependencies
pip install -r requirements-production.txt
# Install PostgreSQL
# On macOS:
brew install postgresql@15
brew services start postgresql@15
# On Ubuntu:
sudo apt-get install postgresql-15
```
### 2. Setup Database
```bash
# Create database and user
psql postgres
CREATE DATABASE scraper;
CREATE USER scraper WITH PASSWORD 'scraper123';
GRANT ALL PRIVILEGES ON DATABASE scraper TO scraper;
\q
```
### 3. Configure Environment
```bash
# Set environment variables
export DATABASE_URL="postgresql://scraper:scraper123@localhost:5432/scraper"
export API_BASE_URL="http://localhost:8000"
```
### 4. Run Server
```bash
python api_server_production.py
```
Server runs on `http://localhost:8000`
---
## 📡 API Usage
### 1. Submit Job with Webhook
```bash
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
"webhook_url": "https://your-server.com/webhook",
"webhook_secret": "your-secret-key"
}'
```
**Response:**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started"
}
```
### 2. Check Status
```bash
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" | jq
```
### 3. Receive Webhook (When Complete)
Your webhook endpoint will receive:
```json
POST https://your-server.com/webhook
Headers:
X-Webhook-Signature: sha256=abc123...
X-Webhook-Timestamp: 1705582800
Body:
{
"event": "job.completed",
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"reviews_count": 244,
"scrape_time": 18.9,
"reviews_url": "http://localhost:8000/jobs/550e8400-.../reviews",
"timestamp": "2026-01-18T10:30:00Z"
}
```
### 4. Verify Webhook Signature
```python
import hmac
import hashlib
def verify_webhook(payload: str, signature: str, secret: str) -> bool:
"""Verify webhook signature"""
expected = signature.split("sha256=", 1)[1]
computed = hmac.new(
secret.encode(),
payload.encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, computed)
# In your webhook handler:
@app.post("/webhook")
async def handle_webhook(request: Request):
payload = await request.body()
signature = request.headers.get("X-Webhook-Signature")
if not verify_webhook(payload.decode(), signature, WEBHOOK_SECRET):
raise HTTPException(status_code=401, detail="Invalid signature")
# Process webhook...
data = await request.json()
job_id = data['job_id']
# Download reviews
reviews = requests.get(data['reviews_url']).json()
print(f"Got {len(reviews['reviews'])} reviews for job {job_id}")
```
### 5. Get Reviews
```bash
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" | jq
```
---
## 🏥 Health Checks
### Liveness (Is server alive?)
```bash
curl http://localhost:8000/health/live
```
**Use**: Kubernetes liveness probe (restart if fails)
### Readiness (Can handle traffic?)
```bash
curl http://localhost:8000/health/ready
```
**Use**: Kubernetes readiness probe (remove from load balancer if fails)
### Canary (Does scraping work?)
```bash
curl http://localhost:8000/health/canary
```
**Use**: External monitoring (PagerDuty alerts)
**How it works**:
- Runs real scrape test every 4 hours on test URL
- Verifies Chrome, selectors, GDPR handling all work
- Alerts if 3 consecutive failures
### Detailed Health
```bash
curl http://localhost:8000/health/detailed | jq
```
**Example response:**
```json
{
"status": "healthy",
"components": {
"liveness": {
"status": "alive"
},
"readiness": {
"status": "ready",
"checks": {
"database": {"healthy": true}
}
},
"canary": {
"status": "healthy",
"last_success": "2026-01-18T10:00:00Z",
"age_minutes": 30,
"consecutive_failures": 0
}
}
}
```
---
## 📊 Monitoring
### View Canary History
```bash
# Connect to database
docker-compose -f docker-compose.production.yml exec db psql -U scraper
# Query canary results
SELECT
timestamp,
success,
reviews_count,
scrape_time,
error_message
FROM canary_results
ORDER BY timestamp DESC
LIMIT 10;
```
### View Job Statistics
```bash
curl http://localhost:8000/stats | jq
```
**Response:**
```json
{
"total_jobs": 150,
"pending": 2,
"running": 3,
"completed": 140,
"failed": 5,
"cancelled": 0,
"avg_scrape_time": 19.2,
"total_reviews": 34560
}
```
### View Webhook Delivery Stats
```sql
-- Connect to database
SELECT
j.job_id,
j.webhook_url,
COUNT(w.id) as attempts,
SUM(CASE WHEN w.success THEN 1 ELSE 0 END) as successful,
MAX(w.timestamp) as last_attempt
FROM jobs j
LEFT JOIN webhook_attempts w ON j.job_id = w.job_id
WHERE j.webhook_url IS NOT NULL
GROUP BY j.job_id, j.webhook_url
ORDER BY last_attempt DESC
LIMIT 10;
```
---
## 🐳 Docker Commands
### Start Services
```bash
docker-compose -f docker-compose.production.yml up -d
```
### Stop Services
```bash
docker-compose -f docker-compose.production.yml down
```
### View Logs
```bash
# All services
docker-compose -f docker-compose.production.yml logs -f
# Just API
docker-compose -f docker-compose.production.yml logs -f api
# Just database
docker-compose -f docker-compose.production.yml logs -f db
```
### Restart Services
```bash
docker-compose -f docker-compose.production.yml restart api
```
### Access Database
```bash
docker-compose -f docker-compose.production.yml exec db psql -U scraper
```
### Backup Database
```bash
docker-compose -f docker-compose.production.yml exec db pg_dump -U scraper scraper > backup.sql
```
### Restore Database
```bash
docker-compose -f docker-compose.production.yml exec -T db psql -U scraper scraper < backup.sql
```
---
## 🔐 Security
### Webhook Signatures
All webhooks include HMAC-SHA256 signatures:
```
X-Webhook-Signature: sha256=abc123def456...
X-Webhook-Timestamp: 1705582800
```
**Always verify signatures** in your webhook handler!
### Environment Variables
Store secrets in `.env` file (never commit to git):
```bash
# .env
DB_PASSWORD=strong_random_password_here
WEBHOOK_SECRET=another_strong_secret_here
```
### HTTPS in Production
Always use HTTPS URLs for:
- API_BASE_URL
- webhook_url parameters
---
## 📈 Scaling
### Vertical Scaling (Single Server)
```yaml
# docker-compose.production.yml
services:
api:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
```
### Horizontal Scaling (Multiple Workers)
Phase 2 will add Redis queue for distributing jobs across multiple workers:
```
Load Balancer
API Servers (3 replicas)
Redis Queue
Workers (10 replicas)
PostgreSQL
```
---
## 🚨 Alerting
### Slack Alerts
Set environment variable:
```bash
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
```
Canary failures will automatically post to Slack:
```
🚨 CRITICAL: Scraper canary failed 3 times in a row!
Last error: Timeout after 60 seconds
```
### Email Alerts (TODO)
Future enhancement - integrate with SMTP or SendGrid.
### PagerDuty (TODO)
Future enhancement - integrate with PagerDuty API.
---
## 🧪 Testing
### Test Webhook Locally
Use webhook.site or ngrok:
```bash
# Start ngrok
ngrok http 8000
# Use ngrok URL as webhook
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://maps.google.com/...",
"webhook_url": "https://your-id.ngrok.io/webhook"
}'
```
### Test Health Checks
```bash
# Should return 200
curl -f http://localhost:8000/health/live || echo "FAILED"
# Should return 200
curl -f http://localhost:8000/health/ready || echo "FAILED"
# May return 503 if no canary run yet
curl http://localhost:8000/health/canary
```
---
## 📝 Database Schema
### Jobs Table
```sql
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL,
url TEXT NOT NULL,
webhook_url TEXT,
webhook_secret TEXT,
created_at TIMESTAMP NOT NULL,
started_at TIMESTAMP,
completed_at TIMESTAMP,
reviews_count INTEGER,
reviews_data JSONB, -- All reviews stored here!
scrape_time REAL,
error_message TEXT,
metadata JSONB
);
```
### Canary Results Table
```sql
CREATE TABLE canary_results (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL,
success BOOLEAN NOT NULL,
reviews_count INTEGER,
scrape_time REAL,
error_message TEXT,
metadata JSONB
);
```
### Webhook Attempts Table
```sql
CREATE TABLE webhook_attempts (
id SERIAL PRIMARY KEY,
job_id UUID NOT NULL,
attempt_number INTEGER NOT NULL,
timestamp TIMESTAMP NOT NULL,
success BOOLEAN NOT NULL,
status_code INTEGER,
error_message TEXT,
response_time_ms REAL
);
```
---
## 🎯 Next Steps (Phase 2)
Phase 2 will add:
-**Redis Queue** - Distribute jobs across multiple workers
-**Worker Processes** - Separate API from scraping
-**Auto-scaling** - Kubernetes HPA based on queue length
-**SSE Streaming** - Real-time progress updates (optional)
---
## 🐛 Troubleshooting
### Database Connection Errors
```bash
# Check database is running
docker-compose -f docker-compose.production.yml ps db
# Check connection
psql postgresql://scraper:scraper123@localhost:5432/scraper -c "SELECT 1"
```
### Canary Always Failing
Check canary test URL is accessible:
```bash
curl -I "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"
```
Try a different test URL in .env:
```
CANARY_TEST_URL=https://www.google.com/maps/place/YOUR_STABLE_BUSINESS
```
### Webhooks Not Delivered
Check webhook attempts table:
```sql
SELECT * FROM webhook_attempts
WHERE job_id = '550e8400-e29b-41d4-a716-446655440000'
ORDER BY timestamp DESC;
```
Check webhook dispatcher is running:
```bash
docker-compose -f docker-compose.production.yml logs -f api | grep "webhook"
```
---
**Your production microservice is ready!** 🚀
For questions or issues, check:
- Server logs: `docker-compose logs -f api`
- Database: `docker-compose exec db psql -U scraper`
- Health checks: `curl http://localhost:8000/health/detailed`