Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

604
DEPLOYMENT_GUIDE.md Normal file
View File

@@ -0,0 +1,604 @@
# Production Deployment Guide
## Phase 1: PostgreSQL + Webhooks + Health Checks
---
## <20><> What's Included
### Phase 1 Features:
-**PostgreSQL Storage** - Job metadata + reviews as JSONB
-**Webhooks** - Async notifications with retry logic and HMAC signatures
-**Smart Health Checks** - Canary testing every 4 hours to verify scraping works
-**Fast Scraper** - 18.9s average scraping time (8.2x faster)
-**Docker Deployment** - Easy deployment with Docker Compose
---
## 🚀 Quick Start (Docker)
### 1. Clone and Configure
```bash
# Copy environment file
cp .env.example .env
# Edit .env with your settings
nano .env
```
### 2. Start Services
```bash
# Build and start all services
docker-compose -f docker-compose.production.yml up -d
# Check logs
docker-compose -f docker-compose.production.yml logs -f api
```
### 3. Verify Health
```bash
# Check if API is running
curl http://localhost:8000/
# Check detailed health
curl http://localhost:8000/health/detailed | jq
```
**Done!** API is running on `http://localhost:8000`
---
## 🔧 Manual Installation
### 1. Install Dependencies
```bash
# Install Python dependencies
pip install -r requirements-production.txt
# Install PostgreSQL
# On macOS:
brew install postgresql@15
brew services start postgresql@15
# On Ubuntu:
sudo apt-get install postgresql-15
```
### 2. Setup Database
```bash
# Create database and user
psql postgres
CREATE DATABASE scraper;
CREATE USER scraper WITH PASSWORD 'scraper123';
GRANT ALL PRIVILEGES ON DATABASE scraper TO scraper;
\q
```
### 3. Configure Environment
```bash
# Set environment variables
export DATABASE_URL="postgresql://scraper:scraper123@localhost:5432/scraper"
export API_BASE_URL="http://localhost:8000"
```
### 4. Run Server
```bash
python api_server_production.py
```
Server runs on `http://localhost:8000`
---
## 📡 API Usage
### 1. Submit Job with Webhook
```bash
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
"webhook_url": "https://your-server.com/webhook",
"webhook_secret": "your-secret-key"
}'
```
**Response:**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started"
}
```
### 2. Check Status
```bash
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" | jq
```
### 3. Receive Webhook (When Complete)
Your webhook endpoint will receive:
```json
POST https://your-server.com/webhook
Headers:
X-Webhook-Signature: sha256=abc123...
X-Webhook-Timestamp: 1705582800
Body:
{
"event": "job.completed",
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"reviews_count": 244,
"scrape_time": 18.9,
"reviews_url": "http://localhost:8000/jobs/550e8400-.../reviews",
"timestamp": "2026-01-18T10:30:00Z"
}
```
### 4. Verify Webhook Signature
```python
import hmac
import hashlib
def verify_webhook(payload: str, signature: str, secret: str) -> bool:
"""Verify webhook signature"""
expected = signature.split("sha256=", 1)[1]
computed = hmac.new(
secret.encode(),
payload.encode(),
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, computed)
# In your webhook handler:
@app.post("/webhook")
async def handle_webhook(request: Request):
payload = await request.body()
signature = request.headers.get("X-Webhook-Signature")
if not verify_webhook(payload.decode(), signature, WEBHOOK_SECRET):
raise HTTPException(status_code=401, detail="Invalid signature")
# Process webhook...
data = await request.json()
job_id = data['job_id']
# Download reviews
reviews = requests.get(data['reviews_url']).json()
print(f"Got {len(reviews['reviews'])} reviews for job {job_id}")
```
### 5. Get Reviews
```bash
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" | jq
```
---
## 🏥 Health Checks
### Liveness (Is server alive?)
```bash
curl http://localhost:8000/health/live
```
**Use**: Kubernetes liveness probe (restart if fails)
### Readiness (Can handle traffic?)
```bash
curl http://localhost:8000/health/ready
```
**Use**: Kubernetes readiness probe (remove from load balancer if fails)
### Canary (Does scraping work?)
```bash
curl http://localhost:8000/health/canary
```
**Use**: External monitoring (PagerDuty alerts)
**How it works**:
- Runs real scrape test every 4 hours on test URL
- Verifies Chrome, selectors, GDPR handling all work
- Alerts if 3 consecutive failures
### Detailed Health
```bash
curl http://localhost:8000/health/detailed | jq
```
**Example response:**
```json
{
"status": "healthy",
"components": {
"liveness": {
"status": "alive"
},
"readiness": {
"status": "ready",
"checks": {
"database": {"healthy": true}
}
},
"canary": {
"status": "healthy",
"last_success": "2026-01-18T10:00:00Z",
"age_minutes": 30,
"consecutive_failures": 0
}
}
}
```
---
## 📊 Monitoring
### View Canary History
```bash
# Connect to database
docker-compose -f docker-compose.production.yml exec db psql -U scraper
# Query canary results
SELECT
timestamp,
success,
reviews_count,
scrape_time,
error_message
FROM canary_results
ORDER BY timestamp DESC
LIMIT 10;
```
### View Job Statistics
```bash
curl http://localhost:8000/stats | jq
```
**Response:**
```json
{
"total_jobs": 150,
"pending": 2,
"running": 3,
"completed": 140,
"failed": 5,
"cancelled": 0,
"avg_scrape_time": 19.2,
"total_reviews": 34560
}
```
### View Webhook Delivery Stats
```sql
-- Connect to database
SELECT
j.job_id,
j.webhook_url,
COUNT(w.id) as attempts,
SUM(CASE WHEN w.success THEN 1 ELSE 0 END) as successful,
MAX(w.timestamp) as last_attempt
FROM jobs j
LEFT JOIN webhook_attempts w ON j.job_id = w.job_id
WHERE j.webhook_url IS NOT NULL
GROUP BY j.job_id, j.webhook_url
ORDER BY last_attempt DESC
LIMIT 10;
```
---
## 🐳 Docker Commands
### Start Services
```bash
docker-compose -f docker-compose.production.yml up -d
```
### Stop Services
```bash
docker-compose -f docker-compose.production.yml down
```
### View Logs
```bash
# All services
docker-compose -f docker-compose.production.yml logs -f
# Just API
docker-compose -f docker-compose.production.yml logs -f api
# Just database
docker-compose -f docker-compose.production.yml logs -f db
```
### Restart Services
```bash
docker-compose -f docker-compose.production.yml restart api
```
### Access Database
```bash
docker-compose -f docker-compose.production.yml exec db psql -U scraper
```
### Backup Database
```bash
docker-compose -f docker-compose.production.yml exec db pg_dump -U scraper scraper > backup.sql
```
### Restore Database
```bash
docker-compose -f docker-compose.production.yml exec -T db psql -U scraper scraper < backup.sql
```
---
## 🔐 Security
### Webhook Signatures
All webhooks include HMAC-SHA256 signatures:
```
X-Webhook-Signature: sha256=abc123def456...
X-Webhook-Timestamp: 1705582800
```
**Always verify signatures** in your webhook handler!
### Environment Variables
Store secrets in `.env` file (never commit to git):
```bash
# .env
DB_PASSWORD=strong_random_password_here
WEBHOOK_SECRET=another_strong_secret_here
```
### HTTPS in Production
Always use HTTPS URLs for:
- API_BASE_URL
- webhook_url parameters
---
## 📈 Scaling
### Vertical Scaling (Single Server)
```yaml
# docker-compose.production.yml
services:
api:
deploy:
resources:
limits:
cpus: '2'
memory: 4G
```
### Horizontal Scaling (Multiple Workers)
Phase 2 will add Redis queue for distributing jobs across multiple workers:
```
Load Balancer
API Servers (3 replicas)
Redis Queue
Workers (10 replicas)
PostgreSQL
```
---
## 🚨 Alerting
### Slack Alerts
Set environment variable:
```bash
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
```
Canary failures will automatically post to Slack:
```
🚨 CRITICAL: Scraper canary failed 3 times in a row!
Last error: Timeout after 60 seconds
```
### Email Alerts (TODO)
Future enhancement - integrate with SMTP or SendGrid.
### PagerDuty (TODO)
Future enhancement - integrate with PagerDuty API.
---
## 🧪 Testing
### Test Webhook Locally
Use webhook.site or ngrok:
```bash
# Start ngrok
ngrok http 8000
# Use ngrok URL as webhook
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://maps.google.com/...",
"webhook_url": "https://your-id.ngrok.io/webhook"
}'
```
### Test Health Checks
```bash
# Should return 200
curl -f http://localhost:8000/health/live || echo "FAILED"
# Should return 200
curl -f http://localhost:8000/health/ready || echo "FAILED"
# May return 503 if no canary run yet
curl http://localhost:8000/health/canary
```
---
## 📝 Database Schema
### Jobs Table
```sql
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL,
url TEXT NOT NULL,
webhook_url TEXT,
webhook_secret TEXT,
created_at TIMESTAMP NOT NULL,
started_at TIMESTAMP,
completed_at TIMESTAMP,
reviews_count INTEGER,
reviews_data JSONB, -- All reviews stored here!
scrape_time REAL,
error_message TEXT,
metadata JSONB
);
```
### Canary Results Table
```sql
CREATE TABLE canary_results (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMP NOT NULL,
success BOOLEAN NOT NULL,
reviews_count INTEGER,
scrape_time REAL,
error_message TEXT,
metadata JSONB
);
```
### Webhook Attempts Table
```sql
CREATE TABLE webhook_attempts (
id SERIAL PRIMARY KEY,
job_id UUID NOT NULL,
attempt_number INTEGER NOT NULL,
timestamp TIMESTAMP NOT NULL,
success BOOLEAN NOT NULL,
status_code INTEGER,
error_message TEXT,
response_time_ms REAL
);
```
---
## 🎯 Next Steps (Phase 2)
Phase 2 will add:
-**Redis Queue** - Distribute jobs across multiple workers
-**Worker Processes** - Separate API from scraping
-**Auto-scaling** - Kubernetes HPA based on queue length
-**SSE Streaming** - Real-time progress updates (optional)
---
## 🐛 Troubleshooting
### Database Connection Errors
```bash
# Check database is running
docker-compose -f docker-compose.production.yml ps db
# Check connection
psql postgresql://scraper:scraper123@localhost:5432/scraper -c "SELECT 1"
```
### Canary Always Failing
Check canary test URL is accessible:
```bash
curl -I "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"
```
Try a different test URL in .env:
```
CANARY_TEST_URL=https://www.google.com/maps/place/YOUR_STABLE_BUSINESS
```
### Webhooks Not Delivered
Check webhook attempts table:
```sql
SELECT * FROM webhook_attempts
WHERE job_id = '550e8400-e29b-41d4-a716-446655440000'
ORDER BY timestamp DESC;
```
Check webhook dispatcher is running:
```bash
docker-compose -f docker-compose.production.yml logs -f api | grep "webhook"
```
---
**Your production microservice is ready!** 🚀
For questions or issues, check:
- Server logs: `docker-compose logs -f api`
- Database: `docker-compose exec db psql -U scraper`
- Health checks: `curl http://localhost:8000/health/detailed`