Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
605 lines
11 KiB
Markdown
605 lines
11 KiB
Markdown
# Production Deployment Guide
|
||
## Phase 1: PostgreSQL + Webhooks + Health Checks
|
||
|
||
---
|
||
|
||
## <20><>️ What's Included
|
||
|
||
### Phase 1 Features:
|
||
- ✅ **PostgreSQL Storage** - Job metadata + reviews as JSONB
|
||
- ✅ **Webhooks** - Async notifications with retry logic and HMAC signatures
|
||
- ✅ **Smart Health Checks** - Canary testing every 4 hours to verify scraping works
|
||
- ✅ **Fast Scraper** - 18.9s average scraping time (8.2x faster)
|
||
- ✅ **Docker Deployment** - Easy deployment with Docker Compose
|
||
|
||
---
|
||
|
||
## 🚀 Quick Start (Docker)
|
||
|
||
### 1. Clone and Configure
|
||
|
||
```bash
|
||
# Copy environment file
|
||
cp .env.example .env
|
||
|
||
# Edit .env with your settings
|
||
nano .env
|
||
```
|
||
|
||
### 2. Start Services
|
||
|
||
```bash
|
||
# Build and start all services
|
||
docker-compose -f docker-compose.production.yml up -d
|
||
|
||
# Check logs
|
||
docker-compose -f docker-compose.production.yml logs -f api
|
||
```
|
||
|
||
### 3. Verify Health
|
||
|
||
```bash
|
||
# Check if API is running
|
||
curl http://localhost:8000/
|
||
|
||
# Check detailed health
|
||
curl http://localhost:8000/health/detailed | jq
|
||
```
|
||
|
||
**Done!** API is running on `http://localhost:8000`
|
||
|
||
---
|
||
|
||
## 🔧 Manual Installation
|
||
|
||
### 1. Install Dependencies
|
||
|
||
```bash
|
||
# Install Python dependencies
|
||
pip install -r requirements-production.txt
|
||
|
||
# Install PostgreSQL
|
||
# On macOS:
|
||
brew install postgresql@15
|
||
brew services start postgresql@15
|
||
|
||
# On Ubuntu:
|
||
sudo apt-get install postgresql-15
|
||
```
|
||
|
||
### 2. Setup Database
|
||
|
||
```bash
|
||
# Create database and user
|
||
psql postgres
|
||
CREATE DATABASE scraper;
|
||
CREATE USER scraper WITH PASSWORD 'scraper123';
|
||
GRANT ALL PRIVILEGES ON DATABASE scraper TO scraper;
|
||
\q
|
||
```
|
||
|
||
### 3. Configure Environment
|
||
|
||
```bash
|
||
# Set environment variables
|
||
export DATABASE_URL="postgresql://scraper:scraper123@localhost:5432/scraper"
|
||
export API_BASE_URL="http://localhost:8000"
|
||
```
|
||
|
||
### 4. Run Server
|
||
|
||
```bash
|
||
python api_server_production.py
|
||
```
|
||
|
||
Server runs on `http://localhost:8000`
|
||
|
||
---
|
||
|
||
## 📡 API Usage
|
||
|
||
### 1. Submit Job with Webhook
|
||
|
||
```bash
|
||
curl -X POST "http://localhost:8000/scrape" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
|
||
"webhook_url": "https://your-server.com/webhook",
|
||
"webhook_secret": "your-secret-key"
|
||
}'
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"job_id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"status": "started"
|
||
}
|
||
```
|
||
|
||
### 2. Check Status
|
||
|
||
```bash
|
||
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" | jq
|
||
```
|
||
|
||
### 3. Receive Webhook (When Complete)
|
||
|
||
Your webhook endpoint will receive:
|
||
|
||
```json
|
||
POST https://your-server.com/webhook
|
||
Headers:
|
||
X-Webhook-Signature: sha256=abc123...
|
||
X-Webhook-Timestamp: 1705582800
|
||
|
||
Body:
|
||
{
|
||
"event": "job.completed",
|
||
"job_id": "550e8400-e29b-41d4-a716-446655440000",
|
||
"status": "completed",
|
||
"reviews_count": 244,
|
||
"scrape_time": 18.9,
|
||
"reviews_url": "http://localhost:8000/jobs/550e8400-.../reviews",
|
||
"timestamp": "2026-01-18T10:30:00Z"
|
||
}
|
||
```
|
||
|
||
### 4. Verify Webhook Signature
|
||
|
||
```python
|
||
import hmac
|
||
import hashlib
|
||
|
||
def verify_webhook(payload: str, signature: str, secret: str) -> bool:
|
||
"""Verify webhook signature"""
|
||
expected = signature.split("sha256=", 1)[1]
|
||
computed = hmac.new(
|
||
secret.encode(),
|
||
payload.encode(),
|
||
hashlib.sha256
|
||
).hexdigest()
|
||
|
||
return hmac.compare_digest(expected, computed)
|
||
|
||
# In your webhook handler:
|
||
@app.post("/webhook")
|
||
async def handle_webhook(request: Request):
|
||
payload = await request.body()
|
||
signature = request.headers.get("X-Webhook-Signature")
|
||
|
||
if not verify_webhook(payload.decode(), signature, WEBHOOK_SECRET):
|
||
raise HTTPException(status_code=401, detail="Invalid signature")
|
||
|
||
# Process webhook...
|
||
data = await request.json()
|
||
job_id = data['job_id']
|
||
|
||
# Download reviews
|
||
reviews = requests.get(data['reviews_url']).json()
|
||
print(f"Got {len(reviews['reviews'])} reviews for job {job_id}")
|
||
```
|
||
|
||
### 5. Get Reviews
|
||
|
||
```bash
|
||
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" | jq
|
||
```
|
||
|
||
---
|
||
|
||
## 🏥 Health Checks
|
||
|
||
### Liveness (Is server alive?)
|
||
|
||
```bash
|
||
curl http://localhost:8000/health/live
|
||
```
|
||
|
||
**Use**: Kubernetes liveness probe (restart if fails)
|
||
|
||
### Readiness (Can handle traffic?)
|
||
|
||
```bash
|
||
curl http://localhost:8000/health/ready
|
||
```
|
||
|
||
**Use**: Kubernetes readiness probe (remove from load balancer if fails)
|
||
|
||
### Canary (Does scraping work?)
|
||
|
||
```bash
|
||
curl http://localhost:8000/health/canary
|
||
```
|
||
|
||
**Use**: External monitoring (PagerDuty alerts)
|
||
|
||
**How it works**:
|
||
- Runs real scrape test every 4 hours on test URL
|
||
- Verifies Chrome, selectors, GDPR handling all work
|
||
- Alerts if 3 consecutive failures
|
||
|
||
### Detailed Health
|
||
|
||
```bash
|
||
curl http://localhost:8000/health/detailed | jq
|
||
```
|
||
|
||
**Example response:**
|
||
```json
|
||
{
|
||
"status": "healthy",
|
||
"components": {
|
||
"liveness": {
|
||
"status": "alive"
|
||
},
|
||
"readiness": {
|
||
"status": "ready",
|
||
"checks": {
|
||
"database": {"healthy": true}
|
||
}
|
||
},
|
||
"canary": {
|
||
"status": "healthy",
|
||
"last_success": "2026-01-18T10:00:00Z",
|
||
"age_minutes": 30,
|
||
"consecutive_failures": 0
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 Monitoring
|
||
|
||
### View Canary History
|
||
|
||
```bash
|
||
# Connect to database
|
||
docker-compose -f docker-compose.production.yml exec db psql -U scraper
|
||
|
||
# Query canary results
|
||
SELECT
|
||
timestamp,
|
||
success,
|
||
reviews_count,
|
||
scrape_time,
|
||
error_message
|
||
FROM canary_results
|
||
ORDER BY timestamp DESC
|
||
LIMIT 10;
|
||
```
|
||
|
||
### View Job Statistics
|
||
|
||
```bash
|
||
curl http://localhost:8000/stats | jq
|
||
```
|
||
|
||
**Response:**
|
||
```json
|
||
{
|
||
"total_jobs": 150,
|
||
"pending": 2,
|
||
"running": 3,
|
||
"completed": 140,
|
||
"failed": 5,
|
||
"cancelled": 0,
|
||
"avg_scrape_time": 19.2,
|
||
"total_reviews": 34560
|
||
}
|
||
```
|
||
|
||
### View Webhook Delivery Stats
|
||
|
||
```sql
|
||
-- Connect to database
|
||
SELECT
|
||
j.job_id,
|
||
j.webhook_url,
|
||
COUNT(w.id) as attempts,
|
||
SUM(CASE WHEN w.success THEN 1 ELSE 0 END) as successful,
|
||
MAX(w.timestamp) as last_attempt
|
||
FROM jobs j
|
||
LEFT JOIN webhook_attempts w ON j.job_id = w.job_id
|
||
WHERE j.webhook_url IS NOT NULL
|
||
GROUP BY j.job_id, j.webhook_url
|
||
ORDER BY last_attempt DESC
|
||
LIMIT 10;
|
||
```
|
||
|
||
---
|
||
|
||
## 🐳 Docker Commands
|
||
|
||
### Start Services
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.production.yml up -d
|
||
```
|
||
|
||
### Stop Services
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.production.yml down
|
||
```
|
||
|
||
### View Logs
|
||
|
||
```bash
|
||
# All services
|
||
docker-compose -f docker-compose.production.yml logs -f
|
||
|
||
# Just API
|
||
docker-compose -f docker-compose.production.yml logs -f api
|
||
|
||
# Just database
|
||
docker-compose -f docker-compose.production.yml logs -f db
|
||
```
|
||
|
||
### Restart Services
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.production.yml restart api
|
||
```
|
||
|
||
### Access Database
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.production.yml exec db psql -U scraper
|
||
```
|
||
|
||
### Backup Database
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.production.yml exec db pg_dump -U scraper scraper > backup.sql
|
||
```
|
||
|
||
### Restore Database
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.production.yml exec -T db psql -U scraper scraper < backup.sql
|
||
```
|
||
|
||
---
|
||
|
||
## 🔐 Security
|
||
|
||
### Webhook Signatures
|
||
|
||
All webhooks include HMAC-SHA256 signatures:
|
||
|
||
```
|
||
X-Webhook-Signature: sha256=abc123def456...
|
||
X-Webhook-Timestamp: 1705582800
|
||
```
|
||
|
||
**Always verify signatures** in your webhook handler!
|
||
|
||
### Environment Variables
|
||
|
||
Store secrets in `.env` file (never commit to git):
|
||
|
||
```bash
|
||
# .env
|
||
DB_PASSWORD=strong_random_password_here
|
||
WEBHOOK_SECRET=another_strong_secret_here
|
||
```
|
||
|
||
### HTTPS in Production
|
||
|
||
Always use HTTPS URLs for:
|
||
- API_BASE_URL
|
||
- webhook_url parameters
|
||
|
||
---
|
||
|
||
## 📈 Scaling
|
||
|
||
### Vertical Scaling (Single Server)
|
||
|
||
```yaml
|
||
# docker-compose.production.yml
|
||
services:
|
||
api:
|
||
deploy:
|
||
resources:
|
||
limits:
|
||
cpus: '2'
|
||
memory: 4G
|
||
```
|
||
|
||
### Horizontal Scaling (Multiple Workers)
|
||
|
||
Phase 2 will add Redis queue for distributing jobs across multiple workers:
|
||
|
||
```
|
||
Load Balancer
|
||
↓
|
||
API Servers (3 replicas)
|
||
↓
|
||
Redis Queue
|
||
↓
|
||
Workers (10 replicas)
|
||
↓
|
||
PostgreSQL
|
||
```
|
||
|
||
---
|
||
|
||
## 🚨 Alerting
|
||
|
||
### Slack Alerts
|
||
|
||
Set environment variable:
|
||
|
||
```bash
|
||
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||
```
|
||
|
||
Canary failures will automatically post to Slack:
|
||
|
||
```
|
||
🚨 CRITICAL: Scraper canary failed 3 times in a row!
|
||
Last error: Timeout after 60 seconds
|
||
```
|
||
|
||
### Email Alerts (TODO)
|
||
|
||
Future enhancement - integrate with SMTP or SendGrid.
|
||
|
||
### PagerDuty (TODO)
|
||
|
||
Future enhancement - integrate with PagerDuty API.
|
||
|
||
---
|
||
|
||
## 🧪 Testing
|
||
|
||
### Test Webhook Locally
|
||
|
||
Use webhook.site or ngrok:
|
||
|
||
```bash
|
||
# Start ngrok
|
||
ngrok http 8000
|
||
|
||
# Use ngrok URL as webhook
|
||
curl -X POST "http://localhost:8000/scrape" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"url": "https://maps.google.com/...",
|
||
"webhook_url": "https://your-id.ngrok.io/webhook"
|
||
}'
|
||
```
|
||
|
||
### Test Health Checks
|
||
|
||
```bash
|
||
# Should return 200
|
||
curl -f http://localhost:8000/health/live || echo "FAILED"
|
||
|
||
# Should return 200
|
||
curl -f http://localhost:8000/health/ready || echo "FAILED"
|
||
|
||
# May return 503 if no canary run yet
|
||
curl http://localhost:8000/health/canary
|
||
```
|
||
|
||
---
|
||
|
||
## 📝 Database Schema
|
||
|
||
### Jobs Table
|
||
|
||
```sql
|
||
CREATE TABLE jobs (
|
||
job_id UUID PRIMARY KEY,
|
||
status VARCHAR(20) NOT NULL,
|
||
url TEXT NOT NULL,
|
||
webhook_url TEXT,
|
||
webhook_secret TEXT,
|
||
created_at TIMESTAMP NOT NULL,
|
||
started_at TIMESTAMP,
|
||
completed_at TIMESTAMP,
|
||
reviews_count INTEGER,
|
||
reviews_data JSONB, -- All reviews stored here!
|
||
scrape_time REAL,
|
||
error_message TEXT,
|
||
metadata JSONB
|
||
);
|
||
```
|
||
|
||
### Canary Results Table
|
||
|
||
```sql
|
||
CREATE TABLE canary_results (
|
||
id SERIAL PRIMARY KEY,
|
||
timestamp TIMESTAMP NOT NULL,
|
||
success BOOLEAN NOT NULL,
|
||
reviews_count INTEGER,
|
||
scrape_time REAL,
|
||
error_message TEXT,
|
||
metadata JSONB
|
||
);
|
||
```
|
||
|
||
### Webhook Attempts Table
|
||
|
||
```sql
|
||
CREATE TABLE webhook_attempts (
|
||
id SERIAL PRIMARY KEY,
|
||
job_id UUID NOT NULL,
|
||
attempt_number INTEGER NOT NULL,
|
||
timestamp TIMESTAMP NOT NULL,
|
||
success BOOLEAN NOT NULL,
|
||
status_code INTEGER,
|
||
error_message TEXT,
|
||
response_time_ms REAL
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 Next Steps (Phase 2)
|
||
|
||
Phase 2 will add:
|
||
- ✅ **Redis Queue** - Distribute jobs across multiple workers
|
||
- ✅ **Worker Processes** - Separate API from scraping
|
||
- ✅ **Auto-scaling** - Kubernetes HPA based on queue length
|
||
- ✅ **SSE Streaming** - Real-time progress updates (optional)
|
||
|
||
---
|
||
|
||
## 🐛 Troubleshooting
|
||
|
||
### Database Connection Errors
|
||
|
||
```bash
|
||
# Check database is running
|
||
docker-compose -f docker-compose.production.yml ps db
|
||
|
||
# Check connection
|
||
psql postgresql://scraper:scraper123@localhost:5432/scraper -c "SELECT 1"
|
||
```
|
||
|
||
### Canary Always Failing
|
||
|
||
Check canary test URL is accessible:
|
||
|
||
```bash
|
||
curl -I "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/"
|
||
```
|
||
|
||
Try a different test URL in .env:
|
||
```
|
||
CANARY_TEST_URL=https://www.google.com/maps/place/YOUR_STABLE_BUSINESS
|
||
```
|
||
|
||
### Webhooks Not Delivered
|
||
|
||
Check webhook attempts table:
|
||
|
||
```sql
|
||
SELECT * FROM webhook_attempts
|
||
WHERE job_id = '550e8400-e29b-41d4-a716-446655440000'
|
||
ORDER BY timestamp DESC;
|
||
```
|
||
|
||
Check webhook dispatcher is running:
|
||
|
||
```bash
|
||
docker-compose -f docker-compose.production.yml logs -f api | grep "webhook"
|
||
```
|
||
|
||
---
|
||
|
||
**Your production microservice is ready!** 🚀
|
||
|
||
For questions or issues, check:
|
||
- Server logs: `docker-compose logs -f api`
|
||
- Database: `docker-compose exec db psql -U scraper`
|
||
- Health checks: `curl http://localhost:8000/health/detailed`
|