# Production Deployment Guide ## Phase 1: PostgreSQL + Webhooks + Health Checks --- ## ��️ What's Included ### Phase 1 Features: - ✅ **PostgreSQL Storage** - Job metadata + reviews as JSONB - ✅ **Webhooks** - Async notifications with retry logic and HMAC signatures - ✅ **Smart Health Checks** - Canary testing every 4 hours to verify scraping works - ✅ **Fast Scraper** - 18.9s average scraping time (8.2x faster) - ✅ **Docker Deployment** - Easy deployment with Docker Compose --- ## 🚀 Quick Start (Docker) ### 1. Clone and Configure ```bash # Copy environment file cp .env.example .env # Edit .env with your settings nano .env ``` ### 2. Start Services ```bash # Build and start all services docker-compose -f docker-compose.production.yml up -d # Check logs docker-compose -f docker-compose.production.yml logs -f api ``` ### 3. Verify Health ```bash # Check if API is running curl http://localhost:8000/ # Check detailed health curl http://localhost:8000/health/detailed | jq ``` **Done!** API is running on `http://localhost:8000` --- ## 🔧 Manual Installation ### 1. Install Dependencies ```bash # Install Python dependencies pip install -r requirements-production.txt # Install PostgreSQL # On macOS: brew install postgresql@15 brew services start postgresql@15 # On Ubuntu: sudo apt-get install postgresql-15 ``` ### 2. Setup Database ```bash # Create database and user psql postgres CREATE DATABASE scraper; CREATE USER scraper WITH PASSWORD 'scraper123'; GRANT ALL PRIVILEGES ON DATABASE scraper TO scraper; \q ``` ### 3. Configure Environment ```bash # Set environment variables export DATABASE_URL="postgresql://scraper:scraper123@localhost:5432/scraper" export API_BASE_URL="http://localhost:8000" ``` ### 4. Run Server ```bash python api_server_production.py ``` Server runs on `http://localhost:8000` --- ## 📡 API Usage ### 1. Submit Job with Webhook ```bash curl -X POST "http://localhost:8000/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL", "webhook_url": "https://your-server.com/webhook", "webhook_secret": "your-secret-key" }' ``` **Response:** ```json { "job_id": "550e8400-e29b-41d4-a716-446655440000", "status": "started" } ``` ### 2. Check Status ```bash curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" | jq ``` ### 3. Receive Webhook (When Complete) Your webhook endpoint will receive: ```json POST https://your-server.com/webhook Headers: X-Webhook-Signature: sha256=abc123... X-Webhook-Timestamp: 1705582800 Body: { "event": "job.completed", "job_id": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "reviews_count": 244, "scrape_time": 18.9, "reviews_url": "http://localhost:8000/jobs/550e8400-.../reviews", "timestamp": "2026-01-18T10:30:00Z" } ``` ### 4. Verify Webhook Signature ```python import hmac import hashlib def verify_webhook(payload: str, signature: str, secret: str) -> bool: """Verify webhook signature""" expected = signature.split("sha256=", 1)[1] computed = hmac.new( secret.encode(), payload.encode(), hashlib.sha256 ).hexdigest() return hmac.compare_digest(expected, computed) # In your webhook handler: @app.post("/webhook") async def handle_webhook(request: Request): payload = await request.body() signature = request.headers.get("X-Webhook-Signature") if not verify_webhook(payload.decode(), signature, WEBHOOK_SECRET): raise HTTPException(status_code=401, detail="Invalid signature") # Process webhook... data = await request.json() job_id = data['job_id'] # Download reviews reviews = requests.get(data['reviews_url']).json() print(f"Got {len(reviews['reviews'])} reviews for job {job_id}") ``` ### 5. Get Reviews ```bash curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" | jq ``` --- ## 🏥 Health Checks ### Liveness (Is server alive?) ```bash curl http://localhost:8000/health/live ``` **Use**: Kubernetes liveness probe (restart if fails) ### Readiness (Can handle traffic?) ```bash curl http://localhost:8000/health/ready ``` **Use**: Kubernetes readiness probe (remove from load balancer if fails) ### Canary (Does scraping work?) ```bash curl http://localhost:8000/health/canary ``` **Use**: External monitoring (PagerDuty alerts) **How it works**: - Runs real scrape test every 4 hours on test URL - Verifies Chrome, selectors, GDPR handling all work - Alerts if 3 consecutive failures ### Detailed Health ```bash curl http://localhost:8000/health/detailed | jq ``` **Example response:** ```json { "status": "healthy", "components": { "liveness": { "status": "alive" }, "readiness": { "status": "ready", "checks": { "database": {"healthy": true} } }, "canary": { "status": "healthy", "last_success": "2026-01-18T10:00:00Z", "age_minutes": 30, "consecutive_failures": 0 } } } ``` --- ## 📊 Monitoring ### View Canary History ```bash # Connect to database docker-compose -f docker-compose.production.yml exec db psql -U scraper # Query canary results SELECT timestamp, success, reviews_count, scrape_time, error_message FROM canary_results ORDER BY timestamp DESC LIMIT 10; ``` ### View Job Statistics ```bash curl http://localhost:8000/stats | jq ``` **Response:** ```json { "total_jobs": 150, "pending": 2, "running": 3, "completed": 140, "failed": 5, "cancelled": 0, "avg_scrape_time": 19.2, "total_reviews": 34560 } ``` ### View Webhook Delivery Stats ```sql -- Connect to database SELECT j.job_id, j.webhook_url, COUNT(w.id) as attempts, SUM(CASE WHEN w.success THEN 1 ELSE 0 END) as successful, MAX(w.timestamp) as last_attempt FROM jobs j LEFT JOIN webhook_attempts w ON j.job_id = w.job_id WHERE j.webhook_url IS NOT NULL GROUP BY j.job_id, j.webhook_url ORDER BY last_attempt DESC LIMIT 10; ``` --- ## 🐳 Docker Commands ### Start Services ```bash docker-compose -f docker-compose.production.yml up -d ``` ### Stop Services ```bash docker-compose -f docker-compose.production.yml down ``` ### View Logs ```bash # All services docker-compose -f docker-compose.production.yml logs -f # Just API docker-compose -f docker-compose.production.yml logs -f api # Just database docker-compose -f docker-compose.production.yml logs -f db ``` ### Restart Services ```bash docker-compose -f docker-compose.production.yml restart api ``` ### Access Database ```bash docker-compose -f docker-compose.production.yml exec db psql -U scraper ``` ### Backup Database ```bash docker-compose -f docker-compose.production.yml exec db pg_dump -U scraper scraper > backup.sql ``` ### Restore Database ```bash docker-compose -f docker-compose.production.yml exec -T db psql -U scraper scraper < backup.sql ``` --- ## 🔐 Security ### Webhook Signatures All webhooks include HMAC-SHA256 signatures: ``` X-Webhook-Signature: sha256=abc123def456... X-Webhook-Timestamp: 1705582800 ``` **Always verify signatures** in your webhook handler! ### Environment Variables Store secrets in `.env` file (never commit to git): ```bash # .env DB_PASSWORD=strong_random_password_here WEBHOOK_SECRET=another_strong_secret_here ``` ### HTTPS in Production Always use HTTPS URLs for: - API_BASE_URL - webhook_url parameters --- ## 📈 Scaling ### Vertical Scaling (Single Server) ```yaml # docker-compose.production.yml services: api: deploy: resources: limits: cpus: '2' memory: 4G ``` ### Horizontal Scaling (Multiple Workers) Phase 2 will add Redis queue for distributing jobs across multiple workers: ``` Load Balancer ↓ API Servers (3 replicas) ↓ Redis Queue ↓ Workers (10 replicas) ↓ PostgreSQL ``` --- ## 🚨 Alerting ### Slack Alerts Set environment variable: ```bash export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" ``` Canary failures will automatically post to Slack: ``` 🚨 CRITICAL: Scraper canary failed 3 times in a row! Last error: Timeout after 60 seconds ``` ### Email Alerts (TODO) Future enhancement - integrate with SMTP or SendGrid. ### PagerDuty (TODO) Future enhancement - integrate with PagerDuty API. --- ## 🧪 Testing ### Test Webhook Locally Use webhook.site or ngrok: ```bash # Start ngrok ngrok http 8000 # Use ngrok URL as webhook curl -X POST "http://localhost:8000/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://maps.google.com/...", "webhook_url": "https://your-id.ngrok.io/webhook" }' ``` ### Test Health Checks ```bash # Should return 200 curl -f http://localhost:8000/health/live || echo "FAILED" # Should return 200 curl -f http://localhost:8000/health/ready || echo "FAILED" # May return 503 if no canary run yet curl http://localhost:8000/health/canary ``` --- ## 📝 Database Schema ### Jobs Table ```sql CREATE TABLE jobs ( job_id UUID PRIMARY KEY, status VARCHAR(20) NOT NULL, url TEXT NOT NULL, webhook_url TEXT, webhook_secret TEXT, created_at TIMESTAMP NOT NULL, started_at TIMESTAMP, completed_at TIMESTAMP, reviews_count INTEGER, reviews_data JSONB, -- All reviews stored here! scrape_time REAL, error_message TEXT, metadata JSONB ); ``` ### Canary Results Table ```sql CREATE TABLE canary_results ( id SERIAL PRIMARY KEY, timestamp TIMESTAMP NOT NULL, success BOOLEAN NOT NULL, reviews_count INTEGER, scrape_time REAL, error_message TEXT, metadata JSONB ); ``` ### Webhook Attempts Table ```sql CREATE TABLE webhook_attempts ( id SERIAL PRIMARY KEY, job_id UUID NOT NULL, attempt_number INTEGER NOT NULL, timestamp TIMESTAMP NOT NULL, success BOOLEAN NOT NULL, status_code INTEGER, error_message TEXT, response_time_ms REAL ); ``` --- ## 🎯 Next Steps (Phase 2) Phase 2 will add: - ✅ **Redis Queue** - Distribute jobs across multiple workers - ✅ **Worker Processes** - Separate API from scraping - ✅ **Auto-scaling** - Kubernetes HPA based on queue length - ✅ **SSE Streaming** - Real-time progress updates (optional) --- ## 🐛 Troubleshooting ### Database Connection Errors ```bash # Check database is running docker-compose -f docker-compose.production.yml ps db # Check connection psql postgresql://scraper:scraper123@localhost:5432/scraper -c "SELECT 1" ``` ### Canary Always Failing Check canary test URL is accessible: ```bash curl -I "https://www.google.com/maps/place/Soho+Factory/@54.6738155,25.2595844,17z/" ``` Try a different test URL in .env: ``` CANARY_TEST_URL=https://www.google.com/maps/place/YOUR_STABLE_BUSINESS ``` ### Webhooks Not Delivered Check webhook attempts table: ```sql SELECT * FROM webhook_attempts WHERE job_id = '550e8400-e29b-41d4-a716-446655440000' ORDER BY timestamp DESC; ``` Check webhook dispatcher is running: ```bash docker-compose -f docker-compose.production.yml logs -f api | grep "webhook" ``` --- **Your production microservice is ready!** 🚀 For questions or issues, check: - Server logs: `docker-compose logs -f api` - Database: `docker-compose exec db psql -U scraper` - Health checks: `curl http://localhost:8000/health/detailed`