# Production Microservice Architecture ## Google Reviews Scraper API --- ## 🎯 Recommended Communication Patterns ### 1. **Webhooks** (Primary - RECOMMENDED) ✅ **Best for**: Production async job processing ``` Client → POST /scrape (with webhook_url) ↓ Server → Starts job, returns job_id ↓ [Scraping in progress...] ↓ Server → POST to client's webhook_url when complete { "job_id": "...", "status": "completed", "reviews_count": 244, "reviews_url": "https://api.example.com/jobs/{job_id}/reviews" } ``` **Advantages**: - ✅ No polling needed (reduces server load) - ✅ Instant notifications when job completes - ✅ Industry standard (Stripe, GitHub, Twilio use this) - ✅ Client can go offline and come back - ✅ Scales to millions of jobs **Use cases**: - Batch processing systems - Integration with other services - When client has a public endpoint --- ### 2. **Server-Sent Events (SSE)** (Real-time Updates) ⚡ **Best for**: Real-time progress monitoring ``` Client → GET /jobs/{job_id}/stream (keeps connection open) ↓ Server → Sends progress updates in real-time: data: {"stage": "scrolling", "reviews_loaded": 50} data: {"stage": "scrolling", "reviews_loaded": 100} data: {"stage": "extracting", "reviews_loaded": 244} data: {"stage": "completed", "total": 244} ``` **Advantages**: - ✅ Real-time progress updates - ✅ HTTP-based (works through firewalls) - ✅ Lightweight (one-way communication) - ✅ Auto-reconnection support - ✅ Great for dashboards/UIs **Use cases**: - Web dashboards - Real-time monitoring - Progress bars in UI --- ### 3. **Polling** (Fallback) 🔄 **Best for**: Simple clients, no webhook capability ``` Client → POST /scrape ↓ Server → Returns job_id ↓ Client → Polls GET /jobs/{job_id} every 2-5 seconds ↓ Server → Returns current status ``` **Advantages**: - ✅ Simple to implement - ✅ Works everywhere (no public endpoint needed) - ✅ Firewall-friendly **Disadvantages**: - ❌ Inefficient (many wasted requests) - ❌ Delayed notifications (polling interval) - ❌ Higher server load **Use cases**: - Internal tools - Clients behind firewalls - Simple integrations --- ## 🏛️ Complete Production Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ LOAD BALANCER │ │ (nginx/AWS ALB) │ └──────────┬──────────────────────────────────┬────────────────┘ │ │ ▼ ▼ ┌──────────────────────┐ ┌──────────────────────┐ │ API Server 1 │ │ API Server 2 │ │ (FastAPI) │ │ (FastAPI) │ │ - REST endpoints │ │ - REST endpoints │ │ - Health checks │ │ - Health checks │ │ - Job management │ │ - Job management │ └──────────┬───────────┘ └──────────┬───────────┘ │ │ └────────────┬───────────────────┘ ▼ ┌────────────────────────┐ │ REDIS / RabbitMQ │ │ (Job Queue) │ │ │ │ - Pending jobs │ │ - Job distribution │ │ - Pub/Sub for events │ └────────┬───────────────┘ │ ▼ ┌──────────────┴──────────────┐ │ │ ▼ ▼ ┌─────────────┐ ┌─────────────┐ │ Worker 1 │ │ Worker 2 │ │ │ │ │ │ - Scraping │ │ - Scraping │ │ - Headless │ │ - Headless │ │ - Chrome │ │ - Chrome │ └─────┬───────┘ └─────┬───────┘ │ │ └────────────┬───────────────┘ ▼ ┌──────────────────────────────┐ │ PERSISTENT STORAGE │ │ │ │ ┌────────────────────────┐ │ │ │ PostgreSQL / MongoDB │ │ │ │ - Job metadata │ │ │ │ - Status tracking │ │ │ │ - Webhook configs │ │ │ └────────────────────────┘ │ │ │ │ ┌────────────────────────┐ │ │ │ File Storage / S3 │ │ │ │ - Review JSON files │ │ │ │ - Large payloads │ │ │ └────────────────────────┘ │ └───────────────────────────────┘ │ ▼ ┌─────────────────────┐ │ Webhook Dispatcher │ │ - Retry logic │ │ - Dead letter queue│ └─────────────────────┘ │ ▼ [Client's webhook URL] ``` --- ## 📦 Component Breakdown ### 1. **API Server** (FastAPI) **Responsibilities**: - Handle HTTP requests - Validate input - Enqueue jobs - Serve results - Health checks **Endpoints**: ```python POST /scrape # Submit job GET /jobs/{id} # Get job status GET /jobs/{id}/reviews # Get results GET /jobs/{id}/stream # SSE progress stream DELETE /jobs/{id} # Cancel job GET /health # Health check GET /metrics # Prometheus metrics ``` --- ### 2. **Job Queue** (Redis or RabbitMQ) **Why needed**: - Decouple API from scraping workers - Distribute load across workers - Retry failed jobs - Handle backpressure **Options**: **Option A: Redis** (Recommended for simpler setups) ```python # Fast, simple, good for most use cases - In-memory queue - Pub/Sub for events - Job state storage - Session storage ``` **Option B: RabbitMQ** (For complex workflows) ```python # More features, better for complex scenarios - Guaranteed delivery - Advanced routing - Dead letter queues - Priority queues ``` **Recommendation**: Start with **Redis**, upgrade to RabbitMQ if needed. --- ### 3. **Worker Processes** (Celery or Custom) **Responsibilities**: - Pull jobs from queue - Run scraping (headless Chrome) - Save results - Send webhooks - Update job status **Scaling**: ```bash # Run 4 workers on same machine celery -A worker worker --concurrency=4 # Or 4 separate processes python worker.py & python worker.py & python worker.py & python worker.py & # Or Kubernetes deployment kubectl scale deployment scraper-worker --replicas=10 ``` --- ### 4. **Database** (PostgreSQL or MongoDB) **Job Metadata Schema**: **PostgreSQL** (Recommended): ```sql CREATE TABLE jobs ( job_id UUID PRIMARY KEY, status VARCHAR(20) NOT NULL, url TEXT NOT NULL, webhook_url TEXT, created_at TIMESTAMP NOT NULL, started_at TIMESTAMP, completed_at TIMESTAMP, reviews_count INTEGER, reviews_file_path TEXT, error_message TEXT, metadata JSONB ); CREATE INDEX idx_jobs_status ON jobs(status); CREATE INDEX idx_jobs_created_at ON jobs(created_at); ``` **Why PostgreSQL**: - ✅ ACID transactions - ✅ Good for structured data - ✅ SQL queries - ✅ Mature ecosystem **Alternative - MongoDB**: ```javascript { _id: ObjectId("..."), job_id: "550e8400-...", status: "completed", url: "https://...", webhook_url: "https://...", created_at: ISODate("2026-01-18T..."), reviews_count: 244, reviews_file: "/data/reviews/550e8400.json", metadata: { ... } } ``` **Why MongoDB**: - ✅ Flexible schema - ✅ Good for document storage - ✅ Built-in sharding **Recommendation**: **PostgreSQL** for most cases (better for job queues and transactions) --- ### 5. **File Storage** **Options**: **Option A: Local Filesystem** (Development/Small scale) ```python /data/reviews/ ├── 550e8400-e29b-41d4-a716-446655440000.json ├── 6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json └── ... ``` **Option B: S3 / Object Storage** (Production - RECOMMENDED) ```python s3://scraper-reviews-bucket/ ├── 2026/01/18/550e8400-e29b-41d4-a716-446655440000.json ├── 2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json └── ... ``` **Why S3**: - ✅ Unlimited storage - ✅ No disk management - ✅ High availability - ✅ Versioning support - ✅ Pre-signed URLs for direct access - ✅ Lifecycle policies (auto-delete old files) **Recommendation**: **S3 (or compatible)** for production --- ### 6. **Webhook Dispatcher** **Features**: - ✅ Retry logic (exponential backoff) - ✅ Dead letter queue for failed webhooks - ✅ Webhook signatures (HMAC for security) - ✅ Timeout handling - ✅ Async delivery **Implementation**: ```python async def send_webhook(webhook_url, payload, max_retries=3): for attempt in range(max_retries): try: # Add signature signature = hmac.new( WEBHOOK_SECRET, json.dumps(payload).encode(), hashlib.sha256 ).hexdigest() # Send with timeout async with httpx.AsyncClient() as client: response = await client.post( webhook_url, json=payload, headers={"X-Webhook-Signature": signature}, timeout=10.0 ) if response.status_code == 200: return True except Exception as e: if attempt < max_retries - 1: await asyncio.sleep(2 ** attempt) # Exponential backoff else: # Move to dead letter queue await save_to_dead_letter_queue(webhook_url, payload) return False ``` --- ## 🔥 Complete Workflow Examples ### Workflow 1: **Webhooks** (Production) ```python # 1. Client submits job with webhook POST /scrape { "url": "https://maps.google.com/...", "webhook_url": "https://client.com/webhook", "webhook_secret": "secret123" # For signature verification } Response: { "job_id": "550e8400-...", "status": "queued", "estimated_time": "20s" } # 2. Server enqueues job redis.lpush("scraper:queue", job_id) # 3. Worker picks up job worker = get_from_queue() result = fast_scrape_reviews(url) # 4. Save to S3 s3.upload(f"reviews/{job_id}.json", reviews) # 5. Update database db.jobs.update(job_id, { status: "completed", reviews_count: 244, reviews_url: f"https://api.example.com/jobs/{job_id}/reviews" }) # 6. Send webhook to client POST https://client.com/webhook Headers: X-Webhook-Signature: hmac_sha256(payload, secret) Body: { "event": "job.completed", "job_id": "550e8400-...", "status": "completed", "reviews_count": 244, "reviews_url": "https://api.example.com/jobs/{job_id}/reviews", "completed_at": "2026-01-18T10:30:20Z" } # 7. Client downloads reviews GET https://api.example.com/jobs/{job_id}/reviews # Or direct S3 pre-signed URL GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=... ``` --- ### Workflow 2: **SSE Streaming** (Real-time Dashboard) ```python # 1. Client opens SSE connection EventSource("/jobs/{job_id}/stream") # 2. Server streams progress updates def stream_progress(job_id): while True: job = get_job(job_id) yield f"data: {json.dumps({ 'stage': job.stage, 'reviews_loaded': job.reviews_loaded, 'progress_percent': job.progress_percent })}\n\n" if job.status in ['completed', 'failed']: break await asyncio.sleep(1) # Update every second # 3. Client receives updates onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20} onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40} onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60} onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100} onmessage: {"stage": "completed", "total": 244} ``` --- ### Workflow 3: **Polling** (Simple Clients) ```python # 1. Submit job (no webhook) POST /scrape { "url": "https://maps.google.com/..." } Response: { "job_id": "550e8400-...", "status": "queued" } # 2. Poll every 3 seconds while True: response = GET /jobs/{job_id} if response.status == "completed": reviews = GET /jobs/{job_id}/reviews break elif response.status == "failed": handle_error(response.error_message) break sleep(3) ``` --- ## 🏥 Health Checks ### 1. **Basic Health Check** ```python @app.get("/health") async def health_check(): return { "status": "healthy", "timestamp": datetime.utcnow().isoformat(), "version": "1.0.0" } ``` ### 2. **Detailed Health Check** (Recommended) ```python @app.get("/health/detailed") async def detailed_health(): checks = { "api": await check_api(), # Always healthy if responding "database": await check_database(), # Query DB "redis": await check_redis(), # Ping Redis "s3": await check_s3(), # List buckets "workers": await check_workers(), # Check if workers alive "disk": await check_disk_space(), # Check disk usage } overall_healthy = all(c["healthy"] for c in checks.values()) return { "status": "healthy" if overall_healthy else "degraded", "checks": checks, "timestamp": datetime.utcnow().isoformat() } # Example response: { "status": "healthy", "checks": { "api": {"healthy": true, "latency_ms": 1}, "database": {"healthy": true, "latency_ms": 5}, "redis": {"healthy": true, "latency_ms": 2}, "s3": {"healthy": true, "latency_ms": 50}, "workers": {"healthy": true, "active_workers": 4}, "disk": {"healthy": true, "usage_percent": 45} }, "timestamp": "2026-01-18T10:30:00Z" } ``` ### 3. **Readiness vs Liveness** (Kubernetes) ```python # Liveness: Is the app alive? (restart if false) @app.get("/health/live") async def liveness(): # Simple check - is the server running? return {"status": "alive"} # Readiness: Can the app handle traffic? (remove from load balancer if false) @app.get("/health/ready") async def readiness(): # Check dependencies db_ok = await ping_database() redis_ok = await ping_redis() if db_ok and redis_ok: return {"status": "ready"} else: raise HTTPException(status_code=503, detail="Not ready") ``` --- ## 📊 Monitoring & Metrics ### Prometheus Metrics ```python from prometheus_client import Counter, Histogram, Gauge # Counters jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status']) webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success']) # Histograms scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration') reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job') # Gauges active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs') queue_size = Gauge('scraper_queue_size', 'Jobs in queue') @app.get("/metrics") async def metrics(): # Prometheus scrapes this endpoint return Response(generate_latest(), media_type="text/plain") ``` --- ## 🔐 Security ### 1. **API Keys** ```python @app.post("/scrape") async def scrape( request: ScrapeRequest, api_key: str = Header(..., alias="X-API-Key") ): if not validate_api_key(api_key): raise HTTPException(status_code=401, detail="Invalid API key") # Process request... ``` ### 2. **Rate Limiting** ```python from slowapi import Limiter, _rate_limit_exceeded_handler from slowapi.util import get_remote_address limiter = Limiter(key_func=get_remote_address) @app.post("/scrape") @limiter.limit("10/minute") # Max 10 jobs per minute async def scrape(request: Request, ...): # Process request... ``` ### 3. **Webhook Signatures** ```python import hmac def verify_webhook_signature(payload, signature, secret): expected = hmac.new( secret.encode(), payload.encode(), hashlib.sha256 ).hexdigest() return hmac.compare_digest(signature, expected) ``` --- ## 🚀 Deployment Options ### Option 1: **Docker Compose** (Development) ```yaml version: '3.8' services: api: build: . ports: - "8000:8000" environment: - REDIS_URL=redis://redis:6379 - DATABASE_URL=postgresql://db:5432/scraper depends_on: - redis - db worker: build: . command: python worker.py environment: - REDIS_URL=redis://redis:6379 depends_on: - redis deploy: replicas: 4 redis: image: redis:7-alpine db: image: postgres:15-alpine environment: - POSTGRES_DB=scraper ``` ### Option 2: **Kubernetes** (Production) ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: scraper-api spec: replicas: 3 selector: matchLabels: app: scraper-api template: spec: containers: - name: api image: scraper-api:latest ports: - containerPort: 8000 env: - name: REDIS_URL value: redis://redis:6379 livenessProbe: httpGet: path: /health/live port: 8000 readinessProbe: httpGet: path: /health/ready port: 8000 --- apiVersion: apps/v1 kind: Deployment metadata: name: scraper-worker spec: replicas: 10 selector: matchLabels: app: scraper-worker template: spec: containers: - name: worker image: scraper-worker:latest ``` --- ## 📈 Scaling Considerations ### Horizontal Scaling ``` 1 Worker = 3 jobs/minute (20s per job) 10 Workers = 30 jobs/minute 100 Workers = 300 jobs/minute = 432,000 jobs/day ``` ### Resource Requirements (per worker) ``` CPU: 1-2 cores (Chrome is CPU-intensive) RAM: 2-4 GB (headless Chrome + data) Disk: Minimal (results go to S3) ``` ### Auto-scaling (Kubernetes HPA) ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: scraper-worker-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: scraper-worker minReplicas: 2 maxReplicas: 50 metrics: - type: External external: metric: name: redis_queue_size target: type: Value value: "10" # Scale up if queue > 10 ``` --- ## ✅ Recommended Stack ### For Small-Medium (< 1000 jobs/day): ``` ✅ FastAPI (API Server) ✅ Redis (Queue + Cache) ✅ PostgreSQL (Job metadata) ✅ Local files or S3 (Reviews storage) ✅ Webhooks (Primary) ✅ Polling (Fallback) ✅ Docker Compose (Deployment) ``` ### For Large Scale (> 10,000 jobs/day): ``` ✅ FastAPI (API Server) ✅ RabbitMQ (Queue) ✅ PostgreSQL (Job metadata) ✅ S3 (Reviews storage) ✅ Webhooks (Primary) ✅ SSE (Real-time updates) ✅ Kubernetes (Orchestration) ✅ Prometheus + Grafana (Monitoring) ✅ ELK Stack (Logging) ``` --- ## 🎯 Next Steps Would you like me to implement: 1. ✅ **Webhooks** - Full webhook support with retries 2. ✅ **Redis Queue** - Job queue with Celery/RQ 3. ✅ **PostgreSQL** - Job metadata storage 4. ✅ **S3 Storage** - Reviews file storage 5. ✅ **Health Checks** - Detailed health endpoints 6. ✅ **SSE Streaming** - Real-time progress updates (optional) 7. ✅ **Docker Setup** - Complete docker-compose.yml **My recommendation**: Start with **#1-5** (core production features), add #6-7 later if needed. Let me know which to implement first!