whyrating-engine-legacy/MICROSERVICE_ARCHITECTURE.md

# Production Microservice Architecture
## Google Reviews Scraper API

---

## 🎯 Recommended Communication Patterns

### 1. **Webhooks** (Primary - RECOMMENDED) ✅

**Best for**: Production async job processing

```
Client → POST /scrape (with webhook_url)
         ↓
Server → Starts job, returns job_id
         ↓
         [Scraping in progress...]
         ↓
Server → POST to client's webhook_url when complete
         {
           "job_id": "...",
           "status": "completed",
           "reviews_count": 244,
           "reviews_url": "https://api.example.com/jobs/{job_id}/reviews"
         }
```

**Advantages**:
- ✅ No polling needed (reduces server load)
- ✅ Instant notifications when job completes
- ✅ Industry standard (Stripe, GitHub, Twilio use this)
- ✅ Client can go offline and come back
- ✅ Scales to millions of jobs

**Use cases**:
- Batch processing systems
- Integration with other services
- When client has a public endpoint

---

### 2. **Server-Sent Events (SSE)** (Real-time Updates) ⚡

**Best for**: Real-time progress monitoring

```
Client → GET /jobs/{job_id}/stream (keeps connection open)
         ↓
Server → Sends progress updates in real-time:

         data: {"stage": "scrolling", "reviews_loaded": 50}

         data: {"stage": "scrolling", "reviews_loaded": 100}

         data: {"stage": "extracting", "reviews_loaded": 244}

         data: {"stage": "completed", "total": 244}
```

**Advantages**:
- ✅ Real-time progress updates
- ✅ HTTP-based (works through firewalls)
- ✅ Lightweight (one-way communication)
- ✅ Auto-reconnection support
- ✅ Great for dashboards/UIs

**Use cases**:
- Web dashboards
- Real-time monitoring
- Progress bars in UI

---

### 3. **Polling** (Fallback) 🔄

**Best for**: Simple clients, no webhook capability

```
Client → POST /scrape
         ↓
Server → Returns job_id
         ↓
Client → Polls GET /jobs/{job_id} every 2-5 seconds
         ↓
Server → Returns current status
```

**Advantages**:
- ✅ Simple to implement
- ✅ Works everywhere (no public endpoint needed)
- ✅ Firewall-friendly

**Disadvantages**:
- ❌ Inefficient (many wasted requests)
- ❌ Delayed notifications (polling interval)
- ❌ Higher server load

**Use cases**:
- Internal tools
- Clients behind firewalls
- Simple integrations

---

## 🏛️ Complete Production Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                        LOAD BALANCER                         │
│                     (nginx/AWS ALB)                          │
└──────────┬──────────────────────────────────┬────────────────┘
           │                                  │
           ▼                                  ▼
┌──────────────────────┐         ┌──────────────────────┐
│   API Server 1       │         │   API Server 2       │
│   (FastAPI)          │         │   (FastAPI)          │
│   - REST endpoints   │         │   - REST endpoints   │
│   - Health checks    │         │   - Health checks    │
│   - Job management   │         │   - Job management   │
└──────────┬───────────┘         └──────────┬───────────┘
           │                                │
           └────────────┬───────────────────┘
                        ▼
           ┌────────────────────────┐
           │    REDIS / RabbitMQ    │
           │    (Job Queue)         │
           │                        │
           │  - Pending jobs        │
           │  - Job distribution    │
           │  - Pub/Sub for events  │
           └────────┬───────────────┘
                    │
                    ▼
     ┌──────────────┴──────────────┐
     │                             │
     ▼                             ▼
┌─────────────┐              ┌─────────────┐
│  Worker 1   │              │  Worker 2   │
│             │              │             │
│ - Scraping  │              │ - Scraping  │
│ - Headless  │              │ - Headless  │
│ - Chrome    │              │ - Chrome    │
└─────┬───────┘              └─────┬───────┘
      │                            │
      └────────────┬───────────────┘
                   ▼
    ┌──────────────────────────────┐
    │   PERSISTENT STORAGE          │
    │                               │
    │  ┌────────────────────────┐   │
    │  │  PostgreSQL / MongoDB  │   │
    │  │  - Job metadata        │   │
    │  │  - Status tracking     │   │
    │  │  - Webhook configs     │   │
    │  └────────────────────────┘   │
    │                               │
    │  ┌────────────────────────┐   │
    │  │  File Storage / S3     │   │
    │  │  - Review JSON files   │   │
    │  │  - Large payloads      │   │
    │  └────────────────────────┘   │
    └───────────────────────────────┘
                   │
                   ▼
         ┌─────────────────────┐
         │  Webhook Dispatcher │
         │  - Retry logic      │
         │  - Dead letter queue│
         └─────────────────────┘
                   │
                   ▼
         [Client's webhook URL]
```

---

## 📦 Component Breakdown

### 1. **API Server** (FastAPI)

**Responsibilities**:
- Handle HTTP requests
- Validate input
- Enqueue jobs
- Serve results
- Health checks

**Endpoints**:
```python
POST   /scrape              # Submit job
GET    /jobs/{id}           # Get job status
GET    /jobs/{id}/reviews   # Get results
GET    /jobs/{id}/stream    # SSE progress stream
DELETE /jobs/{id}           # Cancel job
GET    /health              # Health check
GET    /metrics             # Prometheus metrics
```

---

### 2. **Job Queue** (Redis or RabbitMQ)

**Why needed**:
- Decouple API from scraping workers
- Distribute load across workers
- Retry failed jobs
- Handle backpressure

**Options**:

**Option A: Redis** (Recommended for simpler setups)
```python
# Fast, simple, good for most use cases
- In-memory queue
- Pub/Sub for events
- Job state storage
- Session storage
```

**Option B: RabbitMQ** (For complex workflows)
```python
# More features, better for complex scenarios
- Guaranteed delivery
- Advanced routing
- Dead letter queues
- Priority queues
```

**Recommendation**: Start with **Redis**, upgrade to RabbitMQ if needed.

---

### 3. **Worker Processes** (Celery or Custom)

**Responsibilities**:
- Pull jobs from queue
- Run scraping (headless Chrome)
- Save results
- Send webhooks
- Update job status

**Scaling**:
```bash
# Run 4 workers on same machine
celery -A worker worker --concurrency=4

# Or 4 separate processes
python worker.py &
python worker.py &
python worker.py &
python worker.py &

# Or Kubernetes deployment
kubectl scale deployment scraper-worker --replicas=10
```

---

### 4. **Database** (PostgreSQL or MongoDB)

**Job Metadata Schema**:

**PostgreSQL** (Recommended):
```sql
CREATE TABLE jobs (
    job_id UUID PRIMARY KEY,
    status VARCHAR(20) NOT NULL,
    url TEXT NOT NULL,
    webhook_url TEXT,
    created_at TIMESTAMP NOT NULL,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    reviews_count INTEGER,
    reviews_file_path TEXT,
    error_message TEXT,
    metadata JSONB
);

CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
```

**Why PostgreSQL**:
- ✅ ACID transactions
- ✅ Good for structured data
- ✅ SQL queries
- ✅ Mature ecosystem

**Alternative - MongoDB**:
```javascript
{
  _id: ObjectId("..."),
  job_id: "550e8400-...",
  status: "completed",
  url: "https://...",
  webhook_url: "https://...",
  created_at: ISODate("2026-01-18T..."),
  reviews_count: 244,
  reviews_file: "/data/reviews/550e8400.json",
  metadata: { ... }
}
```

**Why MongoDB**:
- ✅ Flexible schema
- ✅ Good for document storage
- ✅ Built-in sharding

**Recommendation**: **PostgreSQL** for most cases (better for job queues and transactions)

---

### 5. **File Storage**

**Options**:

**Option A: Local Filesystem** (Development/Small scale)
```python
/data/reviews/
  ├── 550e8400-e29b-41d4-a716-446655440000.json
  ├── 6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
  └── ...
```

**Option B: S3 / Object Storage** (Production - RECOMMENDED)
```python
s3://scraper-reviews-bucket/
  ├── 2026/01/18/550e8400-e29b-41d4-a716-446655440000.json
  ├── 2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
  └── ...
```

**Why S3**:
- ✅ Unlimited storage
- ✅ No disk management
- ✅ High availability
- ✅ Versioning support
- ✅ Pre-signed URLs for direct access
- ✅ Lifecycle policies (auto-delete old files)

**Recommendation**: **S3 (or compatible)** for production

---

### 6. **Webhook Dispatcher**

**Features**:
- ✅ Retry logic (exponential backoff)
- ✅ Dead letter queue for failed webhooks
- ✅ Webhook signatures (HMAC for security)
- ✅ Timeout handling
- ✅ Async delivery

**Implementation**:
```python
async def send_webhook(webhook_url, payload, max_retries=3):
    for attempt in range(max_retries):
        try:
            # Add signature
            signature = hmac.new(
                WEBHOOK_SECRET,
                json.dumps(payload).encode(),
                hashlib.sha256
            ).hexdigest()

            # Send with timeout
            async with httpx.AsyncClient() as client:
                response = await client.post(
                    webhook_url,
                    json=payload,
                    headers={"X-Webhook-Signature": signature},
                    timeout=10.0
                )

                if response.status_code == 200:
                    return True

        except Exception as e:
            if attempt < max_retries - 1:
                await asyncio.sleep(2 ** attempt)  # Exponential backoff
            else:
                # Move to dead letter queue
                await save_to_dead_letter_queue(webhook_url, payload)

    return False
```

---

## 🔥 Complete Workflow Examples

### Workflow 1: **Webhooks** (Production)

```python
# 1. Client submits job with webhook
POST /scrape
{
  "url": "https://maps.google.com/...",
  "webhook_url": "https://client.com/webhook",
  "webhook_secret": "secret123"  # For signature verification
}

Response:
{
  "job_id": "550e8400-...",
  "status": "queued",
  "estimated_time": "20s"
}

# 2. Server enqueues job
redis.lpush("scraper:queue", job_id)

# 3. Worker picks up job
worker = get_from_queue()
result = fast_scrape_reviews(url)

# 4. Save to S3
s3.upload(f"reviews/{job_id}.json", reviews)

# 5. Update database
db.jobs.update(job_id, {
  status: "completed",
  reviews_count: 244,
  reviews_url: f"https://api.example.com/jobs/{job_id}/reviews"
})

# 6. Send webhook to client
POST https://client.com/webhook
Headers:
  X-Webhook-Signature: hmac_sha256(payload, secret)
Body:
{
  "event": "job.completed",
  "job_id": "550e8400-...",
  "status": "completed",
  "reviews_count": 244,
  "reviews_url": "https://api.example.com/jobs/{job_id}/reviews",
  "completed_at": "2026-01-18T10:30:20Z"
}

# 7. Client downloads reviews
GET https://api.example.com/jobs/{job_id}/reviews
# Or direct S3 pre-signed URL
GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=...
```

---

### Workflow 2: **SSE Streaming** (Real-time Dashboard)

```python
# 1. Client opens SSE connection
EventSource("/jobs/{job_id}/stream")

# 2. Server streams progress updates
def stream_progress(job_id):
    while True:
        job = get_job(job_id)

        yield f"data: {json.dumps({
            'stage': job.stage,
            'reviews_loaded': job.reviews_loaded,
            'progress_percent': job.progress_percent
        })}\n\n"

        if job.status in ['completed', 'failed']:
            break

        await asyncio.sleep(1)  # Update every second

# 3. Client receives updates
onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20}
onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40}
onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60}
onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100}
onmessage: {"stage": "completed", "total": 244}
```

---

### Workflow 3: **Polling** (Simple Clients)

```python
# 1. Submit job (no webhook)
POST /scrape
{
  "url": "https://maps.google.com/..."
}

Response:
{
  "job_id": "550e8400-...",
  "status": "queued"
}

# 2. Poll every 3 seconds
while True:
    response = GET /jobs/{job_id}

    if response.status == "completed":
        reviews = GET /jobs/{job_id}/reviews
        break
    elif response.status == "failed":
        handle_error(response.error_message)
        break

    sleep(3)
```

---

## 🏥 Health Checks

### 1. **Basic Health Check**

```python
@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "timestamp": datetime.utcnow().isoformat(),
        "version": "1.0.0"
    }
```

### 2. **Detailed Health Check** (Recommended)

```python
@app.get("/health/detailed")
async def detailed_health():
    checks = {
        "api": await check_api(),           # Always healthy if responding
        "database": await check_database(), # Query DB
        "redis": await check_redis(),       # Ping Redis
        "s3": await check_s3(),            # List buckets
        "workers": await check_workers(),   # Check if workers alive
        "disk": await check_disk_space(),  # Check disk usage
    }

    overall_healthy = all(c["healthy"] for c in checks.values())

    return {
        "status": "healthy" if overall_healthy else "degraded",
        "checks": checks,
        "timestamp": datetime.utcnow().isoformat()
    }

# Example response:
{
  "status": "healthy",
  "checks": {
    "api": {"healthy": true, "latency_ms": 1},
    "database": {"healthy": true, "latency_ms": 5},
    "redis": {"healthy": true, "latency_ms": 2},
    "s3": {"healthy": true, "latency_ms": 50},
    "workers": {"healthy": true, "active_workers": 4},
    "disk": {"healthy": true, "usage_percent": 45}
  },
  "timestamp": "2026-01-18T10:30:00Z"
}
```

### 3. **Readiness vs Liveness** (Kubernetes)

```python
# Liveness: Is the app alive? (restart if false)
@app.get("/health/live")
async def liveness():
    # Simple check - is the server running?
    return {"status": "alive"}

# Readiness: Can the app handle traffic? (remove from load balancer if false)
@app.get("/health/ready")
async def readiness():
    # Check dependencies
    db_ok = await ping_database()
    redis_ok = await ping_redis()

    if db_ok and redis_ok:
        return {"status": "ready"}
    else:
        raise HTTPException(status_code=503, detail="Not ready")
```

---

## 📊 Monitoring & Metrics

### Prometheus Metrics

```python
from prometheus_client import Counter, Histogram, Gauge

# Counters
jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status'])
webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success'])

# Histograms
scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration')
reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job')

# Gauges
active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs')
queue_size = Gauge('scraper_queue_size', 'Jobs in queue')

@app.get("/metrics")
async def metrics():
    # Prometheus scrapes this endpoint
    return Response(generate_latest(), media_type="text/plain")
```

---

## 🔐 Security

### 1. **API Keys**

```python
@app.post("/scrape")
async def scrape(
    request: ScrapeRequest,
    api_key: str = Header(..., alias="X-API-Key")
):
    if not validate_api_key(api_key):
        raise HTTPException(status_code=401, detail="Invalid API key")

    # Process request...
```

### 2. **Rate Limiting**

```python
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/scrape")
@limiter.limit("10/minute")  # Max 10 jobs per minute
async def scrape(request: Request, ...):
    # Process request...
```

### 3. **Webhook Signatures**

```python
import hmac

def verify_webhook_signature(payload, signature, secret):
    expected = hmac.new(
        secret.encode(),
        payload.encode(),
        hashlib.sha256
    ).hexdigest()

    return hmac.compare_digest(signature, expected)
```

---

## 🚀 Deployment Options

### Option 1: **Docker Compose** (Development)

```yaml
version: '3.8'
services:
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://db:5432/scraper
    depends_on:
      - redis
      - db

  worker:
    build: .
    command: python worker.py
    environment:
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
    deploy:
      replicas: 4

  redis:
    image: redis:7-alpine

  db:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=scraper
```

### Option 2: **Kubernetes** (Production)

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: scraper-api
  template:
    spec:
      containers:
      - name: api
        image: scraper-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: REDIS_URL
          value: redis://redis:6379
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8000
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scraper-worker
spec:
  replicas: 10
  selector:
    matchLabels:
      app: scraper-worker
  template:
    spec:
      containers:
      - name: worker
        image: scraper-worker:latest
```

---

## 📈 Scaling Considerations

### Horizontal Scaling

```
1 Worker  = 3 jobs/minute (20s per job)
10 Workers = 30 jobs/minute
100 Workers = 300 jobs/minute = 432,000 jobs/day
```

### Resource Requirements (per worker)

```
CPU: 1-2 cores (Chrome is CPU-intensive)
RAM: 2-4 GB (headless Chrome + data)
Disk: Minimal (results go to S3)
```

### Auto-scaling (Kubernetes HPA)

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scraper-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: scraper-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: redis_queue_size
      target:
        type: Value
        value: "10"  # Scale up if queue > 10
```

---

## ✅ Recommended Stack

### For Small-Medium (< 1000 jobs/day):
```
✅ FastAPI (API Server)
✅ Redis (Queue + Cache)
✅ PostgreSQL (Job metadata)
✅ Local files or S3 (Reviews storage)
✅ Webhooks (Primary)
✅ Polling (Fallback)
✅ Docker Compose (Deployment)
```

### For Large Scale (> 10,000 jobs/day):
```
✅ FastAPI (API Server)
✅ RabbitMQ (Queue)
✅ PostgreSQL (Job metadata)
✅ S3 (Reviews storage)
✅ Webhooks (Primary)
✅ SSE (Real-time updates)
✅ Kubernetes (Orchestration)
✅ Prometheus + Grafana (Monitoring)
✅ ELK Stack (Logging)
```

---

## 🎯 Next Steps

Would you like me to implement:

1. ✅ **Webhooks** - Full webhook support with retries
2. ✅ **Redis Queue** - Job queue with Celery/RQ
3. ✅ **PostgreSQL** - Job metadata storage
4. ✅ **S3 Storage** - Reviews file storage
5. ✅ **Health Checks** - Detailed health endpoints
6. ✅ **SSE Streaming** - Real-time progress updates (optional)
7. ✅ **Docker Setup** - Complete docker-compose.yml

**My recommendation**: Start with **#1-5** (core production features), add #6-7 later if needed.

Let me know which to implement first!