Optimize scraper performance and add fallback selectors for robustness

Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions
--- a/MICROSERVICE_ARCHITECTURE.md
+++ b/MICROSERVICE_ARCHITECTURE.md
@@ -0,0 +1,833 @@
+# Production Microservice Architecture
+## Google Reviews Scraper API
+
+---
+
+## 🎯 Recommended Communication Patterns
+
+### 1. **Webhooks** (Primary - RECOMMENDED) ✅
+
+**Best for**: Production async job processing
+
+```
+Client → POST /scrape (with webhook_url)
+         ↓
+Server → Starts job, returns job_id
+         ↓
+         [Scraping in progress...]
+         ↓
+Server → POST to client's webhook_url when complete
+         {
+           "job_id": "...",
+           "status": "completed",
+           "reviews_count": 244,
+           "reviews_url": "https://api.example.com/jobs/{job_id}/reviews"
+         }
+```
+
+**Advantages**:
+- ✅ No polling needed (reduces server load)
+- ✅ Instant notifications when job completes
+- ✅ Industry standard (Stripe, GitHub, Twilio use this)
+- ✅ Client can go offline and come back
+- ✅ Scales to millions of jobs
+
+**Use cases**:
+- Batch processing systems
+- Integration with other services
+- When client has a public endpoint
+
+---
+
+### 2. **Server-Sent Events (SSE)** (Real-time Updates) ⚡
+
+**Best for**: Real-time progress monitoring
+
+```
+Client → GET /jobs/{job_id}/stream (keeps connection open)
+         ↓
+Server → Sends progress updates in real-time:
+
+         data: {"stage": "scrolling", "reviews_loaded": 50}
+
+         data: {"stage": "scrolling", "reviews_loaded": 100}
+
+         data: {"stage": "extracting", "reviews_loaded": 244}
+
+         data: {"stage": "completed", "total": 244}
+```
+
+**Advantages**:
+- ✅ Real-time progress updates
+- ✅ HTTP-based (works through firewalls)
+- ✅ Lightweight (one-way communication)
+- ✅ Auto-reconnection support
+- ✅ Great for dashboards/UIs
+
+**Use cases**:
+- Web dashboards
+- Real-time monitoring
+- Progress bars in UI
+
+---
+
+### 3. **Polling** (Fallback) 🔄
+
+**Best for**: Simple clients, no webhook capability
+
+```
+Client → POST /scrape
+         ↓
+Server → Returns job_id
+         ↓
+Client → Polls GET /jobs/{job_id} every 2-5 seconds
+         ↓
+Server → Returns current status
+```
+
+**Advantages**:
+- ✅ Simple to implement
+- ✅ Works everywhere (no public endpoint needed)
+- ✅ Firewall-friendly
+
+**Disadvantages**:
+- ❌ Inefficient (many wasted requests)
+- ❌ Delayed notifications (polling interval)
+- ❌ Higher server load
+
+**Use cases**:
+- Internal tools
+- Clients behind firewalls
+- Simple integrations
+
+---
+
+## 🏛️ Complete Production Architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                        LOAD BALANCER                         │
+│                     (nginx/AWS ALB)                          │
+└──────────┬──────────────────────────────────┬────────────────┘
+           │                                  │
+           ▼                                  ▼
+┌──────────────────────┐         ┌──────────────────────┐
+│   API Server 1       │         │   API Server 2       │
+│   (FastAPI)          │         │   (FastAPI)          │
+│   - REST endpoints   │         │   - REST endpoints   │
+│   - Health checks    │         │   - Health checks    │
+│   - Job management   │         │   - Job management   │
+└──────────┬───────────┘         └──────────┬───────────┘
+           │                                │
+           └────────────┬───────────────────┘
+                        ▼
+           ┌────────────────────────┐
+           │    REDIS / RabbitMQ    │
+           │    (Job Queue)         │
+           │                        │
+           │  - Pending jobs        │
+           │  - Job distribution    │
+           │  - Pub/Sub for events  │
+           └────────┬───────────────┘
+                    │
+                    ▼
+     ┌──────────────┴──────────────┐
+     │                             │
+     ▼                             ▼
+┌─────────────┐              ┌─────────────┐
+│  Worker 1   │              │  Worker 2   │
+│             │              │             │
+│ - Scraping  │              │ - Scraping  │
+│ - Headless  │              │ - Headless  │
+│ - Chrome    │              │ - Chrome    │
+└─────┬───────┘              └─────┬───────┘
+      │                            │
+      └────────────┬───────────────┘
+                   ▼
+    ┌──────────────────────────────┐
+    │   PERSISTENT STORAGE          │
+    │                               │
+    │  ┌────────────────────────┐   │
+    │  │  PostgreSQL / MongoDB  │   │
+    │  │  - Job metadata        │   │
+    │  │  - Status tracking     │   │
+    │  │  - Webhook configs     │   │
+    │  └────────────────────────┘   │
+    │                               │
+    │  ┌────────────────────────┐   │
+    │  │  File Storage / S3     │   │
+    │  │  - Review JSON files   │   │
+    │  │  - Large payloads      │   │
+    │  └────────────────────────┘   │
+    └───────────────────────────────┘
+                   │
+                   ▼
+         ┌─────────────────────┐
+         │  Webhook Dispatcher │
+         │  - Retry logic      │
+         │  - Dead letter queue│
+         └─────────────────────┘
+                   │
+                   ▼
+         [Client's webhook URL]
+```
+
+---
+
+## 📦 Component Breakdown
+
+### 1. **API Server** (FastAPI)
+
+**Responsibilities**:
+- Handle HTTP requests
+- Validate input
+- Enqueue jobs
+- Serve results
+- Health checks
+
+**Endpoints**:
+```python
+POST   /scrape              # Submit job
+GET    /jobs/{id}           # Get job status
+GET    /jobs/{id}/reviews   # Get results
+GET    /jobs/{id}/stream    # SSE progress stream
+DELETE /jobs/{id}           # Cancel job
+GET    /health              # Health check
+GET    /metrics             # Prometheus metrics
+```
+
+---
+
+### 2. **Job Queue** (Redis or RabbitMQ)
+
+**Why needed**:
+- Decouple API from scraping workers
+- Distribute load across workers
+- Retry failed jobs
+- Handle backpressure
+
+**Options**:
+
+**Option A: Redis** (Recommended for simpler setups)
+```python
+# Fast, simple, good for most use cases
+- In-memory queue
+- Pub/Sub for events
+- Job state storage
+- Session storage
+```
+
+**Option B: RabbitMQ** (For complex workflows)
+```python
+# More features, better for complex scenarios
+- Guaranteed delivery
+- Advanced routing
+- Dead letter queues
+- Priority queues
+```
+
+**Recommendation**: Start with **Redis**, upgrade to RabbitMQ if needed.
+
+---
+
+### 3. **Worker Processes** (Celery or Custom)
+
+**Responsibilities**:
+- Pull jobs from queue
+- Run scraping (headless Chrome)
+- Save results
+- Send webhooks
+- Update job status
+
+**Scaling**:
+```bash
+# Run 4 workers on same machine
+celery -A worker worker --concurrency=4
+
+# Or 4 separate processes
+python worker.py &
+python worker.py &
+python worker.py &
+python worker.py &
+
+# Or Kubernetes deployment
+kubectl scale deployment scraper-worker --replicas=10
+```
+
+---
+
+### 4. **Database** (PostgreSQL or MongoDB)
+
+**Job Metadata Schema**:
+
+**PostgreSQL** (Recommended):
+```sql
+CREATE TABLE jobs (
+    job_id UUID PRIMARY KEY,
+    status VARCHAR(20) NOT NULL,
+    url TEXT NOT NULL,
+    webhook_url TEXT,
+    created_at TIMESTAMP NOT NULL,
+    started_at TIMESTAMP,
+    completed_at TIMESTAMP,
+    reviews_count INTEGER,
+    reviews_file_path TEXT,
+    error_message TEXT,
+    metadata JSONB
+);
+
+CREATE INDEX idx_jobs_status ON jobs(status);
+CREATE INDEX idx_jobs_created_at ON jobs(created_at);
+```
+
+**Why PostgreSQL**:
+- ✅ ACID transactions
+- ✅ Good for structured data
+- ✅ SQL queries
+- ✅ Mature ecosystem
+
+**Alternative - MongoDB**:
+```javascript
+{
+  _id: ObjectId("..."),
+  job_id: "550e8400-...",
+  status: "completed",
+  url: "https://...",
+  webhook_url: "https://...",
+  created_at: ISODate("2026-01-18T..."),
+  reviews_count: 244,
+  reviews_file: "/data/reviews/550e8400.json",
+  metadata: { ... }
+}
+```
+
+**Why MongoDB**:
+- ✅ Flexible schema
+- ✅ Good for document storage
+- ✅ Built-in sharding
+
+**Recommendation**: **PostgreSQL** for most cases (better for job queues and transactions)
+
+---
+
+### 5. **File Storage**
+
+**Options**:
+
+**Option A: Local Filesystem** (Development/Small scale)
+```python
+/data/reviews/
+  ├── 550e8400-e29b-41d4-a716-446655440000.json
+  ├── 6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
+  └── ...
+```
+
+**Option B: S3 / Object Storage** (Production - RECOMMENDED)
+```python
+s3://scraper-reviews-bucket/
+  ├── 2026/01/18/550e8400-e29b-41d4-a716-446655440000.json
+  ├── 2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
+  └── ...
+```
+
+**Why S3**:
+- ✅ Unlimited storage
+- ✅ No disk management
+- ✅ High availability
+- ✅ Versioning support
+- ✅ Pre-signed URLs for direct access
+- ✅ Lifecycle policies (auto-delete old files)
+
+**Recommendation**: **S3 (or compatible)** for production
+
+---
+
+### 6. **Webhook Dispatcher**
+
+**Features**:
+- ✅ Retry logic (exponential backoff)
+- ✅ Dead letter queue for failed webhooks
+- ✅ Webhook signatures (HMAC for security)
+- ✅ Timeout handling
+- ✅ Async delivery
+
+**Implementation**:
+```python
+async def send_webhook(webhook_url, payload, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            # Add signature
+            signature = hmac.new(
+                WEBHOOK_SECRET,
+                json.dumps(payload).encode(),
+                hashlib.sha256
+            ).hexdigest()
+
+            # Send with timeout
+            async with httpx.AsyncClient() as client:
+                response = await client.post(
+                    webhook_url,
+                    json=payload,
+                    headers={"X-Webhook-Signature": signature},
+                    timeout=10.0
+                )
+
+                if response.status_code == 200:
+                    return True
+
+        except Exception as e:
+            if attempt < max_retries - 1:
+                await asyncio.sleep(2 ** attempt)  # Exponential backoff
+            else:
+                # Move to dead letter queue
+                await save_to_dead_letter_queue(webhook_url, payload)
+
+    return False
+```
+
+---
+
+## 🔥 Complete Workflow Examples
+
+### Workflow 1: **Webhooks** (Production)
+
+```python
+# 1. Client submits job with webhook
+POST /scrape
+{
+  "url": "https://maps.google.com/...",
+  "webhook_url": "https://client.com/webhook",
+  "webhook_secret": "secret123"  # For signature verification
+}
+
+Response:
+{
+  "job_id": "550e8400-...",
+  "status": "queued",
+  "estimated_time": "20s"
+}
+
+# 2. Server enqueues job
+redis.lpush("scraper:queue", job_id)
+
+# 3. Worker picks up job
+worker = get_from_queue()
+result = fast_scrape_reviews(url)
+
+# 4. Save to S3
+s3.upload(f"reviews/{job_id}.json", reviews)
+
+# 5. Update database
+db.jobs.update(job_id, {
+  status: "completed",
+  reviews_count: 244,
+  reviews_url: f"https://api.example.com/jobs/{job_id}/reviews"
+})
+
+# 6. Send webhook to client
+POST https://client.com/webhook
+Headers:
+  X-Webhook-Signature: hmac_sha256(payload, secret)
+Body:
+{
+  "event": "job.completed",
+  "job_id": "550e8400-...",
+  "status": "completed",
+  "reviews_count": 244,
+  "reviews_url": "https://api.example.com/jobs/{job_id}/reviews",
+  "completed_at": "2026-01-18T10:30:20Z"
+}
+
+# 7. Client downloads reviews
+GET https://api.example.com/jobs/{job_id}/reviews
+# Or direct S3 pre-signed URL
+GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=...
+```
+
+---
+
+### Workflow 2: **SSE Streaming** (Real-time Dashboard)
+
+```python
+# 1. Client opens SSE connection
+EventSource("/jobs/{job_id}/stream")
+
+# 2. Server streams progress updates
+def stream_progress(job_id):
+    while True:
+        job = get_job(job_id)
+
+        yield f"data: {json.dumps({
+            'stage': job.stage,
+            'reviews_loaded': job.reviews_loaded,
+            'progress_percent': job.progress_percent
+        })}\n\n"
+
+        if job.status in ['completed', 'failed']:
+            break
+
+        await asyncio.sleep(1)  # Update every second
+
+# 3. Client receives updates
+onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20}
+onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40}
+onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60}
+onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100}
+onmessage: {"stage": "completed", "total": 244}
+```
+
+---
+
+### Workflow 3: **Polling** (Simple Clients)
+
+```python
+# 1. Submit job (no webhook)
+POST /scrape
+{
+  "url": "https://maps.google.com/..."
+}
+
+Response:
+{
+  "job_id": "550e8400-...",
+  "status": "queued"
+}
+
+# 2. Poll every 3 seconds
+while True:
+    response = GET /jobs/{job_id}
+
+    if response.status == "completed":
+        reviews = GET /jobs/{job_id}/reviews
+        break
+    elif response.status == "failed":
+        handle_error(response.error_message)
+        break
+
+    sleep(3)
+```
+
+---
+
+## 🏥 Health Checks
+
+### 1. **Basic Health Check**
+
+```python
+@app.get("/health")
+async def health_check():
+    return {
+        "status": "healthy",
+        "timestamp": datetime.utcnow().isoformat(),
+        "version": "1.0.0"
+    }
+```
+
+### 2. **Detailed Health Check** (Recommended)
+
+```python
+@app.get("/health/detailed")
+async def detailed_health():
+    checks = {
+        "api": await check_api(),           # Always healthy if responding
+        "database": await check_database(), # Query DB
+        "redis": await check_redis(),       # Ping Redis
+        "s3": await check_s3(),            # List buckets
+        "workers": await check_workers(),   # Check if workers alive
+        "disk": await check_disk_space(),  # Check disk usage
+    }
+
+    overall_healthy = all(c["healthy"] for c in checks.values())
+
+    return {
+        "status": "healthy" if overall_healthy else "degraded",
+        "checks": checks,
+        "timestamp": datetime.utcnow().isoformat()
+    }
+
+# Example response:
+{
+  "status": "healthy",
+  "checks": {
+    "api": {"healthy": true, "latency_ms": 1},
+    "database": {"healthy": true, "latency_ms": 5},
+    "redis": {"healthy": true, "latency_ms": 2},
+    "s3": {"healthy": true, "latency_ms": 50},
+    "workers": {"healthy": true, "active_workers": 4},
+    "disk": {"healthy": true, "usage_percent": 45}
+  },
+  "timestamp": "2026-01-18T10:30:00Z"
+}
+```
+
+### 3. **Readiness vs Liveness** (Kubernetes)
+
+```python
+# Liveness: Is the app alive? (restart if false)
+@app.get("/health/live")
+async def liveness():
+    # Simple check - is the server running?
+    return {"status": "alive"}
+
+# Readiness: Can the app handle traffic? (remove from load balancer if false)
+@app.get("/health/ready")
+async def readiness():
+    # Check dependencies
+    db_ok = await ping_database()
+    redis_ok = await ping_redis()
+
+    if db_ok and redis_ok:
+        return {"status": "ready"}
+    else:
+        raise HTTPException(status_code=503, detail="Not ready")
+```
+
+---
+
+## 📊 Monitoring & Metrics
+
+### Prometheus Metrics
+
+```python
+from prometheus_client import Counter, Histogram, Gauge
+
+# Counters
+jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status'])
+webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success'])
+
+# Histograms
+scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration')
+reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job')
+
+# Gauges
+active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs')
+queue_size = Gauge('scraper_queue_size', 'Jobs in queue')
+
+@app.get("/metrics")
+async def metrics():
+    # Prometheus scrapes this endpoint
+    return Response(generate_latest(), media_type="text/plain")
+```
+
+---
+
+## 🔐 Security
+
+### 1. **API Keys**
+
+```python
+@app.post("/scrape")
+async def scrape(
+    request: ScrapeRequest,
+    api_key: str = Header(..., alias="X-API-Key")
+):
+    if not validate_api_key(api_key):
+        raise HTTPException(status_code=401, detail="Invalid API key")
+
+    # Process request...
+```
+
+### 2. **Rate Limiting**
+
+```python
+from slowapi import Limiter, _rate_limit_exceeded_handler
+from slowapi.util import get_remote_address
+
+limiter = Limiter(key_func=get_remote_address)
+
+@app.post("/scrape")
+@limiter.limit("10/minute")  # Max 10 jobs per minute
+async def scrape(request: Request, ...):
+    # Process request...
+```
+
+### 3. **Webhook Signatures**
+
+```python
+import hmac
+
+def verify_webhook_signature(payload, signature, secret):
+    expected = hmac.new(
+        secret.encode(),
+        payload.encode(),
+        hashlib.sha256
+    ).hexdigest()
+
+    return hmac.compare_digest(signature, expected)
+```
+
+---
+
+## 🚀 Deployment Options
+
+### Option 1: **Docker Compose** (Development)
+
+```yaml
+version: '3.8'
+services:
+  api:
+    build: .
+    ports:
+      - "8000:8000"
+    environment:
+      - REDIS_URL=redis://redis:6379
+      - DATABASE_URL=postgresql://db:5432/scraper
+    depends_on:
+      - redis
+      - db
+
+  worker:
+    build: .
+    command: python worker.py
+    environment:
+      - REDIS_URL=redis://redis:6379
+    depends_on:
+      - redis
+    deploy:
+      replicas: 4
+
+  redis:
+    image: redis:7-alpine
+
+  db:
+    image: postgres:15-alpine
+    environment:
+      - POSTGRES_DB=scraper
+```
+
+### Option 2: **Kubernetes** (Production)
+
+```yaml
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: scraper-api
+spec:
+  replicas: 3
+  selector:
+    matchLabels:
+      app: scraper-api
+  template:
+    spec:
+      containers:
+      - name: api
+        image: scraper-api:latest
+        ports:
+        - containerPort: 8000
+        env:
+        - name: REDIS_URL
+          value: redis://redis:6379
+        livenessProbe:
+          httpGet:
+            path: /health/live
+            port: 8000
+        readinessProbe:
+          httpGet:
+            path: /health/ready
+            port: 8000
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: scraper-worker
+spec:
+  replicas: 10
+  selector:
+    matchLabels:
+      app: scraper-worker
+  template:
+    spec:
+      containers:
+      - name: worker
+        image: scraper-worker:latest
+```
+
+---
+
+## 📈 Scaling Considerations
+
+### Horizontal Scaling
+
+```
+1 Worker  = 3 jobs/minute (20s per job)
+10 Workers = 30 jobs/minute
+100 Workers = 300 jobs/minute = 432,000 jobs/day
+```
+
+### Resource Requirements (per worker)
+
+```
+CPU: 1-2 cores (Chrome is CPU-intensive)
+RAM: 2-4 GB (headless Chrome + data)
+Disk: Minimal (results go to S3)
+```
+
+### Auto-scaling (Kubernetes HPA)
+
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: scraper-worker-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: scraper-worker
+  minReplicas: 2
+  maxReplicas: 50
+  metrics:
+  - type: External
+    external:
+      metric:
+        name: redis_queue_size
+      target:
+        type: Value
+        value: "10"  # Scale up if queue > 10
+```
+
+---
+
+## ✅ Recommended Stack
+
+### For Small-Medium (< 1000 jobs/day):
+```
+✅ FastAPI (API Server)
+✅ Redis (Queue + Cache)
+✅ PostgreSQL (Job metadata)
+✅ Local files or S3 (Reviews storage)
+✅ Webhooks (Primary)
+✅ Polling (Fallback)
+✅ Docker Compose (Deployment)
+```
+
+### For Large Scale (> 10,000 jobs/day):
+```
+✅ FastAPI (API Server)
+✅ RabbitMQ (Queue)
+✅ PostgreSQL (Job metadata)
+✅ S3 (Reviews storage)
+✅ Webhooks (Primary)
+✅ SSE (Real-time updates)
+✅ Kubernetes (Orchestration)
+✅ Prometheus + Grafana (Monitoring)
+✅ ELK Stack (Logging)
+```
+
+---
+
+## 🎯 Next Steps
+
+Would you like me to implement:
+
+1. ✅ **Webhooks** - Full webhook support with retries
+2. ✅ **Redis Queue** - Job queue with Celery/RQ
+3. ✅ **PostgreSQL** - Job metadata storage
+4. ✅ **S3 Storage** - Reviews file storage
+5. ✅ **Health Checks** - Detailed health endpoints
+6. ✅ **SSE Streaming** - Real-time progress updates (optional)
+7. ✅ **Docker Setup** - Complete docker-compose.yml
+
+**My recommendation**: Start with **#1-5** (core production features), add #6-7 later if needed.
+
+Let me know which to implement first!