Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
834 lines
21 KiB
Markdown
834 lines
21 KiB
Markdown
# Production Microservice Architecture
|
|
## Google Reviews Scraper API
|
|
|
|
---
|
|
|
|
## 🎯 Recommended Communication Patterns
|
|
|
|
### 1. **Webhooks** (Primary - RECOMMENDED) ✅
|
|
|
|
**Best for**: Production async job processing
|
|
|
|
```
|
|
Client → POST /scrape (with webhook_url)
|
|
↓
|
|
Server → Starts job, returns job_id
|
|
↓
|
|
[Scraping in progress...]
|
|
↓
|
|
Server → POST to client's webhook_url when complete
|
|
{
|
|
"job_id": "...",
|
|
"status": "completed",
|
|
"reviews_count": 244,
|
|
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews"
|
|
}
|
|
```
|
|
|
|
**Advantages**:
|
|
- ✅ No polling needed (reduces server load)
|
|
- ✅ Instant notifications when job completes
|
|
- ✅ Industry standard (Stripe, GitHub, Twilio use this)
|
|
- ✅ Client can go offline and come back
|
|
- ✅ Scales to millions of jobs
|
|
|
|
**Use cases**:
|
|
- Batch processing systems
|
|
- Integration with other services
|
|
- When client has a public endpoint
|
|
|
|
---
|
|
|
|
### 2. **Server-Sent Events (SSE)** (Real-time Updates) ⚡
|
|
|
|
**Best for**: Real-time progress monitoring
|
|
|
|
```
|
|
Client → GET /jobs/{job_id}/stream (keeps connection open)
|
|
↓
|
|
Server → Sends progress updates in real-time:
|
|
|
|
data: {"stage": "scrolling", "reviews_loaded": 50}
|
|
|
|
data: {"stage": "scrolling", "reviews_loaded": 100}
|
|
|
|
data: {"stage": "extracting", "reviews_loaded": 244}
|
|
|
|
data: {"stage": "completed", "total": 244}
|
|
```
|
|
|
|
**Advantages**:
|
|
- ✅ Real-time progress updates
|
|
- ✅ HTTP-based (works through firewalls)
|
|
- ✅ Lightweight (one-way communication)
|
|
- ✅ Auto-reconnection support
|
|
- ✅ Great for dashboards/UIs
|
|
|
|
**Use cases**:
|
|
- Web dashboards
|
|
- Real-time monitoring
|
|
- Progress bars in UI
|
|
|
|
---
|
|
|
|
### 3. **Polling** (Fallback) 🔄
|
|
|
|
**Best for**: Simple clients, no webhook capability
|
|
|
|
```
|
|
Client → POST /scrape
|
|
↓
|
|
Server → Returns job_id
|
|
↓
|
|
Client → Polls GET /jobs/{job_id} every 2-5 seconds
|
|
↓
|
|
Server → Returns current status
|
|
```
|
|
|
|
**Advantages**:
|
|
- ✅ Simple to implement
|
|
- ✅ Works everywhere (no public endpoint needed)
|
|
- ✅ Firewall-friendly
|
|
|
|
**Disadvantages**:
|
|
- ❌ Inefficient (many wasted requests)
|
|
- ❌ Delayed notifications (polling interval)
|
|
- ❌ Higher server load
|
|
|
|
**Use cases**:
|
|
- Internal tools
|
|
- Clients behind firewalls
|
|
- Simple integrations
|
|
|
|
---
|
|
|
|
## 🏛️ Complete Production Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ LOAD BALANCER │
|
|
│ (nginx/AWS ALB) │
|
|
└──────────┬──────────────────────────────────┬────────────────┘
|
|
│ │
|
|
▼ ▼
|
|
┌──────────────────────┐ ┌──────────────────────┐
|
|
│ API Server 1 │ │ API Server 2 │
|
|
│ (FastAPI) │ │ (FastAPI) │
|
|
│ - REST endpoints │ │ - REST endpoints │
|
|
│ - Health checks │ │ - Health checks │
|
|
│ - Job management │ │ - Job management │
|
|
└──────────┬───────────┘ └──────────┬───────────┘
|
|
│ │
|
|
└────────────┬───────────────────┘
|
|
▼
|
|
┌────────────────────────┐
|
|
│ REDIS / RabbitMQ │
|
|
│ (Job Queue) │
|
|
│ │
|
|
│ - Pending jobs │
|
|
│ - Job distribution │
|
|
│ - Pub/Sub for events │
|
|
└────────┬───────────────┘
|
|
│
|
|
▼
|
|
┌──────────────┴──────────────┐
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────┐ ┌─────────────┐
|
|
│ Worker 1 │ │ Worker 2 │
|
|
│ │ │ │
|
|
│ - Scraping │ │ - Scraping │
|
|
│ - Headless │ │ - Headless │
|
|
│ - Chrome │ │ - Chrome │
|
|
└─────┬───────┘ └─────┬───────┘
|
|
│ │
|
|
└────────────┬───────────────┘
|
|
▼
|
|
┌──────────────────────────────┐
|
|
│ PERSISTENT STORAGE │
|
|
│ │
|
|
│ ┌────────────────────────┐ │
|
|
│ │ PostgreSQL / MongoDB │ │
|
|
│ │ - Job metadata │ │
|
|
│ │ - Status tracking │ │
|
|
│ │ - Webhook configs │ │
|
|
│ └────────────────────────┘ │
|
|
│ │
|
|
│ ┌────────────────────────┐ │
|
|
│ │ File Storage / S3 │ │
|
|
│ │ - Review JSON files │ │
|
|
│ │ - Large payloads │ │
|
|
│ └────────────────────────┘ │
|
|
└───────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────┐
|
|
│ Webhook Dispatcher │
|
|
│ - Retry logic │
|
|
│ - Dead letter queue│
|
|
└─────────────────────┘
|
|
│
|
|
▼
|
|
[Client's webhook URL]
|
|
```
|
|
|
|
---
|
|
|
|
## 📦 Component Breakdown
|
|
|
|
### 1. **API Server** (FastAPI)
|
|
|
|
**Responsibilities**:
|
|
- Handle HTTP requests
|
|
- Validate input
|
|
- Enqueue jobs
|
|
- Serve results
|
|
- Health checks
|
|
|
|
**Endpoints**:
|
|
```python
|
|
POST /scrape # Submit job
|
|
GET /jobs/{id} # Get job status
|
|
GET /jobs/{id}/reviews # Get results
|
|
GET /jobs/{id}/stream # SSE progress stream
|
|
DELETE /jobs/{id} # Cancel job
|
|
GET /health # Health check
|
|
GET /metrics # Prometheus metrics
|
|
```
|
|
|
|
---
|
|
|
|
### 2. **Job Queue** (Redis or RabbitMQ)
|
|
|
|
**Why needed**:
|
|
- Decouple API from scraping workers
|
|
- Distribute load across workers
|
|
- Retry failed jobs
|
|
- Handle backpressure
|
|
|
|
**Options**:
|
|
|
|
**Option A: Redis** (Recommended for simpler setups)
|
|
```python
|
|
# Fast, simple, good for most use cases
|
|
- In-memory queue
|
|
- Pub/Sub for events
|
|
- Job state storage
|
|
- Session storage
|
|
```
|
|
|
|
**Option B: RabbitMQ** (For complex workflows)
|
|
```python
|
|
# More features, better for complex scenarios
|
|
- Guaranteed delivery
|
|
- Advanced routing
|
|
- Dead letter queues
|
|
- Priority queues
|
|
```
|
|
|
|
**Recommendation**: Start with **Redis**, upgrade to RabbitMQ if needed.
|
|
|
|
---
|
|
|
|
### 3. **Worker Processes** (Celery or Custom)
|
|
|
|
**Responsibilities**:
|
|
- Pull jobs from queue
|
|
- Run scraping (headless Chrome)
|
|
- Save results
|
|
- Send webhooks
|
|
- Update job status
|
|
|
|
**Scaling**:
|
|
```bash
|
|
# Run 4 workers on same machine
|
|
celery -A worker worker --concurrency=4
|
|
|
|
# Or 4 separate processes
|
|
python worker.py &
|
|
python worker.py &
|
|
python worker.py &
|
|
python worker.py &
|
|
|
|
# Or Kubernetes deployment
|
|
kubectl scale deployment scraper-worker --replicas=10
|
|
```
|
|
|
|
---
|
|
|
|
### 4. **Database** (PostgreSQL or MongoDB)
|
|
|
|
**Job Metadata Schema**:
|
|
|
|
**PostgreSQL** (Recommended):
|
|
```sql
|
|
CREATE TABLE jobs (
|
|
job_id UUID PRIMARY KEY,
|
|
status VARCHAR(20) NOT NULL,
|
|
url TEXT NOT NULL,
|
|
webhook_url TEXT,
|
|
created_at TIMESTAMP NOT NULL,
|
|
started_at TIMESTAMP,
|
|
completed_at TIMESTAMP,
|
|
reviews_count INTEGER,
|
|
reviews_file_path TEXT,
|
|
error_message TEXT,
|
|
metadata JSONB
|
|
);
|
|
|
|
CREATE INDEX idx_jobs_status ON jobs(status);
|
|
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
|
|
```
|
|
|
|
**Why PostgreSQL**:
|
|
- ✅ ACID transactions
|
|
- ✅ Good for structured data
|
|
- ✅ SQL queries
|
|
- ✅ Mature ecosystem
|
|
|
|
**Alternative - MongoDB**:
|
|
```javascript
|
|
{
|
|
_id: ObjectId("..."),
|
|
job_id: "550e8400-...",
|
|
status: "completed",
|
|
url: "https://...",
|
|
webhook_url: "https://...",
|
|
created_at: ISODate("2026-01-18T..."),
|
|
reviews_count: 244,
|
|
reviews_file: "/data/reviews/550e8400.json",
|
|
metadata: { ... }
|
|
}
|
|
```
|
|
|
|
**Why MongoDB**:
|
|
- ✅ Flexible schema
|
|
- ✅ Good for document storage
|
|
- ✅ Built-in sharding
|
|
|
|
**Recommendation**: **PostgreSQL** for most cases (better for job queues and transactions)
|
|
|
|
---
|
|
|
|
### 5. **File Storage**
|
|
|
|
**Options**:
|
|
|
|
**Option A: Local Filesystem** (Development/Small scale)
|
|
```python
|
|
/data/reviews/
|
|
├── 550e8400-e29b-41d4-a716-446655440000.json
|
|
├── 6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
|
|
└── ...
|
|
```
|
|
|
|
**Option B: S3 / Object Storage** (Production - RECOMMENDED)
|
|
```python
|
|
s3://scraper-reviews-bucket/
|
|
├── 2026/01/18/550e8400-e29b-41d4-a716-446655440000.json
|
|
├── 2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
|
|
└── ...
|
|
```
|
|
|
|
**Why S3**:
|
|
- ✅ Unlimited storage
|
|
- ✅ No disk management
|
|
- ✅ High availability
|
|
- ✅ Versioning support
|
|
- ✅ Pre-signed URLs for direct access
|
|
- ✅ Lifecycle policies (auto-delete old files)
|
|
|
|
**Recommendation**: **S3 (or compatible)** for production
|
|
|
|
---
|
|
|
|
### 6. **Webhook Dispatcher**
|
|
|
|
**Features**:
|
|
- ✅ Retry logic (exponential backoff)
|
|
- ✅ Dead letter queue for failed webhooks
|
|
- ✅ Webhook signatures (HMAC for security)
|
|
- ✅ Timeout handling
|
|
- ✅ Async delivery
|
|
|
|
**Implementation**:
|
|
```python
|
|
async def send_webhook(webhook_url, payload, max_retries=3):
|
|
for attempt in range(max_retries):
|
|
try:
|
|
# Add signature
|
|
signature = hmac.new(
|
|
WEBHOOK_SECRET,
|
|
json.dumps(payload).encode(),
|
|
hashlib.sha256
|
|
).hexdigest()
|
|
|
|
# Send with timeout
|
|
async with httpx.AsyncClient() as client:
|
|
response = await client.post(
|
|
webhook_url,
|
|
json=payload,
|
|
headers={"X-Webhook-Signature": signature},
|
|
timeout=10.0
|
|
)
|
|
|
|
if response.status_code == 200:
|
|
return True
|
|
|
|
except Exception as e:
|
|
if attempt < max_retries - 1:
|
|
await asyncio.sleep(2 ** attempt) # Exponential backoff
|
|
else:
|
|
# Move to dead letter queue
|
|
await save_to_dead_letter_queue(webhook_url, payload)
|
|
|
|
return False
|
|
```
|
|
|
|
---
|
|
|
|
## 🔥 Complete Workflow Examples
|
|
|
|
### Workflow 1: **Webhooks** (Production)
|
|
|
|
```python
|
|
# 1. Client submits job with webhook
|
|
POST /scrape
|
|
{
|
|
"url": "https://maps.google.com/...",
|
|
"webhook_url": "https://client.com/webhook",
|
|
"webhook_secret": "secret123" # For signature verification
|
|
}
|
|
|
|
Response:
|
|
{
|
|
"job_id": "550e8400-...",
|
|
"status": "queued",
|
|
"estimated_time": "20s"
|
|
}
|
|
|
|
# 2. Server enqueues job
|
|
redis.lpush("scraper:queue", job_id)
|
|
|
|
# 3. Worker picks up job
|
|
worker = get_from_queue()
|
|
result = fast_scrape_reviews(url)
|
|
|
|
# 4. Save to S3
|
|
s3.upload(f"reviews/{job_id}.json", reviews)
|
|
|
|
# 5. Update database
|
|
db.jobs.update(job_id, {
|
|
status: "completed",
|
|
reviews_count: 244,
|
|
reviews_url: f"https://api.example.com/jobs/{job_id}/reviews"
|
|
})
|
|
|
|
# 6. Send webhook to client
|
|
POST https://client.com/webhook
|
|
Headers:
|
|
X-Webhook-Signature: hmac_sha256(payload, secret)
|
|
Body:
|
|
{
|
|
"event": "job.completed",
|
|
"job_id": "550e8400-...",
|
|
"status": "completed",
|
|
"reviews_count": 244,
|
|
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews",
|
|
"completed_at": "2026-01-18T10:30:20Z"
|
|
}
|
|
|
|
# 7. Client downloads reviews
|
|
GET https://api.example.com/jobs/{job_id}/reviews
|
|
# Or direct S3 pre-signed URL
|
|
GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=...
|
|
```
|
|
|
|
---
|
|
|
|
### Workflow 2: **SSE Streaming** (Real-time Dashboard)
|
|
|
|
```python
|
|
# 1. Client opens SSE connection
|
|
EventSource("/jobs/{job_id}/stream")
|
|
|
|
# 2. Server streams progress updates
|
|
def stream_progress(job_id):
|
|
while True:
|
|
job = get_job(job_id)
|
|
|
|
yield f"data: {json.dumps({
|
|
'stage': job.stage,
|
|
'reviews_loaded': job.reviews_loaded,
|
|
'progress_percent': job.progress_percent
|
|
})}\n\n"
|
|
|
|
if job.status in ['completed', 'failed']:
|
|
break
|
|
|
|
await asyncio.sleep(1) # Update every second
|
|
|
|
# 3. Client receives updates
|
|
onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20}
|
|
onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40}
|
|
onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60}
|
|
onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100}
|
|
onmessage: {"stage": "completed", "total": 244}
|
|
```
|
|
|
|
---
|
|
|
|
### Workflow 3: **Polling** (Simple Clients)
|
|
|
|
```python
|
|
# 1. Submit job (no webhook)
|
|
POST /scrape
|
|
{
|
|
"url": "https://maps.google.com/..."
|
|
}
|
|
|
|
Response:
|
|
{
|
|
"job_id": "550e8400-...",
|
|
"status": "queued"
|
|
}
|
|
|
|
# 2. Poll every 3 seconds
|
|
while True:
|
|
response = GET /jobs/{job_id}
|
|
|
|
if response.status == "completed":
|
|
reviews = GET /jobs/{job_id}/reviews
|
|
break
|
|
elif response.status == "failed":
|
|
handle_error(response.error_message)
|
|
break
|
|
|
|
sleep(3)
|
|
```
|
|
|
|
---
|
|
|
|
## 🏥 Health Checks
|
|
|
|
### 1. **Basic Health Check**
|
|
|
|
```python
|
|
@app.get("/health")
|
|
async def health_check():
|
|
return {
|
|
"status": "healthy",
|
|
"timestamp": datetime.utcnow().isoformat(),
|
|
"version": "1.0.0"
|
|
}
|
|
```
|
|
|
|
### 2. **Detailed Health Check** (Recommended)
|
|
|
|
```python
|
|
@app.get("/health/detailed")
|
|
async def detailed_health():
|
|
checks = {
|
|
"api": await check_api(), # Always healthy if responding
|
|
"database": await check_database(), # Query DB
|
|
"redis": await check_redis(), # Ping Redis
|
|
"s3": await check_s3(), # List buckets
|
|
"workers": await check_workers(), # Check if workers alive
|
|
"disk": await check_disk_space(), # Check disk usage
|
|
}
|
|
|
|
overall_healthy = all(c["healthy"] for c in checks.values())
|
|
|
|
return {
|
|
"status": "healthy" if overall_healthy else "degraded",
|
|
"checks": checks,
|
|
"timestamp": datetime.utcnow().isoformat()
|
|
}
|
|
|
|
# Example response:
|
|
{
|
|
"status": "healthy",
|
|
"checks": {
|
|
"api": {"healthy": true, "latency_ms": 1},
|
|
"database": {"healthy": true, "latency_ms": 5},
|
|
"redis": {"healthy": true, "latency_ms": 2},
|
|
"s3": {"healthy": true, "latency_ms": 50},
|
|
"workers": {"healthy": true, "active_workers": 4},
|
|
"disk": {"healthy": true, "usage_percent": 45}
|
|
},
|
|
"timestamp": "2026-01-18T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
### 3. **Readiness vs Liveness** (Kubernetes)
|
|
|
|
```python
|
|
# Liveness: Is the app alive? (restart if false)
|
|
@app.get("/health/live")
|
|
async def liveness():
|
|
# Simple check - is the server running?
|
|
return {"status": "alive"}
|
|
|
|
# Readiness: Can the app handle traffic? (remove from load balancer if false)
|
|
@app.get("/health/ready")
|
|
async def readiness():
|
|
# Check dependencies
|
|
db_ok = await ping_database()
|
|
redis_ok = await ping_redis()
|
|
|
|
if db_ok and redis_ok:
|
|
return {"status": "ready"}
|
|
else:
|
|
raise HTTPException(status_code=503, detail="Not ready")
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Monitoring & Metrics
|
|
|
|
### Prometheus Metrics
|
|
|
|
```python
|
|
from prometheus_client import Counter, Histogram, Gauge
|
|
|
|
# Counters
|
|
jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status'])
|
|
webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success'])
|
|
|
|
# Histograms
|
|
scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration')
|
|
reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job')
|
|
|
|
# Gauges
|
|
active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs')
|
|
queue_size = Gauge('scraper_queue_size', 'Jobs in queue')
|
|
|
|
@app.get("/metrics")
|
|
async def metrics():
|
|
# Prometheus scrapes this endpoint
|
|
return Response(generate_latest(), media_type="text/plain")
|
|
```
|
|
|
|
---
|
|
|
|
## 🔐 Security
|
|
|
|
### 1. **API Keys**
|
|
|
|
```python
|
|
@app.post("/scrape")
|
|
async def scrape(
|
|
request: ScrapeRequest,
|
|
api_key: str = Header(..., alias="X-API-Key")
|
|
):
|
|
if not validate_api_key(api_key):
|
|
raise HTTPException(status_code=401, detail="Invalid API key")
|
|
|
|
# Process request...
|
|
```
|
|
|
|
### 2. **Rate Limiting**
|
|
|
|
```python
|
|
from slowapi import Limiter, _rate_limit_exceeded_handler
|
|
from slowapi.util import get_remote_address
|
|
|
|
limiter = Limiter(key_func=get_remote_address)
|
|
|
|
@app.post("/scrape")
|
|
@limiter.limit("10/minute") # Max 10 jobs per minute
|
|
async def scrape(request: Request, ...):
|
|
# Process request...
|
|
```
|
|
|
|
### 3. **Webhook Signatures**
|
|
|
|
```python
|
|
import hmac
|
|
|
|
def verify_webhook_signature(payload, signature, secret):
|
|
expected = hmac.new(
|
|
secret.encode(),
|
|
payload.encode(),
|
|
hashlib.sha256
|
|
).hexdigest()
|
|
|
|
return hmac.compare_digest(signature, expected)
|
|
```
|
|
|
|
---
|
|
|
|
## 🚀 Deployment Options
|
|
|
|
### Option 1: **Docker Compose** (Development)
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
api:
|
|
build: .
|
|
ports:
|
|
- "8000:8000"
|
|
environment:
|
|
- REDIS_URL=redis://redis:6379
|
|
- DATABASE_URL=postgresql://db:5432/scraper
|
|
depends_on:
|
|
- redis
|
|
- db
|
|
|
|
worker:
|
|
build: .
|
|
command: python worker.py
|
|
environment:
|
|
- REDIS_URL=redis://redis:6379
|
|
depends_on:
|
|
- redis
|
|
deploy:
|
|
replicas: 4
|
|
|
|
redis:
|
|
image: redis:7-alpine
|
|
|
|
db:
|
|
image: postgres:15-alpine
|
|
environment:
|
|
- POSTGRES_DB=scraper
|
|
```
|
|
|
|
### Option 2: **Kubernetes** (Production)
|
|
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: scraper-api
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: scraper-api
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: api
|
|
image: scraper-api:latest
|
|
ports:
|
|
- containerPort: 8000
|
|
env:
|
|
- name: REDIS_URL
|
|
value: redis://redis:6379
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health/live
|
|
port: 8000
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health/ready
|
|
port: 8000
|
|
---
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: scraper-worker
|
|
spec:
|
|
replicas: 10
|
|
selector:
|
|
matchLabels:
|
|
app: scraper-worker
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: worker
|
|
image: scraper-worker:latest
|
|
```
|
|
|
|
---
|
|
|
|
## 📈 Scaling Considerations
|
|
|
|
### Horizontal Scaling
|
|
|
|
```
|
|
1 Worker = 3 jobs/minute (20s per job)
|
|
10 Workers = 30 jobs/minute
|
|
100 Workers = 300 jobs/minute = 432,000 jobs/day
|
|
```
|
|
|
|
### Resource Requirements (per worker)
|
|
|
|
```
|
|
CPU: 1-2 cores (Chrome is CPU-intensive)
|
|
RAM: 2-4 GB (headless Chrome + data)
|
|
Disk: Minimal (results go to S3)
|
|
```
|
|
|
|
### Auto-scaling (Kubernetes HPA)
|
|
|
|
```yaml
|
|
apiVersion: autoscaling/v2
|
|
kind: HorizontalPodAutoscaler
|
|
metadata:
|
|
name: scraper-worker-hpa
|
|
spec:
|
|
scaleTargetRef:
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
name: scraper-worker
|
|
minReplicas: 2
|
|
maxReplicas: 50
|
|
metrics:
|
|
- type: External
|
|
external:
|
|
metric:
|
|
name: redis_queue_size
|
|
target:
|
|
type: Value
|
|
value: "10" # Scale up if queue > 10
|
|
```
|
|
|
|
---
|
|
|
|
## ✅ Recommended Stack
|
|
|
|
### For Small-Medium (< 1000 jobs/day):
|
|
```
|
|
✅ FastAPI (API Server)
|
|
✅ Redis (Queue + Cache)
|
|
✅ PostgreSQL (Job metadata)
|
|
✅ Local files or S3 (Reviews storage)
|
|
✅ Webhooks (Primary)
|
|
✅ Polling (Fallback)
|
|
✅ Docker Compose (Deployment)
|
|
```
|
|
|
|
### For Large Scale (> 10,000 jobs/day):
|
|
```
|
|
✅ FastAPI (API Server)
|
|
✅ RabbitMQ (Queue)
|
|
✅ PostgreSQL (Job metadata)
|
|
✅ S3 (Reviews storage)
|
|
✅ Webhooks (Primary)
|
|
✅ SSE (Real-time updates)
|
|
✅ Kubernetes (Orchestration)
|
|
✅ Prometheus + Grafana (Monitoring)
|
|
✅ ELK Stack (Logging)
|
|
```
|
|
|
|
---
|
|
|
|
## 🎯 Next Steps
|
|
|
|
Would you like me to implement:
|
|
|
|
1. ✅ **Webhooks** - Full webhook support with retries
|
|
2. ✅ **Redis Queue** - Job queue with Celery/RQ
|
|
3. ✅ **PostgreSQL** - Job metadata storage
|
|
4. ✅ **S3 Storage** - Reviews file storage
|
|
5. ✅ **Health Checks** - Detailed health endpoints
|
|
6. ✅ **SSE Streaming** - Real-time progress updates (optional)
|
|
7. ✅ **Docker Setup** - Complete docker-compose.yml
|
|
|
|
**My recommendation**: Start with **#1-5** (core production features), add #6-7 later if needed.
|
|
|
|
Let me know which to implement first!
|