Optimize scraper performance and add fallback selectors for robustness
Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
833
MICROSERVICE_ARCHITECTURE.md
Normal file
833
MICROSERVICE_ARCHITECTURE.md
Normal file
@@ -0,0 +1,833 @@
|
||||
# Production Microservice Architecture
|
||||
## Google Reviews Scraper API
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Recommended Communication Patterns
|
||||
|
||||
### 1. **Webhooks** (Primary - RECOMMENDED) ✅
|
||||
|
||||
**Best for**: Production async job processing
|
||||
|
||||
```
|
||||
Client → POST /scrape (with webhook_url)
|
||||
↓
|
||||
Server → Starts job, returns job_id
|
||||
↓
|
||||
[Scraping in progress...]
|
||||
↓
|
||||
Server → POST to client's webhook_url when complete
|
||||
{
|
||||
"job_id": "...",
|
||||
"status": "completed",
|
||||
"reviews_count": 244,
|
||||
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews"
|
||||
}
|
||||
```
|
||||
|
||||
**Advantages**:
|
||||
- ✅ No polling needed (reduces server load)
|
||||
- ✅ Instant notifications when job completes
|
||||
- ✅ Industry standard (Stripe, GitHub, Twilio use this)
|
||||
- ✅ Client can go offline and come back
|
||||
- ✅ Scales to millions of jobs
|
||||
|
||||
**Use cases**:
|
||||
- Batch processing systems
|
||||
- Integration with other services
|
||||
- When client has a public endpoint
|
||||
|
||||
---
|
||||
|
||||
### 2. **Server-Sent Events (SSE)** (Real-time Updates) ⚡
|
||||
|
||||
**Best for**: Real-time progress monitoring
|
||||
|
||||
```
|
||||
Client → GET /jobs/{job_id}/stream (keeps connection open)
|
||||
↓
|
||||
Server → Sends progress updates in real-time:
|
||||
|
||||
data: {"stage": "scrolling", "reviews_loaded": 50}
|
||||
|
||||
data: {"stage": "scrolling", "reviews_loaded": 100}
|
||||
|
||||
data: {"stage": "extracting", "reviews_loaded": 244}
|
||||
|
||||
data: {"stage": "completed", "total": 244}
|
||||
```
|
||||
|
||||
**Advantages**:
|
||||
- ✅ Real-time progress updates
|
||||
- ✅ HTTP-based (works through firewalls)
|
||||
- ✅ Lightweight (one-way communication)
|
||||
- ✅ Auto-reconnection support
|
||||
- ✅ Great for dashboards/UIs
|
||||
|
||||
**Use cases**:
|
||||
- Web dashboards
|
||||
- Real-time monitoring
|
||||
- Progress bars in UI
|
||||
|
||||
---
|
||||
|
||||
### 3. **Polling** (Fallback) 🔄
|
||||
|
||||
**Best for**: Simple clients, no webhook capability
|
||||
|
||||
```
|
||||
Client → POST /scrape
|
||||
↓
|
||||
Server → Returns job_id
|
||||
↓
|
||||
Client → Polls GET /jobs/{job_id} every 2-5 seconds
|
||||
↓
|
||||
Server → Returns current status
|
||||
```
|
||||
|
||||
**Advantages**:
|
||||
- ✅ Simple to implement
|
||||
- ✅ Works everywhere (no public endpoint needed)
|
||||
- ✅ Firewall-friendly
|
||||
|
||||
**Disadvantages**:
|
||||
- ❌ Inefficient (many wasted requests)
|
||||
- ❌ Delayed notifications (polling interval)
|
||||
- ❌ Higher server load
|
||||
|
||||
**Use cases**:
|
||||
- Internal tools
|
||||
- Clients behind firewalls
|
||||
- Simple integrations
|
||||
|
||||
---
|
||||
|
||||
## 🏛️ Complete Production Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ LOAD BALANCER │
|
||||
│ (nginx/AWS ALB) │
|
||||
└──────────┬──────────────────────────────────┬────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────────────┐ ┌──────────────────────┐
|
||||
│ API Server 1 │ │ API Server 2 │
|
||||
│ (FastAPI) │ │ (FastAPI) │
|
||||
│ - REST endpoints │ │ - REST endpoints │
|
||||
│ - Health checks │ │ - Health checks │
|
||||
│ - Job management │ │ - Job management │
|
||||
└──────────┬───────────┘ └──────────┬───────────┘
|
||||
│ │
|
||||
└────────────┬───────────────────┘
|
||||
▼
|
||||
┌────────────────────────┐
|
||||
│ REDIS / RabbitMQ │
|
||||
│ (Job Queue) │
|
||||
│ │
|
||||
│ - Pending jobs │
|
||||
│ - Job distribution │
|
||||
│ - Pub/Sub for events │
|
||||
└────────┬───────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┴──────────────┐
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────┐ ┌─────────────┐
|
||||
│ Worker 1 │ │ Worker 2 │
|
||||
│ │ │ │
|
||||
│ - Scraping │ │ - Scraping │
|
||||
│ - Headless │ │ - Headless │
|
||||
│ - Chrome │ │ - Chrome │
|
||||
└─────┬───────┘ └─────┬───────┘
|
||||
│ │
|
||||
└────────────┬───────────────┘
|
||||
▼
|
||||
┌──────────────────────────────┐
|
||||
│ PERSISTENT STORAGE │
|
||||
│ │
|
||||
│ ┌────────────────────────┐ │
|
||||
│ │ PostgreSQL / MongoDB │ │
|
||||
│ │ - Job metadata │ │
|
||||
│ │ - Status tracking │ │
|
||||
│ │ - Webhook configs │ │
|
||||
│ └────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌────────────────────────┐ │
|
||||
│ │ File Storage / S3 │ │
|
||||
│ │ - Review JSON files │ │
|
||||
│ │ - Large payloads │ │
|
||||
│ └────────────────────────┘ │
|
||||
└───────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ Webhook Dispatcher │
|
||||
│ - Retry logic │
|
||||
│ - Dead letter queue│
|
||||
└─────────────────────┘
|
||||
│
|
||||
▼
|
||||
[Client's webhook URL]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📦 Component Breakdown
|
||||
|
||||
### 1. **API Server** (FastAPI)
|
||||
|
||||
**Responsibilities**:
|
||||
- Handle HTTP requests
|
||||
- Validate input
|
||||
- Enqueue jobs
|
||||
- Serve results
|
||||
- Health checks
|
||||
|
||||
**Endpoints**:
|
||||
```python
|
||||
POST /scrape # Submit job
|
||||
GET /jobs/{id} # Get job status
|
||||
GET /jobs/{id}/reviews # Get results
|
||||
GET /jobs/{id}/stream # SSE progress stream
|
||||
DELETE /jobs/{id} # Cancel job
|
||||
GET /health # Health check
|
||||
GET /metrics # Prometheus metrics
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. **Job Queue** (Redis or RabbitMQ)
|
||||
|
||||
**Why needed**:
|
||||
- Decouple API from scraping workers
|
||||
- Distribute load across workers
|
||||
- Retry failed jobs
|
||||
- Handle backpressure
|
||||
|
||||
**Options**:
|
||||
|
||||
**Option A: Redis** (Recommended for simpler setups)
|
||||
```python
|
||||
# Fast, simple, good for most use cases
|
||||
- In-memory queue
|
||||
- Pub/Sub for events
|
||||
- Job state storage
|
||||
- Session storage
|
||||
```
|
||||
|
||||
**Option B: RabbitMQ** (For complex workflows)
|
||||
```python
|
||||
# More features, better for complex scenarios
|
||||
- Guaranteed delivery
|
||||
- Advanced routing
|
||||
- Dead letter queues
|
||||
- Priority queues
|
||||
```
|
||||
|
||||
**Recommendation**: Start with **Redis**, upgrade to RabbitMQ if needed.
|
||||
|
||||
---
|
||||
|
||||
### 3. **Worker Processes** (Celery or Custom)
|
||||
|
||||
**Responsibilities**:
|
||||
- Pull jobs from queue
|
||||
- Run scraping (headless Chrome)
|
||||
- Save results
|
||||
- Send webhooks
|
||||
- Update job status
|
||||
|
||||
**Scaling**:
|
||||
```bash
|
||||
# Run 4 workers on same machine
|
||||
celery -A worker worker --concurrency=4
|
||||
|
||||
# Or 4 separate processes
|
||||
python worker.py &
|
||||
python worker.py &
|
||||
python worker.py &
|
||||
python worker.py &
|
||||
|
||||
# Or Kubernetes deployment
|
||||
kubectl scale deployment scraper-worker --replicas=10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4. **Database** (PostgreSQL or MongoDB)
|
||||
|
||||
**Job Metadata Schema**:
|
||||
|
||||
**PostgreSQL** (Recommended):
|
||||
```sql
|
||||
CREATE TABLE jobs (
|
||||
job_id UUID PRIMARY KEY,
|
||||
status VARCHAR(20) NOT NULL,
|
||||
url TEXT NOT NULL,
|
||||
webhook_url TEXT,
|
||||
created_at TIMESTAMP NOT NULL,
|
||||
started_at TIMESTAMP,
|
||||
completed_at TIMESTAMP,
|
||||
reviews_count INTEGER,
|
||||
reviews_file_path TEXT,
|
||||
error_message TEXT,
|
||||
metadata JSONB
|
||||
);
|
||||
|
||||
CREATE INDEX idx_jobs_status ON jobs(status);
|
||||
CREATE INDEX idx_jobs_created_at ON jobs(created_at);
|
||||
```
|
||||
|
||||
**Why PostgreSQL**:
|
||||
- ✅ ACID transactions
|
||||
- ✅ Good for structured data
|
||||
- ✅ SQL queries
|
||||
- ✅ Mature ecosystem
|
||||
|
||||
**Alternative - MongoDB**:
|
||||
```javascript
|
||||
{
|
||||
_id: ObjectId("..."),
|
||||
job_id: "550e8400-...",
|
||||
status: "completed",
|
||||
url: "https://...",
|
||||
webhook_url: "https://...",
|
||||
created_at: ISODate("2026-01-18T..."),
|
||||
reviews_count: 244,
|
||||
reviews_file: "/data/reviews/550e8400.json",
|
||||
metadata: { ... }
|
||||
}
|
||||
```
|
||||
|
||||
**Why MongoDB**:
|
||||
- ✅ Flexible schema
|
||||
- ✅ Good for document storage
|
||||
- ✅ Built-in sharding
|
||||
|
||||
**Recommendation**: **PostgreSQL** for most cases (better for job queues and transactions)
|
||||
|
||||
---
|
||||
|
||||
### 5. **File Storage**
|
||||
|
||||
**Options**:
|
||||
|
||||
**Option A: Local Filesystem** (Development/Small scale)
|
||||
```python
|
||||
/data/reviews/
|
||||
├── 550e8400-e29b-41d4-a716-446655440000.json
|
||||
├── 6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
|
||||
└── ...
|
||||
```
|
||||
|
||||
**Option B: S3 / Object Storage** (Production - RECOMMENDED)
|
||||
```python
|
||||
s3://scraper-reviews-bucket/
|
||||
├── 2026/01/18/550e8400-e29b-41d4-a716-446655440000.json
|
||||
├── 2026/01/18/6a1f9b2c-3d4e-5f6g-7h8i-9j0k1l2m3n4o.json
|
||||
└── ...
|
||||
```
|
||||
|
||||
**Why S3**:
|
||||
- ✅ Unlimited storage
|
||||
- ✅ No disk management
|
||||
- ✅ High availability
|
||||
- ✅ Versioning support
|
||||
- ✅ Pre-signed URLs for direct access
|
||||
- ✅ Lifecycle policies (auto-delete old files)
|
||||
|
||||
**Recommendation**: **S3 (or compatible)** for production
|
||||
|
||||
---
|
||||
|
||||
### 6. **Webhook Dispatcher**
|
||||
|
||||
**Features**:
|
||||
- ✅ Retry logic (exponential backoff)
|
||||
- ✅ Dead letter queue for failed webhooks
|
||||
- ✅ Webhook signatures (HMAC for security)
|
||||
- ✅ Timeout handling
|
||||
- ✅ Async delivery
|
||||
|
||||
**Implementation**:
|
||||
```python
|
||||
async def send_webhook(webhook_url, payload, max_retries=3):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
# Add signature
|
||||
signature = hmac.new(
|
||||
WEBHOOK_SECRET,
|
||||
json.dumps(payload).encode(),
|
||||
hashlib.sha256
|
||||
).hexdigest()
|
||||
|
||||
# Send with timeout
|
||||
async with httpx.AsyncClient() as client:
|
||||
response = await client.post(
|
||||
webhook_url,
|
||||
json=payload,
|
||||
headers={"X-Webhook-Signature": signature},
|
||||
timeout=10.0
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
if attempt < max_retries - 1:
|
||||
await asyncio.sleep(2 ** attempt) # Exponential backoff
|
||||
else:
|
||||
# Move to dead letter queue
|
||||
await save_to_dead_letter_queue(webhook_url, payload)
|
||||
|
||||
return False
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔥 Complete Workflow Examples
|
||||
|
||||
### Workflow 1: **Webhooks** (Production)
|
||||
|
||||
```python
|
||||
# 1. Client submits job with webhook
|
||||
POST /scrape
|
||||
{
|
||||
"url": "https://maps.google.com/...",
|
||||
"webhook_url": "https://client.com/webhook",
|
||||
"webhook_secret": "secret123" # For signature verification
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"job_id": "550e8400-...",
|
||||
"status": "queued",
|
||||
"estimated_time": "20s"
|
||||
}
|
||||
|
||||
# 2. Server enqueues job
|
||||
redis.lpush("scraper:queue", job_id)
|
||||
|
||||
# 3. Worker picks up job
|
||||
worker = get_from_queue()
|
||||
result = fast_scrape_reviews(url)
|
||||
|
||||
# 4. Save to S3
|
||||
s3.upload(f"reviews/{job_id}.json", reviews)
|
||||
|
||||
# 5. Update database
|
||||
db.jobs.update(job_id, {
|
||||
status: "completed",
|
||||
reviews_count: 244,
|
||||
reviews_url: f"https://api.example.com/jobs/{job_id}/reviews"
|
||||
})
|
||||
|
||||
# 6. Send webhook to client
|
||||
POST https://client.com/webhook
|
||||
Headers:
|
||||
X-Webhook-Signature: hmac_sha256(payload, secret)
|
||||
Body:
|
||||
{
|
||||
"event": "job.completed",
|
||||
"job_id": "550e8400-...",
|
||||
"status": "completed",
|
||||
"reviews_count": 244,
|
||||
"reviews_url": "https://api.example.com/jobs/{job_id}/reviews",
|
||||
"completed_at": "2026-01-18T10:30:20Z"
|
||||
}
|
||||
|
||||
# 7. Client downloads reviews
|
||||
GET https://api.example.com/jobs/{job_id}/reviews
|
||||
# Or direct S3 pre-signed URL
|
||||
GET https://s3.amazonaws.com/bucket/reviews/{job_id}.json?signature=...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow 2: **SSE Streaming** (Real-time Dashboard)
|
||||
|
||||
```python
|
||||
# 1. Client opens SSE connection
|
||||
EventSource("/jobs/{job_id}/stream")
|
||||
|
||||
# 2. Server streams progress updates
|
||||
def stream_progress(job_id):
|
||||
while True:
|
||||
job = get_job(job_id)
|
||||
|
||||
yield f"data: {json.dumps({
|
||||
'stage': job.stage,
|
||||
'reviews_loaded': job.reviews_loaded,
|
||||
'progress_percent': job.progress_percent
|
||||
})}\n\n"
|
||||
|
||||
if job.status in ['completed', 'failed']:
|
||||
break
|
||||
|
||||
await asyncio.sleep(1) # Update every second
|
||||
|
||||
# 3. Client receives updates
|
||||
onmessage: {"stage": "scrolling", "reviews_loaded": 50, "progress": 20}
|
||||
onmessage: {"stage": "scrolling", "reviews_loaded": 100, "progress": 40}
|
||||
onmessage: {"stage": "scrolling", "reviews_loaded": 150, "progress": 60}
|
||||
onmessage: {"stage": "extracting", "reviews_loaded": 244, "progress": 100}
|
||||
onmessage: {"stage": "completed", "total": 244}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow 3: **Polling** (Simple Clients)
|
||||
|
||||
```python
|
||||
# 1. Submit job (no webhook)
|
||||
POST /scrape
|
||||
{
|
||||
"url": "https://maps.google.com/..."
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"job_id": "550e8400-...",
|
||||
"status": "queued"
|
||||
}
|
||||
|
||||
# 2. Poll every 3 seconds
|
||||
while True:
|
||||
response = GET /jobs/{job_id}
|
||||
|
||||
if response.status == "completed":
|
||||
reviews = GET /jobs/{job_id}/reviews
|
||||
break
|
||||
elif response.status == "failed":
|
||||
handle_error(response.error_message)
|
||||
break
|
||||
|
||||
sleep(3)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🏥 Health Checks
|
||||
|
||||
### 1. **Basic Health Check**
|
||||
|
||||
```python
|
||||
@app.get("/health")
|
||||
async def health_check():
|
||||
return {
|
||||
"status": "healthy",
|
||||
"timestamp": datetime.utcnow().isoformat(),
|
||||
"version": "1.0.0"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. **Detailed Health Check** (Recommended)
|
||||
|
||||
```python
|
||||
@app.get("/health/detailed")
|
||||
async def detailed_health():
|
||||
checks = {
|
||||
"api": await check_api(), # Always healthy if responding
|
||||
"database": await check_database(), # Query DB
|
||||
"redis": await check_redis(), # Ping Redis
|
||||
"s3": await check_s3(), # List buckets
|
||||
"workers": await check_workers(), # Check if workers alive
|
||||
"disk": await check_disk_space(), # Check disk usage
|
||||
}
|
||||
|
||||
overall_healthy = all(c["healthy"] for c in checks.values())
|
||||
|
||||
return {
|
||||
"status": "healthy" if overall_healthy else "degraded",
|
||||
"checks": checks,
|
||||
"timestamp": datetime.utcnow().isoformat()
|
||||
}
|
||||
|
||||
# Example response:
|
||||
{
|
||||
"status": "healthy",
|
||||
"checks": {
|
||||
"api": {"healthy": true, "latency_ms": 1},
|
||||
"database": {"healthy": true, "latency_ms": 5},
|
||||
"redis": {"healthy": true, "latency_ms": 2},
|
||||
"s3": {"healthy": true, "latency_ms": 50},
|
||||
"workers": {"healthy": true, "active_workers": 4},
|
||||
"disk": {"healthy": true, "usage_percent": 45}
|
||||
},
|
||||
"timestamp": "2026-01-18T10:30:00Z"
|
||||
}
|
||||
```
|
||||
|
||||
### 3. **Readiness vs Liveness** (Kubernetes)
|
||||
|
||||
```python
|
||||
# Liveness: Is the app alive? (restart if false)
|
||||
@app.get("/health/live")
|
||||
async def liveness():
|
||||
# Simple check - is the server running?
|
||||
return {"status": "alive"}
|
||||
|
||||
# Readiness: Can the app handle traffic? (remove from load balancer if false)
|
||||
@app.get("/health/ready")
|
||||
async def readiness():
|
||||
# Check dependencies
|
||||
db_ok = await ping_database()
|
||||
redis_ok = await ping_redis()
|
||||
|
||||
if db_ok and redis_ok:
|
||||
return {"status": "ready"}
|
||||
else:
|
||||
raise HTTPException(status_code=503, detail="Not ready")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Monitoring & Metrics
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
```python
|
||||
from prometheus_client import Counter, Histogram, Gauge
|
||||
|
||||
# Counters
|
||||
jobs_total = Counter('scraper_jobs_total', 'Total jobs created', ['status'])
|
||||
webhooks_sent = Counter('scraper_webhooks_sent_total', 'Webhooks sent', ['success'])
|
||||
|
||||
# Histograms
|
||||
scrape_duration = Histogram('scraper_duration_seconds', 'Scraping duration')
|
||||
reviews_scraped = Histogram('scraper_reviews_count', 'Reviews per job')
|
||||
|
||||
# Gauges
|
||||
active_jobs = Gauge('scraper_active_jobs', 'Currently running jobs')
|
||||
queue_size = Gauge('scraper_queue_size', 'Jobs in queue')
|
||||
|
||||
@app.get("/metrics")
|
||||
async def metrics():
|
||||
# Prometheus scrapes this endpoint
|
||||
return Response(generate_latest(), media_type="text/plain")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔐 Security
|
||||
|
||||
### 1. **API Keys**
|
||||
|
||||
```python
|
||||
@app.post("/scrape")
|
||||
async def scrape(
|
||||
request: ScrapeRequest,
|
||||
api_key: str = Header(..., alias="X-API-Key")
|
||||
):
|
||||
if not validate_api_key(api_key):
|
||||
raise HTTPException(status_code=401, detail="Invalid API key")
|
||||
|
||||
# Process request...
|
||||
```
|
||||
|
||||
### 2. **Rate Limiting**
|
||||
|
||||
```python
|
||||
from slowapi import Limiter, _rate_limit_exceeded_handler
|
||||
from slowapi.util import get_remote_address
|
||||
|
||||
limiter = Limiter(key_func=get_remote_address)
|
||||
|
||||
@app.post("/scrape")
|
||||
@limiter.limit("10/minute") # Max 10 jobs per minute
|
||||
async def scrape(request: Request, ...):
|
||||
# Process request...
|
||||
```
|
||||
|
||||
### 3. **Webhook Signatures**
|
||||
|
||||
```python
|
||||
import hmac
|
||||
|
||||
def verify_webhook_signature(payload, signature, secret):
|
||||
expected = hmac.new(
|
||||
secret.encode(),
|
||||
payload.encode(),
|
||||
hashlib.sha256
|
||||
).hexdigest()
|
||||
|
||||
return hmac.compare_digest(signature, expected)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Deployment Options
|
||||
|
||||
### Option 1: **Docker Compose** (Development)
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
api:
|
||||
build: .
|
||||
ports:
|
||||
- "8000:8000"
|
||||
environment:
|
||||
- REDIS_URL=redis://redis:6379
|
||||
- DATABASE_URL=postgresql://db:5432/scraper
|
||||
depends_on:
|
||||
- redis
|
||||
- db
|
||||
|
||||
worker:
|
||||
build: .
|
||||
command: python worker.py
|
||||
environment:
|
||||
- REDIS_URL=redis://redis:6379
|
||||
depends_on:
|
||||
- redis
|
||||
deploy:
|
||||
replicas: 4
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
|
||||
db:
|
||||
image: postgres:15-alpine
|
||||
environment:
|
||||
- POSTGRES_DB=scraper
|
||||
```
|
||||
|
||||
### Option 2: **Kubernetes** (Production)
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: scraper-api
|
||||
spec:
|
||||
replicas: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: scraper-api
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: api
|
||||
image: scraper-api:latest
|
||||
ports:
|
||||
- containerPort: 8000
|
||||
env:
|
||||
- name: REDIS_URL
|
||||
value: redis://redis:6379
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/live
|
||||
port: 8000
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/ready
|
||||
port: 8000
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: scraper-worker
|
||||
spec:
|
||||
replicas: 10
|
||||
selector:
|
||||
matchLabels:
|
||||
app: scraper-worker
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: worker
|
||||
image: scraper-worker:latest
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 Scaling Considerations
|
||||
|
||||
### Horizontal Scaling
|
||||
|
||||
```
|
||||
1 Worker = 3 jobs/minute (20s per job)
|
||||
10 Workers = 30 jobs/minute
|
||||
100 Workers = 300 jobs/minute = 432,000 jobs/day
|
||||
```
|
||||
|
||||
### Resource Requirements (per worker)
|
||||
|
||||
```
|
||||
CPU: 1-2 cores (Chrome is CPU-intensive)
|
||||
RAM: 2-4 GB (headless Chrome + data)
|
||||
Disk: Minimal (results go to S3)
|
||||
```
|
||||
|
||||
### Auto-scaling (Kubernetes HPA)
|
||||
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: scraper-worker-hpa
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: scraper-worker
|
||||
minReplicas: 2
|
||||
maxReplicas: 50
|
||||
metrics:
|
||||
- type: External
|
||||
external:
|
||||
metric:
|
||||
name: redis_queue_size
|
||||
target:
|
||||
type: Value
|
||||
value: "10" # Scale up if queue > 10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ Recommended Stack
|
||||
|
||||
### For Small-Medium (< 1000 jobs/day):
|
||||
```
|
||||
✅ FastAPI (API Server)
|
||||
✅ Redis (Queue + Cache)
|
||||
✅ PostgreSQL (Job metadata)
|
||||
✅ Local files or S3 (Reviews storage)
|
||||
✅ Webhooks (Primary)
|
||||
✅ Polling (Fallback)
|
||||
✅ Docker Compose (Deployment)
|
||||
```
|
||||
|
||||
### For Large Scale (> 10,000 jobs/day):
|
||||
```
|
||||
✅ FastAPI (API Server)
|
||||
✅ RabbitMQ (Queue)
|
||||
✅ PostgreSQL (Job metadata)
|
||||
✅ S3 (Reviews storage)
|
||||
✅ Webhooks (Primary)
|
||||
✅ SSE (Real-time updates)
|
||||
✅ Kubernetes (Orchestration)
|
||||
✅ Prometheus + Grafana (Monitoring)
|
||||
✅ ELK Stack (Logging)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Next Steps
|
||||
|
||||
Would you like me to implement:
|
||||
|
||||
1. ✅ **Webhooks** - Full webhook support with retries
|
||||
2. ✅ **Redis Queue** - Job queue with Celery/RQ
|
||||
3. ✅ **PostgreSQL** - Job metadata storage
|
||||
4. ✅ **S3 Storage** - Reviews file storage
|
||||
5. ✅ **Health Checks** - Detailed health endpoints
|
||||
6. ✅ **SSE Streaming** - Real-time progress updates (optional)
|
||||
7. ✅ **Docker Setup** - Complete docker-compose.yml
|
||||
|
||||
**My recommendation**: Start with **#1-5** (core production features), add #6-7 later if needed.
|
||||
|
||||
Let me know which to implement first!
|
||||
Reference in New Issue
Block a user