Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
329 lines
7.0 KiB
Markdown
329 lines
7.0 KiB
Markdown
# Storage Strategy Comparison
|
||
## PostgreSQL JSONB vs S3 for Review Data
|
||
|
||
---
|
||
|
||
## 🎯 Recommendation: Start with PostgreSQL JSONB
|
||
|
||
### Why PostgreSQL is Better for Most Cases:
|
||
|
||
```sql
|
||
CREATE TABLE jobs (
|
||
job_id UUID PRIMARY KEY,
|
||
status VARCHAR(20) NOT NULL,
|
||
url TEXT NOT NULL,
|
||
webhook_url TEXT,
|
||
created_at TIMESTAMP NOT NULL,
|
||
completed_at TIMESTAMP,
|
||
reviews_count INTEGER,
|
||
|
||
-- Store reviews directly as JSONB!
|
||
reviews_data JSONB, ← All 244 reviews in one column
|
||
|
||
error_message TEXT
|
||
);
|
||
|
||
-- You can even query INSIDE the JSON!
|
||
SELECT
|
||
job_id,
|
||
jsonb_array_length(reviews_data) as review_count,
|
||
reviews_data->0->>'author' as first_reviewer
|
||
FROM jobs
|
||
WHERE reviews_data @> '[{"rating": 5}]'; -- Find jobs with 5-star reviews
|
||
```
|
||
|
||
### Advantages:
|
||
|
||
✅ **Simpler Architecture**
|
||
- One service instead of two
|
||
- No S3 credentials/SDK to manage
|
||
- Easier local development
|
||
|
||
✅ **Transactional**
|
||
- Atomic updates (job status + reviews in one transaction)
|
||
- ACID guarantees
|
||
- No eventual consistency issues
|
||
|
||
✅ **Queryable**
|
||
```sql
|
||
-- Find all jobs with >200 reviews
|
||
SELECT job_id, reviews_count
|
||
FROM jobs
|
||
WHERE jsonb_array_length(reviews_data) > 200;
|
||
|
||
-- Extract specific review data
|
||
SELECT
|
||
job_id,
|
||
review->>'author' as author,
|
||
review->>'rating' as rating
|
||
FROM jobs, jsonb_array_elements(reviews_data) as review
|
||
WHERE review->>'rating' = '5';
|
||
```
|
||
|
||
✅ **Cost-Effective (Small-Medium Scale)**
|
||
```
|
||
244 reviews × 0.6 KB = ~150 KB per job
|
||
1,000 jobs/month = 150 MB/month
|
||
10,000 jobs/month = 1.5 GB/month
|
||
|
||
PostgreSQL:
|
||
- $0/month (self-hosted) or $15/month (managed)
|
||
- Handles 10,000 jobs easily
|
||
|
||
S3:
|
||
- Storage: $0.03/month (cheap!)
|
||
- But need to manage: credentials, SDK, buckets
|
||
```
|
||
|
||
✅ **Built-in Backup**
|
||
- Standard PostgreSQL backup tools
|
||
- Point-in-time recovery
|
||
- Replication included
|
||
|
||
✅ **Fast Retrieval**
|
||
```python
|
||
# Single query gets everything
|
||
job = db.query("""
|
||
SELECT job_id, status, reviews_data
|
||
FROM jobs
|
||
WHERE job_id = %s
|
||
""", job_id)
|
||
|
||
return {
|
||
"job_id": job.job_id,
|
||
"reviews": job.reviews_data # Already parsed JSON
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## When to Use S3 Instead
|
||
|
||
### Use S3 if:
|
||
|
||
❌ **Very High Volume**
|
||
```
|
||
> 100,000 jobs/month
|
||
> 100 GB of review data
|
||
Database backup/restore becomes slow
|
||
```
|
||
|
||
❌ **Long-Term Retention**
|
||
```
|
||
Need to keep reviews for years
|
||
Want lifecycle policies (auto-delete after 1 year)
|
||
Cold storage for compliance
|
||
```
|
||
|
||
❌ **Direct Client Access**
|
||
```python
|
||
# Pre-signed URLs let clients download directly
|
||
url = s3.generate_presigned_url(
|
||
'get_object',
|
||
Params={'Bucket': 'reviews', 'Key': f'{job_id}.json'},
|
||
ExpiresIn=3600
|
||
)
|
||
|
||
# Client downloads directly from S3 (saves bandwidth)
|
||
return {"reviews_url": url}
|
||
```
|
||
|
||
❌ **Multi-Region**
|
||
```
|
||
S3 replication across regions
|
||
CDN integration (CloudFront)
|
||
Global low-latency access
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 Performance Comparison
|
||
|
||
### PostgreSQL JSONB
|
||
|
||
```python
|
||
# Store reviews (single INSERT)
|
||
INSERT INTO jobs (job_id, reviews_data)
|
||
VALUES (%s, %s::jsonb)
|
||
# 244 reviews: ~5ms
|
||
|
||
# Retrieve reviews (single SELECT)
|
||
SELECT reviews_data FROM jobs WHERE job_id = %s
|
||
# 244 reviews: ~2ms
|
||
```
|
||
|
||
**Total**: ~7ms for store + retrieve
|
||
|
||
### S3
|
||
|
||
```python
|
||
# Store reviews (HTTP PUT)
|
||
s3.put_object(
|
||
Bucket='reviews',
|
||
Key=f'{job_id}.json',
|
||
Body=json.dumps(reviews)
|
||
)
|
||
# 244 reviews: ~50-200ms (network latency)
|
||
|
||
# Retrieve reviews (HTTP GET)
|
||
response = s3.get_object(
|
||
Bucket='reviews',
|
||
Key=f'{job_id}.json'
|
||
)
|
||
# 244 reviews: ~50-200ms
|
||
```
|
||
|
||
**Total**: ~100-400ms for store + retrieve
|
||
|
||
**PostgreSQL is 14-57x faster!**
|
||
|
||
---
|
||
|
||
## 💾 Size Limits
|
||
|
||
### PostgreSQL JSONB
|
||
```
|
||
Max column size: 1 GB
|
||
Practical limit: ~100 MB per row
|
||
|
||
Our use case:
|
||
244 reviews × 0.6 KB = 150 KB ✅ Perfect!
|
||
10,000 reviews × 0.6 KB = 6 MB ✅ Still great
|
||
100,000 reviews × 0.6 KB = 60 MB ✅ OK, but consider splitting
|
||
```
|
||
|
||
### When to worry:
|
||
```
|
||
> 50,000 reviews per job → Consider S3
|
||
> 100 MB per job → Definitely use S3
|
||
```
|
||
|
||
---
|
||
|
||
## 🏗️ Hybrid Approach (Best of Both Worlds)
|
||
|
||
For maximum flexibility:
|
||
|
||
```python
|
||
class JobStorage:
|
||
def __init__(self):
|
||
self.db = PostgreSQL()
|
||
self.s3 = S3Client() # Optional
|
||
|
||
async def save_reviews(self, job_id, reviews):
|
||
reviews_json = json.dumps(reviews)
|
||
size_mb = len(reviews_json) / 1024 / 1024
|
||
|
||
if size_mb < 10: # Small job: use PostgreSQL
|
||
await self.db.execute("""
|
||
UPDATE jobs
|
||
SET reviews_data = %s::jsonb
|
||
WHERE job_id = %s
|
||
""", reviews_json, job_id)
|
||
|
||
else: # Large job: use S3
|
||
await self.s3.upload(
|
||
f'reviews/{job_id}.json',
|
||
reviews_json
|
||
)
|
||
await self.db.execute("""
|
||
UPDATE jobs
|
||
SET reviews_s3_key = %s
|
||
WHERE job_id = %s
|
||
""", f'reviews/{job_id}.json', job_id)
|
||
|
||
async def get_reviews(self, job_id):
|
||
job = await self.db.fetch_one("""
|
||
SELECT reviews_data, reviews_s3_key
|
||
FROM jobs
|
||
WHERE job_id = %s
|
||
""", job_id)
|
||
|
||
if job.reviews_data:
|
||
return job.reviews_data # From PostgreSQL
|
||
elif job.reviews_s3_key:
|
||
return await self.s3.download(job.reviews_s3_key) # From S3
|
||
else:
|
||
raise NotFound()
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ Final Recommendation
|
||
|
||
### For Your Use Case:
|
||
|
||
**Use PostgreSQL JSONB** because:
|
||
|
||
1. ✅ Simpler (one service, not two)
|
||
2. ✅ Faster (2ms vs 200ms)
|
||
3. ✅ Cheaper (for typical volumes)
|
||
4. ✅ Queryable (can analyze reviews in SQL)
|
||
5. ✅ Transactional (atomic updates)
|
||
6. ✅ Easier backups
|
||
|
||
**Schema**:
|
||
```sql
|
||
CREATE TABLE jobs (
|
||
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
||
status VARCHAR(20) NOT NULL DEFAULT 'pending',
|
||
url TEXT NOT NULL,
|
||
webhook_url TEXT,
|
||
|
||
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
|
||
started_at TIMESTAMP,
|
||
completed_at TIMESTAMP,
|
||
|
||
reviews_count INTEGER,
|
||
reviews_data JSONB, -- All reviews here!
|
||
scrape_time REAL,
|
||
|
||
error_message TEXT,
|
||
metadata JSONB,
|
||
|
||
CONSTRAINT valid_status CHECK (status IN ('pending', 'running', 'completed', 'failed', 'cancelled'))
|
||
);
|
||
|
||
CREATE INDEX idx_jobs_status ON jobs(status);
|
||
CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);
|
||
CREATE INDEX idx_jobs_webhook ON jobs(webhook_url) WHERE webhook_url IS NOT NULL;
|
||
```
|
||
|
||
**Migration Path to S3**:
|
||
- Start with PostgreSQL
|
||
- If you reach 100GB+ of data, migrate to S3
|
||
- Keep PostgreSQL for metadata only
|
||
- Use the hybrid approach above
|
||
|
||
---
|
||
|
||
## 📈 Scale Projections
|
||
|
||
```
|
||
Small:
|
||
1,000 jobs/month × 150 KB = 150 MB/month
|
||
→ PostgreSQL ✅
|
||
|
||
Medium:
|
||
10,000 jobs/month × 150 KB = 1.5 GB/month
|
||
→ PostgreSQL ✅
|
||
|
||
Large:
|
||
100,000 jobs/month × 150 KB = 15 GB/month
|
||
→ PostgreSQL ✅ (but consider S3)
|
||
|
||
Very Large:
|
||
1,000,000 jobs/month × 150 KB = 150 GB/month
|
||
→ S3 ✅
|
||
|
||
Enterprise:
|
||
Need multi-year retention
|
||
Multi-region replication
|
||
Compliance requirements
|
||
→ S3 ✅
|
||
```
|
||
|
||
---
|
||
|
||
**Bottom Line**: Start with **PostgreSQL JSONB**. It's simpler, faster, and cheaper for 99% of use cases. Upgrade to S3 only if you need it.
|