Files
whyrating-engine-legacy/STORAGE_COMPARISON.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

329 lines
7.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Storage Strategy Comparison
## PostgreSQL JSONB vs S3 for Review Data
---
## 🎯 Recommendation: Start with PostgreSQL JSONB
### Why PostgreSQL is Better for Most Cases:
```sql
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL,
url TEXT NOT NULL,
webhook_url TEXT,
created_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
reviews_count INTEGER,
-- Store reviews directly as JSONB!
reviews_data JSONB, All 244 reviews in one column
error_message TEXT
);
-- You can even query INSIDE the JSON!
SELECT
job_id,
jsonb_array_length(reviews_data) as review_count,
reviews_data->0->>'author' as first_reviewer
FROM jobs
WHERE reviews_data @> '[{"rating": 5}]'; -- Find jobs with 5-star reviews
```
### Advantages:
**Simpler Architecture**
- One service instead of two
- No S3 credentials/SDK to manage
- Easier local development
**Transactional**
- Atomic updates (job status + reviews in one transaction)
- ACID guarantees
- No eventual consistency issues
**Queryable**
```sql
-- Find all jobs with >200 reviews
SELECT job_id, reviews_count
FROM jobs
WHERE jsonb_array_length(reviews_data) > 200;
-- Extract specific review data
SELECT
job_id,
review->>'author' as author,
review->>'rating' as rating
FROM jobs, jsonb_array_elements(reviews_data) as review
WHERE review->>'rating' = '5';
```
**Cost-Effective (Small-Medium Scale)**
```
244 reviews × 0.6 KB = ~150 KB per job
1,000 jobs/month = 150 MB/month
10,000 jobs/month = 1.5 GB/month
PostgreSQL:
- $0/month (self-hosted) or $15/month (managed)
- Handles 10,000 jobs easily
S3:
- Storage: $0.03/month (cheap!)
- But need to manage: credentials, SDK, buckets
```
**Built-in Backup**
- Standard PostgreSQL backup tools
- Point-in-time recovery
- Replication included
**Fast Retrieval**
```python
# Single query gets everything
job = db.query("""
SELECT job_id, status, reviews_data
FROM jobs
WHERE job_id = %s
""", job_id)
return {
"job_id": job.job_id,
"reviews": job.reviews_data # Already parsed JSON
}
```
---
## When to Use S3 Instead
### Use S3 if:
**Very High Volume**
```
> 100,000 jobs/month
> 100 GB of review data
Database backup/restore becomes slow
```
**Long-Term Retention**
```
Need to keep reviews for years
Want lifecycle policies (auto-delete after 1 year)
Cold storage for compliance
```
**Direct Client Access**
```python
# Pre-signed URLs let clients download directly
url = s3.generate_presigned_url(
'get_object',
Params={'Bucket': 'reviews', 'Key': f'{job_id}.json'},
ExpiresIn=3600
)
# Client downloads directly from S3 (saves bandwidth)
return {"reviews_url": url}
```
**Multi-Region**
```
S3 replication across regions
CDN integration (CloudFront)
Global low-latency access
```
---
## 📊 Performance Comparison
### PostgreSQL JSONB
```python
# Store reviews (single INSERT)
INSERT INTO jobs (job_id, reviews_data)
VALUES (%s, %s::jsonb)
# 244 reviews: ~5ms
# Retrieve reviews (single SELECT)
SELECT reviews_data FROM jobs WHERE job_id = %s
# 244 reviews: ~2ms
```
**Total**: ~7ms for store + retrieve
### S3
```python
# Store reviews (HTTP PUT)
s3.put_object(
Bucket='reviews',
Key=f'{job_id}.json',
Body=json.dumps(reviews)
)
# 244 reviews: ~50-200ms (network latency)
# Retrieve reviews (HTTP GET)
response = s3.get_object(
Bucket='reviews',
Key=f'{job_id}.json'
)
# 244 reviews: ~50-200ms
```
**Total**: ~100-400ms for store + retrieve
**PostgreSQL is 14-57x faster!**
---
## 💾 Size Limits
### PostgreSQL JSONB
```
Max column size: 1 GB
Practical limit: ~100 MB per row
Our use case:
244 reviews × 0.6 KB = 150 KB ✅ Perfect!
10,000 reviews × 0.6 KB = 6 MB ✅ Still great
100,000 reviews × 0.6 KB = 60 MB ✅ OK, but consider splitting
```
### When to worry:
```
> 50,000 reviews per job → Consider S3
> 100 MB per job → Definitely use S3
```
---
## 🏗️ Hybrid Approach (Best of Both Worlds)
For maximum flexibility:
```python
class JobStorage:
def __init__(self):
self.db = PostgreSQL()
self.s3 = S3Client() # Optional
async def save_reviews(self, job_id, reviews):
reviews_json = json.dumps(reviews)
size_mb = len(reviews_json) / 1024 / 1024
if size_mb < 10: # Small job: use PostgreSQL
await self.db.execute("""
UPDATE jobs
SET reviews_data = %s::jsonb
WHERE job_id = %s
""", reviews_json, job_id)
else: # Large job: use S3
await self.s3.upload(
f'reviews/{job_id}.json',
reviews_json
)
await self.db.execute("""
UPDATE jobs
SET reviews_s3_key = %s
WHERE job_id = %s
""", f'reviews/{job_id}.json', job_id)
async def get_reviews(self, job_id):
job = await self.db.fetch_one("""
SELECT reviews_data, reviews_s3_key
FROM jobs
WHERE job_id = %s
""", job_id)
if job.reviews_data:
return job.reviews_data # From PostgreSQL
elif job.reviews_s3_key:
return await self.s3.download(job.reviews_s3_key) # From S3
else:
raise NotFound()
```
---
## ✅ Final Recommendation
### For Your Use Case:
**Use PostgreSQL JSONB** because:
1. ✅ Simpler (one service, not two)
2. ✅ Faster (2ms vs 200ms)
3. ✅ Cheaper (for typical volumes)
4. ✅ Queryable (can analyze reviews in SQL)
5. ✅ Transactional (atomic updates)
6. ✅ Easier backups
**Schema**:
```sql
CREATE TABLE jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
status VARCHAR(20) NOT NULL DEFAULT 'pending',
url TEXT NOT NULL,
webhook_url TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
started_at TIMESTAMP,
completed_at TIMESTAMP,
reviews_count INTEGER,
reviews_data JSONB, -- All reviews here!
scrape_time REAL,
error_message TEXT,
metadata JSONB,
CONSTRAINT valid_status CHECK (status IN ('pending', 'running', 'completed', 'failed', 'cancelled'))
);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);
CREATE INDEX idx_jobs_webhook ON jobs(webhook_url) WHERE webhook_url IS NOT NULL;
```
**Migration Path to S3**:
- Start with PostgreSQL
- If you reach 100GB+ of data, migrate to S3
- Keep PostgreSQL for metadata only
- Use the hybrid approach above
---
## 📈 Scale Projections
```
Small:
1,000 jobs/month × 150 KB = 150 MB/month
→ PostgreSQL ✅
Medium:
10,000 jobs/month × 150 KB = 1.5 GB/month
→ PostgreSQL ✅
Large:
100,000 jobs/month × 150 KB = 15 GB/month
→ PostgreSQL ✅ (but consider S3)
Very Large:
1,000,000 jobs/month × 150 KB = 150 GB/month
→ S3 ✅
Enterprise:
Need multi-year retention
Multi-region replication
Compliance requirements
→ S3 ✅
```
---
**Bottom Line**: Start with **PostgreSQL JSONB**. It's simpler, faster, and cheaper for 99% of use cases. Upgrade to S3 only if you need it.