Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

328
STORAGE_COMPARISON.md Normal file
View File

@@ -0,0 +1,328 @@
# Storage Strategy Comparison
## PostgreSQL JSONB vs S3 for Review Data
---
## 🎯 Recommendation: Start with PostgreSQL JSONB
### Why PostgreSQL is Better for Most Cases:
```sql
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(20) NOT NULL,
url TEXT NOT NULL,
webhook_url TEXT,
created_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
reviews_count INTEGER,
-- Store reviews directly as JSONB!
reviews_data JSONB, All 244 reviews in one column
error_message TEXT
);
-- You can even query INSIDE the JSON!
SELECT
job_id,
jsonb_array_length(reviews_data) as review_count,
reviews_data->0->>'author' as first_reviewer
FROM jobs
WHERE reviews_data @> '[{"rating": 5}]'; -- Find jobs with 5-star reviews
```
### Advantages:
**Simpler Architecture**
- One service instead of two
- No S3 credentials/SDK to manage
- Easier local development
**Transactional**
- Atomic updates (job status + reviews in one transaction)
- ACID guarantees
- No eventual consistency issues
**Queryable**
```sql
-- Find all jobs with >200 reviews
SELECT job_id, reviews_count
FROM jobs
WHERE jsonb_array_length(reviews_data) > 200;
-- Extract specific review data
SELECT
job_id,
review->>'author' as author,
review->>'rating' as rating
FROM jobs, jsonb_array_elements(reviews_data) as review
WHERE review->>'rating' = '5';
```
**Cost-Effective (Small-Medium Scale)**
```
244 reviews × 0.6 KB = ~150 KB per job
1,000 jobs/month = 150 MB/month
10,000 jobs/month = 1.5 GB/month
PostgreSQL:
- $0/month (self-hosted) or $15/month (managed)
- Handles 10,000 jobs easily
S3:
- Storage: $0.03/month (cheap!)
- But need to manage: credentials, SDK, buckets
```
**Built-in Backup**
- Standard PostgreSQL backup tools
- Point-in-time recovery
- Replication included
**Fast Retrieval**
```python
# Single query gets everything
job = db.query("""
SELECT job_id, status, reviews_data
FROM jobs
WHERE job_id = %s
""", job_id)
return {
"job_id": job.job_id,
"reviews": job.reviews_data # Already parsed JSON
}
```
---
## When to Use S3 Instead
### Use S3 if:
**Very High Volume**
```
> 100,000 jobs/month
> 100 GB of review data
Database backup/restore becomes slow
```
**Long-Term Retention**
```
Need to keep reviews for years
Want lifecycle policies (auto-delete after 1 year)
Cold storage for compliance
```
**Direct Client Access**
```python
# Pre-signed URLs let clients download directly
url = s3.generate_presigned_url(
'get_object',
Params={'Bucket': 'reviews', 'Key': f'{job_id}.json'},
ExpiresIn=3600
)
# Client downloads directly from S3 (saves bandwidth)
return {"reviews_url": url}
```
**Multi-Region**
```
S3 replication across regions
CDN integration (CloudFront)
Global low-latency access
```
---
## 📊 Performance Comparison
### PostgreSQL JSONB
```python
# Store reviews (single INSERT)
INSERT INTO jobs (job_id, reviews_data)
VALUES (%s, %s::jsonb)
# 244 reviews: ~5ms
# Retrieve reviews (single SELECT)
SELECT reviews_data FROM jobs WHERE job_id = %s
# 244 reviews: ~2ms
```
**Total**: ~7ms for store + retrieve
### S3
```python
# Store reviews (HTTP PUT)
s3.put_object(
Bucket='reviews',
Key=f'{job_id}.json',
Body=json.dumps(reviews)
)
# 244 reviews: ~50-200ms (network latency)
# Retrieve reviews (HTTP GET)
response = s3.get_object(
Bucket='reviews',
Key=f'{job_id}.json'
)
# 244 reviews: ~50-200ms
```
**Total**: ~100-400ms for store + retrieve
**PostgreSQL is 14-57x faster!**
---
## 💾 Size Limits
### PostgreSQL JSONB
```
Max column size: 1 GB
Practical limit: ~100 MB per row
Our use case:
244 reviews × 0.6 KB = 150 KB ✅ Perfect!
10,000 reviews × 0.6 KB = 6 MB ✅ Still great
100,000 reviews × 0.6 KB = 60 MB ✅ OK, but consider splitting
```
### When to worry:
```
> 50,000 reviews per job → Consider S3
> 100 MB per job → Definitely use S3
```
---
## 🏗️ Hybrid Approach (Best of Both Worlds)
For maximum flexibility:
```python
class JobStorage:
def __init__(self):
self.db = PostgreSQL()
self.s3 = S3Client() # Optional
async def save_reviews(self, job_id, reviews):
reviews_json = json.dumps(reviews)
size_mb = len(reviews_json) / 1024 / 1024
if size_mb < 10: # Small job: use PostgreSQL
await self.db.execute("""
UPDATE jobs
SET reviews_data = %s::jsonb
WHERE job_id = %s
""", reviews_json, job_id)
else: # Large job: use S3
await self.s3.upload(
f'reviews/{job_id}.json',
reviews_json
)
await self.db.execute("""
UPDATE jobs
SET reviews_s3_key = %s
WHERE job_id = %s
""", f'reviews/{job_id}.json', job_id)
async def get_reviews(self, job_id):
job = await self.db.fetch_one("""
SELECT reviews_data, reviews_s3_key
FROM jobs
WHERE job_id = %s
""", job_id)
if job.reviews_data:
return job.reviews_data # From PostgreSQL
elif job.reviews_s3_key:
return await self.s3.download(job.reviews_s3_key) # From S3
else:
raise NotFound()
```
---
## ✅ Final Recommendation
### For Your Use Case:
**Use PostgreSQL JSONB** because:
1. ✅ Simpler (one service, not two)
2. ✅ Faster (2ms vs 200ms)
3. ✅ Cheaper (for typical volumes)
4. ✅ Queryable (can analyze reviews in SQL)
5. ✅ Transactional (atomic updates)
6. ✅ Easier backups
**Schema**:
```sql
CREATE TABLE jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
status VARCHAR(20) NOT NULL DEFAULT 'pending',
url TEXT NOT NULL,
webhook_url TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
started_at TIMESTAMP,
completed_at TIMESTAMP,
reviews_count INTEGER,
reviews_data JSONB, -- All reviews here!
scrape_time REAL,
error_message TEXT,
metadata JSONB,
CONSTRAINT valid_status CHECK (status IN ('pending', 'running', 'completed', 'failed', 'cancelled'))
);
CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);
CREATE INDEX idx_jobs_webhook ON jobs(webhook_url) WHERE webhook_url IS NOT NULL;
```
**Migration Path to S3**:
- Start with PostgreSQL
- If you reach 100GB+ of data, migrate to S3
- Keep PostgreSQL for metadata only
- Use the hybrid approach above
---
## 📈 Scale Projections
```
Small:
1,000 jobs/month × 150 KB = 150 MB/month
→ PostgreSQL ✅
Medium:
10,000 jobs/month × 150 KB = 1.5 GB/month
→ PostgreSQL ✅
Large:
100,000 jobs/month × 150 KB = 15 GB/month
→ PostgreSQL ✅ (but consider S3)
Very Large:
1,000,000 jobs/month × 150 KB = 150 GB/month
→ S3 ✅
Enterprise:
Need multi-year retention
Multi-region replication
Compliance requirements
→ S3 ✅
```
---
**Bottom Line**: Start with **PostgreSQL JSONB**. It's simpler, faster, and cheaper for 99% of use cases. Upgrade to S3 only if you need it.