Optimize scraper performance and add fallback selectors for robustness

Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions
--- a/STORAGE_COMPARISON.md
+++ b/STORAGE_COMPARISON.md
@@ -0,0 +1,328 @@
+# Storage Strategy Comparison
+## PostgreSQL JSONB vs S3 for Review Data
+
+---
+
+## 🎯 Recommendation: Start with PostgreSQL JSONB
+
+### Why PostgreSQL is Better for Most Cases:
+
+```sql
+CREATE TABLE jobs (
+    job_id UUID PRIMARY KEY,
+    status VARCHAR(20) NOT NULL,
+    url TEXT NOT NULL,
+    webhook_url TEXT,
+    created_at TIMESTAMP NOT NULL,
+    completed_at TIMESTAMP,
+    reviews_count INTEGER,
+
+    -- Store reviews directly as JSONB!
+    reviews_data JSONB,  ← All 244 reviews in one column
+
+    error_message TEXT
+);
+
+-- You can even query INSIDE the JSON!
+SELECT
+    job_id,
+    jsonb_array_length(reviews_data) as review_count,
+    reviews_data->0->>'author' as first_reviewer
+FROM jobs
+WHERE reviews_data @> '[{"rating": 5}]';  -- Find jobs with 5-star reviews
+```
+
+### Advantages:
+
+✅ **Simpler Architecture**
+- One service instead of two
+- No S3 credentials/SDK to manage
+- Easier local development
+
+✅ **Transactional**
+- Atomic updates (job status + reviews in one transaction)
+- ACID guarantees
+- No eventual consistency issues
+
+✅ **Queryable**
+```sql
+-- Find all jobs with >200 reviews
+SELECT job_id, reviews_count
+FROM jobs
+WHERE jsonb_array_length(reviews_data) > 200;
+
+-- Extract specific review data
+SELECT
+    job_id,
+    review->>'author' as author,
+    review->>'rating' as rating
+FROM jobs, jsonb_array_elements(reviews_data) as review
+WHERE review->>'rating' = '5';
+```
+
+✅ **Cost-Effective (Small-Medium Scale)**
+```
+244 reviews × 0.6 KB = ~150 KB per job
+1,000 jobs/month = 150 MB/month
+10,000 jobs/month = 1.5 GB/month
+
+PostgreSQL:
+  - $0/month (self-hosted) or $15/month (managed)
+  - Handles 10,000 jobs easily
+
+S3:
+  - Storage: $0.03/month (cheap!)
+  - But need to manage: credentials, SDK, buckets
+```
+
+✅ **Built-in Backup**
+- Standard PostgreSQL backup tools
+- Point-in-time recovery
+- Replication included
+
+✅ **Fast Retrieval**
+```python
+# Single query gets everything
+job = db.query("""
+    SELECT job_id, status, reviews_data
+    FROM jobs
+    WHERE job_id = %s
+""", job_id)
+
+return {
+    "job_id": job.job_id,
+    "reviews": job.reviews_data  # Already parsed JSON
+}
+```
+
+---
+
+## When to Use S3 Instead
+
+### Use S3 if:
+
+❌ **Very High Volume**
+```
+> 100,000 jobs/month
+> 100 GB of review data
+Database backup/restore becomes slow
+```
+
+❌ **Long-Term Retention**
+```
+Need to keep reviews for years
+Want lifecycle policies (auto-delete after 1 year)
+Cold storage for compliance
+```
+
+❌ **Direct Client Access**
+```python
+# Pre-signed URLs let clients download directly
+url = s3.generate_presigned_url(
+    'get_object',
+    Params={'Bucket': 'reviews', 'Key': f'{job_id}.json'},
+    ExpiresIn=3600
+)
+
+# Client downloads directly from S3 (saves bandwidth)
+return {"reviews_url": url}
+```
+
+❌ **Multi-Region**
+```
+S3 replication across regions
+CDN integration (CloudFront)
+Global low-latency access
+```
+
+---
+
+## 📊 Performance Comparison
+
+### PostgreSQL JSONB
+
+```python
+# Store reviews (single INSERT)
+INSERT INTO jobs (job_id, reviews_data)
+VALUES (%s, %s::jsonb)
+# 244 reviews: ~5ms
+
+# Retrieve reviews (single SELECT)
+SELECT reviews_data FROM jobs WHERE job_id = %s
+# 244 reviews: ~2ms
+```
+
+**Total**: ~7ms for store + retrieve
+
+### S3
+
+```python
+# Store reviews (HTTP PUT)
+s3.put_object(
+    Bucket='reviews',
+    Key=f'{job_id}.json',
+    Body=json.dumps(reviews)
+)
+# 244 reviews: ~50-200ms (network latency)
+
+# Retrieve reviews (HTTP GET)
+response = s3.get_object(
+    Bucket='reviews',
+    Key=f'{job_id}.json'
+)
+# 244 reviews: ~50-200ms
+```
+
+**Total**: ~100-400ms for store + retrieve
+
+**PostgreSQL is 14-57x faster!**
+
+---
+
+## 💾 Size Limits
+
+### PostgreSQL JSONB
+```
+Max column size: 1 GB
+Practical limit: ~100 MB per row
+
+Our use case:
+  244 reviews × 0.6 KB = 150 KB  ✅ Perfect!
+  10,000 reviews × 0.6 KB = 6 MB  ✅ Still great
+  100,000 reviews × 0.6 KB = 60 MB  ✅ OK, but consider splitting
+```
+
+### When to worry:
+```
+> 50,000 reviews per job → Consider S3
+> 100 MB per job → Definitely use S3
+```
+
+---
+
+## 🏗️ Hybrid Approach (Best of Both Worlds)
+
+For maximum flexibility:
+
+```python
+class JobStorage:
+    def __init__(self):
+        self.db = PostgreSQL()
+        self.s3 = S3Client()  # Optional
+
+    async def save_reviews(self, job_id, reviews):
+        reviews_json = json.dumps(reviews)
+        size_mb = len(reviews_json) / 1024 / 1024
+
+        if size_mb < 10:  # Small job: use PostgreSQL
+            await self.db.execute("""
+                UPDATE jobs
+                SET reviews_data = %s::jsonb
+                WHERE job_id = %s
+            """, reviews_json, job_id)
+
+        else:  # Large job: use S3
+            await self.s3.upload(
+                f'reviews/{job_id}.json',
+                reviews_json
+            )
+            await self.db.execute("""
+                UPDATE jobs
+                SET reviews_s3_key = %s
+                WHERE job_id = %s
+            """, f'reviews/{job_id}.json', job_id)
+
+    async def get_reviews(self, job_id):
+        job = await self.db.fetch_one("""
+            SELECT reviews_data, reviews_s3_key
+            FROM jobs
+            WHERE job_id = %s
+        """, job_id)
+
+        if job.reviews_data:
+            return job.reviews_data  # From PostgreSQL
+        elif job.reviews_s3_key:
+            return await self.s3.download(job.reviews_s3_key)  # From S3
+        else:
+            raise NotFound()
+```
+
+---
+
+## ✅ Final Recommendation
+
+### For Your Use Case:
+
+**Use PostgreSQL JSONB** because:
+
+1. ✅ Simpler (one service, not two)
+2. ✅ Faster (2ms vs 200ms)
+3. ✅ Cheaper (for typical volumes)
+4. ✅ Queryable (can analyze reviews in SQL)
+5. ✅ Transactional (atomic updates)
+6. ✅ Easier backups
+
+**Schema**:
+```sql
+CREATE TABLE jobs (
+    job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    status VARCHAR(20) NOT NULL DEFAULT 'pending',
+    url TEXT NOT NULL,
+    webhook_url TEXT,
+
+    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
+    started_at TIMESTAMP,
+    completed_at TIMESTAMP,
+
+    reviews_count INTEGER,
+    reviews_data JSONB,  -- All reviews here!
+    scrape_time REAL,
+
+    error_message TEXT,
+    metadata JSONB,
+
+    CONSTRAINT valid_status CHECK (status IN ('pending', 'running', 'completed', 'failed', 'cancelled'))
+);
+
+CREATE INDEX idx_jobs_status ON jobs(status);
+CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);
+CREATE INDEX idx_jobs_webhook ON jobs(webhook_url) WHERE webhook_url IS NOT NULL;
+```
+
+**Migration Path to S3**:
+- Start with PostgreSQL
+- If you reach 100GB+ of data, migrate to S3
+- Keep PostgreSQL for metadata only
+- Use the hybrid approach above
+
+---
+
+## 📈 Scale Projections
+
+```
+Small:
+  1,000 jobs/month × 150 KB = 150 MB/month
+  → PostgreSQL ✅
+
+Medium:
+  10,000 jobs/month × 150 KB = 1.5 GB/month
+  → PostgreSQL ✅
+
+Large:
+  100,000 jobs/month × 150 KB = 15 GB/month
+  → PostgreSQL ✅ (but consider S3)
+
+Very Large:
+  1,000,000 jobs/month × 150 KB = 150 GB/month
+  → S3 ✅
+
+Enterprise:
+  Need multi-year retention
+  Multi-region replication
+  Compliance requirements
+  → S3 ✅
+```
+
+---
+
+**Bottom Line**: Start with **PostgreSQL JSONB**. It's simpler, faster, and cheaper for 99% of use cases. Upgrade to S3 only if you need it.