Files
whyrating-engine-legacy/STORAGE_COMPARISON.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

7.0 KiB
Raw Blame History

Storage Strategy Comparison

PostgreSQL JSONB vs S3 for Review Data


🎯 Recommendation: Start with PostgreSQL JSONB

Why PostgreSQL is Better for Most Cases:

CREATE TABLE jobs (
    job_id UUID PRIMARY KEY,
    status VARCHAR(20) NOT NULL,
    url TEXT NOT NULL,
    webhook_url TEXT,
    created_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP,
    reviews_count INTEGER,

    -- Store reviews directly as JSONB!
    reviews_data JSONB,   All 244 reviews in one column

    error_message TEXT
);

-- You can even query INSIDE the JSON!
SELECT
    job_id,
    jsonb_array_length(reviews_data) as review_count,
    reviews_data->0->>'author' as first_reviewer
FROM jobs
WHERE reviews_data @> '[{"rating": 5}]';  -- Find jobs with 5-star reviews

Advantages:

Simpler Architecture

  • One service instead of two
  • No S3 credentials/SDK to manage
  • Easier local development

Transactional

  • Atomic updates (job status + reviews in one transaction)
  • ACID guarantees
  • No eventual consistency issues

Queryable

-- Find all jobs with >200 reviews
SELECT job_id, reviews_count
FROM jobs
WHERE jsonb_array_length(reviews_data) > 200;

-- Extract specific review data
SELECT
    job_id,
    review->>'author' as author,
    review->>'rating' as rating
FROM jobs, jsonb_array_elements(reviews_data) as review
WHERE review->>'rating' = '5';

Cost-Effective (Small-Medium Scale)

244 reviews × 0.6 KB = ~150 KB per job
1,000 jobs/month = 150 MB/month
10,000 jobs/month = 1.5 GB/month

PostgreSQL:
  - $0/month (self-hosted) or $15/month (managed)
  - Handles 10,000 jobs easily

S3:
  - Storage: $0.03/month (cheap!)
  - But need to manage: credentials, SDK, buckets

Built-in Backup

  • Standard PostgreSQL backup tools
  • Point-in-time recovery
  • Replication included

Fast Retrieval

# Single query gets everything
job = db.query("""
    SELECT job_id, status, reviews_data
    FROM jobs
    WHERE job_id = %s
""", job_id)

return {
    "job_id": job.job_id,
    "reviews": job.reviews_data  # Already parsed JSON
}

When to Use S3 Instead

Use S3 if:

Very High Volume

> 100,000 jobs/month
> 100 GB of review data
Database backup/restore becomes slow

Long-Term Retention

Need to keep reviews for years
Want lifecycle policies (auto-delete after 1 year)
Cold storage for compliance

Direct Client Access

# Pre-signed URLs let clients download directly
url = s3.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'reviews', 'Key': f'{job_id}.json'},
    ExpiresIn=3600
)

# Client downloads directly from S3 (saves bandwidth)
return {"reviews_url": url}

Multi-Region

S3 replication across regions
CDN integration (CloudFront)
Global low-latency access

📊 Performance Comparison

PostgreSQL JSONB

# Store reviews (single INSERT)
INSERT INTO jobs (job_id, reviews_data)
VALUES (%s, %s::jsonb)
# 244 reviews: ~5ms

# Retrieve reviews (single SELECT)
SELECT reviews_data FROM jobs WHERE job_id = %s
# 244 reviews: ~2ms

Total: ~7ms for store + retrieve

S3

# Store reviews (HTTP PUT)
s3.put_object(
    Bucket='reviews',
    Key=f'{job_id}.json',
    Body=json.dumps(reviews)
)
# 244 reviews: ~50-200ms (network latency)

# Retrieve reviews (HTTP GET)
response = s3.get_object(
    Bucket='reviews',
    Key=f'{job_id}.json'
)
# 244 reviews: ~50-200ms

Total: ~100-400ms for store + retrieve

PostgreSQL is 14-57x faster!


💾 Size Limits

PostgreSQL JSONB

Max column size: 1 GB
Practical limit: ~100 MB per row

Our use case:
  244 reviews × 0.6 KB = 150 KB  ✅ Perfect!
  10,000 reviews × 0.6 KB = 6 MB  ✅ Still great
  100,000 reviews × 0.6 KB = 60 MB  ✅ OK, but consider splitting

When to worry:

> 50,000 reviews per job → Consider S3
> 100 MB per job → Definitely use S3

🏗️ Hybrid Approach (Best of Both Worlds)

For maximum flexibility:

class JobStorage:
    def __init__(self):
        self.db = PostgreSQL()
        self.s3 = S3Client()  # Optional

    async def save_reviews(self, job_id, reviews):
        reviews_json = json.dumps(reviews)
        size_mb = len(reviews_json) / 1024 / 1024

        if size_mb < 10:  # Small job: use PostgreSQL
            await self.db.execute("""
                UPDATE jobs
                SET reviews_data = %s::jsonb
                WHERE job_id = %s
            """, reviews_json, job_id)

        else:  # Large job: use S3
            await self.s3.upload(
                f'reviews/{job_id}.json',
                reviews_json
            )
            await self.db.execute("""
                UPDATE jobs
                SET reviews_s3_key = %s
                WHERE job_id = %s
            """, f'reviews/{job_id}.json', job_id)

    async def get_reviews(self, job_id):
        job = await self.db.fetch_one("""
            SELECT reviews_data, reviews_s3_key
            FROM jobs
            WHERE job_id = %s
        """, job_id)

        if job.reviews_data:
            return job.reviews_data  # From PostgreSQL
        elif job.reviews_s3_key:
            return await self.s3.download(job.reviews_s3_key)  # From S3
        else:
            raise NotFound()

Final Recommendation

For Your Use Case:

Use PostgreSQL JSONB because:

  1. Simpler (one service, not two)
  2. Faster (2ms vs 200ms)
  3. Cheaper (for typical volumes)
  4. Queryable (can analyze reviews in SQL)
  5. Transactional (atomic updates)
  6. Easier backups

Schema:

CREATE TABLE jobs (
    job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    url TEXT NOT NULL,
    webhook_url TEXT,

    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    started_at TIMESTAMP,
    completed_at TIMESTAMP,

    reviews_count INTEGER,
    reviews_data JSONB,  -- All reviews here!
    scrape_time REAL,

    error_message TEXT,
    metadata JSONB,

    CONSTRAINT valid_status CHECK (status IN ('pending', 'running', 'completed', 'failed', 'cancelled'))
);

CREATE INDEX idx_jobs_status ON jobs(status);
CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC);
CREATE INDEX idx_jobs_webhook ON jobs(webhook_url) WHERE webhook_url IS NOT NULL;

Migration Path to S3:

  • Start with PostgreSQL
  • If you reach 100GB+ of data, migrate to S3
  • Keep PostgreSQL for metadata only
  • Use the hybrid approach above

📈 Scale Projections

Small:
  1,000 jobs/month × 150 KB = 150 MB/month
  → PostgreSQL ✅

Medium:
  10,000 jobs/month × 150 KB = 1.5 GB/month
  → PostgreSQL ✅

Large:
  100,000 jobs/month × 150 KB = 15 GB/month
  → PostgreSQL ✅ (but consider S3)

Very Large:
  1,000,000 jobs/month × 150 KB = 150 GB/month
  → S3 ✅

Enterprise:
  Need multi-year retention
  Multi-region replication
  Compliance requirements
  → S3 ✅

Bottom Line: Start with PostgreSQL JSONB. It's simpler, faster, and cheaper for 99% of use cases. Upgrade to S3 only if you need it.