Files

Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-18 19:49:24 +00:00

4.6 KiB

Raw Blame History

API Quick Start - Fast Google Reviews Scraper

⚡ Ultra-Fast API (18.9 seconds!)

REST API for scraping Google Maps reviews using the optimized DOM-only scraper.

Performance: ~18.9 seconds for 244 reviews (8.2x faster than original!)

🚀 Quick Start

1. Install & Run

# Install dependencies
pip install fastapi uvicorn seleniumbase pyyaml

# Start API server
python api_server.py

Server starts on: http://localhost:8000

2. Use the API

# Start a scraping job
curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
    "headless": true
  }'

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started"
}

3. Check Status

# Check job status
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"

Response:

{
  "status": "completed",
  "reviews_count": 244,
  "scrape_time": 18.9
}

4. Get Reviews

# Get the actual reviews
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
  -o reviews.json

📋 Key Endpoints

Endpoint	Method	Description
`/scrape`	POST	Start scraping job
`/jobs/{job_id}`	GET	Get job status
`/jobs/{job_id}/reviews`	GET	Get scraped reviews
`/jobs`	GET	List all jobs
`/stats`	GET	Get statistics

💻 Python Example

import requests
import time

# 1. Start job
response = requests.post(
    "http://localhost:8000/scrape",
    json={
        "url": "https://www.google.com/maps/place/...",
        "headless": True
    }
)
job_id = response.json()['job_id']

# 2. Wait for completion
while True:
    job = requests.get(f"http://localhost:8000/jobs/{job_id}").json()
    if job['status'] in ['completed', 'failed']:
        break
    time.sleep(2)

# 3. Get reviews
reviews = requests.get(
    f"http://localhost:8000/jobs/{job_id}/reviews"
).json()['reviews']

print(f"Got {len(reviews)} reviews!")

🧪 Test It

# Run the test script
python test_fast_api.py

This will:

Start a job
Poll until complete
Save reviews to JSON
Show statistics

📚 Full Documentation

See API_DOCUMENTATION.md for:

Complete endpoint reference
Advanced examples
Error handling
Production deployment
Monitoring & troubleshooting

🎯 API Features

✅ Ultra-fast scraping (18.9s average) ✅ Background job processing (non-blocking) ✅ Concurrent jobs (up to 3 simultaneous) ✅ Job status tracking (pending/running/completed) ✅ Review data retrieval (via dedicated endpoint) ✅ Automatic cleanup (removes old jobs) ✅ GDPR auto-handling (no manual intervention) ✅ REST API (language-agnostic) ✅ OpenAPI docs (visit /docs for Swagger UI)

🔧 Configuration

API Server

# In api_server.py
job_manager = JobManager(max_concurrent_jobs=3)  # Max parallel jobs

uvicorn.run(
    "api_server:app",
    host="0.0.0.0",  # Listen on all interfaces
    port=8000,        # Port number
    reload=True       # Auto-reload on code changes
)

Scraping Options

{
  "url": "https://www.google.com/maps/place/...",
  "headless": true,     // Run Chrome in headless mode
  "max_scrolls": 35     // Maximum scrolls (default: 35)
}

📊 Performance

Operation                 Time      % of Total
──────────────────────────────────────────────
Scrolling (dynamic)       ~14s      74%
Setup & navigation        ~4.5s     24%
JavaScript extraction     ~0.01s    0.1%
──────────────────────────────────────────────
TOTAL                     ~18.9s    100%

8.2x faster than the original scraper! 🚀

🌐 Interactive Documentation

Visit http://localhost:8000/docs for:

Interactive API testing
Request/response schemas
Try out endpoints directly in browser

⚙️ What Changed?

The API now uses the fast DOM-only scraper (modules/fast_scraper.py) instead of the old scraper:

Before: 155 seconds ❌ Now: 18.9 seconds ✅

Key optimizations:

GDPR consent auto-handling
Dynamic scroll waiting (adapts to page speed)
JavaScript extraction (40x faster than Selenium)
Universal design (no hardcoded values)

Ready to scrape at 8.2x speed via API! 🚀

4.6 KiB Raw Blame History