Files

Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-18 19:49:24 +00:00

12 KiB

Raw Blame History

Google Reviews Scraper - Fast API Documentation

Overview

REST API for scraping Google Maps reviews using the ultra-fast DOM-only scraper (18.9s average).

Performance: ~18.9 seconds for 244 reviews (8.2x faster than original!)

Quick Start

1. Install Dependencies

pip install fastapi uvicorn seleniumbase pyyaml

2. Start the API Server

python api_server.py

Server runs on: http://localhost:8000

3. API Documentation

Visit http://localhost:8000/docs for interactive Swagger UI documentation.

API Endpoints

Health Check

GET /

Check if the API is running.

Response:

{
  "message": "Google Reviews Scraper API is running",
  "status": "healthy",
  "version": "1.0.0"
}

Start Scraping Job

POST /scrape

Start a new scraping job in the background.

Request Body:

{
  "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
  "headless": true
}

Parameters:

url (required): Google Maps URL to scrape
headless (optional): Run Chrome in headless mode (default: false)
max_scrolls (optional): Maximum number of scrolls (default: 35)

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "message": "Scraping job started successfully"
}

Example (curl):

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.google.com/maps/place/...",
    "headless": true
  }'

Example (Python):

import requests

response = requests.post(
    "http://localhost:8000/scrape",
    json={
        "url": "https://www.google.com/maps/place/...",
        "headless": True
    }
)

job_id = response.json()['job_id']
print(f"Job started: {job_id}")

Get Job Status

GET /jobs/{job_id}

Get detailed information about a specific job.

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "url": "https://www.google.com/maps/...",
  "created_at": "2026-01-18T10:30:00",
  "started_at": "2026-01-18T10:30:01",
  "completed_at": "2026-01-18T10:30:20",
  "reviews_count": 244,
  "scrape_time": 18.9,
  "progress": {
    "stage": "completed",
    "message": "Scraping completed successfully in 18.9s",
    "scroll_time": 14.2,
    "extract_time": 0.01
  }
}

Job Status Values:

pending: Job is queued but not started
running: Job is currently scraping
completed: Job finished successfully
failed: Job failed with an error
cancelled: Job was cancelled

Example (curl):

curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"

Example (Python - Poll until complete):

import requests
import time

job_id = "550e8400-e29b-41d4-a716-446655440000"

while True:
    response = requests.get(f"http://localhost:8000/jobs/{job_id}")
    job = response.json()

    print(f"Status: {job['status']} - {job['progress']['message']}")

    if job['status'] in ['completed', 'failed', 'cancelled']:
        break

    time.sleep(2)  # Poll every 2 seconds

print(f"Final: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")

Get Job Reviews

GET /jobs/{job_id}/reviews

Get the actual scraped reviews data for a completed job.

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "reviews": [
    {
      "review_id": "review_123456789",
      "author": "John Doe",
      "rating": 5.0,
      "text": "Great place! Highly recommend...",
      "date_text": "2 months ago",
      "avatar_url": "https://lh3.googleusercontent.com/...",
      "profile_url": "..."
    },
    ...
  ],
  "count": 244
}

Error Responses:

404: Job not found
400: Job not completed yet

Example (curl):

curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
  -o reviews.json

Example (Python):

import requests
import json

job_id = "550e8400-e29b-41d4-a716-446655440000"

response = requests.get(f"http://localhost:8000/jobs/{job_id}/reviews")
reviews_data = response.json()

# Save to file
with open('reviews.json', 'w', encoding='utf-8') as f:
    json.dump(reviews_data['reviews'], f, indent=2, ensure_ascii=False)

print(f"Retrieved {reviews_data['count']} reviews")

List All Jobs

GET /jobs

List all jobs, optionally filtered by status.

Query Parameters:

status (optional): Filter by job status (pending, running, completed, failed, cancelled)
limit (optional): Maximum number of jobs to return (default: 100, max: 1000)

Response:

[
  {
    "job_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "url": "https://www.google.com/maps/...",
    "created_at": "2026-01-18T10:30:00",
    "reviews_count": 244,
    "scrape_time": 18.9
  },
  ...
]

Example (curl):

# Get all completed jobs
curl "http://localhost:8000/jobs?status=completed&limit=10"

Cancel Job

POST /jobs/{job_id}/cancel

Cancel a pending or running job.

Response:

{
  "message": "Job cancelled successfully"
}

Error Responses:

404: Job not found
400: Job cannot be cancelled (already completed/failed)

Delete Job

DELETE /jobs/{job_id}

Delete a job from the system (removes job data).

Response:

{
  "message": "Job deleted successfully"
}

Get Statistics

GET /stats

Get job manager statistics.

Response:

{
  "total_jobs": 42,
  "by_status": {
    "pending": 2,
    "running": 1,
    "completed": 35,
    "failed": 3,
    "cancelled": 1
  },
  "running_jobs": 1,
  "max_concurrent_jobs": 3
}

Manual Cleanup

POST /cleanup

Manually trigger cleanup of old completed/failed jobs.

Query Parameters:

max_age_hours (optional): Maximum age in hours (default: 24)

Response:

{
  "message": "Cleaned up jobs older than 24 hours"
}

Complete Workflow Example

Python Script

import requests
import time
import json

BASE_URL = "http://localhost:8000"

# 1. Start scraping job
response = requests.post(
    f"{BASE_URL}/scrape",
    json={
        "url": "https://www.google.com/maps/place/...",
        "headless": True
    }
)
job_id = response.json()['job_id']
print(f"Job started: {job_id}")

# 2. Poll until complete
while True:
    response = requests.get(f"{BASE_URL}/jobs/{job_id}")
    job = response.json()

    print(f"Status: {job['status']} - {job['progress']['message']}")

    if job['status'] == 'completed':
        print(f"✅ Completed: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
        break
    elif job['status'] == 'failed':
        print(f"❌ Failed: {job['error_message']}")
        break

    time.sleep(2)

# 3. Get reviews
if job['status'] == 'completed':
    response = requests.get(f"{BASE_URL}/jobs/{job_id}/reviews")
    reviews = response.json()['reviews']

    # Save to file
    with open('reviews.json', 'w', encoding='utf-8') as f:
        json.dump(reviews, f, indent=2, ensure_ascii=False)

    print(f"💾 Saved {len(reviews)} reviews to reviews.json")

JavaScript/Node.js Example

const axios = require('axios');
const fs = require('fs');

const BASE_URL = 'http://localhost:8000';

async function scrapeReviews(url) {
  // 1. Start job
  const { data: startData } = await axios.post(`${BASE_URL}/scrape`, {
    url: url,
    headless: true
  });

  const jobId = startData.job_id;
  console.log(`Job started: ${jobId}`);

  // 2. Poll until complete
  while (true) {
    const { data: job } = await axios.get(`${BASE_URL}/jobs/${jobId}`);

    console.log(`Status: ${job.status} - ${job.progress.message}`);

    if (job.status === 'completed') {
      console.log(`✅ Completed: ${job.reviews_count} reviews in ${job.scrape_time}s`);
      break;
    } else if (job.status === 'failed') {
      console.log(`❌ Failed: ${job.error_message}`);
      return;
    }

    await new Promise(resolve => setTimeout(resolve, 2000));
  }

  // 3. Get reviews
  const { data: reviewsData } = await axios.get(`${BASE_URL}/jobs/${jobId}/reviews`);

  // Save to file
  fs.writeFileSync('reviews.json', JSON.stringify(reviewsData.reviews, null, 2));

  console.log(`💾 Saved ${reviewsData.count} reviews to reviews.json`);
}

scrapeReviews('https://www.google.com/maps/place/...');

Performance

Fast Scraper Performance

The API now uses the ultra-fast DOM-only scraper:

Metric	Value
Average Time	18.9s
Speedup	8.2x faster
Success Rate	100%
Reviews/Second	~12.9

Timing Breakdown:

Scrolling: ~14s (60-74%)
Extraction: ~0.01s (0.1%)
Setup: ~4-5s (25-30%)

Configuration

Server Configuration

Edit api_server.py to configure:

# Number of concurrent scraping jobs
job_manager = JobManager(max_concurrent_jobs=3)

# Server host and port
uvicorn.run(
    "api_server:app",
    host="0.0.0.0",
    port=8000,
    reload=True
)

Scraper Configuration

Pass configuration when starting a job:

{
  "url": "https://www.google.com/maps/place/...",
  "headless": true,
  "max_scrolls": 35
}

Error Handling

HTTP Status Codes

200: Success
400: Bad request (invalid parameters or job state)
404: Job not found
500: Internal server error

Error Response Format

{
  "detail": "Error message here"
}

Common Errors

1. Job not completed yet

{
  "detail": "Job not completed yet (current status: running)"
}

2. Job not found

{
  "detail": "Job not found"
}

3. Maximum concurrent jobs reached

{
  "detail": "Maximum concurrent jobs reached"
}

Testing

Run Test Script

python test_fast_api.py

This will:

Start a scraping job
Poll until complete
Retrieve and save reviews
Show statistics

Manual Testing (curl)

# Start job
curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "YOUR_GOOGLE_MAPS_URL", "headless": true}' \
  | jq

# Get status (replace JOB_ID)
curl "http://localhost:8000/jobs/JOB_ID" | jq

# Get reviews
curl "http://localhost:8000/jobs/JOB_ID/reviews" | jq

Production Deployment

Using Gunicorn

pip install gunicorn

gunicorn api_server:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000

Using Docker

Create Dockerfile:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["python", "api_server.py"]

Run:

docker build -t google-reviews-api .
docker run -p 8000:8000 google-reviews-api

Monitoring

Check Running Jobs

curl "http://localhost:8000/stats" | jq

List Recent Jobs

curl "http://localhost:8000/jobs?limit=10" | jq

Auto-Cleanup

Jobs are automatically cleaned up after 24 hours. Configure in api_server.py:

async def cleanup_jobs_periodically():
    while True:
        await asyncio.sleep(3600)  # Run every hour
        if job_manager:
            job_manager.cleanup_old_jobs(max_age_hours=24)

Troubleshooting

API won't start

Error: "Address already in use"

Solution: Change port in api_server.py or kill existing process:

lsof -ti:8000 | xargs kill

Jobs stuck in "running" status

Solution: Check server logs for errors. Restart the server if needed.

The fast scraper automatically handles GDPR consent pages. If issues persist:

Set headless: false to see what's happening
Check server logs for consent page detection

Support

For issues or questions, check:

Server logs: Console output when running python api_server.py
Interactive docs: http://localhost:8000/docs
Test script: python test_fast_api.py

Enjoy ultra-fast Google Maps scraping with the API! 🚀

12 KiB Raw Blame History

Google Reviews Scraper - Fast API Documentation

Overview

Quick Start

1. Install Dependencies

2. Start the API Server

3. API Documentation

API Endpoints

Health Check

Start Scraping Job

Get Job Status

Get Job Reviews

List All Jobs

Cancel Job

Delete Job

Get Statistics

Manual Cleanup

Complete Workflow Example

Python Script

JavaScript/Node.js Example

Performance

Fast Scraper Performance

Configuration

Server Configuration

Scraper Configuration

Error Handling

HTTP Status Codes

Error Response Format

Common Errors

Testing

Run Test Script

Manual Testing (curl)

Production Deployment

Using Gunicorn

Using Docker

Monitoring

Check Running Jobs

List Recent Jobs

Auto-Cleanup

Troubleshooting

API won't start

Jobs stuck in "running" status

GDPR consent issues

Support

12 KiB

Raw Blame History