Files
whyrating-engine-legacy/API_DOCUMENTATION.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

12 KiB

Google Reviews Scraper - Fast API Documentation

Overview

REST API for scraping Google Maps reviews using the ultra-fast DOM-only scraper (18.9s average).

Performance: ~18.9 seconds for 244 reviews (8.2x faster than original!)


Quick Start

1. Install Dependencies

pip install fastapi uvicorn seleniumbase pyyaml

2. Start the API Server

python api_server.py

Server runs on: http://localhost:8000

3. API Documentation

Visit http://localhost:8000/docs for interactive Swagger UI documentation.


API Endpoints

Health Check

GET /

Check if the API is running.

Response:

{
  "message": "Google Reviews Scraper API is running",
  "status": "healthy",
  "version": "1.0.0"
}

Start Scraping Job

POST /scrape

Start a new scraping job in the background.

Request Body:

{
  "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
  "headless": true
}

Parameters:

  • url (required): Google Maps URL to scrape
  • headless (optional): Run Chrome in headless mode (default: false)
  • max_scrolls (optional): Maximum number of scrolls (default: 35)

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "started",
  "message": "Scraping job started successfully"
}

Example (curl):

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.google.com/maps/place/...",
    "headless": true
  }'

Example (Python):

import requests

response = requests.post(
    "http://localhost:8000/scrape",
    json={
        "url": "https://www.google.com/maps/place/...",
        "headless": True
    }
)

job_id = response.json()['job_id']
print(f"Job started: {job_id}")

Get Job Status

GET /jobs/{job_id}

Get detailed information about a specific job.

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "url": "https://www.google.com/maps/...",
  "created_at": "2026-01-18T10:30:00",
  "started_at": "2026-01-18T10:30:01",
  "completed_at": "2026-01-18T10:30:20",
  "reviews_count": 244,
  "scrape_time": 18.9,
  "progress": {
    "stage": "completed",
    "message": "Scraping completed successfully in 18.9s",
    "scroll_time": 14.2,
    "extract_time": 0.01
  }
}

Job Status Values:

  • pending: Job is queued but not started
  • running: Job is currently scraping
  • completed: Job finished successfully
  • failed: Job failed with an error
  • cancelled: Job was cancelled

Example (curl):

curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"

Example (Python - Poll until complete):

import requests
import time

job_id = "550e8400-e29b-41d4-a716-446655440000"

while True:
    response = requests.get(f"http://localhost:8000/jobs/{job_id}")
    job = response.json()

    print(f"Status: {job['status']} - {job['progress']['message']}")

    if job['status'] in ['completed', 'failed', 'cancelled']:
        break

    time.sleep(2)  # Poll every 2 seconds

print(f"Final: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")

Get Job Reviews

GET /jobs/{job_id}/reviews

Get the actual scraped reviews data for a completed job.

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "reviews": [
    {
      "review_id": "review_123456789",
      "author": "John Doe",
      "rating": 5.0,
      "text": "Great place! Highly recommend...",
      "date_text": "2 months ago",
      "avatar_url": "https://lh3.googleusercontent.com/...",
      "profile_url": "..."
    },
    ...
  ],
  "count": 244
}

Error Responses:

  • 404: Job not found
  • 400: Job not completed yet

Example (curl):

curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
  -o reviews.json

Example (Python):

import requests
import json

job_id = "550e8400-e29b-41d4-a716-446655440000"

response = requests.get(f"http://localhost:8000/jobs/{job_id}/reviews")
reviews_data = response.json()

# Save to file
with open('reviews.json', 'w', encoding='utf-8') as f:
    json.dump(reviews_data['reviews'], f, indent=2, ensure_ascii=False)

print(f"Retrieved {reviews_data['count']} reviews")

List All Jobs

GET /jobs

List all jobs, optionally filtered by status.

Query Parameters:

  • status (optional): Filter by job status (pending, running, completed, failed, cancelled)
  • limit (optional): Maximum number of jobs to return (default: 100, max: 1000)

Response:

[
  {
    "job_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "url": "https://www.google.com/maps/...",
    "created_at": "2026-01-18T10:30:00",
    "reviews_count": 244,
    "scrape_time": 18.9
  },
  ...
]

Example (curl):

# Get all completed jobs
curl "http://localhost:8000/jobs?status=completed&limit=10"

Cancel Job

POST /jobs/{job_id}/cancel

Cancel a pending or running job.

Response:

{
  "message": "Job cancelled successfully"
}

Error Responses:

  • 404: Job not found
  • 400: Job cannot be cancelled (already completed/failed)

Delete Job

DELETE /jobs/{job_id}

Delete a job from the system (removes job data).

Response:

{
  "message": "Job deleted successfully"
}

Get Statistics

GET /stats

Get job manager statistics.

Response:

{
  "total_jobs": 42,
  "by_status": {
    "pending": 2,
    "running": 1,
    "completed": 35,
    "failed": 3,
    "cancelled": 1
  },
  "running_jobs": 1,
  "max_concurrent_jobs": 3
}

Manual Cleanup

POST /cleanup

Manually trigger cleanup of old completed/failed jobs.

Query Parameters:

  • max_age_hours (optional): Maximum age in hours (default: 24)

Response:

{
  "message": "Cleaned up jobs older than 24 hours"
}

Complete Workflow Example

Python Script

import requests
import time
import json

BASE_URL = "http://localhost:8000"

# 1. Start scraping job
response = requests.post(
    f"{BASE_URL}/scrape",
    json={
        "url": "https://www.google.com/maps/place/...",
        "headless": True
    }
)
job_id = response.json()['job_id']
print(f"Job started: {job_id}")

# 2. Poll until complete
while True:
    response = requests.get(f"{BASE_URL}/jobs/{job_id}")
    job = response.json()

    print(f"Status: {job['status']} - {job['progress']['message']}")

    if job['status'] == 'completed':
        print(f"✅ Completed: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
        break
    elif job['status'] == 'failed':
        print(f"❌ Failed: {job['error_message']}")
        break

    time.sleep(2)

# 3. Get reviews
if job['status'] == 'completed':
    response = requests.get(f"{BASE_URL}/jobs/{job_id}/reviews")
    reviews = response.json()['reviews']

    # Save to file
    with open('reviews.json', 'w', encoding='utf-8') as f:
        json.dump(reviews, f, indent=2, ensure_ascii=False)

    print(f"💾 Saved {len(reviews)} reviews to reviews.json")

JavaScript/Node.js Example

const axios = require('axios');
const fs = require('fs');

const BASE_URL = 'http://localhost:8000';

async function scrapeReviews(url) {
  // 1. Start job
  const { data: startData } = await axios.post(`${BASE_URL}/scrape`, {
    url: url,
    headless: true
  });

  const jobId = startData.job_id;
  console.log(`Job started: ${jobId}`);

  // 2. Poll until complete
  while (true) {
    const { data: job } = await axios.get(`${BASE_URL}/jobs/${jobId}`);

    console.log(`Status: ${job.status} - ${job.progress.message}`);

    if (job.status === 'completed') {
      console.log(`✅ Completed: ${job.reviews_count} reviews in ${job.scrape_time}s`);
      break;
    } else if (job.status === 'failed') {
      console.log(`❌ Failed: ${job.error_message}`);
      return;
    }

    await new Promise(resolve => setTimeout(resolve, 2000));
  }

  // 3. Get reviews
  const { data: reviewsData } = await axios.get(`${BASE_URL}/jobs/${jobId}/reviews`);

  // Save to file
  fs.writeFileSync('reviews.json', JSON.stringify(reviewsData.reviews, null, 2));

  console.log(`💾 Saved ${reviewsData.count} reviews to reviews.json`);
}

scrapeReviews('https://www.google.com/maps/place/...');

Performance

Fast Scraper Performance

The API now uses the ultra-fast DOM-only scraper:

Metric Value
Average Time 18.9s
Speedup 8.2x faster
Success Rate 100%
Reviews/Second ~12.9

Timing Breakdown:

  • Scrolling: ~14s (60-74%)
  • Extraction: ~0.01s (0.1%)
  • Setup: ~4-5s (25-30%)

Configuration

Server Configuration

Edit api_server.py to configure:

# Number of concurrent scraping jobs
job_manager = JobManager(max_concurrent_jobs=3)

# Server host and port
uvicorn.run(
    "api_server:app",
    host="0.0.0.0",
    port=8000,
    reload=True
)

Scraper Configuration

Pass configuration when starting a job:

{
  "url": "https://www.google.com/maps/place/...",
  "headless": true,
  "max_scrolls": 35
}

Error Handling

HTTP Status Codes

  • 200: Success
  • 400: Bad request (invalid parameters or job state)
  • 404: Job not found
  • 500: Internal server error

Error Response Format

{
  "detail": "Error message here"
}

Common Errors

1. Job not completed yet

{
  "detail": "Job not completed yet (current status: running)"
}

2. Job not found

{
  "detail": "Job not found"
}

3. Maximum concurrent jobs reached

{
  "detail": "Maximum concurrent jobs reached"
}

Testing

Run Test Script

python test_fast_api.py

This will:

  1. Start a scraping job
  2. Poll until complete
  3. Retrieve and save reviews
  4. Show statistics

Manual Testing (curl)

# Start job
curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{"url": "YOUR_GOOGLE_MAPS_URL", "headless": true}' \
  | jq

# Get status (replace JOB_ID)
curl "http://localhost:8000/jobs/JOB_ID" | jq

# Get reviews
curl "http://localhost:8000/jobs/JOB_ID/reviews" | jq

Production Deployment

Using Gunicorn

pip install gunicorn

gunicorn api_server:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000

Using Docker

Create Dockerfile:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["python", "api_server.py"]

Run:

docker build -t google-reviews-api .
docker run -p 8000:8000 google-reviews-api

Monitoring

Check Running Jobs

curl "http://localhost:8000/stats" | jq

List Recent Jobs

curl "http://localhost:8000/jobs?limit=10" | jq

Auto-Cleanup

Jobs are automatically cleaned up after 24 hours. Configure in api_server.py:

async def cleanup_jobs_periodically():
    while True:
        await asyncio.sleep(3600)  # Run every hour
        if job_manager:
            job_manager.cleanup_old_jobs(max_age_hours=24)

Troubleshooting

API won't start

Error: "Address already in use"

Solution: Change port in api_server.py or kill existing process:

lsof -ti:8000 | xargs kill

Jobs stuck in "running" status

Solution: Check server logs for errors. Restart the server if needed.

The fast scraper automatically handles GDPR consent pages. If issues persist:

  • Set headless: false to see what's happening
  • Check server logs for consent page detection

Support

For issues or questions, check:

  • Server logs: Console output when running python api_server.py
  • Interactive docs: http://localhost:8000/docs
  • Test script: python test_fast_api.py

Enjoy ultra-fast Google Maps scraping with the API! 🚀