Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
Google Reviews Scraper - Fast API Documentation
Overview
REST API for scraping Google Maps reviews using the ultra-fast DOM-only scraper (18.9s average).
Performance: ~18.9 seconds for 244 reviews (8.2x faster than original!)
Quick Start
1. Install Dependencies
pip install fastapi uvicorn seleniumbase pyyaml
2. Start the API Server
python api_server.py
Server runs on: http://localhost:8000
3. API Documentation
Visit http://localhost:8000/docs for interactive Swagger UI documentation.
API Endpoints
Health Check
GET /
Check if the API is running.
Response:
{
"message": "Google Reviews Scraper API is running",
"status": "healthy",
"version": "1.0.0"
}
Start Scraping Job
POST /scrape
Start a new scraping job in the background.
Request Body:
{
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
"headless": true
}
Parameters:
url(required): Google Maps URL to scrapeheadless(optional): Run Chrome in headless mode (default: false)max_scrolls(optional): Maximum number of scrolls (default: 35)
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started",
"message": "Scraping job started successfully"
}
Example (curl):
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.google.com/maps/place/...",
"headless": true
}'
Example (Python):
import requests
response = requests.post(
"http://localhost:8000/scrape",
json={
"url": "https://www.google.com/maps/place/...",
"headless": True
}
)
job_id = response.json()['job_id']
print(f"Job started: {job_id}")
Get Job Status
GET /jobs/{job_id}
Get detailed information about a specific job.
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"url": "https://www.google.com/maps/...",
"created_at": "2026-01-18T10:30:00",
"started_at": "2026-01-18T10:30:01",
"completed_at": "2026-01-18T10:30:20",
"reviews_count": 244,
"scrape_time": 18.9,
"progress": {
"stage": "completed",
"message": "Scraping completed successfully in 18.9s",
"scroll_time": 14.2,
"extract_time": 0.01
}
}
Job Status Values:
pending: Job is queued but not startedrunning: Job is currently scrapingcompleted: Job finished successfullyfailed: Job failed with an errorcancelled: Job was cancelled
Example (curl):
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"
Example (Python - Poll until complete):
import requests
import time
job_id = "550e8400-e29b-41d4-a716-446655440000"
while True:
response = requests.get(f"http://localhost:8000/jobs/{job_id}")
job = response.json()
print(f"Status: {job['status']} - {job['progress']['message']}")
if job['status'] in ['completed', 'failed', 'cancelled']:
break
time.sleep(2) # Poll every 2 seconds
print(f"Final: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
Get Job Reviews
GET /jobs/{job_id}/reviews
Get the actual scraped reviews data for a completed job.
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"reviews": [
{
"review_id": "review_123456789",
"author": "John Doe",
"rating": 5.0,
"text": "Great place! Highly recommend...",
"date_text": "2 months ago",
"avatar_url": "https://lh3.googleusercontent.com/...",
"profile_url": "..."
},
...
],
"count": 244
}
Error Responses:
404: Job not found400: Job not completed yet
Example (curl):
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
-o reviews.json
Example (Python):
import requests
import json
job_id = "550e8400-e29b-41d4-a716-446655440000"
response = requests.get(f"http://localhost:8000/jobs/{job_id}/reviews")
reviews_data = response.json()
# Save to file
with open('reviews.json', 'w', encoding='utf-8') as f:
json.dump(reviews_data['reviews'], f, indent=2, ensure_ascii=False)
print(f"Retrieved {reviews_data['count']} reviews")
List All Jobs
GET /jobs
List all jobs, optionally filtered by status.
Query Parameters:
status(optional): Filter by job status (pending, running, completed, failed, cancelled)limit(optional): Maximum number of jobs to return (default: 100, max: 1000)
Response:
[
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "completed",
"url": "https://www.google.com/maps/...",
"created_at": "2026-01-18T10:30:00",
"reviews_count": 244,
"scrape_time": 18.9
},
...
]
Example (curl):
# Get all completed jobs
curl "http://localhost:8000/jobs?status=completed&limit=10"
Cancel Job
POST /jobs/{job_id}/cancel
Cancel a pending or running job.
Response:
{
"message": "Job cancelled successfully"
}
Error Responses:
404: Job not found400: Job cannot be cancelled (already completed/failed)
Delete Job
DELETE /jobs/{job_id}
Delete a job from the system (removes job data).
Response:
{
"message": "Job deleted successfully"
}
Get Statistics
GET /stats
Get job manager statistics.
Response:
{
"total_jobs": 42,
"by_status": {
"pending": 2,
"running": 1,
"completed": 35,
"failed": 3,
"cancelled": 1
},
"running_jobs": 1,
"max_concurrent_jobs": 3
}
Manual Cleanup
POST /cleanup
Manually trigger cleanup of old completed/failed jobs.
Query Parameters:
max_age_hours(optional): Maximum age in hours (default: 24)
Response:
{
"message": "Cleaned up jobs older than 24 hours"
}
Complete Workflow Example
Python Script
import requests
import time
import json
BASE_URL = "http://localhost:8000"
# 1. Start scraping job
response = requests.post(
f"{BASE_URL}/scrape",
json={
"url": "https://www.google.com/maps/place/...",
"headless": True
}
)
job_id = response.json()['job_id']
print(f"Job started: {job_id}")
# 2. Poll until complete
while True:
response = requests.get(f"{BASE_URL}/jobs/{job_id}")
job = response.json()
print(f"Status: {job['status']} - {job['progress']['message']}")
if job['status'] == 'completed':
print(f"✅ Completed: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
break
elif job['status'] == 'failed':
print(f"❌ Failed: {job['error_message']}")
break
time.sleep(2)
# 3. Get reviews
if job['status'] == 'completed':
response = requests.get(f"{BASE_URL}/jobs/{job_id}/reviews")
reviews = response.json()['reviews']
# Save to file
with open('reviews.json', 'w', encoding='utf-8') as f:
json.dump(reviews, f, indent=2, ensure_ascii=False)
print(f"💾 Saved {len(reviews)} reviews to reviews.json")
JavaScript/Node.js Example
const axios = require('axios');
const fs = require('fs');
const BASE_URL = 'http://localhost:8000';
async function scrapeReviews(url) {
// 1. Start job
const { data: startData } = await axios.post(`${BASE_URL}/scrape`, {
url: url,
headless: true
});
const jobId = startData.job_id;
console.log(`Job started: ${jobId}`);
// 2. Poll until complete
while (true) {
const { data: job } = await axios.get(`${BASE_URL}/jobs/${jobId}`);
console.log(`Status: ${job.status} - ${job.progress.message}`);
if (job.status === 'completed') {
console.log(`✅ Completed: ${job.reviews_count} reviews in ${job.scrape_time}s`);
break;
} else if (job.status === 'failed') {
console.log(`❌ Failed: ${job.error_message}`);
return;
}
await new Promise(resolve => setTimeout(resolve, 2000));
}
// 3. Get reviews
const { data: reviewsData } = await axios.get(`${BASE_URL}/jobs/${jobId}/reviews`);
// Save to file
fs.writeFileSync('reviews.json', JSON.stringify(reviewsData.reviews, null, 2));
console.log(`💾 Saved ${reviewsData.count} reviews to reviews.json`);
}
scrapeReviews('https://www.google.com/maps/place/...');
Performance
Fast Scraper Performance
The API now uses the ultra-fast DOM-only scraper:
| Metric | Value |
|---|---|
| Average Time | 18.9s |
| Speedup | 8.2x faster |
| Success Rate | 100% |
| Reviews/Second | ~12.9 |
Timing Breakdown:
- Scrolling: ~14s (60-74%)
- Extraction: ~0.01s (0.1%)
- Setup: ~4-5s (25-30%)
Configuration
Server Configuration
Edit api_server.py to configure:
# Number of concurrent scraping jobs
job_manager = JobManager(max_concurrent_jobs=3)
# Server host and port
uvicorn.run(
"api_server:app",
host="0.0.0.0",
port=8000,
reload=True
)
Scraper Configuration
Pass configuration when starting a job:
{
"url": "https://www.google.com/maps/place/...",
"headless": true,
"max_scrolls": 35
}
Error Handling
HTTP Status Codes
200: Success400: Bad request (invalid parameters or job state)404: Job not found500: Internal server error
Error Response Format
{
"detail": "Error message here"
}
Common Errors
1. Job not completed yet
{
"detail": "Job not completed yet (current status: running)"
}
2. Job not found
{
"detail": "Job not found"
}
3. Maximum concurrent jobs reached
{
"detail": "Maximum concurrent jobs reached"
}
Testing
Run Test Script
python test_fast_api.py
This will:
- Start a scraping job
- Poll until complete
- Retrieve and save reviews
- Show statistics
Manual Testing (curl)
# Start job
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "YOUR_GOOGLE_MAPS_URL", "headless": true}' \
| jq
# Get status (replace JOB_ID)
curl "http://localhost:8000/jobs/JOB_ID" | jq
# Get reviews
curl "http://localhost:8000/jobs/JOB_ID/reviews" | jq
Production Deployment
Using Gunicorn
pip install gunicorn
gunicorn api_server:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--bind 0.0.0.0:8000
Using Docker
Create Dockerfile:
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["python", "api_server.py"]
Run:
docker build -t google-reviews-api .
docker run -p 8000:8000 google-reviews-api
Monitoring
Check Running Jobs
curl "http://localhost:8000/stats" | jq
List Recent Jobs
curl "http://localhost:8000/jobs?limit=10" | jq
Auto-Cleanup
Jobs are automatically cleaned up after 24 hours. Configure in api_server.py:
async def cleanup_jobs_periodically():
while True:
await asyncio.sleep(3600) # Run every hour
if job_manager:
job_manager.cleanup_old_jobs(max_age_hours=24)
Troubleshooting
API won't start
Error: "Address already in use"
Solution: Change port in api_server.py or kill existing process:
lsof -ti:8000 | xargs kill
Jobs stuck in "running" status
Solution: Check server logs for errors. Restart the server if needed.
GDPR consent issues
The fast scraper automatically handles GDPR consent pages. If issues persist:
- Set
headless: falseto see what's happening - Check server logs for consent page detection
Support
For issues or questions, check:
- Server logs: Console output when running
python api_server.py - Interactive docs:
http://localhost:8000/docs - Test script:
python test_fast_api.py
Enjoy ultra-fast Google Maps scraping with the API! 🚀