Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
4.6 KiB
API Quick Start - Fast Google Reviews Scraper
⚡ Ultra-Fast API (18.9 seconds!)
REST API for scraping Google Maps reviews using the optimized DOM-only scraper.
Performance: ~18.9 seconds for 244 reviews (8.2x faster than original!)
🚀 Quick Start
1. Install & Run
# Install dependencies
pip install fastapi uvicorn seleniumbase pyyaml
# Start API server
python api_server.py
Server starts on: http://localhost:8000
2. Use the API
# Start a scraping job
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
"headless": true
}'
Response:
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started"
}
3. Check Status
# Check job status
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"
Response:
{
"status": "completed",
"reviews_count": 244,
"scrape_time": 18.9
}
4. Get Reviews
# Get the actual reviews
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
-o reviews.json
📋 Key Endpoints
| Endpoint | Method | Description |
|---|---|---|
/scrape |
POST | Start scraping job |
/jobs/{job_id} |
GET | Get job status |
/jobs/{job_id}/reviews |
GET | Get scraped reviews |
/jobs |
GET | List all jobs |
/stats |
GET | Get statistics |
💻 Python Example
import requests
import time
# 1. Start job
response = requests.post(
"http://localhost:8000/scrape",
json={
"url": "https://www.google.com/maps/place/...",
"headless": True
}
)
job_id = response.json()['job_id']
# 2. Wait for completion
while True:
job = requests.get(f"http://localhost:8000/jobs/{job_id}").json()
if job['status'] in ['completed', 'failed']:
break
time.sleep(2)
# 3. Get reviews
reviews = requests.get(
f"http://localhost:8000/jobs/{job_id}/reviews"
).json()['reviews']
print(f"Got {len(reviews)} reviews!")
🧪 Test It
# Run the test script
python test_fast_api.py
This will:
- Start a job
- Poll until complete
- Save reviews to JSON
- Show statistics
📚 Full Documentation
See API_DOCUMENTATION.md for:
- Complete endpoint reference
- Advanced examples
- Error handling
- Production deployment
- Monitoring & troubleshooting
🎯 API Features
✅ Ultra-fast scraping (18.9s average)
✅ Background job processing (non-blocking)
✅ Concurrent jobs (up to 3 simultaneous)
✅ Job status tracking (pending/running/completed)
✅ Review data retrieval (via dedicated endpoint)
✅ Automatic cleanup (removes old jobs)
✅ GDPR auto-handling (no manual intervention)
✅ REST API (language-agnostic)
✅ OpenAPI docs (visit /docs for Swagger UI)
🔧 Configuration
API Server
# In api_server.py
job_manager = JobManager(max_concurrent_jobs=3) # Max parallel jobs
uvicorn.run(
"api_server:app",
host="0.0.0.0", # Listen on all interfaces
port=8000, # Port number
reload=True # Auto-reload on code changes
)
Scraping Options
{
"url": "https://www.google.com/maps/place/...",
"headless": true, // Run Chrome in headless mode
"max_scrolls": 35 // Maximum scrolls (default: 35)
}
📊 Performance
Operation Time % of Total
──────────────────────────────────────────────
Scrolling (dynamic) ~14s 74%
Setup & navigation ~4.5s 24%
JavaScript extraction ~0.01s 0.1%
──────────────────────────────────────────────
TOTAL ~18.9s 100%
8.2x faster than the original scraper! 🚀
🌐 Interactive Documentation
Visit http://localhost:8000/docs for:
- Interactive API testing
- Request/response schemas
- Try out endpoints directly in browser
⚙️ What Changed?
The API now uses the fast DOM-only scraper (modules/fast_scraper.py) instead of the old scraper:
Before: 155 seconds ❌ Now: 18.9 seconds ✅
Key optimizations:
- GDPR consent auto-handling
- Dynamic scroll waiting (adapts to page speed)
- JavaScript extraction (40x faster than Selenium)
- Universal design (no hardcoded values)
Ready to scrape at 8.2x speed via API! 🚀