# Google Reviews Scraper - Fast API Documentation ## Overview REST API for scraping Google Maps reviews using the **ultra-fast DOM-only scraper** (18.9s average). **Performance**: ~18.9 seconds for 244 reviews (8.2x faster than original!) --- ## Quick Start ### 1. Install Dependencies ```bash pip install fastapi uvicorn seleniumbase pyyaml ``` ### 2. Start the API Server ```bash python api_server.py ``` Server runs on: `http://localhost:8000` ### 3. API Documentation Visit `http://localhost:8000/docs` for interactive Swagger UI documentation. --- ## API Endpoints ### Health Check **GET** `/` Check if the API is running. **Response:** ```json { "message": "Google Reviews Scraper API is running", "status": "healthy", "version": "1.0.0" } ``` --- ### Start Scraping Job **POST** `/scrape` Start a new scraping job in the background. **Request Body:** ```json { "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL", "headless": true } ``` **Parameters:** - `url` (required): Google Maps URL to scrape - `headless` (optional): Run Chrome in headless mode (default: false) - `max_scrolls` (optional): Maximum number of scrolls (default: 35) **Response:** ```json { "job_id": "550e8400-e29b-41d4-a716-446655440000", "status": "started", "message": "Scraping job started successfully" } ``` **Example (curl):** ```bash curl -X POST "http://localhost:8000/scrape" \ -H "Content-Type: application/json" \ -d '{ "url": "https://www.google.com/maps/place/...", "headless": true }' ``` **Example (Python):** ```python import requests response = requests.post( "http://localhost:8000/scrape", json={ "url": "https://www.google.com/maps/place/...", "headless": True } ) job_id = response.json()['job_id'] print(f"Job started: {job_id}") ``` --- ### Get Job Status **GET** `/jobs/{job_id}` Get detailed information about a specific job. **Response:** ```json { "job_id": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "url": "https://www.google.com/maps/...", "created_at": "2026-01-18T10:30:00", "started_at": "2026-01-18T10:30:01", "completed_at": "2026-01-18T10:30:20", "reviews_count": 244, "scrape_time": 18.9, "progress": { "stage": "completed", "message": "Scraping completed successfully in 18.9s", "scroll_time": 14.2, "extract_time": 0.01 } } ``` **Job Status Values:** - `pending`: Job is queued but not started - `running`: Job is currently scraping - `completed`: Job finished successfully - `failed`: Job failed with an error - `cancelled`: Job was cancelled **Example (curl):** ```bash curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000" ``` **Example (Python - Poll until complete):** ```python import requests import time job_id = "550e8400-e29b-41d4-a716-446655440000" while True: response = requests.get(f"http://localhost:8000/jobs/{job_id}") job = response.json() print(f"Status: {job['status']} - {job['progress']['message']}") if job['status'] in ['completed', 'failed', 'cancelled']: break time.sleep(2) # Poll every 2 seconds print(f"Final: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s") ``` --- ### Get Job Reviews **GET** `/jobs/{job_id}/reviews` Get the actual scraped reviews data for a completed job. **Response:** ```json { "job_id": "550e8400-e29b-41d4-a716-446655440000", "reviews": [ { "review_id": "review_123456789", "author": "John Doe", "rating": 5.0, "text": "Great place! Highly recommend...", "date_text": "2 months ago", "avatar_url": "https://lh3.googleusercontent.com/...", "profile_url": "..." }, ... ], "count": 244 } ``` **Error Responses:** - `404`: Job not found - `400`: Job not completed yet **Example (curl):** ```bash curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \ -o reviews.json ``` **Example (Python):** ```python import requests import json job_id = "550e8400-e29b-41d4-a716-446655440000" response = requests.get(f"http://localhost:8000/jobs/{job_id}/reviews") reviews_data = response.json() # Save to file with open('reviews.json', 'w', encoding='utf-8') as f: json.dump(reviews_data['reviews'], f, indent=2, ensure_ascii=False) print(f"Retrieved {reviews_data['count']} reviews") ``` --- ### List All Jobs **GET** `/jobs` List all jobs, optionally filtered by status. **Query Parameters:** - `status` (optional): Filter by job status (pending, running, completed, failed, cancelled) - `limit` (optional): Maximum number of jobs to return (default: 100, max: 1000) **Response:** ```json [ { "job_id": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "url": "https://www.google.com/maps/...", "created_at": "2026-01-18T10:30:00", "reviews_count": 244, "scrape_time": 18.9 }, ... ] ``` **Example (curl):** ```bash # Get all completed jobs curl "http://localhost:8000/jobs?status=completed&limit=10" ``` --- ### Cancel Job **POST** `/jobs/{job_id}/cancel` Cancel a pending or running job. **Response:** ```json { "message": "Job cancelled successfully" } ``` **Error Responses:** - `404`: Job not found - `400`: Job cannot be cancelled (already completed/failed) --- ### Delete Job **DELETE** `/jobs/{job_id}` Delete a job from the system (removes job data). **Response:** ```json { "message": "Job deleted successfully" } ``` --- ### Get Statistics **GET** `/stats` Get job manager statistics. **Response:** ```json { "total_jobs": 42, "by_status": { "pending": 2, "running": 1, "completed": 35, "failed": 3, "cancelled": 1 }, "running_jobs": 1, "max_concurrent_jobs": 3 } ``` --- ### Manual Cleanup **POST** `/cleanup` Manually trigger cleanup of old completed/failed jobs. **Query Parameters:** - `max_age_hours` (optional): Maximum age in hours (default: 24) **Response:** ```json { "message": "Cleaned up jobs older than 24 hours" } ``` --- ## Complete Workflow Example ### Python Script ```python import requests import time import json BASE_URL = "http://localhost:8000" # 1. Start scraping job response = requests.post( f"{BASE_URL}/scrape", json={ "url": "https://www.google.com/maps/place/...", "headless": True } ) job_id = response.json()['job_id'] print(f"Job started: {job_id}") # 2. Poll until complete while True: response = requests.get(f"{BASE_URL}/jobs/{job_id}") job = response.json() print(f"Status: {job['status']} - {job['progress']['message']}") if job['status'] == 'completed': print(f"✅ Completed: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s") break elif job['status'] == 'failed': print(f"❌ Failed: {job['error_message']}") break time.sleep(2) # 3. Get reviews if job['status'] == 'completed': response = requests.get(f"{BASE_URL}/jobs/{job_id}/reviews") reviews = response.json()['reviews'] # Save to file with open('reviews.json', 'w', encoding='utf-8') as f: json.dump(reviews, f, indent=2, ensure_ascii=False) print(f"💾 Saved {len(reviews)} reviews to reviews.json") ``` ### JavaScript/Node.js Example ```javascript const axios = require('axios'); const fs = require('fs'); const BASE_URL = 'http://localhost:8000'; async function scrapeReviews(url) { // 1. Start job const { data: startData } = await axios.post(`${BASE_URL}/scrape`, { url: url, headless: true }); const jobId = startData.job_id; console.log(`Job started: ${jobId}`); // 2. Poll until complete while (true) { const { data: job } = await axios.get(`${BASE_URL}/jobs/${jobId}`); console.log(`Status: ${job.status} - ${job.progress.message}`); if (job.status === 'completed') { console.log(`✅ Completed: ${job.reviews_count} reviews in ${job.scrape_time}s`); break; } else if (job.status === 'failed') { console.log(`❌ Failed: ${job.error_message}`); return; } await new Promise(resolve => setTimeout(resolve, 2000)); } // 3. Get reviews const { data: reviewsData } = await axios.get(`${BASE_URL}/jobs/${jobId}/reviews`); // Save to file fs.writeFileSync('reviews.json', JSON.stringify(reviewsData.reviews, null, 2)); console.log(`💾 Saved ${reviewsData.count} reviews to reviews.json`); } scrapeReviews('https://www.google.com/maps/place/...'); ``` --- ## Performance ### Fast Scraper Performance The API now uses the **ultra-fast DOM-only scraper**: | Metric | Value | |--------|-------| | Average Time | 18.9s | | Speedup | 8.2x faster | | Success Rate | 100% | | Reviews/Second | ~12.9 | **Timing Breakdown:** - Scrolling: ~14s (60-74%) - Extraction: ~0.01s (0.1%) - Setup: ~4-5s (25-30%) --- ## Configuration ### Server Configuration Edit `api_server.py` to configure: ```python # Number of concurrent scraping jobs job_manager = JobManager(max_concurrent_jobs=3) # Server host and port uvicorn.run( "api_server:app", host="0.0.0.0", port=8000, reload=True ) ``` ### Scraper Configuration Pass configuration when starting a job: ```json { "url": "https://www.google.com/maps/place/...", "headless": true, "max_scrolls": 35 } ``` --- ## Error Handling ### HTTP Status Codes - `200`: Success - `400`: Bad request (invalid parameters or job state) - `404`: Job not found - `500`: Internal server error ### Error Response Format ```json { "detail": "Error message here" } ``` ### Common Errors **1. Job not completed yet** ```json { "detail": "Job not completed yet (current status: running)" } ``` **2. Job not found** ```json { "detail": "Job not found" } ``` **3. Maximum concurrent jobs reached** ```json { "detail": "Maximum concurrent jobs reached" } ``` --- ## Testing ### Run Test Script ```bash python test_fast_api.py ``` This will: 1. Start a scraping job 2. Poll until complete 3. Retrieve and save reviews 4. Show statistics ### Manual Testing (curl) ```bash # Start job curl -X POST "http://localhost:8000/scrape" \ -H "Content-Type: application/json" \ -d '{"url": "YOUR_GOOGLE_MAPS_URL", "headless": true}' \ | jq # Get status (replace JOB_ID) curl "http://localhost:8000/jobs/JOB_ID" | jq # Get reviews curl "http://localhost:8000/jobs/JOB_ID/reviews" | jq ``` --- ## Production Deployment ### Using Gunicorn ```bash pip install gunicorn gunicorn api_server:app \ --workers 4 \ --worker-class uvicorn.workers.UvicornWorker \ --bind 0.0.0.0:8000 ``` ### Using Docker Create `Dockerfile`: ```dockerfile FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "api_server.py"] ``` Run: ```bash docker build -t google-reviews-api . docker run -p 8000:8000 google-reviews-api ``` --- ## Monitoring ### Check Running Jobs ```bash curl "http://localhost:8000/stats" | jq ``` ### List Recent Jobs ```bash curl "http://localhost:8000/jobs?limit=10" | jq ``` ### Auto-Cleanup Jobs are automatically cleaned up after 24 hours. Configure in `api_server.py`: ```python async def cleanup_jobs_periodically(): while True: await asyncio.sleep(3600) # Run every hour if job_manager: job_manager.cleanup_old_jobs(max_age_hours=24) ``` --- ## Troubleshooting ### API won't start **Error**: "Address already in use" **Solution**: Change port in `api_server.py` or kill existing process: ```bash lsof -ti:8000 | xargs kill ``` ### Jobs stuck in "running" status **Solution**: Check server logs for errors. Restart the server if needed. ### GDPR consent issues The fast scraper automatically handles GDPR consent pages. If issues persist: - Set `headless: false` to see what's happening - Check server logs for consent page detection --- ## Support For issues or questions, check: - Server logs: Console output when running `python api_server.py` - Interactive docs: `http://localhost:8000/docs` - Test script: `python test_fast_api.py` --- **Enjoy ultra-fast Google Maps scraping with the API!** 🚀