Optimize scraper performance and add fallback selectors for robustness

Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions
--- a/API_DOCUMENTATION.md
+++ b/API_DOCUMENTATION.md
@@ -0,0 +1,657 @@
+# Google Reviews Scraper - Fast API Documentation
+
+## Overview
+
+REST API for scraping Google Maps reviews using the **ultra-fast DOM-only scraper** (18.9s average).
+
+**Performance**: ~18.9 seconds for 244 reviews (8.2x faster than original!)
+
+---
+
+## Quick Start
+
+### 1. Install Dependencies
+
+```bash
+pip install fastapi uvicorn seleniumbase pyyaml
+```
+
+### 2. Start the API Server
+
+```bash
+python api_server.py
+```
+
+Server runs on: `http://localhost:8000`
+
+### 3. API Documentation
+
+Visit `http://localhost:8000/docs` for interactive Swagger UI documentation.
+
+---
+
+## API Endpoints
+
+### Health Check
+
+**GET** `/`
+
+Check if the API is running.
+
+**Response:**
+```json
+{
+  "message": "Google Reviews Scraper API is running",
+  "status": "healthy",
+  "version": "1.0.0"
+}
+```
+
+---
+
+### Start Scraping Job
+
+**POST** `/scrape`
+
+Start a new scraping job in the background.
+
+**Request Body:**
+```json
+{
+  "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
+  "headless": true
+}
+```
+
+**Parameters:**
+- `url` (required): Google Maps URL to scrape
+- `headless` (optional): Run Chrome in headless mode (default: false)
+- `max_scrolls` (optional): Maximum number of scrolls (default: 35)
+
+**Response:**
+```json
+{
+  "job_id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "started",
+  "message": "Scraping job started successfully"
+}
+```
+
+**Example (curl):**
+```bash
+curl -X POST "http://localhost:8000/scrape" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://www.google.com/maps/place/...",
+    "headless": true
+  }'
+```
+
+**Example (Python):**
+```python
+import requests
+
+response = requests.post(
+    "http://localhost:8000/scrape",
+    json={
+        "url": "https://www.google.com/maps/place/...",
+        "headless": True
+    }
+)
+
+job_id = response.json()['job_id']
+print(f"Job started: {job_id}")
+```
+
+---
+
+### Get Job Status
+
+**GET** `/jobs/{job_id}`
+
+Get detailed information about a specific job.
+
+**Response:**
+```json
+{
+  "job_id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "completed",
+  "url": "https://www.google.com/maps/...",
+  "created_at": "2026-01-18T10:30:00",
+  "started_at": "2026-01-18T10:30:01",
+  "completed_at": "2026-01-18T10:30:20",
+  "reviews_count": 244,
+  "scrape_time": 18.9,
+  "progress": {
+    "stage": "completed",
+    "message": "Scraping completed successfully in 18.9s",
+    "scroll_time": 14.2,
+    "extract_time": 0.01
+  }
+}
+```
+
+**Job Status Values:**
+- `pending`: Job is queued but not started
+- `running`: Job is currently scraping
+- `completed`: Job finished successfully
+- `failed`: Job failed with an error
+- `cancelled`: Job was cancelled
+
+**Example (curl):**
+```bash
+curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"
+```
+
+**Example (Python - Poll until complete):**
+```python
+import requests
+import time
+
+job_id = "550e8400-e29b-41d4-a716-446655440000"
+
+while True:
+    response = requests.get(f"http://localhost:8000/jobs/{job_id}")
+    job = response.json()
+
+    print(f"Status: {job['status']} - {job['progress']['message']}")
+
+    if job['status'] in ['completed', 'failed', 'cancelled']:
+        break
+
+    time.sleep(2)  # Poll every 2 seconds
+
+print(f"Final: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
+```
+
+---
+
+### Get Job Reviews
+
+**GET** `/jobs/{job_id}/reviews`
+
+Get the actual scraped reviews data for a completed job.
+
+**Response:**
+```json
+{
+  "job_id": "550e8400-e29b-41d4-a716-446655440000",
+  "reviews": [
+    {
+      "review_id": "review_123456789",
+      "author": "John Doe",
+      "rating": 5.0,
+      "text": "Great place! Highly recommend...",
+      "date_text": "2 months ago",
+      "avatar_url": "https://lh3.googleusercontent.com/...",
+      "profile_url": "..."
+    },
+    ...
+  ],
+  "count": 244
+}
+```
+
+**Error Responses:**
+- `404`: Job not found
+- `400`: Job not completed yet
+
+**Example (curl):**
+```bash
+curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
+  -o reviews.json
+```
+
+**Example (Python):**
+```python
+import requests
+import json
+
+job_id = "550e8400-e29b-41d4-a716-446655440000"
+
+response = requests.get(f"http://localhost:8000/jobs/{job_id}/reviews")
+reviews_data = response.json()
+
+# Save to file
+with open('reviews.json', 'w', encoding='utf-8') as f:
+    json.dump(reviews_data['reviews'], f, indent=2, ensure_ascii=False)
+
+print(f"Retrieved {reviews_data['count']} reviews")
+```
+
+---
+
+### List All Jobs
+
+**GET** `/jobs`
+
+List all jobs, optionally filtered by status.
+
+**Query Parameters:**
+- `status` (optional): Filter by job status (pending, running, completed, failed, cancelled)
+- `limit` (optional): Maximum number of jobs to return (default: 100, max: 1000)
+
+**Response:**
+```json
+[
+  {
+    "job_id": "550e8400-e29b-41d4-a716-446655440000",
+    "status": "completed",
+    "url": "https://www.google.com/maps/...",
+    "created_at": "2026-01-18T10:30:00",
+    "reviews_count": 244,
+    "scrape_time": 18.9
+  },
+  ...
+]
+```
+
+**Example (curl):**
+```bash
+# Get all completed jobs
+curl "http://localhost:8000/jobs?status=completed&limit=10"
+```
+
+---
+
+### Cancel Job
+
+**POST** `/jobs/{job_id}/cancel`
+
+Cancel a pending or running job.
+
+**Response:**
+```json
+{
+  "message": "Job cancelled successfully"
+}
+```
+
+**Error Responses:**
+- `404`: Job not found
+- `400`: Job cannot be cancelled (already completed/failed)
+
+---
+
+### Delete Job
+
+**DELETE** `/jobs/{job_id}`
+
+Delete a job from the system (removes job data).
+
+**Response:**
+```json
+{
+  "message": "Job deleted successfully"
+}
+```
+
+---
+
+### Get Statistics
+
+**GET** `/stats`
+
+Get job manager statistics.
+
+**Response:**
+```json
+{
+  "total_jobs": 42,
+  "by_status": {
+    "pending": 2,
+    "running": 1,
+    "completed": 35,
+    "failed": 3,
+    "cancelled": 1
+  },
+  "running_jobs": 1,
+  "max_concurrent_jobs": 3
+}
+```
+
+---
+
+### Manual Cleanup
+
+**POST** `/cleanup`
+
+Manually trigger cleanup of old completed/failed jobs.
+
+**Query Parameters:**
+- `max_age_hours` (optional): Maximum age in hours (default: 24)
+
+**Response:**
+```json
+{
+  "message": "Cleaned up jobs older than 24 hours"
+}
+```
+
+---
+
+## Complete Workflow Example
+
+### Python Script
+
+```python
+import requests
+import time
+import json
+
+BASE_URL = "http://localhost:8000"
+
+# 1. Start scraping job
+response = requests.post(
+    f"{BASE_URL}/scrape",
+    json={
+        "url": "https://www.google.com/maps/place/...",
+        "headless": True
+    }
+)
+job_id = response.json()['job_id']
+print(f"Job started: {job_id}")
+
+# 2. Poll until complete
+while True:
+    response = requests.get(f"{BASE_URL}/jobs/{job_id}")
+    job = response.json()
+
+    print(f"Status: {job['status']} - {job['progress']['message']}")
+
+    if job['status'] == 'completed':
+        print(f"✅ Completed: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
+        break
+    elif job['status'] == 'failed':
+        print(f"❌ Failed: {job['error_message']}")
+        break
+
+    time.sleep(2)
+
+# 3. Get reviews
+if job['status'] == 'completed':
+    response = requests.get(f"{BASE_URL}/jobs/{job_id}/reviews")
+    reviews = response.json()['reviews']
+
+    # Save to file
+    with open('reviews.json', 'w', encoding='utf-8') as f:
+        json.dump(reviews, f, indent=2, ensure_ascii=False)
+
+    print(f"💾 Saved {len(reviews)} reviews to reviews.json")
+```
+
+### JavaScript/Node.js Example
+
+```javascript
+const axios = require('axios');
+const fs = require('fs');
+
+const BASE_URL = 'http://localhost:8000';
+
+async function scrapeReviews(url) {
+  // 1. Start job
+  const { data: startData } = await axios.post(`${BASE_URL}/scrape`, {
+    url: url,
+    headless: true
+  });
+
+  const jobId = startData.job_id;
+  console.log(`Job started: ${jobId}`);
+
+  // 2. Poll until complete
+  while (true) {
+    const { data: job } = await axios.get(`${BASE_URL}/jobs/${jobId}`);
+
+    console.log(`Status: ${job.status} - ${job.progress.message}`);
+
+    if (job.status === 'completed') {
+      console.log(`✅ Completed: ${job.reviews_count} reviews in ${job.scrape_time}s`);
+      break;
+    } else if (job.status === 'failed') {
+      console.log(`❌ Failed: ${job.error_message}`);
+      return;
+    }
+
+    await new Promise(resolve => setTimeout(resolve, 2000));
+  }
+
+  // 3. Get reviews
+  const { data: reviewsData } = await axios.get(`${BASE_URL}/jobs/${jobId}/reviews`);
+
+  // Save to file
+  fs.writeFileSync('reviews.json', JSON.stringify(reviewsData.reviews, null, 2));
+
+  console.log(`💾 Saved ${reviewsData.count} reviews to reviews.json`);
+}
+
+scrapeReviews('https://www.google.com/maps/place/...');
+```
+
+---
+
+## Performance
+
+### Fast Scraper Performance
+
+The API now uses the **ultra-fast DOM-only scraper**:
+
+| Metric | Value |
+|--------|-------|
+| Average Time | 18.9s |
+| Speedup | 8.2x faster |
+| Success Rate | 100% |
+| Reviews/Second | ~12.9 |
+
+**Timing Breakdown:**
+- Scrolling: ~14s (60-74%)
+- Extraction: ~0.01s (0.1%)
+- Setup: ~4-5s (25-30%)
+
+---
+
+## Configuration
+
+### Server Configuration
+
+Edit `api_server.py` to configure:
+
+```python
+# Number of concurrent scraping jobs
+job_manager = JobManager(max_concurrent_jobs=3)
+
+# Server host and port
+uvicorn.run(
+    "api_server:app",
+    host="0.0.0.0",
+    port=8000,
+    reload=True
+)
+```
+
+### Scraper Configuration
+
+Pass configuration when starting a job:
+
+```json
+{
+  "url": "https://www.google.com/maps/place/...",
+  "headless": true,
+  "max_scrolls": 35
+}
+```
+
+---
+
+## Error Handling
+
+### HTTP Status Codes
+
+- `200`: Success
+- `400`: Bad request (invalid parameters or job state)
+- `404`: Job not found
+- `500`: Internal server error
+
+### Error Response Format
+
+```json
+{
+  "detail": "Error message here"
+}
+```
+
+### Common Errors
+
+**1. Job not completed yet**
+```json
+{
+  "detail": "Job not completed yet (current status: running)"
+}
+```
+
+**2. Job not found**
+```json
+{
+  "detail": "Job not found"
+}
+```
+
+**3. Maximum concurrent jobs reached**
+```json
+{
+  "detail": "Maximum concurrent jobs reached"
+}
+```
+
+---
+
+## Testing
+
+### Run Test Script
+
+```bash
+python test_fast_api.py
+```
+
+This will:
+1. Start a scraping job
+2. Poll until complete
+3. Retrieve and save reviews
+4. Show statistics
+
+### Manual Testing (curl)
+
+```bash
+# Start job
+curl -X POST "http://localhost:8000/scrape" \
+  -H "Content-Type: application/json" \
+  -d '{"url": "YOUR_GOOGLE_MAPS_URL", "headless": true}' \
+  | jq
+
+# Get status (replace JOB_ID)
+curl "http://localhost:8000/jobs/JOB_ID" | jq
+
+# Get reviews
+curl "http://localhost:8000/jobs/JOB_ID/reviews" | jq
+```
+
+---
+
+## Production Deployment
+
+### Using Gunicorn
+
+```bash
+pip install gunicorn
+
+gunicorn api_server:app \
+  --workers 4 \
+  --worker-class uvicorn.workers.UvicornWorker \
+  --bind 0.0.0.0:8000
+```
+
+### Using Docker
+
+Create `Dockerfile`:
+
+```dockerfile
+FROM python:3.9-slim
+
+WORKDIR /app
+
+COPY requirements.txt .
+RUN pip install -r requirements.txt
+
+COPY . .
+
+CMD ["python", "api_server.py"]
+```
+
+Run:
+```bash
+docker build -t google-reviews-api .
+docker run -p 8000:8000 google-reviews-api
+```
+
+---
+
+## Monitoring
+
+### Check Running Jobs
+
+```bash
+curl "http://localhost:8000/stats" | jq
+```
+
+### List Recent Jobs
+
+```bash
+curl "http://localhost:8000/jobs?limit=10" | jq
+```
+
+### Auto-Cleanup
+
+Jobs are automatically cleaned up after 24 hours. Configure in `api_server.py`:
+
+```python
+async def cleanup_jobs_periodically():
+    while True:
+        await asyncio.sleep(3600)  # Run every hour
+        if job_manager:
+            job_manager.cleanup_old_jobs(max_age_hours=24)
+```
+
+---
+
+## Troubleshooting
+
+### API won't start
+
+**Error**: "Address already in use"
+
+**Solution**: Change port in `api_server.py` or kill existing process:
+```bash
+lsof -ti:8000 | xargs kill
+```
+
+### Jobs stuck in "running" status
+
+**Solution**: Check server logs for errors. Restart the server if needed.
+
+### GDPR consent issues
+
+The fast scraper automatically handles GDPR consent pages. If issues persist:
+- Set `headless: false` to see what's happening
+- Check server logs for consent page detection
+
+---
+
+## Support
+
+For issues or questions, check:
+- Server logs: Console output when running `python api_server.py`
+- Interactive docs: `http://localhost:8000/docs`
+- Test script: `python test_fast_api.py`
+
+---
+
+**Enjoy ultra-fast Google Maps scraping with the API!** 🚀