Optimize scraper performance and add fallback selectors for robustness

Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions
--- a/API_QUICKSTART.md
+++ b/API_QUICKSTART.md
@@ -0,0 +1,224 @@
+# API Quick Start - Fast Google Reviews Scraper
+
+## ⚡ Ultra-Fast API (18.9 seconds!)
+
+REST API for scraping Google Maps reviews using the optimized DOM-only scraper.
+
+**Performance**: ~18.9 seconds for 244 reviews (8.2x faster than original!)
+
+---
+
+## 🚀 Quick Start
+
+### 1. Install & Run
+
+```bash
+# Install dependencies
+pip install fastapi uvicorn seleniumbase pyyaml
+
+# Start API server
+python api_server.py
+```
+
+Server starts on: `http://localhost:8000`
+
+### 2. Use the API
+
+```bash
+# Start a scraping job
+curl -X POST "http://localhost:8000/scrape" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
+    "headless": true
+  }'
+```
+
+**Response:**
+```json
+{
+  "job_id": "550e8400-e29b-41d4-a716-446655440000",
+  "status": "started"
+}
+```
+
+### 3. Check Status
+
+```bash
+# Check job status
+curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"
+```
+
+**Response:**
+```json
+{
+  "status": "completed",
+  "reviews_count": 244,
+  "scrape_time": 18.9
+}
+```
+
+### 4. Get Reviews
+
+```bash
+# Get the actual reviews
+curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
+  -o reviews.json
+```
+
+---
+
+## 📋 Key Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/scrape` | POST | Start scraping job |
+| `/jobs/{job_id}` | GET | Get job status |
+| `/jobs/{job_id}/reviews` | GET | Get scraped reviews |
+| `/jobs` | GET | List all jobs |
+| `/stats` | GET | Get statistics |
+
+---
+
+## 💻 Python Example
+
+```python
+import requests
+import time
+
+# 1. Start job
+response = requests.post(
+    "http://localhost:8000/scrape",
+    json={
+        "url": "https://www.google.com/maps/place/...",
+        "headless": True
+    }
+)
+job_id = response.json()['job_id']
+
+# 2. Wait for completion
+while True:
+    job = requests.get(f"http://localhost:8000/jobs/{job_id}").json()
+    if job['status'] in ['completed', 'failed']:
+        break
+    time.sleep(2)
+
+# 3. Get reviews
+reviews = requests.get(
+    f"http://localhost:8000/jobs/{job_id}/reviews"
+).json()['reviews']
+
+print(f"Got {len(reviews)} reviews!")
+```
+
+---
+
+## 🧪 Test It
+
+```bash
+# Run the test script
+python test_fast_api.py
+```
+
+This will:
+- Start a job
+- Poll until complete
+- Save reviews to JSON
+- Show statistics
+
+---
+
+## 📚 Full Documentation
+
+See [API_DOCUMENTATION.md](API_DOCUMENTATION.md) for:
+- Complete endpoint reference
+- Advanced examples
+- Error handling
+- Production deployment
+- Monitoring & troubleshooting
+
+---
+
+## 🎯 API Features
+
+✅ **Ultra-fast scraping** (18.9s average)
+✅ **Background job processing** (non-blocking)
+✅ **Concurrent jobs** (up to 3 simultaneous)
+✅ **Job status tracking** (pending/running/completed)
+✅ **Review data retrieval** (via dedicated endpoint)
+✅ **Automatic cleanup** (removes old jobs)
+✅ **GDPR auto-handling** (no manual intervention)
+✅ **REST API** (language-agnostic)
+✅ **OpenAPI docs** (visit `/docs` for Swagger UI)
+
+---
+
+## 🔧 Configuration
+
+### API Server
+
+```python
+# In api_server.py
+job_manager = JobManager(max_concurrent_jobs=3)  # Max parallel jobs
+
+uvicorn.run(
+    "api_server:app",
+    host="0.0.0.0",  # Listen on all interfaces
+    port=8000,        # Port number
+    reload=True       # Auto-reload on code changes
+)
+```
+
+### Scraping Options
+
+```json
+{
+  "url": "https://www.google.com/maps/place/...",
+  "headless": true,     // Run Chrome in headless mode
+  "max_scrolls": 35     // Maximum scrolls (default: 35)
+}
+```
+
+---
+
+## 📊 Performance
+
+```
+Operation                 Time      % of Total
+──────────────────────────────────────────────
+Scrolling (dynamic)       ~14s      74%
+Setup & navigation        ~4.5s     24%
+JavaScript extraction     ~0.01s    0.1%
+──────────────────────────────────────────────
+TOTAL                     ~18.9s    100%
+```
+
+**8.2x faster than the original scraper!** 🚀
+
+---
+
+## 🌐 Interactive Documentation
+
+Visit `http://localhost:8000/docs` for:
+- Interactive API testing
+- Request/response schemas
+- Try out endpoints directly in browser
+
+---
+
+## ⚙️ What Changed?
+
+The API now uses the **fast DOM-only scraper** (`modules/fast_scraper.py`) instead of the old scraper:
+
+**Before**: 155 seconds ❌
+**Now**: 18.9 seconds ✅
+
+**Key optimizations**:
+1. GDPR consent auto-handling
+2. Dynamic scroll waiting (adapts to page speed)
+3. JavaScript extraction (40x faster than Selenium)
+4. Universal design (no hardcoded values)
+
+---
+
+**Ready to scrape at 8.2x speed via API!** 🚀