Files
whyrating-engine-legacy/API_QUICKSTART.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

225 lines
4.6 KiB
Markdown

# API Quick Start - Fast Google Reviews Scraper
## ⚡ Ultra-Fast API (18.9 seconds!)
REST API for scraping Google Maps reviews using the optimized DOM-only scraper.
**Performance**: ~18.9 seconds for 244 reviews (8.2x faster than original!)
---
## 🚀 Quick Start
### 1. Install & Run
```bash
# Install dependencies
pip install fastapi uvicorn seleniumbase pyyaml
# Start API server
python api_server.py
```
Server starts on: `http://localhost:8000`
### 2. Use the API
```bash
# Start a scraping job
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
"headless": true
}'
```
**Response:**
```json
{
"job_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "started"
}
```
### 3. Check Status
```bash
# Check job status
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"
```
**Response:**
```json
{
"status": "completed",
"reviews_count": 244,
"scrape_time": 18.9
}
```
### 4. Get Reviews
```bash
# Get the actual reviews
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
-o reviews.json
```
---
## 📋 Key Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/scrape` | POST | Start scraping job |
| `/jobs/{job_id}` | GET | Get job status |
| `/jobs/{job_id}/reviews` | GET | Get scraped reviews |
| `/jobs` | GET | List all jobs |
| `/stats` | GET | Get statistics |
---
## 💻 Python Example
```python
import requests
import time
# 1. Start job
response = requests.post(
"http://localhost:8000/scrape",
json={
"url": "https://www.google.com/maps/place/...",
"headless": True
}
)
job_id = response.json()['job_id']
# 2. Wait for completion
while True:
job = requests.get(f"http://localhost:8000/jobs/{job_id}").json()
if job['status'] in ['completed', 'failed']:
break
time.sleep(2)
# 3. Get reviews
reviews = requests.get(
f"http://localhost:8000/jobs/{job_id}/reviews"
).json()['reviews']
print(f"Got {len(reviews)} reviews!")
```
---
## 🧪 Test It
```bash
# Run the test script
python test_fast_api.py
```
This will:
- Start a job
- Poll until complete
- Save reviews to JSON
- Show statistics
---
## 📚 Full Documentation
See [API_DOCUMENTATION.md](API_DOCUMENTATION.md) for:
- Complete endpoint reference
- Advanced examples
- Error handling
- Production deployment
- Monitoring & troubleshooting
---
## 🎯 API Features
**Ultra-fast scraping** (18.9s average)
**Background job processing** (non-blocking)
**Concurrent jobs** (up to 3 simultaneous)
**Job status tracking** (pending/running/completed)
**Review data retrieval** (via dedicated endpoint)
**Automatic cleanup** (removes old jobs)
**GDPR auto-handling** (no manual intervention)
**REST API** (language-agnostic)
**OpenAPI docs** (visit `/docs` for Swagger UI)
---
## 🔧 Configuration
### API Server
```python
# In api_server.py
job_manager = JobManager(max_concurrent_jobs=3) # Max parallel jobs
uvicorn.run(
"api_server:app",
host="0.0.0.0", # Listen on all interfaces
port=8000, # Port number
reload=True # Auto-reload on code changes
)
```
### Scraping Options
```json
{
"url": "https://www.google.com/maps/place/...",
"headless": true, // Run Chrome in headless mode
"max_scrolls": 35 // Maximum scrolls (default: 35)
}
```
---
## 📊 Performance
```
Operation Time % of Total
──────────────────────────────────────────────
Scrolling (dynamic) ~14s 74%
Setup & navigation ~4.5s 24%
JavaScript extraction ~0.01s 0.1%
──────────────────────────────────────────────
TOTAL ~18.9s 100%
```
**8.2x faster than the original scraper!** 🚀
---
## 🌐 Interactive Documentation
Visit `http://localhost:8000/docs` for:
- Interactive API testing
- Request/response schemas
- Try out endpoints directly in browser
---
## ⚙️ What Changed?
The API now uses the **fast DOM-only scraper** (`modules/fast_scraper.py`) instead of the old scraper:
**Before**: 155 seconds ❌
**Now**: 18.9 seconds ✅
**Key optimizations**:
1. GDPR consent auto-handling
2. Dynamic scroll waiting (adapts to page speed)
3. JavaScript extraction (40x faster than Selenium)
4. Universal design (no hardcoded values)
---
**Ready to scrape at 8.2x speed via API!** 🚀