Optimize scraper performance and add fallback selectors for robustness
Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
657
API_DOCUMENTATION.md
Normal file
657
API_DOCUMENTATION.md
Normal file
@@ -0,0 +1,657 @@
|
||||
# Google Reviews Scraper - Fast API Documentation
|
||||
|
||||
## Overview
|
||||
|
||||
REST API for scraping Google Maps reviews using the **ultra-fast DOM-only scraper** (18.9s average).
|
||||
|
||||
**Performance**: ~18.9 seconds for 244 reviews (8.2x faster than original!)
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install fastapi uvicorn seleniumbase pyyaml
|
||||
```
|
||||
|
||||
### 2. Start the API Server
|
||||
|
||||
```bash
|
||||
python api_server.py
|
||||
```
|
||||
|
||||
Server runs on: `http://localhost:8000`
|
||||
|
||||
### 3. API Documentation
|
||||
|
||||
Visit `http://localhost:8000/docs` for interactive Swagger UI documentation.
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Health Check
|
||||
|
||||
**GET** `/`
|
||||
|
||||
Check if the API is running.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"message": "Google Reviews Scraper API is running",
|
||||
"status": "healthy",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Start Scraping Job
|
||||
|
||||
**POST** `/scrape`
|
||||
|
||||
Start a new scraping job in the background.
|
||||
|
||||
**Request Body:**
|
||||
```json
|
||||
{
|
||||
"url": "https://www.google.com/maps/place/YOUR_BUSINESS_URL",
|
||||
"headless": true
|
||||
}
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `url` (required): Google Maps URL to scrape
|
||||
- `headless` (optional): Run Chrome in headless mode (default: false)
|
||||
- `max_scrolls` (optional): Maximum number of scrolls (default: 35)
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"job_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"status": "started",
|
||||
"message": "Scraping job started successfully"
|
||||
}
|
||||
```
|
||||
|
||||
**Example (curl):**
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/scrape" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"url": "https://www.google.com/maps/place/...",
|
||||
"headless": true
|
||||
}'
|
||||
```
|
||||
|
||||
**Example (Python):**
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:8000/scrape",
|
||||
json={
|
||||
"url": "https://www.google.com/maps/place/...",
|
||||
"headless": True
|
||||
}
|
||||
)
|
||||
|
||||
job_id = response.json()['job_id']
|
||||
print(f"Job started: {job_id}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Get Job Status
|
||||
|
||||
**GET** `/jobs/{job_id}`
|
||||
|
||||
Get detailed information about a specific job.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"job_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"status": "completed",
|
||||
"url": "https://www.google.com/maps/...",
|
||||
"created_at": "2026-01-18T10:30:00",
|
||||
"started_at": "2026-01-18T10:30:01",
|
||||
"completed_at": "2026-01-18T10:30:20",
|
||||
"reviews_count": 244,
|
||||
"scrape_time": 18.9,
|
||||
"progress": {
|
||||
"stage": "completed",
|
||||
"message": "Scraping completed successfully in 18.9s",
|
||||
"scroll_time": 14.2,
|
||||
"extract_time": 0.01
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Job Status Values:**
|
||||
- `pending`: Job is queued but not started
|
||||
- `running`: Job is currently scraping
|
||||
- `completed`: Job finished successfully
|
||||
- `failed`: Job failed with an error
|
||||
- `cancelled`: Job was cancelled
|
||||
|
||||
**Example (curl):**
|
||||
```bash
|
||||
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000"
|
||||
```
|
||||
|
||||
**Example (Python - Poll until complete):**
|
||||
```python
|
||||
import requests
|
||||
import time
|
||||
|
||||
job_id = "550e8400-e29b-41d4-a716-446655440000"
|
||||
|
||||
while True:
|
||||
response = requests.get(f"http://localhost:8000/jobs/{job_id}")
|
||||
job = response.json()
|
||||
|
||||
print(f"Status: {job['status']} - {job['progress']['message']}")
|
||||
|
||||
if job['status'] in ['completed', 'failed', 'cancelled']:
|
||||
break
|
||||
|
||||
time.sleep(2) # Poll every 2 seconds
|
||||
|
||||
print(f"Final: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Get Job Reviews
|
||||
|
||||
**GET** `/jobs/{job_id}/reviews`
|
||||
|
||||
Get the actual scraped reviews data for a completed job.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"job_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"reviews": [
|
||||
{
|
||||
"review_id": "review_123456789",
|
||||
"author": "John Doe",
|
||||
"rating": 5.0,
|
||||
"text": "Great place! Highly recommend...",
|
||||
"date_text": "2 months ago",
|
||||
"avatar_url": "https://lh3.googleusercontent.com/...",
|
||||
"profile_url": "..."
|
||||
},
|
||||
...
|
||||
],
|
||||
"count": 244
|
||||
}
|
||||
```
|
||||
|
||||
**Error Responses:**
|
||||
- `404`: Job not found
|
||||
- `400`: Job not completed yet
|
||||
|
||||
**Example (curl):**
|
||||
```bash
|
||||
curl "http://localhost:8000/jobs/550e8400-e29b-41d4-a716-446655440000/reviews" \
|
||||
-o reviews.json
|
||||
```
|
||||
|
||||
**Example (Python):**
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
job_id = "550e8400-e29b-41d4-a716-446655440000"
|
||||
|
||||
response = requests.get(f"http://localhost:8000/jobs/{job_id}/reviews")
|
||||
reviews_data = response.json()
|
||||
|
||||
# Save to file
|
||||
with open('reviews.json', 'w', encoding='utf-8') as f:
|
||||
json.dump(reviews_data['reviews'], f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"Retrieved {reviews_data['count']} reviews")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### List All Jobs
|
||||
|
||||
**GET** `/jobs`
|
||||
|
||||
List all jobs, optionally filtered by status.
|
||||
|
||||
**Query Parameters:**
|
||||
- `status` (optional): Filter by job status (pending, running, completed, failed, cancelled)
|
||||
- `limit` (optional): Maximum number of jobs to return (default: 100, max: 1000)
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
[
|
||||
{
|
||||
"job_id": "550e8400-e29b-41d4-a716-446655440000",
|
||||
"status": "completed",
|
||||
"url": "https://www.google.com/maps/...",
|
||||
"created_at": "2026-01-18T10:30:00",
|
||||
"reviews_count": 244,
|
||||
"scrape_time": 18.9
|
||||
},
|
||||
...
|
||||
]
|
||||
```
|
||||
|
||||
**Example (curl):**
|
||||
```bash
|
||||
# Get all completed jobs
|
||||
curl "http://localhost:8000/jobs?status=completed&limit=10"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Cancel Job
|
||||
|
||||
**POST** `/jobs/{job_id}/cancel`
|
||||
|
||||
Cancel a pending or running job.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"message": "Job cancelled successfully"
|
||||
}
|
||||
```
|
||||
|
||||
**Error Responses:**
|
||||
- `404`: Job not found
|
||||
- `400`: Job cannot be cancelled (already completed/failed)
|
||||
|
||||
---
|
||||
|
||||
### Delete Job
|
||||
|
||||
**DELETE** `/jobs/{job_id}`
|
||||
|
||||
Delete a job from the system (removes job data).
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"message": "Job deleted successfully"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Get Statistics
|
||||
|
||||
**GET** `/stats`
|
||||
|
||||
Get job manager statistics.
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"total_jobs": 42,
|
||||
"by_status": {
|
||||
"pending": 2,
|
||||
"running": 1,
|
||||
"completed": 35,
|
||||
"failed": 3,
|
||||
"cancelled": 1
|
||||
},
|
||||
"running_jobs": 1,
|
||||
"max_concurrent_jobs": 3
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Manual Cleanup
|
||||
|
||||
**POST** `/cleanup`
|
||||
|
||||
Manually trigger cleanup of old completed/failed jobs.
|
||||
|
||||
**Query Parameters:**
|
||||
- `max_age_hours` (optional): Maximum age in hours (default: 24)
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"message": "Cleaned up jobs older than 24 hours"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Complete Workflow Example
|
||||
|
||||
### Python Script
|
||||
|
||||
```python
|
||||
import requests
|
||||
import time
|
||||
import json
|
||||
|
||||
BASE_URL = "http://localhost:8000"
|
||||
|
||||
# 1. Start scraping job
|
||||
response = requests.post(
|
||||
f"{BASE_URL}/scrape",
|
||||
json={
|
||||
"url": "https://www.google.com/maps/place/...",
|
||||
"headless": True
|
||||
}
|
||||
)
|
||||
job_id = response.json()['job_id']
|
||||
print(f"Job started: {job_id}")
|
||||
|
||||
# 2. Poll until complete
|
||||
while True:
|
||||
response = requests.get(f"{BASE_URL}/jobs/{job_id}")
|
||||
job = response.json()
|
||||
|
||||
print(f"Status: {job['status']} - {job['progress']['message']}")
|
||||
|
||||
if job['status'] == 'completed':
|
||||
print(f"✅ Completed: {job['reviews_count']} reviews in {job['scrape_time']:.1f}s")
|
||||
break
|
||||
elif job['status'] == 'failed':
|
||||
print(f"❌ Failed: {job['error_message']}")
|
||||
break
|
||||
|
||||
time.sleep(2)
|
||||
|
||||
# 3. Get reviews
|
||||
if job['status'] == 'completed':
|
||||
response = requests.get(f"{BASE_URL}/jobs/{job_id}/reviews")
|
||||
reviews = response.json()['reviews']
|
||||
|
||||
# Save to file
|
||||
with open('reviews.json', 'w', encoding='utf-8') as f:
|
||||
json.dump(reviews, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print(f"💾 Saved {len(reviews)} reviews to reviews.json")
|
||||
```
|
||||
|
||||
### JavaScript/Node.js Example
|
||||
|
||||
```javascript
|
||||
const axios = require('axios');
|
||||
const fs = require('fs');
|
||||
|
||||
const BASE_URL = 'http://localhost:8000';
|
||||
|
||||
async function scrapeReviews(url) {
|
||||
// 1. Start job
|
||||
const { data: startData } = await axios.post(`${BASE_URL}/scrape`, {
|
||||
url: url,
|
||||
headless: true
|
||||
});
|
||||
|
||||
const jobId = startData.job_id;
|
||||
console.log(`Job started: ${jobId}`);
|
||||
|
||||
// 2. Poll until complete
|
||||
while (true) {
|
||||
const { data: job } = await axios.get(`${BASE_URL}/jobs/${jobId}`);
|
||||
|
||||
console.log(`Status: ${job.status} - ${job.progress.message}`);
|
||||
|
||||
if (job.status === 'completed') {
|
||||
console.log(`✅ Completed: ${job.reviews_count} reviews in ${job.scrape_time}s`);
|
||||
break;
|
||||
} else if (job.status === 'failed') {
|
||||
console.log(`❌ Failed: ${job.error_message}`);
|
||||
return;
|
||||
}
|
||||
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
}
|
||||
|
||||
// 3. Get reviews
|
||||
const { data: reviewsData } = await axios.get(`${BASE_URL}/jobs/${jobId}/reviews`);
|
||||
|
||||
// Save to file
|
||||
fs.writeFileSync('reviews.json', JSON.stringify(reviewsData.reviews, null, 2));
|
||||
|
||||
console.log(`💾 Saved ${reviewsData.count} reviews to reviews.json`);
|
||||
}
|
||||
|
||||
scrapeReviews('https://www.google.com/maps/place/...');
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance
|
||||
|
||||
### Fast Scraper Performance
|
||||
|
||||
The API now uses the **ultra-fast DOM-only scraper**:
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Average Time | 18.9s |
|
||||
| Speedup | 8.2x faster |
|
||||
| Success Rate | 100% |
|
||||
| Reviews/Second | ~12.9 |
|
||||
|
||||
**Timing Breakdown:**
|
||||
- Scrolling: ~14s (60-74%)
|
||||
- Extraction: ~0.01s (0.1%)
|
||||
- Setup: ~4-5s (25-30%)
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
### Server Configuration
|
||||
|
||||
Edit `api_server.py` to configure:
|
||||
|
||||
```python
|
||||
# Number of concurrent scraping jobs
|
||||
job_manager = JobManager(max_concurrent_jobs=3)
|
||||
|
||||
# Server host and port
|
||||
uvicorn.run(
|
||||
"api_server:app",
|
||||
host="0.0.0.0",
|
||||
port=8000,
|
||||
reload=True
|
||||
)
|
||||
```
|
||||
|
||||
### Scraper Configuration
|
||||
|
||||
Pass configuration when starting a job:
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://www.google.com/maps/place/...",
|
||||
"headless": true,
|
||||
"max_scrolls": 35
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTTP Status Codes
|
||||
|
||||
- `200`: Success
|
||||
- `400`: Bad request (invalid parameters or job state)
|
||||
- `404`: Job not found
|
||||
- `500`: Internal server error
|
||||
|
||||
### Error Response Format
|
||||
|
||||
```json
|
||||
{
|
||||
"detail": "Error message here"
|
||||
}
|
||||
```
|
||||
|
||||
### Common Errors
|
||||
|
||||
**1. Job not completed yet**
|
||||
```json
|
||||
{
|
||||
"detail": "Job not completed yet (current status: running)"
|
||||
}
|
||||
```
|
||||
|
||||
**2. Job not found**
|
||||
```json
|
||||
{
|
||||
"detail": "Job not found"
|
||||
}
|
||||
```
|
||||
|
||||
**3. Maximum concurrent jobs reached**
|
||||
```json
|
||||
{
|
||||
"detail": "Maximum concurrent jobs reached"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### Run Test Script
|
||||
|
||||
```bash
|
||||
python test_fast_api.py
|
||||
```
|
||||
|
||||
This will:
|
||||
1. Start a scraping job
|
||||
2. Poll until complete
|
||||
3. Retrieve and save reviews
|
||||
4. Show statistics
|
||||
|
||||
### Manual Testing (curl)
|
||||
|
||||
```bash
|
||||
# Start job
|
||||
curl -X POST "http://localhost:8000/scrape" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"url": "YOUR_GOOGLE_MAPS_URL", "headless": true}' \
|
||||
| jq
|
||||
|
||||
# Get status (replace JOB_ID)
|
||||
curl "http://localhost:8000/jobs/JOB_ID" | jq
|
||||
|
||||
# Get reviews
|
||||
curl "http://localhost:8000/jobs/JOB_ID/reviews" | jq
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Production Deployment
|
||||
|
||||
### Using Gunicorn
|
||||
|
||||
```bash
|
||||
pip install gunicorn
|
||||
|
||||
gunicorn api_server:app \
|
||||
--workers 4 \
|
||||
--worker-class uvicorn.workers.UvicornWorker \
|
||||
--bind 0.0.0.0:8000
|
||||
```
|
||||
|
||||
### Using Docker
|
||||
|
||||
Create `Dockerfile`:
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.9-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip install -r requirements.txt
|
||||
|
||||
COPY . .
|
||||
|
||||
CMD ["python", "api_server.py"]
|
||||
```
|
||||
|
||||
Run:
|
||||
```bash
|
||||
docker build -t google-reviews-api .
|
||||
docker run -p 8000:8000 google-reviews-api
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Running Jobs
|
||||
|
||||
```bash
|
||||
curl "http://localhost:8000/stats" | jq
|
||||
```
|
||||
|
||||
### List Recent Jobs
|
||||
|
||||
```bash
|
||||
curl "http://localhost:8000/jobs?limit=10" | jq
|
||||
```
|
||||
|
||||
### Auto-Cleanup
|
||||
|
||||
Jobs are automatically cleaned up after 24 hours. Configure in `api_server.py`:
|
||||
|
||||
```python
|
||||
async def cleanup_jobs_periodically():
|
||||
while True:
|
||||
await asyncio.sleep(3600) # Run every hour
|
||||
if job_manager:
|
||||
job_manager.cleanup_old_jobs(max_age_hours=24)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### API won't start
|
||||
|
||||
**Error**: "Address already in use"
|
||||
|
||||
**Solution**: Change port in `api_server.py` or kill existing process:
|
||||
```bash
|
||||
lsof -ti:8000 | xargs kill
|
||||
```
|
||||
|
||||
### Jobs stuck in "running" status
|
||||
|
||||
**Solution**: Check server logs for errors. Restart the server if needed.
|
||||
|
||||
### GDPR consent issues
|
||||
|
||||
The fast scraper automatically handles GDPR consent pages. If issues persist:
|
||||
- Set `headless: false` to see what's happening
|
||||
- Check server logs for consent page detection
|
||||
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
For issues or questions, check:
|
||||
- Server logs: Console output when running `python api_server.py`
|
||||
- Interactive docs: `http://localhost:8000/docs`
|
||||
- Test script: `python test_fast_api.py`
|
||||
|
||||
---
|
||||
|
||||
**Enjoy ultra-fast Google Maps scraping with the API!** 🚀
|
||||
Reference in New Issue
Block a user