Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

268
TESTING_INTERFACE.md Normal file
View File

@@ -0,0 +1,268 @@
# Testing Interface - Quick Start Guide
A beautiful Next.js web interface for testing the Google Reviews Scraper API.
## 🎯 What You Get
### Business Search Mode
- **Search by name** - Just type "Soho Club Vilnius" instead of pasting URLs
- **Live map preview** - See the business location before scraping
- **Auto-generate URL** - Creates the perfect Google Maps search URL
### Direct URL Mode
- **Paste any URL** - For specific Google Maps business pages
- **Flexible input** - Works with any Google Maps URL format
### Real-Time Tracking
- **Live status updates** - Watch your job progress in real-time
- **Performance metrics** - Reviews count, time, speed
- **Beautiful UI** - Clean, modern interface with status icons
### Results Display
- **Review cards** - Author, rating, text, avatar, date
- **Export to JSON** - Download all reviews as formatted JSON
- **Scrollable list** - Handle hundreds of reviews smoothly
## 🚀 Quick Start
### 1. Start the Scraper API
```bash
# From project root
docker-compose -f docker-compose.production.yml up -d
```
API runs at: **http://localhost:8000**
### 2. Start the Web Interface
```bash
cd web
npm install
npm run dev
```
Web interface runs at: **http://localhost:3000** (or next available port)
## 💡 Usage Examples
### Search Mode (Recommended)
1. Click "🔍 Search Business"
2. Type: `Soho Club Vilnius`
3. Map shows the business location
4. Click "Scrape All Reviews"
5. Watch real-time progress
6. Export results as JSON
### URL Mode
1. Click "🔗 Paste URL"
2. Paste Google Maps URL
3. Click "Scrape"
4. View results
## 📊 Features
### Search Interface
- **Debounced search** - Updates map 500ms after typing stops
- **Enter key support** - Press Enter to search
- **Visual feedback** - Loading states, icons, colors
### Job Tracking
- **Polling every 2 seconds** - Real-time status updates
- **Status indicators**:
- 🔵 Running (spinner animation)
- ✅ Completed (green checkmark)
- ❌ Failed (red X)
- ⏱️ Pending (clock icon)
### Performance Metrics
- **Reviews count** - Total scraped
- **Time taken** - Seconds elapsed
- **Speed** - Reviews per second
- **Start time** - When job began
### Export
- **JSON download** - Formatted, ready to use
- **Filename** - Includes job ID for tracking
- **Complete data** - All review fields preserved
## 🏗️ Architecture
```
┌─────────────────────────────────────┐
│ Web Interface (Next.js) │
│ http://localhost:3000 │
│ │
│ - Search business by name │
│ - Or paste URL directly │
│ - View map preview │
│ - Real-time job tracking │
│ - Export results │
└──────────────┬──────────────────────┘
│ API Calls
┌─────────────────────────────────────┐
│ API Proxy (Next.js API Routes) │
│ │
│ POST /api/scrape │
│ GET /api/jobs/[id] │
│ GET /api/jobs/[id]/reviews │
└──────────────┬──────────────────────┘
│ Forward to
┌─────────────────────────────────────┐
│ Scraper API (FastAPI) │
│ http://localhost:8000 │
│ │
│ - Job queue management │
│ - Chrome + SeleniumBase │
│ - PostgreSQL storage │
└─────────────────────────────────────┘
```
## 🎨 UI Components
### Mode Toggle
```
┌──────────────┬──────────────┐
│ 🔍 Search │ 🔗 Paste URL │
└──────────────┴──────────────┘
```
### Search Interface
```
┌─────────────────────────────────────┐
│ 🔍 Business name and location... │
├─────────────────────────────────────┤
│ │
│ Google Maps Embed │
│ │
├─────────────────────────────────────┤
│ 📥 Scrape All Reviews │
└─────────────────────────────────────┘
```
### Job Status Card
```
┌─────────────────────────────────────┐
│ ✅ Job Status: COMPLETED │
│ 5f1d394f-10c5-4f30-8c2b-cb789c05918f│
│ │
│ 190 19.9s 9.5 │
│ Reviews Time Reviews/sec │
└─────────────────────────────────────┘
```
### Review Card
```
┌─────────────────────────────────────┐
│ 👤 John Doe ⭐⭐⭐⭐⭐ │
│ 2 weeks ago │
│ │
│ Great place! Really enjoyed... │
└─────────────────────────────────────┘
```
## 🔧 Configuration
### Environment Variables
Create `web/.env.local`:
```bash
# API URL (default: http://localhost:8000)
NEXT_PUBLIC_API_URL=http://localhost:8000
```
### Custom Port
If port 3000 is taken, Next.js auto-selects the next available port (3001, 3002, etc.)
## 🐛 Troubleshooting
### Web interface won't connect to API
```bash
# Check API is running
curl http://localhost:8000/health/live
# Check for CORS issues
# (Next.js API routes handle CORS automatically)
```
### Map not showing
- Check search query is at least 2 characters
- Wait 500ms after typing (debounce delay)
- Press Enter or click Search button
### Reviews not loading
- Check job status reached "completed"
- Look for error message in red box
- Check browser console for errors
## 📱 Mobile Friendly
The interface is fully responsive:
- Mobile: Single column, touch-optimized
- Tablet: Comfortable layout
- Desktop: Full width with max-width constraint
## 🎯 Example Businesses to Test
```
Soho Club Vilnius
McDonald's Times Square New York
Eiffel Tower Paris
Tokyo Tower Japan
Sydney Opera House
```
## 🚀 Production Deployment
### Option 1: Vercel (Recommended)
```bash
cd web
vercel deploy
```
### Option 2: Docker
```bash
cd web
docker build -t scraper-web .
docker run -p 3000:3000 -e NEXT_PUBLIC_API_URL=http://api:8000 scraper-web
```
### Option 3: Self-hosted
```bash
cd web
npm run build
npm run start
```
## 📝 Notes
- Interface polls job status every 2 seconds
- Polling stops when job completes or fails
- Reviews fetched with limit of 1000 (configurable)
- Export creates `reviews-{job_id}.json` file
- All processing happens server-side (secure API calls)
## 🎉 Benefits Over curl
Before (curl):
```bash
curl -X POST http://localhost:8000/scrape -d '{"url":"..."}'
# Copy job_id
curl http://localhost:8000/jobs/{job_id}
# Wait and check again
curl http://localhost:8000/jobs/{job_id}
# Finally get reviews
curl http://localhost:8000/jobs/{job_id}/reviews
```
After (Web UI):
1. Type business name
2. Click "Scrape All Reviews"
3. Watch progress
4. Export JSON
**Much better! 🚀**