Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.9 KiB
7.9 KiB
Testing Interface - Quick Start Guide
A beautiful Next.js web interface for testing the Google Reviews Scraper API.
🎯 What You Get
Business Search Mode
- Search by name - Just type "Soho Club Vilnius" instead of pasting URLs
- Live map preview - See the business location before scraping
- Auto-generate URL - Creates the perfect Google Maps search URL
Direct URL Mode
- Paste any URL - For specific Google Maps business pages
- Flexible input - Works with any Google Maps URL format
Real-Time Tracking
- Live status updates - Watch your job progress in real-time
- Performance metrics - Reviews count, time, speed
- Beautiful UI - Clean, modern interface with status icons
Results Display
- Review cards - Author, rating, text, avatar, date
- Export to JSON - Download all reviews as formatted JSON
- Scrollable list - Handle hundreds of reviews smoothly
🚀 Quick Start
1. Start the Scraper API
# From project root
docker-compose -f docker-compose.production.yml up -d
API runs at: http://localhost:8000
2. Start the Web Interface
cd web
npm install
npm run dev
Web interface runs at: http://localhost:3000 (or next available port)
💡 Usage Examples
Search Mode (Recommended)
- Click "🔍 Search Business"
- Type:
Soho Club Vilnius - Map shows the business location
- Click "Scrape All Reviews"
- Watch real-time progress
- Export results as JSON
URL Mode
- Click "🔗 Paste URL"
- Paste Google Maps URL
- Click "Scrape"
- View results
📊 Features
Search Interface
- Debounced search - Updates map 500ms after typing stops
- Enter key support - Press Enter to search
- Visual feedback - Loading states, icons, colors
Job Tracking
- Polling every 2 seconds - Real-time status updates
- Status indicators:
- 🔵 Running (spinner animation)
- ✅ Completed (green checkmark)
- ❌ Failed (red X)
- ⏱️ Pending (clock icon)
Performance Metrics
- Reviews count - Total scraped
- Time taken - Seconds elapsed
- Speed - Reviews per second
- Start time - When job began
Export
- JSON download - Formatted, ready to use
- Filename - Includes job ID for tracking
- Complete data - All review fields preserved
🏗️ Architecture
┌─────────────────────────────────────┐
│ Web Interface (Next.js) │
│ http://localhost:3000 │
│ │
│ - Search business by name │
│ - Or paste URL directly │
│ - View map preview │
│ - Real-time job tracking │
│ - Export results │
└──────────────┬──────────────────────┘
│ API Calls
▼
┌─────────────────────────────────────┐
│ API Proxy (Next.js API Routes) │
│ │
│ POST /api/scrape │
│ GET /api/jobs/[id] │
│ GET /api/jobs/[id]/reviews │
└──────────────┬──────────────────────┘
│ Forward to
▼
┌─────────────────────────────────────┐
│ Scraper API (FastAPI) │
│ http://localhost:8000 │
│ │
│ - Job queue management │
│ - Chrome + SeleniumBase │
│ - PostgreSQL storage │
└─────────────────────────────────────┘
🎨 UI Components
Mode Toggle
┌──────────────┬──────────────┐
│ 🔍 Search │ 🔗 Paste URL │
└──────────────┴──────────────┘
Search Interface
┌─────────────────────────────────────┐
│ 🔍 Business name and location... │
├─────────────────────────────────────┤
│ │
│ Google Maps Embed │
│ │
├─────────────────────────────────────┤
│ 📥 Scrape All Reviews │
└─────────────────────────────────────┘
Job Status Card
┌─────────────────────────────────────┐
│ ✅ Job Status: COMPLETED │
│ 5f1d394f-10c5-4f30-8c2b-cb789c05918f│
│ │
│ 190 19.9s 9.5 │
│ Reviews Time Reviews/sec │
└─────────────────────────────────────┘
Review Card
┌─────────────────────────────────────┐
│ 👤 John Doe ⭐⭐⭐⭐⭐ │
│ 2 weeks ago │
│ │
│ Great place! Really enjoyed... │
└─────────────────────────────────────┘
🔧 Configuration
Environment Variables
Create web/.env.local:
# API URL (default: http://localhost:8000)
NEXT_PUBLIC_API_URL=http://localhost:8000
Custom Port
If port 3000 is taken, Next.js auto-selects the next available port (3001, 3002, etc.)
🐛 Troubleshooting
Web interface won't connect to API
# Check API is running
curl http://localhost:8000/health/live
# Check for CORS issues
# (Next.js API routes handle CORS automatically)
Map not showing
- Check search query is at least 2 characters
- Wait 500ms after typing (debounce delay)
- Press Enter or click Search button
Reviews not loading
- Check job status reached "completed"
- Look for error message in red box
- Check browser console for errors
📱 Mobile Friendly
The interface is fully responsive:
- Mobile: Single column, touch-optimized
- Tablet: Comfortable layout
- Desktop: Full width with max-width constraint
🎯 Example Businesses to Test
Soho Club Vilnius
McDonald's Times Square New York
Eiffel Tower Paris
Tokyo Tower Japan
Sydney Opera House
🚀 Production Deployment
Option 1: Vercel (Recommended)
cd web
vercel deploy
Option 2: Docker
cd web
docker build -t scraper-web .
docker run -p 3000:3000 -e NEXT_PUBLIC_API_URL=http://api:8000 scraper-web
Option 3: Self-hosted
cd web
npm run build
npm run start
📝 Notes
- Interface polls job status every 2 seconds
- Polling stops when job completes or fails
- Reviews fetched with limit of 1000 (configurable)
- Export creates
reviews-{job_id}.jsonfile - All processing happens server-side (secure API calls)
🎉 Benefits Over curl
Before (curl):
curl -X POST http://localhost:8000/scrape -d '{"url":"..."}'
# Copy job_id
curl http://localhost:8000/jobs/{job_id}
# Wait and check again
curl http://localhost:8000/jobs/{job_id}
# Finally get reviews
curl http://localhost:8000/jobs/{job_id}/reviews
After (Web UI):
- Type business name
- Click "Scrape All Reviews"
- Watch progress
- Export JSON
Much better! 🚀