Files
whyrating-engine-legacy/TESTING_INTERFACE.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

7.9 KiB

Testing Interface - Quick Start Guide

A beautiful Next.js web interface for testing the Google Reviews Scraper API.

🎯 What You Get

Business Search Mode

  • Search by name - Just type "Soho Club Vilnius" instead of pasting URLs
  • Live map preview - See the business location before scraping
  • Auto-generate URL - Creates the perfect Google Maps search URL

Direct URL Mode

  • Paste any URL - For specific Google Maps business pages
  • Flexible input - Works with any Google Maps URL format

Real-Time Tracking

  • Live status updates - Watch your job progress in real-time
  • Performance metrics - Reviews count, time, speed
  • Beautiful UI - Clean, modern interface with status icons

Results Display

  • Review cards - Author, rating, text, avatar, date
  • Export to JSON - Download all reviews as formatted JSON
  • Scrollable list - Handle hundreds of reviews smoothly

🚀 Quick Start

1. Start the Scraper API

# From project root
docker-compose -f docker-compose.production.yml up -d

API runs at: http://localhost:8000

2. Start the Web Interface

cd web
npm install
npm run dev

Web interface runs at: http://localhost:3000 (or next available port)

💡 Usage Examples

  1. Click "🔍 Search Business"
  2. Type: Soho Club Vilnius
  3. Map shows the business location
  4. Click "Scrape All Reviews"
  5. Watch real-time progress
  6. Export results as JSON

URL Mode

  1. Click "🔗 Paste URL"
  2. Paste Google Maps URL
  3. Click "Scrape"
  4. View results

📊 Features

Search Interface

  • Debounced search - Updates map 500ms after typing stops
  • Enter key support - Press Enter to search
  • Visual feedback - Loading states, icons, colors

Job Tracking

  • Polling every 2 seconds - Real-time status updates
  • Status indicators:
    • 🔵 Running (spinner animation)
    • Completed (green checkmark)
    • Failed (red X)
    • ⏱️ Pending (clock icon)

Performance Metrics

  • Reviews count - Total scraped
  • Time taken - Seconds elapsed
  • Speed - Reviews per second
  • Start time - When job began

Export

  • JSON download - Formatted, ready to use
  • Filename - Includes job ID for tracking
  • Complete data - All review fields preserved

🏗️ Architecture

┌─────────────────────────────────────┐
│   Web Interface (Next.js)           │
│   http://localhost:3000              │
│                                      │
│   - Search business by name          │
│   - Or paste URL directly            │
│   - View map preview                 │
│   - Real-time job tracking           │
│   - Export results                   │
└──────────────┬──────────────────────┘
               │ API Calls
               ▼
┌─────────────────────────────────────┐
│   API Proxy (Next.js API Routes)    │
│                                      │
│   POST   /api/scrape                │
│   GET    /api/jobs/[id]             │
│   GET    /api/jobs/[id]/reviews     │
└──────────────┬──────────────────────┘
               │ Forward to
               ▼
┌─────────────────────────────────────┐
│   Scraper API (FastAPI)             │
│   http://localhost:8000              │
│                                      │
│   - Job queue management             │
│   - Chrome + SeleniumBase            │
│   - PostgreSQL storage               │
└─────────────────────────────────────┘

🎨 UI Components

Mode Toggle

┌──────────────┬──────────────┐
│ 🔍 Search    │ 🔗 Paste URL │
└──────────────┴──────────────┘

Search Interface

┌─────────────────────────────────────┐
│ 🔍  Business name and location...   │
├─────────────────────────────────────┤
│                                      │
│         Google Maps Embed            │
│                                      │
├─────────────────────────────────────┤
│       📥 Scrape All Reviews          │
└─────────────────────────────────────┘

Job Status Card

┌─────────────────────────────────────┐
│ ✅ Job Status: COMPLETED             │
│ 5f1d394f-10c5-4f30-8c2b-cb789c05918f│
│                                      │
│  190        19.9s       9.5         │
│ Reviews     Time    Reviews/sec      │
└─────────────────────────────────────┘

Review Card

┌─────────────────────────────────────┐
│ 👤 John Doe          ⭐⭐⭐⭐⭐      │
│ 2 weeks ago                          │
│                                      │
│ Great place! Really enjoyed...       │
└─────────────────────────────────────┘

🔧 Configuration

Environment Variables

Create web/.env.local:

# API URL (default: http://localhost:8000)
NEXT_PUBLIC_API_URL=http://localhost:8000

Custom Port

If port 3000 is taken, Next.js auto-selects the next available port (3001, 3002, etc.)

🐛 Troubleshooting

Web interface won't connect to API

# Check API is running
curl http://localhost:8000/health/live

# Check for CORS issues
# (Next.js API routes handle CORS automatically)

Map not showing

  • Check search query is at least 2 characters
  • Wait 500ms after typing (debounce delay)
  • Press Enter or click Search button

Reviews not loading

  • Check job status reached "completed"
  • Look for error message in red box
  • Check browser console for errors

📱 Mobile Friendly

The interface is fully responsive:

  • Mobile: Single column, touch-optimized
  • Tablet: Comfortable layout
  • Desktop: Full width with max-width constraint

🎯 Example Businesses to Test

Soho Club Vilnius
McDonald's Times Square New York
Eiffel Tower Paris
Tokyo Tower Japan
Sydney Opera House

🚀 Production Deployment

cd web
vercel deploy

Option 2: Docker

cd web
docker build -t scraper-web .
docker run -p 3000:3000 -e NEXT_PUBLIC_API_URL=http://api:8000 scraper-web

Option 3: Self-hosted

cd web
npm run build
npm run start

📝 Notes

  • Interface polls job status every 2 seconds
  • Polling stops when job completes or fails
  • Reviews fetched with limit of 1000 (configurable)
  • Export creates reviews-{job_id}.json file
  • All processing happens server-side (secure API calls)

🎉 Benefits Over curl

Before (curl):

curl -X POST http://localhost:8000/scrape -d '{"url":"..."}'
# Copy job_id
curl http://localhost:8000/jobs/{job_id}
# Wait and check again
curl http://localhost:8000/jobs/{job_id}
# Finally get reviews
curl http://localhost:8000/jobs/{job_id}/reviews

After (Web UI):

  1. Type business name
  2. Click "Scrape All Reviews"
  3. Watch progress
  4. Export JSON

Much better! 🚀