Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
269 lines
7.9 KiB
Markdown
269 lines
7.9 KiB
Markdown
# Testing Interface - Quick Start Guide
|
|
|
|
A beautiful Next.js web interface for testing the Google Reviews Scraper API.
|
|
|
|
## 🎯 What You Get
|
|
|
|
### Business Search Mode
|
|
- **Search by name** - Just type "Soho Club Vilnius" instead of pasting URLs
|
|
- **Live map preview** - See the business location before scraping
|
|
- **Auto-generate URL** - Creates the perfect Google Maps search URL
|
|
|
|
### Direct URL Mode
|
|
- **Paste any URL** - For specific Google Maps business pages
|
|
- **Flexible input** - Works with any Google Maps URL format
|
|
|
|
### Real-Time Tracking
|
|
- **Live status updates** - Watch your job progress in real-time
|
|
- **Performance metrics** - Reviews count, time, speed
|
|
- **Beautiful UI** - Clean, modern interface with status icons
|
|
|
|
### Results Display
|
|
- **Review cards** - Author, rating, text, avatar, date
|
|
- **Export to JSON** - Download all reviews as formatted JSON
|
|
- **Scrollable list** - Handle hundreds of reviews smoothly
|
|
|
|
## 🚀 Quick Start
|
|
|
|
### 1. Start the Scraper API
|
|
|
|
```bash
|
|
# From project root
|
|
docker-compose -f docker-compose.production.yml up -d
|
|
```
|
|
|
|
API runs at: **http://localhost:8000**
|
|
|
|
### 2. Start the Web Interface
|
|
|
|
```bash
|
|
cd web
|
|
npm install
|
|
npm run dev
|
|
```
|
|
|
|
Web interface runs at: **http://localhost:3000** (or next available port)
|
|
|
|
## 💡 Usage Examples
|
|
|
|
### Search Mode (Recommended)
|
|
1. Click "🔍 Search Business"
|
|
2. Type: `Soho Club Vilnius`
|
|
3. Map shows the business location
|
|
4. Click "Scrape All Reviews"
|
|
5. Watch real-time progress
|
|
6. Export results as JSON
|
|
|
|
### URL Mode
|
|
1. Click "🔗 Paste URL"
|
|
2. Paste Google Maps URL
|
|
3. Click "Scrape"
|
|
4. View results
|
|
|
|
## 📊 Features
|
|
|
|
### Search Interface
|
|
- **Debounced search** - Updates map 500ms after typing stops
|
|
- **Enter key support** - Press Enter to search
|
|
- **Visual feedback** - Loading states, icons, colors
|
|
|
|
### Job Tracking
|
|
- **Polling every 2 seconds** - Real-time status updates
|
|
- **Status indicators**:
|
|
- 🔵 Running (spinner animation)
|
|
- ✅ Completed (green checkmark)
|
|
- ❌ Failed (red X)
|
|
- ⏱️ Pending (clock icon)
|
|
|
|
### Performance Metrics
|
|
- **Reviews count** - Total scraped
|
|
- **Time taken** - Seconds elapsed
|
|
- **Speed** - Reviews per second
|
|
- **Start time** - When job began
|
|
|
|
### Export
|
|
- **JSON download** - Formatted, ready to use
|
|
- **Filename** - Includes job ID for tracking
|
|
- **Complete data** - All review fields preserved
|
|
|
|
## 🏗️ Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ Web Interface (Next.js) │
|
|
│ http://localhost:3000 │
|
|
│ │
|
|
│ - Search business by name │
|
|
│ - Or paste URL directly │
|
|
│ - View map preview │
|
|
│ - Real-time job tracking │
|
|
│ - Export results │
|
|
└──────────────┬──────────────────────┘
|
|
│ API Calls
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ API Proxy (Next.js API Routes) │
|
|
│ │
|
|
│ POST /api/scrape │
|
|
│ GET /api/jobs/[id] │
|
|
│ GET /api/jobs/[id]/reviews │
|
|
└──────────────┬──────────────────────┘
|
|
│ Forward to
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Scraper API (FastAPI) │
|
|
│ http://localhost:8000 │
|
|
│ │
|
|
│ - Job queue management │
|
|
│ - Chrome + SeleniumBase │
|
|
│ - PostgreSQL storage │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
## 🎨 UI Components
|
|
|
|
### Mode Toggle
|
|
```
|
|
┌──────────────┬──────────────┐
|
|
│ 🔍 Search │ 🔗 Paste URL │
|
|
└──────────────┴──────────────┘
|
|
```
|
|
|
|
### Search Interface
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ 🔍 Business name and location... │
|
|
├─────────────────────────────────────┤
|
|
│ │
|
|
│ Google Maps Embed │
|
|
│ │
|
|
├─────────────────────────────────────┤
|
|
│ 📥 Scrape All Reviews │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### Job Status Card
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ ✅ Job Status: COMPLETED │
|
|
│ 5f1d394f-10c5-4f30-8c2b-cb789c05918f│
|
|
│ │
|
|
│ 190 19.9s 9.5 │
|
|
│ Reviews Time Reviews/sec │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### Review Card
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ 👤 John Doe ⭐⭐⭐⭐⭐ │
|
|
│ 2 weeks ago │
|
|
│ │
|
|
│ Great place! Really enjoyed... │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
## 🔧 Configuration
|
|
|
|
### Environment Variables
|
|
|
|
Create `web/.env.local`:
|
|
|
|
```bash
|
|
# API URL (default: http://localhost:8000)
|
|
NEXT_PUBLIC_API_URL=http://localhost:8000
|
|
```
|
|
|
|
### Custom Port
|
|
|
|
If port 3000 is taken, Next.js auto-selects the next available port (3001, 3002, etc.)
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### Web interface won't connect to API
|
|
```bash
|
|
# Check API is running
|
|
curl http://localhost:8000/health/live
|
|
|
|
# Check for CORS issues
|
|
# (Next.js API routes handle CORS automatically)
|
|
```
|
|
|
|
### Map not showing
|
|
- Check search query is at least 2 characters
|
|
- Wait 500ms after typing (debounce delay)
|
|
- Press Enter or click Search button
|
|
|
|
### Reviews not loading
|
|
- Check job status reached "completed"
|
|
- Look for error message in red box
|
|
- Check browser console for errors
|
|
|
|
## 📱 Mobile Friendly
|
|
|
|
The interface is fully responsive:
|
|
- Mobile: Single column, touch-optimized
|
|
- Tablet: Comfortable layout
|
|
- Desktop: Full width with max-width constraint
|
|
|
|
## 🎯 Example Businesses to Test
|
|
|
|
```
|
|
Soho Club Vilnius
|
|
McDonald's Times Square New York
|
|
Eiffel Tower Paris
|
|
Tokyo Tower Japan
|
|
Sydney Opera House
|
|
```
|
|
|
|
## 🚀 Production Deployment
|
|
|
|
### Option 1: Vercel (Recommended)
|
|
```bash
|
|
cd web
|
|
vercel deploy
|
|
```
|
|
|
|
### Option 2: Docker
|
|
```bash
|
|
cd web
|
|
docker build -t scraper-web .
|
|
docker run -p 3000:3000 -e NEXT_PUBLIC_API_URL=http://api:8000 scraper-web
|
|
```
|
|
|
|
### Option 3: Self-hosted
|
|
```bash
|
|
cd web
|
|
npm run build
|
|
npm run start
|
|
```
|
|
|
|
## 📝 Notes
|
|
|
|
- Interface polls job status every 2 seconds
|
|
- Polling stops when job completes or fails
|
|
- Reviews fetched with limit of 1000 (configurable)
|
|
- Export creates `reviews-{job_id}.json` file
|
|
- All processing happens server-side (secure API calls)
|
|
|
|
## 🎉 Benefits Over curl
|
|
|
|
Before (curl):
|
|
```bash
|
|
curl -X POST http://localhost:8000/scrape -d '{"url":"..."}'
|
|
# Copy job_id
|
|
curl http://localhost:8000/jobs/{job_id}
|
|
# Wait and check again
|
|
curl http://localhost:8000/jobs/{job_id}
|
|
# Finally get reviews
|
|
curl http://localhost:8000/jobs/{job_id}/reviews
|
|
```
|
|
|
|
After (Web UI):
|
|
1. Type business name
|
|
2. Click "Scrape All Reviews"
|
|
3. Watch progress
|
|
4. Export JSON
|
|
|
|
**Much better! 🚀**
|