Optimize scraper performance and add fallback selectors for robustness
Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
268
TESTING_INTERFACE.md
Normal file
268
TESTING_INTERFACE.md
Normal file
@@ -0,0 +1,268 @@
|
||||
# Testing Interface - Quick Start Guide
|
||||
|
||||
A beautiful Next.js web interface for testing the Google Reviews Scraper API.
|
||||
|
||||
## 🎯 What You Get
|
||||
|
||||
### Business Search Mode
|
||||
- **Search by name** - Just type "Soho Club Vilnius" instead of pasting URLs
|
||||
- **Live map preview** - See the business location before scraping
|
||||
- **Auto-generate URL** - Creates the perfect Google Maps search URL
|
||||
|
||||
### Direct URL Mode
|
||||
- **Paste any URL** - For specific Google Maps business pages
|
||||
- **Flexible input** - Works with any Google Maps URL format
|
||||
|
||||
### Real-Time Tracking
|
||||
- **Live status updates** - Watch your job progress in real-time
|
||||
- **Performance metrics** - Reviews count, time, speed
|
||||
- **Beautiful UI** - Clean, modern interface with status icons
|
||||
|
||||
### Results Display
|
||||
- **Review cards** - Author, rating, text, avatar, date
|
||||
- **Export to JSON** - Download all reviews as formatted JSON
|
||||
- **Scrollable list** - Handle hundreds of reviews smoothly
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### 1. Start the Scraper API
|
||||
|
||||
```bash
|
||||
# From project root
|
||||
docker-compose -f docker-compose.production.yml up -d
|
||||
```
|
||||
|
||||
API runs at: **http://localhost:8000**
|
||||
|
||||
### 2. Start the Web Interface
|
||||
|
||||
```bash
|
||||
cd web
|
||||
npm install
|
||||
npm run dev
|
||||
```
|
||||
|
||||
Web interface runs at: **http://localhost:3000** (or next available port)
|
||||
|
||||
## 💡 Usage Examples
|
||||
|
||||
### Search Mode (Recommended)
|
||||
1. Click "🔍 Search Business"
|
||||
2. Type: `Soho Club Vilnius`
|
||||
3. Map shows the business location
|
||||
4. Click "Scrape All Reviews"
|
||||
5. Watch real-time progress
|
||||
6. Export results as JSON
|
||||
|
||||
### URL Mode
|
||||
1. Click "🔗 Paste URL"
|
||||
2. Paste Google Maps URL
|
||||
3. Click "Scrape"
|
||||
4. View results
|
||||
|
||||
## 📊 Features
|
||||
|
||||
### Search Interface
|
||||
- **Debounced search** - Updates map 500ms after typing stops
|
||||
- **Enter key support** - Press Enter to search
|
||||
- **Visual feedback** - Loading states, icons, colors
|
||||
|
||||
### Job Tracking
|
||||
- **Polling every 2 seconds** - Real-time status updates
|
||||
- **Status indicators**:
|
||||
- 🔵 Running (spinner animation)
|
||||
- ✅ Completed (green checkmark)
|
||||
- ❌ Failed (red X)
|
||||
- ⏱️ Pending (clock icon)
|
||||
|
||||
### Performance Metrics
|
||||
- **Reviews count** - Total scraped
|
||||
- **Time taken** - Seconds elapsed
|
||||
- **Speed** - Reviews per second
|
||||
- **Start time** - When job began
|
||||
|
||||
### Export
|
||||
- **JSON download** - Formatted, ready to use
|
||||
- **Filename** - Includes job ID for tracking
|
||||
- **Complete data** - All review fields preserved
|
||||
|
||||
## 🏗️ Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Web Interface (Next.js) │
|
||||
│ http://localhost:3000 │
|
||||
│ │
|
||||
│ - Search business by name │
|
||||
│ - Or paste URL directly │
|
||||
│ - View map preview │
|
||||
│ - Real-time job tracking │
|
||||
│ - Export results │
|
||||
└──────────────┬──────────────────────┘
|
||||
│ API Calls
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ API Proxy (Next.js API Routes) │
|
||||
│ │
|
||||
│ POST /api/scrape │
|
||||
│ GET /api/jobs/[id] │
|
||||
│ GET /api/jobs/[id]/reviews │
|
||||
└──────────────┬──────────────────────┘
|
||||
│ Forward to
|
||||
▼
|
||||
┌─────────────────────────────────────┐
|
||||
│ Scraper API (FastAPI) │
|
||||
│ http://localhost:8000 │
|
||||
│ │
|
||||
│ - Job queue management │
|
||||
│ - Chrome + SeleniumBase │
|
||||
│ - PostgreSQL storage │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 🎨 UI Components
|
||||
|
||||
### Mode Toggle
|
||||
```
|
||||
┌──────────────┬──────────────┐
|
||||
│ 🔍 Search │ 🔗 Paste URL │
|
||||
└──────────────┴──────────────┘
|
||||
```
|
||||
|
||||
### Search Interface
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ 🔍 Business name and location... │
|
||||
├─────────────────────────────────────┤
|
||||
│ │
|
||||
│ Google Maps Embed │
|
||||
│ │
|
||||
├─────────────────────────────────────┤
|
||||
│ 📥 Scrape All Reviews │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Job Status Card
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ ✅ Job Status: COMPLETED │
|
||||
│ 5f1d394f-10c5-4f30-8c2b-cb789c05918f│
|
||||
│ │
|
||||
│ 190 19.9s 9.5 │
|
||||
│ Reviews Time Reviews/sec │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Review Card
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ 👤 John Doe ⭐⭐⭐⭐⭐ │
|
||||
│ 2 weeks ago │
|
||||
│ │
|
||||
│ Great place! Really enjoyed... │
|
||||
└─────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 🔧 Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
Create `web/.env.local`:
|
||||
|
||||
```bash
|
||||
# API URL (default: http://localhost:8000)
|
||||
NEXT_PUBLIC_API_URL=http://localhost:8000
|
||||
```
|
||||
|
||||
### Custom Port
|
||||
|
||||
If port 3000 is taken, Next.js auto-selects the next available port (3001, 3002, etc.)
|
||||
|
||||
## 🐛 Troubleshooting
|
||||
|
||||
### Web interface won't connect to API
|
||||
```bash
|
||||
# Check API is running
|
||||
curl http://localhost:8000/health/live
|
||||
|
||||
# Check for CORS issues
|
||||
# (Next.js API routes handle CORS automatically)
|
||||
```
|
||||
|
||||
### Map not showing
|
||||
- Check search query is at least 2 characters
|
||||
- Wait 500ms after typing (debounce delay)
|
||||
- Press Enter or click Search button
|
||||
|
||||
### Reviews not loading
|
||||
- Check job status reached "completed"
|
||||
- Look for error message in red box
|
||||
- Check browser console for errors
|
||||
|
||||
## 📱 Mobile Friendly
|
||||
|
||||
The interface is fully responsive:
|
||||
- Mobile: Single column, touch-optimized
|
||||
- Tablet: Comfortable layout
|
||||
- Desktop: Full width with max-width constraint
|
||||
|
||||
## 🎯 Example Businesses to Test
|
||||
|
||||
```
|
||||
Soho Club Vilnius
|
||||
McDonald's Times Square New York
|
||||
Eiffel Tower Paris
|
||||
Tokyo Tower Japan
|
||||
Sydney Opera House
|
||||
```
|
||||
|
||||
## 🚀 Production Deployment
|
||||
|
||||
### Option 1: Vercel (Recommended)
|
||||
```bash
|
||||
cd web
|
||||
vercel deploy
|
||||
```
|
||||
|
||||
### Option 2: Docker
|
||||
```bash
|
||||
cd web
|
||||
docker build -t scraper-web .
|
||||
docker run -p 3000:3000 -e NEXT_PUBLIC_API_URL=http://api:8000 scraper-web
|
||||
```
|
||||
|
||||
### Option 3: Self-hosted
|
||||
```bash
|
||||
cd web
|
||||
npm run build
|
||||
npm run start
|
||||
```
|
||||
|
||||
## 📝 Notes
|
||||
|
||||
- Interface polls job status every 2 seconds
|
||||
- Polling stops when job completes or fails
|
||||
- Reviews fetched with limit of 1000 (configurable)
|
||||
- Export creates `reviews-{job_id}.json` file
|
||||
- All processing happens server-side (secure API calls)
|
||||
|
||||
## 🎉 Benefits Over curl
|
||||
|
||||
Before (curl):
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/scrape -d '{"url":"..."}'
|
||||
# Copy job_id
|
||||
curl http://localhost:8000/jobs/{job_id}
|
||||
# Wait and check again
|
||||
curl http://localhost:8000/jobs/{job_id}
|
||||
# Finally get reviews
|
||||
curl http://localhost:8000/jobs/{job_id}/reviews
|
||||
```
|
||||
|
||||
After (Web UI):
|
||||
1. Type business name
|
||||
2. Click "Scrape All Reviews"
|
||||
3. Watch progress
|
||||
4. Export JSON
|
||||
|
||||
**Much better! 🚀**
|
||||
Reference in New Issue
Block a user