Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

90
web/README.md Normal file
View File

@@ -0,0 +1,90 @@
# Google Reviews Scraper - Testing Interface
A Next.js web interface for testing the containerized Google Reviews Scraper API.
## Features
- 🎯 **URL Input** - Paste any Google Maps business URL
- 📊 **Real-time Status** - Live job tracking with polling
-**Performance Metrics** - Reviews count, time, speed
- 📱 **Review Display** - Beautiful UI for scraped reviews
- 💾 **Export JSON** - Download reviews as JSON
## Quick Start
### 1. Start the Scraper API
First, make sure the containerized scraper is running:
```bash
cd ..
docker-compose -f docker-compose.production.yml up -d
```
The API should be running at `http://localhost:8000`
### 2. Start the Web Interface
```bash
npm install
npm run dev
```
Open [http://localhost:3000](http://localhost:3000)
## Usage
1. **Paste a Google Maps URL**
```
https://www.google.com/maps/place/Business+Name/...
```
2. **Click "Scrape"**
- Job is submitted to the API
- Status updates in real-time
- Reviews appear when complete
3. **View Results**
- See all scraped reviews
- Export as JSON
- View performance metrics
## Environment Variables
Create `.env.local` if you need to customize:
```bash
# API URL (default: http://localhost:8000)
NEXT_PUBLIC_API_URL=http://localhost:8000
```
## API Endpoints Used
This interface connects to:
- `POST /scrape` - Submit scraping job
- `GET /jobs/{job_id}` - Get job status
- `GET /jobs/{job_id}/reviews` - Get reviews
## Tech Stack
- **Next.js 15** - React framework
- **TypeScript** - Type safety
- **Tailwind CSS** - Styling
- **API Proxy** - Next.js API routes proxy to scraper API
## Development
```bash
npm run dev # Start dev server
npm run build # Build for production
npm run start # Start production server
npm run lint # Run ESLint
```
## Notes
- The interface polls job status every 2 seconds
- Polling stops when job completes or fails
- Reviews are fetched with a limit of 1000 by default
- Export button downloads reviews as formatted JSON