Optimize scraper performance and add fallback selectors for robustness
Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
90
web/README.md
Normal file
90
web/README.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# Google Reviews Scraper - Testing Interface
|
||||
|
||||
A Next.js web interface for testing the containerized Google Reviews Scraper API.
|
||||
|
||||
## Features
|
||||
|
||||
- 🎯 **URL Input** - Paste any Google Maps business URL
|
||||
- 📊 **Real-time Status** - Live job tracking with polling
|
||||
- ⚡ **Performance Metrics** - Reviews count, time, speed
|
||||
- 📱 **Review Display** - Beautiful UI for scraped reviews
|
||||
- 💾 **Export JSON** - Download reviews as JSON
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Start the Scraper API
|
||||
|
||||
First, make sure the containerized scraper is running:
|
||||
|
||||
```bash
|
||||
cd ..
|
||||
docker-compose -f docker-compose.production.yml up -d
|
||||
```
|
||||
|
||||
The API should be running at `http://localhost:8000`
|
||||
|
||||
### 2. Start the Web Interface
|
||||
|
||||
```bash
|
||||
npm install
|
||||
npm run dev
|
||||
```
|
||||
|
||||
Open [http://localhost:3000](http://localhost:3000)
|
||||
|
||||
## Usage
|
||||
|
||||
1. **Paste a Google Maps URL**
|
||||
```
|
||||
https://www.google.com/maps/place/Business+Name/...
|
||||
```
|
||||
|
||||
2. **Click "Scrape"**
|
||||
- Job is submitted to the API
|
||||
- Status updates in real-time
|
||||
- Reviews appear when complete
|
||||
|
||||
3. **View Results**
|
||||
- See all scraped reviews
|
||||
- Export as JSON
|
||||
- View performance metrics
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Create `.env.local` if you need to customize:
|
||||
|
||||
```bash
|
||||
# API URL (default: http://localhost:8000)
|
||||
NEXT_PUBLIC_API_URL=http://localhost:8000
|
||||
```
|
||||
|
||||
## API Endpoints Used
|
||||
|
||||
This interface connects to:
|
||||
|
||||
- `POST /scrape` - Submit scraping job
|
||||
- `GET /jobs/{job_id}` - Get job status
|
||||
- `GET /jobs/{job_id}/reviews` - Get reviews
|
||||
|
||||
## Tech Stack
|
||||
|
||||
- **Next.js 15** - React framework
|
||||
- **TypeScript** - Type safety
|
||||
- **Tailwind CSS** - Styling
|
||||
- **API Proxy** - Next.js API routes proxy to scraper API
|
||||
|
||||
## Development
|
||||
|
||||
```bash
|
||||
npm run dev # Start dev server
|
||||
npm run build # Build for production
|
||||
npm run start # Start production server
|
||||
npm run lint # Run ESLint
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The interface polls job status every 2 seconds
|
||||
- Polling stops when job completes or fails
|
||||
- Reviews are fetched with a limit of 1000 by default
|
||||
- Export button downloads reviews as formatted JSON
|
||||
Reference in New Issue
Block a user