Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
91 lines
2.0 KiB
Markdown
91 lines
2.0 KiB
Markdown
# Google Reviews Scraper - Testing Interface
|
|
|
|
A Next.js web interface for testing the containerized Google Reviews Scraper API.
|
|
|
|
## Features
|
|
|
|
- 🎯 **URL Input** - Paste any Google Maps business URL
|
|
- 📊 **Real-time Status** - Live job tracking with polling
|
|
- ⚡ **Performance Metrics** - Reviews count, time, speed
|
|
- 📱 **Review Display** - Beautiful UI for scraped reviews
|
|
- 💾 **Export JSON** - Download reviews as JSON
|
|
|
|
## Quick Start
|
|
|
|
### 1. Start the Scraper API
|
|
|
|
First, make sure the containerized scraper is running:
|
|
|
|
```bash
|
|
cd ..
|
|
docker-compose -f docker-compose.production.yml up -d
|
|
```
|
|
|
|
The API should be running at `http://localhost:8000`
|
|
|
|
### 2. Start the Web Interface
|
|
|
|
```bash
|
|
npm install
|
|
npm run dev
|
|
```
|
|
|
|
Open [http://localhost:3000](http://localhost:3000)
|
|
|
|
## Usage
|
|
|
|
1. **Paste a Google Maps URL**
|
|
```
|
|
https://www.google.com/maps/place/Business+Name/...
|
|
```
|
|
|
|
2. **Click "Scrape"**
|
|
- Job is submitted to the API
|
|
- Status updates in real-time
|
|
- Reviews appear when complete
|
|
|
|
3. **View Results**
|
|
- See all scraped reviews
|
|
- Export as JSON
|
|
- View performance metrics
|
|
|
|
## Environment Variables
|
|
|
|
Create `.env.local` if you need to customize:
|
|
|
|
```bash
|
|
# API URL (default: http://localhost:8000)
|
|
NEXT_PUBLIC_API_URL=http://localhost:8000
|
|
```
|
|
|
|
## API Endpoints Used
|
|
|
|
This interface connects to:
|
|
|
|
- `POST /scrape` - Submit scraping job
|
|
- `GET /jobs/{job_id}` - Get job status
|
|
- `GET /jobs/{job_id}/reviews` - Get reviews
|
|
|
|
## Tech Stack
|
|
|
|
- **Next.js 15** - React framework
|
|
- **TypeScript** - Type safety
|
|
- **Tailwind CSS** - Styling
|
|
- **API Proxy** - Next.js API routes proxy to scraper API
|
|
|
|
## Development
|
|
|
|
```bash
|
|
npm run dev # Start dev server
|
|
npm run build # Build for production
|
|
npm run start # Start production server
|
|
npm run lint # Run ESLint
|
|
```
|
|
|
|
## Notes
|
|
|
|
- The interface polls job status every 2 seconds
|
|
- Polling stops when job completes or fails
|
|
- Reviews are fetched with a limit of 1000 by default
|
|
- Export button downloads reviews as formatted JSON
|