Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

View File

@@ -0,0 +1,140 @@
# API Interceptor Debug Summary
## Problem Statement
The scraper was working but **very slow** due to scrolling + DOM parsing. We wanted to use Google's internal API (`/maps/rpc/listugcposts`) to get reviews faster.
## What We Discovered
### ✅ API Interception IS Working!
The JavaScript interceptor successfully captures Google Maps API calls:
- **Endpoint**: `/maps/rpc/listugcposts`
- **Response sizes**: 41KB - 96KB per request
- **Frequency**: 2-5 responses captured per scroll cycle
- **Content**: Each response contains ~10-20 reviews in Google's nested array format
### ❌ What Was Broken
1. **Parser Bug**: `TypeError: '>' not supported between instances of 'InterceptedReview' and 'int'`
- The recursive parser was trying to compare InterceptedReview objects with integers
- Caused ALL parsing to fail despite responses being captured
2. **Missing Specialized Parser**: Generic recursive extraction didn't understand Google's `listugcposts` format
3. **Insufficient Logging**: Hard to diagnose without seeing what was captured
## Fixes Implemented
### 1. Fixed Recursion Bug (api_interceptor.py:527-555)
```python
def _extract_reviews_recursive(self, data: Any, depth: int = 0) -> List[InterceptedReview]:
# Skip if data is already an InterceptedReview object
if isinstance(data, InterceptedReview):
return [data]
# ... rest of logic with proper type checks
```
### 2. Added Enhanced Debug Logging
**JavaScript Interceptor** (api_interceptor.py:204-307):
- Console logs with `[API Interceptor]` prefix
- Real-time stats every 10 seconds
- Captures ALL network requests (not just matches)
- Logs request types, URLs, and sizes
**Python Side** (api_interceptor.py:331-369, scraper.py:1419-1436):
- Shows number of responses retrieved
- Logs parsing attempts and results
- Reports final stats even if 0 reviews captured
- Browser console log extraction
- Optional response dumping to files in debug mode
### 3. Specialized Parser for listugcposts (api_interceptor.py:435-558)
```python
def _parse_listugcposts_response(self, data: Any) -> List[InterceptedReview]:
"""
Parse Google Maps listugcposts API response.
Handles deeply nested array format with pattern matching.
"""
```
**Detection Patterns**:
- Long string (30+ chars) = Review ID
- Number 1-5 = Rating
- Long string (50+ chars, not URL) = Review text
- Short string (3-100 chars) = Author name
- Date patterns = Review date
### 4. Stats & Diagnostics (scraper.py:1487-1509)
When API interception is enabled but captures 0 reviews:
```
⚠️ API interception was enabled but captured 0 reviews.
Network stats - Fetch requests: 0/X, XHR requests: Y/Z
Found N API interceptor console messages
```
## How to Use Debug Mode
```bash
# Enable debug logging
LOG_LEVEL=DEBUG python start.py
# You'll see output like:
[DEBUG] Retrieved 2 intercepted responses from browser
[DEBUG] - XHR: /maps/rpc/listugcposts?authuser=0... (68426 bytes)
[DEBUG] Collected 2 network responses from browser
[DEBUG] Parsed 0 reviews from responses # If parsing fails
[INFO] API interceptor captured 10 reviews (total unique API: 10) # If parsing works!
```
## Next Steps to Complete API Speed Optimization
1. **Test with Real Data**: Run scraper with DEBUG logging to see actual listugcposts responses
2. **Analyze Response Format**: Examine captured responses in `debug_api_dump/` directory
3. **Refine Parser**: Adjust field detection based on actual Google API format
4. **Benchmark Performance**: Compare DOM vs API scraping speed
5. **Add Pure API Mode**: Option to skip DOM scraping entirely and only use API
## Expected Performance Improvement
**Current (DOM Scraping)**:
- ~2-4 reviews/second
- Requires scrolling + waiting for render
- 244 reviews in ~3 minutes
**Target (API Mode)**:
- ~20-50 reviews/second (10-25x faster!)
- No scrolling needed
- 244 reviews in ~10-20 seconds
## Files Modified
1. `modules/api_interceptor.py` - Core interceptor with parsing logic
2. `modules/scraper.py` - Integration and stats reporting
3. `config.yaml` - `enable_api_intercept: true`
## Testing the Fixes
```bash
# Clean Python cache first
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
find . -name "*.pyc" -delete
# Run with debug logging
LOG_LEVEL=DEBUG python start.py
# Or run specific test
python test_api_quick.py
```
## Browser Console Messages
When the interceptor is working, you'll see in the browser console:
```
[API Interceptor] ✅ Injected successfully! Monitoring network requests...
[API Interceptor] ✅ CAPTURED XHR: /maps/rpc/listugcposts... Size: 68426
[API Interceptor] Stats: Fetch: 0/0 XHR: 5/15 Queue: 5
```
These messages confirm the interceptor is active and capturing responses.