Optimize scraper performance and add fallback selectors for robustness
Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
140
API_INTERCEPTOR_DEBUG_SUMMARY.md
Normal file
140
API_INTERCEPTOR_DEBUG_SUMMARY.md
Normal file
@@ -0,0 +1,140 @@
|
||||
# API Interceptor Debug Summary
|
||||
|
||||
## Problem Statement
|
||||
The scraper was working but **very slow** due to scrolling + DOM parsing. We wanted to use Google's internal API (`/maps/rpc/listugcposts`) to get reviews faster.
|
||||
|
||||
## What We Discovered
|
||||
|
||||
### ✅ API Interception IS Working!
|
||||
The JavaScript interceptor successfully captures Google Maps API calls:
|
||||
- **Endpoint**: `/maps/rpc/listugcposts`
|
||||
- **Response sizes**: 41KB - 96KB per request
|
||||
- **Frequency**: 2-5 responses captured per scroll cycle
|
||||
- **Content**: Each response contains ~10-20 reviews in Google's nested array format
|
||||
|
||||
### ❌ What Was Broken
|
||||
1. **Parser Bug**: `TypeError: '>' not supported between instances of 'InterceptedReview' and 'int'`
|
||||
- The recursive parser was trying to compare InterceptedReview objects with integers
|
||||
- Caused ALL parsing to fail despite responses being captured
|
||||
|
||||
2. **Missing Specialized Parser**: Generic recursive extraction didn't understand Google's `listugcposts` format
|
||||
|
||||
3. **Insufficient Logging**: Hard to diagnose without seeing what was captured
|
||||
|
||||
## Fixes Implemented
|
||||
|
||||
### 1. Fixed Recursion Bug (api_interceptor.py:527-555)
|
||||
```python
|
||||
def _extract_reviews_recursive(self, data: Any, depth: int = 0) -> List[InterceptedReview]:
|
||||
# Skip if data is already an InterceptedReview object
|
||||
if isinstance(data, InterceptedReview):
|
||||
return [data]
|
||||
|
||||
# ... rest of logic with proper type checks
|
||||
```
|
||||
|
||||
### 2. Added Enhanced Debug Logging
|
||||
|
||||
**JavaScript Interceptor** (api_interceptor.py:204-307):
|
||||
- Console logs with `[API Interceptor]` prefix
|
||||
- Real-time stats every 10 seconds
|
||||
- Captures ALL network requests (not just matches)
|
||||
- Logs request types, URLs, and sizes
|
||||
|
||||
**Python Side** (api_interceptor.py:331-369, scraper.py:1419-1436):
|
||||
- Shows number of responses retrieved
|
||||
- Logs parsing attempts and results
|
||||
- Reports final stats even if 0 reviews captured
|
||||
- Browser console log extraction
|
||||
- Optional response dumping to files in debug mode
|
||||
|
||||
### 3. Specialized Parser for listugcposts (api_interceptor.py:435-558)
|
||||
|
||||
```python
|
||||
def _parse_listugcposts_response(self, data: Any) -> List[InterceptedReview]:
|
||||
"""
|
||||
Parse Google Maps listugcposts API response.
|
||||
Handles deeply nested array format with pattern matching.
|
||||
"""
|
||||
```
|
||||
|
||||
**Detection Patterns**:
|
||||
- Long string (30+ chars) = Review ID
|
||||
- Number 1-5 = Rating
|
||||
- Long string (50+ chars, not URL) = Review text
|
||||
- Short string (3-100 chars) = Author name
|
||||
- Date patterns = Review date
|
||||
|
||||
### 4. Stats & Diagnostics (scraper.py:1487-1509)
|
||||
|
||||
When API interception is enabled but captures 0 reviews:
|
||||
```
|
||||
⚠️ API interception was enabled but captured 0 reviews.
|
||||
Network stats - Fetch requests: 0/X, XHR requests: Y/Z
|
||||
Found N API interceptor console messages
|
||||
```
|
||||
|
||||
## How to Use Debug Mode
|
||||
|
||||
```bash
|
||||
# Enable debug logging
|
||||
LOG_LEVEL=DEBUG python start.py
|
||||
|
||||
# You'll see output like:
|
||||
[DEBUG] Retrieved 2 intercepted responses from browser
|
||||
[DEBUG] - XHR: /maps/rpc/listugcposts?authuser=0... (68426 bytes)
|
||||
[DEBUG] Collected 2 network responses from browser
|
||||
[DEBUG] Parsed 0 reviews from responses # If parsing fails
|
||||
[INFO] API interceptor captured 10 reviews (total unique API: 10) # If parsing works!
|
||||
```
|
||||
|
||||
## Next Steps to Complete API Speed Optimization
|
||||
|
||||
1. **Test with Real Data**: Run scraper with DEBUG logging to see actual listugcposts responses
|
||||
2. **Analyze Response Format**: Examine captured responses in `debug_api_dump/` directory
|
||||
3. **Refine Parser**: Adjust field detection based on actual Google API format
|
||||
4. **Benchmark Performance**: Compare DOM vs API scraping speed
|
||||
5. **Add Pure API Mode**: Option to skip DOM scraping entirely and only use API
|
||||
|
||||
## Expected Performance Improvement
|
||||
|
||||
**Current (DOM Scraping)**:
|
||||
- ~2-4 reviews/second
|
||||
- Requires scrolling + waiting for render
|
||||
- 244 reviews in ~3 minutes
|
||||
|
||||
**Target (API Mode)**:
|
||||
- ~20-50 reviews/second (10-25x faster!)
|
||||
- No scrolling needed
|
||||
- 244 reviews in ~10-20 seconds
|
||||
|
||||
## Files Modified
|
||||
|
||||
1. `modules/api_interceptor.py` - Core interceptor with parsing logic
|
||||
2. `modules/scraper.py` - Integration and stats reporting
|
||||
3. `config.yaml` - `enable_api_intercept: true`
|
||||
|
||||
## Testing the Fixes
|
||||
|
||||
```bash
|
||||
# Clean Python cache first
|
||||
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
|
||||
find . -name "*.pyc" -delete
|
||||
|
||||
# Run with debug logging
|
||||
LOG_LEVEL=DEBUG python start.py
|
||||
|
||||
# Or run specific test
|
||||
python test_api_quick.py
|
||||
```
|
||||
|
||||
## Browser Console Messages
|
||||
|
||||
When the interceptor is working, you'll see in the browser console:
|
||||
```
|
||||
[API Interceptor] ✅ Injected successfully! Monitoring network requests...
|
||||
[API Interceptor] ✅ CAPTURED XHR: /maps/rpc/listugcposts... Size: 68426
|
||||
[API Interceptor] Stats: Fetch: 0/0 XHR: 5/15 Queue: 5
|
||||
```
|
||||
|
||||
These messages confirm the interceptor is active and capturing responses.
|
||||
Reference in New Issue
Block a user