Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
141 lines
4.6 KiB
Markdown
141 lines
4.6 KiB
Markdown
# API Interceptor Debug Summary
|
|
|
|
## Problem Statement
|
|
The scraper was working but **very slow** due to scrolling + DOM parsing. We wanted to use Google's internal API (`/maps/rpc/listugcposts`) to get reviews faster.
|
|
|
|
## What We Discovered
|
|
|
|
### ✅ API Interception IS Working!
|
|
The JavaScript interceptor successfully captures Google Maps API calls:
|
|
- **Endpoint**: `/maps/rpc/listugcposts`
|
|
- **Response sizes**: 41KB - 96KB per request
|
|
- **Frequency**: 2-5 responses captured per scroll cycle
|
|
- **Content**: Each response contains ~10-20 reviews in Google's nested array format
|
|
|
|
### ❌ What Was Broken
|
|
1. **Parser Bug**: `TypeError: '>' not supported between instances of 'InterceptedReview' and 'int'`
|
|
- The recursive parser was trying to compare InterceptedReview objects with integers
|
|
- Caused ALL parsing to fail despite responses being captured
|
|
|
|
2. **Missing Specialized Parser**: Generic recursive extraction didn't understand Google's `listugcposts` format
|
|
|
|
3. **Insufficient Logging**: Hard to diagnose without seeing what was captured
|
|
|
|
## Fixes Implemented
|
|
|
|
### 1. Fixed Recursion Bug (api_interceptor.py:527-555)
|
|
```python
|
|
def _extract_reviews_recursive(self, data: Any, depth: int = 0) -> List[InterceptedReview]:
|
|
# Skip if data is already an InterceptedReview object
|
|
if isinstance(data, InterceptedReview):
|
|
return [data]
|
|
|
|
# ... rest of logic with proper type checks
|
|
```
|
|
|
|
### 2. Added Enhanced Debug Logging
|
|
|
|
**JavaScript Interceptor** (api_interceptor.py:204-307):
|
|
- Console logs with `[API Interceptor]` prefix
|
|
- Real-time stats every 10 seconds
|
|
- Captures ALL network requests (not just matches)
|
|
- Logs request types, URLs, and sizes
|
|
|
|
**Python Side** (api_interceptor.py:331-369, scraper.py:1419-1436):
|
|
- Shows number of responses retrieved
|
|
- Logs parsing attempts and results
|
|
- Reports final stats even if 0 reviews captured
|
|
- Browser console log extraction
|
|
- Optional response dumping to files in debug mode
|
|
|
|
### 3. Specialized Parser for listugcposts (api_interceptor.py:435-558)
|
|
|
|
```python
|
|
def _parse_listugcposts_response(self, data: Any) -> List[InterceptedReview]:
|
|
"""
|
|
Parse Google Maps listugcposts API response.
|
|
Handles deeply nested array format with pattern matching.
|
|
"""
|
|
```
|
|
|
|
**Detection Patterns**:
|
|
- Long string (30+ chars) = Review ID
|
|
- Number 1-5 = Rating
|
|
- Long string (50+ chars, not URL) = Review text
|
|
- Short string (3-100 chars) = Author name
|
|
- Date patterns = Review date
|
|
|
|
### 4. Stats & Diagnostics (scraper.py:1487-1509)
|
|
|
|
When API interception is enabled but captures 0 reviews:
|
|
```
|
|
⚠️ API interception was enabled but captured 0 reviews.
|
|
Network stats - Fetch requests: 0/X, XHR requests: Y/Z
|
|
Found N API interceptor console messages
|
|
```
|
|
|
|
## How to Use Debug Mode
|
|
|
|
```bash
|
|
# Enable debug logging
|
|
LOG_LEVEL=DEBUG python start.py
|
|
|
|
# You'll see output like:
|
|
[DEBUG] Retrieved 2 intercepted responses from browser
|
|
[DEBUG] - XHR: /maps/rpc/listugcposts?authuser=0... (68426 bytes)
|
|
[DEBUG] Collected 2 network responses from browser
|
|
[DEBUG] Parsed 0 reviews from responses # If parsing fails
|
|
[INFO] API interceptor captured 10 reviews (total unique API: 10) # If parsing works!
|
|
```
|
|
|
|
## Next Steps to Complete API Speed Optimization
|
|
|
|
1. **Test with Real Data**: Run scraper with DEBUG logging to see actual listugcposts responses
|
|
2. **Analyze Response Format**: Examine captured responses in `debug_api_dump/` directory
|
|
3. **Refine Parser**: Adjust field detection based on actual Google API format
|
|
4. **Benchmark Performance**: Compare DOM vs API scraping speed
|
|
5. **Add Pure API Mode**: Option to skip DOM scraping entirely and only use API
|
|
|
|
## Expected Performance Improvement
|
|
|
|
**Current (DOM Scraping)**:
|
|
- ~2-4 reviews/second
|
|
- Requires scrolling + waiting for render
|
|
- 244 reviews in ~3 minutes
|
|
|
|
**Target (API Mode)**:
|
|
- ~20-50 reviews/second (10-25x faster!)
|
|
- No scrolling needed
|
|
- 244 reviews in ~10-20 seconds
|
|
|
|
## Files Modified
|
|
|
|
1. `modules/api_interceptor.py` - Core interceptor with parsing logic
|
|
2. `modules/scraper.py` - Integration and stats reporting
|
|
3. `config.yaml` - `enable_api_intercept: true`
|
|
|
|
## Testing the Fixes
|
|
|
|
```bash
|
|
# Clean Python cache first
|
|
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
|
|
find . -name "*.pyc" -delete
|
|
|
|
# Run with debug logging
|
|
LOG_LEVEL=DEBUG python start.py
|
|
|
|
# Or run specific test
|
|
python test_api_quick.py
|
|
```
|
|
|
|
## Browser Console Messages
|
|
|
|
When the interceptor is working, you'll see in the browser console:
|
|
```
|
|
[API Interceptor] ✅ Injected successfully! Monitoring network requests...
|
|
[API Interceptor] ✅ CAPTURED XHR: /maps/rpc/listugcposts... Size: 68426
|
|
[API Interceptor] Stats: Fetch: 0/0 XHR: 5/15 Queue: 5
|
|
```
|
|
|
|
These messages confirm the interceptor is active and capturing responses.
|