Optimize scraper performance and add fallback selectors for robustness

Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions
--- a/API_INTERCEPTOR_DEBUG_SUMMARY.md
+++ b/API_INTERCEPTOR_DEBUG_SUMMARY.md
@@ -0,0 +1,140 @@
+# API Interceptor Debug Summary
+
+## Problem Statement
+The scraper was working but **very slow** due to scrolling + DOM parsing. We wanted to use Google's internal API (`/maps/rpc/listugcposts`) to get reviews faster.
+
+## What We Discovered
+
+### ✅ API Interception IS Working!
+The JavaScript interceptor successfully captures Google Maps API calls:
+- **Endpoint**: `/maps/rpc/listugcposts`
+- **Response sizes**: 41KB - 96KB per request
+- **Frequency**: 2-5 responses captured per scroll cycle
+- **Content**: Each response contains ~10-20 reviews in Google's nested array format
+
+### ❌ What Was Broken
+1. **Parser Bug**: `TypeError: '>' not supported between instances of 'InterceptedReview' and 'int'`
+   - The recursive parser was trying to compare InterceptedReview objects with integers
+   - Caused ALL parsing to fail despite responses being captured
+
+2. **Missing Specialized Parser**: Generic recursive extraction didn't understand Google's `listugcposts` format
+
+3. **Insufficient Logging**: Hard to diagnose without seeing what was captured
+
+## Fixes Implemented
+
+### 1. Fixed Recursion Bug (api_interceptor.py:527-555)
+```python
+def _extract_reviews_recursive(self, data: Any, depth: int = 0) -> List[InterceptedReview]:
+    # Skip if data is already an InterceptedReview object
+    if isinstance(data, InterceptedReview):
+        return [data]
+
+    # ... rest of logic with proper type checks
+```
+
+### 2. Added Enhanced Debug Logging
+
+**JavaScript Interceptor** (api_interceptor.py:204-307):
+- Console logs with `[API Interceptor]` prefix
+- Real-time stats every 10 seconds
+- Captures ALL network requests (not just matches)
+- Logs request types, URLs, and sizes
+
+**Python Side** (api_interceptor.py:331-369, scraper.py:1419-1436):
+- Shows number of responses retrieved
+- Logs parsing attempts and results
+- Reports final stats even if 0 reviews captured
+- Browser console log extraction
+- Optional response dumping to files in debug mode
+
+### 3. Specialized Parser for listugcposts (api_interceptor.py:435-558)
+
+```python
+def _parse_listugcposts_response(self, data: Any) -> List[InterceptedReview]:
+    """
+    Parse Google Maps listugcposts API response.
+    Handles deeply nested array format with pattern matching.
+    """
+```
+
+**Detection Patterns**:
+- Long string (30+ chars) = Review ID
+- Number 1-5 = Rating
+- Long string (50+ chars, not URL) = Review text
+- Short string (3-100 chars) = Author name
+- Date patterns = Review date
+
+### 4. Stats & Diagnostics (scraper.py:1487-1509)
+
+When API interception is enabled but captures 0 reviews:
+```
+⚠️  API interception was enabled but captured 0 reviews.
+Network stats - Fetch requests: 0/X, XHR requests: Y/Z
+Found N API interceptor console messages
+```
+
+## How to Use Debug Mode
+
+```bash
+# Enable debug logging
+LOG_LEVEL=DEBUG python start.py
+
+# You'll see output like:
+[DEBUG] Retrieved 2 intercepted responses from browser
+[DEBUG]   - XHR: /maps/rpc/listugcposts?authuser=0... (68426 bytes)
+[DEBUG] Collected 2 network responses from browser
+[DEBUG] Parsed 0 reviews from responses  # If parsing fails
+[INFO] API interceptor captured 10 reviews (total unique API: 10)  # If parsing works!
+```
+
+## Next Steps to Complete API Speed Optimization
+
+1. **Test with Real Data**: Run scraper with DEBUG logging to see actual listugcposts responses
+2. **Analyze Response Format**: Examine captured responses in `debug_api_dump/` directory
+3. **Refine Parser**: Adjust field detection based on actual Google API format
+4. **Benchmark Performance**: Compare DOM vs API scraping speed
+5. **Add Pure API Mode**: Option to skip DOM scraping entirely and only use API
+
+## Expected Performance Improvement
+
+**Current (DOM Scraping)**:
+- ~2-4 reviews/second
+- Requires scrolling + waiting for render
+- 244 reviews in ~3 minutes
+
+**Target (API Mode)**:
+- ~20-50 reviews/second (10-25x faster!)
+- No scrolling needed
+- 244 reviews in ~10-20 seconds
+
+## Files Modified
+
+1. `modules/api_interceptor.py` - Core interceptor with parsing logic
+2. `modules/scraper.py` - Integration and stats reporting
+3. `config.yaml` - `enable_api_intercept: true`
+
+## Testing the Fixes
+
+```bash
+# Clean Python cache first
+find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
+find . -name "*.pyc" -delete
+
+# Run with debug logging
+LOG_LEVEL=DEBUG python start.py
+
+# Or run specific test
+python test_api_quick.py
+```
+
+## Browser Console Messages
+
+When the interceptor is working, you'll see in the browser console:
+```
+[API Interceptor] ✅ Injected successfully! Monitoring network requests...
+[API Interceptor] ✅ CAPTURED XHR: /maps/rpc/listugcposts... Size: 68426
+[API Interceptor] Stats: Fetch: 0/0 XHR: 5/15 Queue: 5
+```
+
+These messages confirm the interceptor is active and capturing responses.