Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
4.6 KiB
API Interceptor Debug Summary
Problem Statement
The scraper was working but very slow due to scrolling + DOM parsing. We wanted to use Google's internal API (/maps/rpc/listugcposts) to get reviews faster.
What We Discovered
✅ API Interception IS Working!
The JavaScript interceptor successfully captures Google Maps API calls:
- Endpoint:
/maps/rpc/listugcposts - Response sizes: 41KB - 96KB per request
- Frequency: 2-5 responses captured per scroll cycle
- Content: Each response contains ~10-20 reviews in Google's nested array format
❌ What Was Broken
-
Parser Bug:
TypeError: '>' not supported between instances of 'InterceptedReview' and 'int'- The recursive parser was trying to compare InterceptedReview objects with integers
- Caused ALL parsing to fail despite responses being captured
-
Missing Specialized Parser: Generic recursive extraction didn't understand Google's
listugcpostsformat -
Insufficient Logging: Hard to diagnose without seeing what was captured
Fixes Implemented
1. Fixed Recursion Bug (api_interceptor.py:527-555)
def _extract_reviews_recursive(self, data: Any, depth: int = 0) -> List[InterceptedReview]:
# Skip if data is already an InterceptedReview object
if isinstance(data, InterceptedReview):
return [data]
# ... rest of logic with proper type checks
2. Added Enhanced Debug Logging
JavaScript Interceptor (api_interceptor.py:204-307):
- Console logs with
[API Interceptor]prefix - Real-time stats every 10 seconds
- Captures ALL network requests (not just matches)
- Logs request types, URLs, and sizes
Python Side (api_interceptor.py:331-369, scraper.py:1419-1436):
- Shows number of responses retrieved
- Logs parsing attempts and results
- Reports final stats even if 0 reviews captured
- Browser console log extraction
- Optional response dumping to files in debug mode
3. Specialized Parser for listugcposts (api_interceptor.py:435-558)
def _parse_listugcposts_response(self, data: Any) -> List[InterceptedReview]:
"""
Parse Google Maps listugcposts API response.
Handles deeply nested array format with pattern matching.
"""
Detection Patterns:
- Long string (30+ chars) = Review ID
- Number 1-5 = Rating
- Long string (50+ chars, not URL) = Review text
- Short string (3-100 chars) = Author name
- Date patterns = Review date
4. Stats & Diagnostics (scraper.py:1487-1509)
When API interception is enabled but captures 0 reviews:
⚠️ API interception was enabled but captured 0 reviews.
Network stats - Fetch requests: 0/X, XHR requests: Y/Z
Found N API interceptor console messages
How to Use Debug Mode
# Enable debug logging
LOG_LEVEL=DEBUG python start.py
# You'll see output like:
[DEBUG] Retrieved 2 intercepted responses from browser
[DEBUG] - XHR: /maps/rpc/listugcposts?authuser=0... (68426 bytes)
[DEBUG] Collected 2 network responses from browser
[DEBUG] Parsed 0 reviews from responses # If parsing fails
[INFO] API interceptor captured 10 reviews (total unique API: 10) # If parsing works!
Next Steps to Complete API Speed Optimization
- Test with Real Data: Run scraper with DEBUG logging to see actual listugcposts responses
- Analyze Response Format: Examine captured responses in
debug_api_dump/directory - Refine Parser: Adjust field detection based on actual Google API format
- Benchmark Performance: Compare DOM vs API scraping speed
- Add Pure API Mode: Option to skip DOM scraping entirely and only use API
Expected Performance Improvement
Current (DOM Scraping):
- ~2-4 reviews/second
- Requires scrolling + waiting for render
- 244 reviews in ~3 minutes
Target (API Mode):
- ~20-50 reviews/second (10-25x faster!)
- No scrolling needed
- 244 reviews in ~10-20 seconds
Files Modified
modules/api_interceptor.py- Core interceptor with parsing logicmodules/scraper.py- Integration and stats reportingconfig.yaml-enable_api_intercept: true
Testing the Fixes
# Clean Python cache first
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
find . -name "*.pyc" -delete
# Run with debug logging
LOG_LEVEL=DEBUG python start.py
# Or run specific test
python test_api_quick.py
Browser Console Messages
When the interceptor is working, you'll see in the browser console:
[API Interceptor] ✅ Injected successfully! Monitoring network requests...
[API Interceptor] ✅ CAPTURED XHR: /maps/rpc/listugcposts... Size: 68426
[API Interceptor] Stats: Fetch: 0/0 XHR: 5/15 Queue: 5
These messages confirm the interceptor is active and capturing responses.