# API Interceptor Debug Summary ## Problem Statement The scraper was working but **very slow** due to scrolling + DOM parsing. We wanted to use Google's internal API (`/maps/rpc/listugcposts`) to get reviews faster. ## What We Discovered ### ✅ API Interception IS Working! The JavaScript interceptor successfully captures Google Maps API calls: - **Endpoint**: `/maps/rpc/listugcposts` - **Response sizes**: 41KB - 96KB per request - **Frequency**: 2-5 responses captured per scroll cycle - **Content**: Each response contains ~10-20 reviews in Google's nested array format ### ❌ What Was Broken 1. **Parser Bug**: `TypeError: '>' not supported between instances of 'InterceptedReview' and 'int'` - The recursive parser was trying to compare InterceptedReview objects with integers - Caused ALL parsing to fail despite responses being captured 2. **Missing Specialized Parser**: Generic recursive extraction didn't understand Google's `listugcposts` format 3. **Insufficient Logging**: Hard to diagnose without seeing what was captured ## Fixes Implemented ### 1. Fixed Recursion Bug (api_interceptor.py:527-555) ```python def _extract_reviews_recursive(self, data: Any, depth: int = 0) -> List[InterceptedReview]: # Skip if data is already an InterceptedReview object if isinstance(data, InterceptedReview): return [data] # ... rest of logic with proper type checks ``` ### 2. Added Enhanced Debug Logging **JavaScript Interceptor** (api_interceptor.py:204-307): - Console logs with `[API Interceptor]` prefix - Real-time stats every 10 seconds - Captures ALL network requests (not just matches) - Logs request types, URLs, and sizes **Python Side** (api_interceptor.py:331-369, scraper.py:1419-1436): - Shows number of responses retrieved - Logs parsing attempts and results - Reports final stats even if 0 reviews captured - Browser console log extraction - Optional response dumping to files in debug mode ### 3. Specialized Parser for listugcposts (api_interceptor.py:435-558) ```python def _parse_listugcposts_response(self, data: Any) -> List[InterceptedReview]: """ Parse Google Maps listugcposts API response. Handles deeply nested array format with pattern matching. """ ``` **Detection Patterns**: - Long string (30+ chars) = Review ID - Number 1-5 = Rating - Long string (50+ chars, not URL) = Review text - Short string (3-100 chars) = Author name - Date patterns = Review date ### 4. Stats & Diagnostics (scraper.py:1487-1509) When API interception is enabled but captures 0 reviews: ``` ⚠️ API interception was enabled but captured 0 reviews. Network stats - Fetch requests: 0/X, XHR requests: Y/Z Found N API interceptor console messages ``` ## How to Use Debug Mode ```bash # Enable debug logging LOG_LEVEL=DEBUG python start.py # You'll see output like: [DEBUG] Retrieved 2 intercepted responses from browser [DEBUG] - XHR: /maps/rpc/listugcposts?authuser=0... (68426 bytes) [DEBUG] Collected 2 network responses from browser [DEBUG] Parsed 0 reviews from responses # If parsing fails [INFO] API interceptor captured 10 reviews (total unique API: 10) # If parsing works! ``` ## Next Steps to Complete API Speed Optimization 1. **Test with Real Data**: Run scraper with DEBUG logging to see actual listugcposts responses 2. **Analyze Response Format**: Examine captured responses in `debug_api_dump/` directory 3. **Refine Parser**: Adjust field detection based on actual Google API format 4. **Benchmark Performance**: Compare DOM vs API scraping speed 5. **Add Pure API Mode**: Option to skip DOM scraping entirely and only use API ## Expected Performance Improvement **Current (DOM Scraping)**: - ~2-4 reviews/second - Requires scrolling + waiting for render - 244 reviews in ~3 minutes **Target (API Mode)**: - ~20-50 reviews/second (10-25x faster!) - No scrolling needed - 244 reviews in ~10-20 seconds ## Files Modified 1. `modules/api_interceptor.py` - Core interceptor with parsing logic 2. `modules/scraper.py` - Integration and stats reporting 3. `config.yaml` - `enable_api_intercept: true` ## Testing the Fixes ```bash # Clean Python cache first find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null find . -name "*.pyc" -delete # Run with debug logging LOG_LEVEL=DEBUG python start.py # Or run specific test python test_api_quick.py ``` ## Browser Console Messages When the interceptor is working, you'll see in the browser console: ``` [API Interceptor] ✅ Injected successfully! Monitoring network requests... [API Interceptor] ✅ CAPTURED XHR: /maps/rpc/listugcposts... Size: 68426 [API Interceptor] Stats: Fetch: 0/0 XHR: 5/15 Queue: 5 ``` These messages confirm the interceptor is active and capturing responses.