Files
whyrating-engine-legacy/QUICK_START_API_MODE.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

5.7 KiB

Quick Start: API Interception Mode

Status: API Interceptor Enhanced & Ready

The API interceptor has been fully debugged and enhanced. It successfully captures Google Maps API responses but needs parser tuning for your specific use case.

🚀 Quick Start

Enable API Mode

Your config.yaml already has:

enable_api_intercept: true

Run with Debug Logging

# Clean Python cache first
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
find . -name "*.pyc" -delete

# Run with debug output
LOG_LEVEL=DEBUG python start.py 2>&1 | tee scraper_debug.log

What You'll See

Successful Setup:

[INFO] API interception enabled via CDP
[INFO] JavaScript response interceptor injected with enhanced debugging
[INFO] API interceptor ready - capturing network responses

📊 During Scraping:

[DEBUG] Retrieved 2 intercepted responses from browser
[DEBUG]   - XHR: /maps/rpc/listugcposts?... (68426 bytes)
[DEBUG] Collected 2 network responses from browser
[DEBUG] Parsed 0 reviews from responses  # If parser needs tuning

OR

[INFO] API interceptor captured 10 reviews (total unique API: 10)  # SUCCESS!

🔧 What I Fixed

1. Fixed Critical Bug (api_interceptor.py:527)

  • Bug: TypeError: '>' not supported between instances of 'InterceptedReview' and 'int'
  • Fix: Added proper type checking in recursive extraction

2. Enhanced Logging (api_interceptor.py:204-369)

  • Browser console logs with [API Interceptor] prefix
  • Real-time network stats (Fetch/XHR counts)
  • Response URL and size tracking
  • Automatic response dumping in debug mode

3. Specialized Parser (api_interceptor.py:435-558)

  • Created _parse_listugcposts_response() for Google's API format
  • Pattern-based detection:
    • Long string (30+ chars) → Review ID
    • Number 1-5 → Rating
    • Long string (50+ chars, not URL) → Review text
    • Short string (3-100 chars) → Author name
    • Date patterns → Review date

4. Stats & Diagnostics (scraper.py:1487-1509)

  • Reports captured vs parsed reviews
  • Shows browser console messages
  • Dumps raw responses for analysis

📈 Expected Performance

Mode Speed Time for 244 Reviews
Current (DOM) 2-4 reviews/sec ~3 minutes
Target (API) 20-50 reviews/sec ~10-20 seconds
Speed Up 10-25x faster! 🚀

🧪 Testing & Tuning

Step 1: Capture Sample Responses

# Run in debug mode to dump API responses
LOG_LEVEL=DEBUG python start.py

# Check for dumped responses
ls -lh debug_api_dump/

Step 2: Analyze Response Format

# View captured response structure
cat debug_api_dump/response_0_body.txt | head -100

Step 3: Tune Parser

If parsing returns 0 reviews, the Google API format may differ from our patterns. Open debug_api_dump/response_0_body.txt and:

  1. Look for review data patterns
  2. Adjust detection logic in _parse_listugcposts_response()
  3. Test again with LOG_LEVEL=DEBUG python start.py

🎯 Browser Console Verification

Open the browser console (F12) while scraping. You should see:

[API Interceptor] ✅ Injected successfully! Monitoring network requests...
[API Interceptor] XHR: /maps/rpc/listugcposts?authuser=0&hl=es...
[API Interceptor] ✅ CAPTURED XHR: /maps/rpc/listugcposts... Size: 68426
[API Interceptor] Stats: Fetch: 0/0 XHR: 5/20 Queue: 5

This confirms the interceptor is actively capturing API calls.

🐛 Troubleshooting

No Responses Captured

⚠️  API interception was enabled but captured 0 reviews.
Network stats - Fetch: 0/0, XHR: 0/0

Solutions:

  1. Check browser console for [API Interceptor] messages
  2. Verify Google Maps is loading reviews (not empty page)
  3. Try scrolling manually to trigger API calls

Responses Captured But 0 Reviews Parsed

[DEBUG] Retrieved 2 intercepted responses from browser
[DEBUG] Parsed 0 reviews from responses

Solutions:

  1. Check debug_api_dump/ for raw responses
  2. Analyze the response format
  3. Adjust parser patterns in _parse_listugcposts_response()

Python Cache Issues

# Thoroughly clean cache
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
find . -name "*.pyc" -delete
find . -name "*.pyo" -delete

# Restart scraper
python start.py

📊 Monitoring Progress

# Real-time monitoring
tail -f scraper_debug.log | grep -E "(API|captured|Parsed|Merging)"

# Check final results
grep -E "(total unique reviews|API interceptor captured|Merging)" scraper_debug.log

🎉 Success Indicators

When API mode is working optimally, you'll see:

[INFO] API interceptor captured 15 reviews (total unique API: 15)
[INFO] API interceptor captured 12 reviews (total unique API: 27)
[INFO] Merging 244 reviews captured via API interception
[INFO] After merge: 244 total reviews
[INFO] Execution completed in 18.5 seconds  # vs 174 seconds before!

📁 Key Files

  • modules/api_interceptor.py - Core interceptor logic
  • modules/scraper.py - Integration with main scraper
  • config.yaml - Configuration (enable_api_intercept: true)
  • API_INTERCEPTOR_DEBUG_SUMMARY.md - Detailed technical docs
  • QUICK_START_API_MODE.md - This file

🔮 Next Steps

  1. Test with Debug Mode: LOG_LEVEL=DEBUG python start.py
  2. Verify Capturing: Check browser console for interceptor messages
  3. Analyze Responses: Review debug_api_dump/ if parsing fails
  4. Tune Parser: Adjust patterns based on actual API format
  5. Benchmark: Compare speed vs DOM-only mode
  6. Pure API Mode: Once working, add option to skip DOM entirely

Ready to test! Run LOG_LEVEL=DEBUG python start.py and watch the magic happen! 🚀