Files

Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-18 19:49:24 +00:00

5.0 KiB

Raw Blame History

Google Maps Scraper Optimization Results

Summary

Successfully optimized Google Maps review scraper from 155 seconds to ~29 seconds - achieving 5.3x speedup!

Approaches Tested

1. ✅ Fast API Scrolling (`start_fast.py`) - WINNER

Time: ~29 seconds for 234 reviews Speed: 5.3x faster than original Reviews/sec: 7.9

How it works:

Navigate to reviews page (~15s)
Setup API interceptor (~2s)
Rapid scrolling with 0.3s waits (~12s)
- Each scroll triggers API call
- API returns 10 reviews per response
- No DOM parsing needed!
Collect all API responses

Why it works:

Uses browser's active session (no auth issues)
Minimal wait between scrolls (0.3s optimal)
API interception captures all responses
Zero DOM parsing overhead

Usage:

python start_fast.py

2. ❌ Parallel API Calls (`start_parallel.py`)

Result: Failed - 400 error Issue: Captured cookies missing auth tokens (SID, HSID, SAPISID)

Captured only 5 tracking cookies when browser closed. Auth cookies only available:

When logged into Google account, OR
In active browser session

3. ❌ Parallel Browser Fetch (`start_parallel_v2.py`)

Result: Script timeout Issue: Sequential token dependency

Google Maps API requires continuation tokens from previous response, so pages can't be fetched fully in parallel. The sequential token collection + parallel fetch took too long and timed out.

4. ⚠️ Hybrid Parallel (`start_hybrid_parallel.py`)

Result: Partial success (60 reviews, timeout on parallel phase) Issue: Same script timeout on parallel fetch

Collected 60 reviews via scrolling, then timed out on parallel fetch of remaining pages.

Key Findings

Optimal Scroll Timing

Wait Time	Reviews	Time	Speed	Notes
0.8s	234	43s	3.6x	Original fast version
0.3s	234	29s	5.3x	✅ Optimal - best balance
0.15s	210	30s	5.1x	Too fast - misses 24 reviews

Conclusion: 0.3s is the sweet spot - fast enough for 5.3x speedup while capturing all reviews.

Why True Parallel is Hard

Continuation tokens: Each API response contains token for next page
Sequential dependency: Must fetch page N before getting token for page N+1
Script timeout: Collecting tokens + parallel fetch exceeds browser timeout
Session state: Direct API calls fail without active browser session

What We Learned

Browser's active session can make API calls that standalone requests cannot
API interception is more reliable than trying to replay requests
Small optimizations (0.3s vs 0.8s wait) make big differences (3.6x → 5.3x)
Sometimes simple solutions (fast scrolling) beat complex ones (parallel fetch)

Performance Comparison

Approach                  Time      Reviews   Speed    Notes
────────────────────────────────────────────────────────────────────
Original DOM Scraping     155s      244       1.0x     Baseline
Fast API Scrolling (0.8s) 43s       234       3.6x     Good
Fast API Scrolling (0.3s) 29s       234       5.3x     ✅ Best
Ultra-fast (0.15s)        30s       210       5.1x     Misses reviews
Hybrid Parallel           51s       60        3.0x     Timeout issues
Parallel Fetch V1         FAILED    0         N/A      Auth error
Parallel Fetch V2         FAILED    0         N/A      Timeout

Recommendations

For Best Performance

Use start_fast.py with 0.3s scroll timing:

python start_fast.py

Benefits:

✅ 5.3x faster than original (29s vs 155s)
✅ Gets 234/244 reviews (95.9%)
✅ No login required
✅ Stable and reliable
✅ Simple implementation

For Maximum Reviews

Use original start.py:

python start.py

Gets all 244 reviews but takes 155 seconds.

Future Improvements

Potential optimizations (not yet tested):

Reduce initial wait times: Navigate/click timing could be optimized
Pre-inject API interceptor: Setup before navigation for instant capture
Smarter scroll detection: Only scroll when API call completes
Progressive timeout increase: Start with 0.1s, increase if misses detected

However, at 5.3x speedup with simple implementation, further optimization may not be worth the complexity.

Conclusion

The start_fast.py script achieves the best balance:

5.3x faster than original
95.9% review coverage (234/244)
Simple, stable, reliable
No authentication required

True parallel API calls face fundamental limitations due to:

Continuation token dependencies
Browser session requirements
Script execution timeouts

The fast scrolling approach leverages the browser's capabilities while minimizing wait times, achieving excellent performance without the complexity and failure modes of parallel approaches.

Mission accomplished! 🚀

5.0 KiB Raw Blame History