Files
whyrating-engine-legacy/FINAL_RESULTS.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

7.3 KiB
Raw Blame History

Final Optimization Results - Google Maps Review Scraper

Executive Summary

Successfully optimized Google Maps review scraper from 155 seconds to ~20-34 seconds depending on completeness requirements, achieving 4.5x-8.0x speedup.


Available Scrapers

1. start_ultra_fast.py - FASTEST

Time: ~19.4 seconds Reviews: 234/244 (95.9%) Speedup: 8.0x faster

Best for:

  • Maximum speed priority
  • When 234 reviews is sufficient
  • Time-critical applications
python start_ultra_fast.py

Time: ~34 seconds Reviews: 244/244 (100%) Speedup: 4.5x faster

Best for:

  • Balance of speed and completeness
  • Production use
  • When all reviews are needed

How it works:

  • Phase 1: Ultra-fast API scrolling → 234 reviews in ~20s
  • Phase 2: DOM parsing for missing 10 → ~13s
  • Total: 244 reviews in ~34s
python start_ultra_fast_complete.py

3. start.py - ORIGINAL

Time: 155 seconds Reviews: 244/244 (100%) Speedup: 1.0x (baseline)

Best for:

  • Reference implementation
  • Debugging

Key Findings

API Limitation Discovery

After extensive testing with different scrolling strategies:

Strategy Time Reviews Notes
Ultra-fast (0.27s scroll) 19.4s 234 Optimal API speed
Patient (0.30-0.80s scroll) 58.2s 234 Still only 234
Complete (0.27-0.50s adaptive) 30.8s 234 Still only 234

Conclusion: The Google Maps API endpoint consistently returns only 234/244 reviews regardless of scrolling speed or patience. The missing 10 reviews are NOT available via API - they only exist in the DOM.

Why 10 Reviews Missing from API?

Possible reasons:

  1. Pagination limit: Google's API may have a hard limit on returned reviews
  2. Different endpoint: Some reviews may use a different API endpoint
  3. Age/status filtering: Older or filtered reviews may be excluded from API responses
  4. DOM-only content: Some reviews may be rendered client-side only

Performance Comparison

Scraper                     Time    Reviews   Speedup   Completeness
─────────────────────────────────────────────────────────────────────
Original (start.py)         155s    244       1.0x      100%
Fast API (start_fast.py)    29s     234       5.3x      95.9%
Ultra-fast (start_ultra_fast.py)  19.4s   234   8.0x      95.9%
API-only attempt            58.2s   234       2.7x      95.9%
Hybrid Complete (WINNER)    34s     244       4.5x      100% ✅

Optimization Journey

Phase 1: API Interception (3.6x speedup)

  • Replaced DOM parsing with API interception
  • 155s → 43s
  • Scroll timing: 0.8s

Phase 2: Faster Scrolling (5.3x speedup)

  • Optimized scroll timing
  • 43s → 29s
  • Scroll timing: 0.3s

Phase 3: Ultra-Fast (8.0x speedup)

  • Minimized all waits
  • Optimal scroll timing (0.27s)
  • Less logging overhead
  • 155s → 19.4s

Phase 4: Complete Coverage (4.5x speedup)

  • Ultra-fast API scrolling (234 reviews)
  • DOM parsing fallback (10 reviews)
  • 155s → 34s
  • 100% completeness maintained

Technical Details

Optimal Scroll Timing

After extensive testing:

Timing Result Notes
0.15s 210 reviews Too fast - misses API responses
0.25s 0 reviews (33% failure) Unreliable
0.27s 234 reviews (100% success) Sweet spot
0.30s 234 reviews Reliable but slower
0.80s 234 reviews Original, very slow

Timing Breakdown (Ultra-Fast)

Operation                    Time    % of Total
──────────────────────────────────────────────────
Browser startup              ~1.0s   5%
Navigate to page             1.5s    8%
Cookie dialog dismiss        0.4s    2%
Click reviews tab            0.4s    2%
Wait for page stability      1.0s    5%
Find reviews pane            ~1.5s   8%
Setup API interceptor        0.3s    2%
Initial scroll trigger       0.3s    2%
Scrolling (30 × 0.27s)       8.1s    42%
Response collection          ~3.0s   15%
Parsing & saving             ~1.9s   10%
──────────────────────────────────────────────────
TOTAL                        ~19.4s  100%

Theoretical Limits

  • Current best: 19.4s for 234 reviews
  • Theoretical minimum: ~13s (if everything instant except scrolling)
  • Achievement: 68% of theoretical maximum speed

Bottleneck Analysis

Current bottlenecks (in order):

  1. Scrolling loop: 8.1s (42%) - Already optimized to limit (0.27s/scroll)
  2. Response collection: 3.0s (15%) - Necessary overhead
  3. Parsing & saving: 1.9s (10%) - Fast enough
  4. Page navigation: 1.5s (8%) - Network dependent
  5. Browser startup: 1.0s (5%) - Can't optimize much

Further optimization would require:

  • Faster Google API responses (impossible)
  • Instant browser startup (impossible)
  • Instant network requests (impossible)

Recommendations

For Production Use

Use start_ultra_fast_complete.py:

python start_ultra_fast_complete.py

Benefits:

  • 4.5x faster (34s vs 155s)
  • 100% completeness (244/244 reviews)
  • Stable and reliable
  • No authentication needed
  • Best balance of speed and completeness

For Maximum Speed

Use start_ultra_fast.py:

python start_ultra_fast.py

Benefits:

  • 8.0x faster (19.4s vs 155s)
  • 100% stable
  • 95.9% review coverage
  • ⚠️ Missing 10 reviews (4.1%)

Configuration

headless: false  # Must be false for stability

Performance Metrics

Metric                   Value
────────────────────────────────────
Average time             34s
Reviews captured         244 (100%)
Success rate             100%
API reviews              234 (95.9%)
DOM reviews              10 (4.1%)
Speedup vs original      4.5x
Time saved per run       121s

Ultra-Fast (Maximum Speed)

Metric                   Value
────────────────────────────────────
Average time             19.4s
Std deviation            ±0.4s
Success rate             100%
Reviews captured         234 (95.9%)
Reviews/second           12.1
Speedup vs original      8.0x
Time saved per run       135.6s

Conclusion

After extensive testing, we discovered:

  1. API Hard Limit: Google Maps API consistently returns only 234/244 reviews, regardless of scrolling strategy
  2. DOM Required: The missing 10 reviews are ONLY available via DOM parsing
  3. Hybrid is Optimal: Combining ultra-fast API scrolling with DOM fallback achieves best balance

Final Achievement:

  • 📊 Original: 155s → Optimized: 34s (100% complete)
  • 📊 Original: 155s → Ultra-fast: 19.4s (95.9% complete)
  • 🚀 4.5x-8.0x faster!
  • ⏱️ Saves 121-136 seconds per run
  • 100% stable

The scraper is now operating near theoretical maximum efficiency! 🚀