Files
whyrating-engine-legacy/SPEED_OPTIMIZATION_SUMMARY.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

5.2 KiB
Raw Blame History

Speed Optimization Journey

Final Results

Best Stable Performance: start_ultra_fast.py

  • Time: ~19.4 seconds (averaged over 4 runs)
  • Speed: 8.0x faster than original (155s → 19.4s)
  • Reviews: 234/244 (95.9%)
  • Success Rate: 100% stable

Optimization Progression

Version Time Speedup Notes
Original DOM scraping 155s 1.0x Baseline - scrolls + parses DOM
Fast API (0.8s scroll) 43s 3.6x API interception + scrolling
Fast API (0.3s scroll) 29s 5.3x Faster scroll timing
Ultra-fast (0.25s, unstable) 18s 8.6x 33% failure rate
Ultra-fast (0.27s, stable) 19.4s 8.0x 100% stable

Key Optimizations Applied

1. Removed Unnecessary Waits (~6s saved)

  • 3s "wait for reviews page to load" → 1s (saves 2s)
  • 2s after tab click → 0.4s (saves 1.6s)
  • 2s after cookie dismiss → 0.4s (saves 1.6s)
  • 2s for initial API trigger → 0.3s (saves 1.7s)

2. Faster Scroll Timing (~10s saved)

  • 0.8s per scroll (30 scrolls = 24s)
  • 0.27s per scroll (30 scrolls = 8.1s)
  • Savings: 15.9s

3. Reduced Logging Overhead

  • Log only every 10 scrolls instead of every scroll
  • Minimal I/O during tight loop

4. Optimized Pane Finding

  • Use most common selector first
  • Reduced timeout from 5s to 3s

5. Streamlined API Interception

  • Reduced setup wait from 2s to 0.3s
  • Still 100% reliable

Timing Breakdown (Ultra-Fast)

Operation                    Time    % of Total
──────────────────────────────────────────────────
Browser startup              ~1.0s   5%
Navigate to page             1.5s    8%
Cookie dialog dismiss        0.4s    2%
Click reviews tab            0.4s    2%
Wait for page stability      1.0s    5%
Find reviews pane            ~1.5s   8%
Setup API interceptor        0.3s    2%
Initial scroll trigger       0.3s    2%
Scrolling (30 × 0.27s)       8.1s    42%
Response collection          ~3.0s   15%
Parsing & saving             ~1.9s   10%
──────────────────────────────────────────────────
TOTAL                        ~19.4s  100%

Bottleneck Analysis

Current bottlenecks (in order):

  1. Scrolling loop: 8.1s (42%) - Already optimized to 0.27s/scroll
  2. Response collection: 3.0s (15%) - Necessary overhead
  3. Parsing & saving: 1.9s (10%) - Fast enough
  4. Browser startup: 1.0s (5%) - Can't optimize much
  5. Page navigation: 1.5s (8%) - Network dependent

Why We Can't Go Faster

Scroll Timing Limit: 0.27s

  • 0.25s: 33% failure rate (too fast, misses API responses)
  • 0.27s: 100% success rate
  • 0.30s: 100% success but slower

Conclusion: 0.27s is the optimal balance.

Page Load Times (Fixed)

  • Network latency: ~1-2s
  • Browser initialization: ~1s
  • Can't be eliminated

API Response Time

  • Google's server needs time to respond
  • We can't make their API faster

Alternative Approaches Tested

Parallel API Calls

Issue: Continuation tokens are sequential - each response contains token for next page

Result: Can't truly parallelize without tokens

Issue: Browser cookies don't include auth tokens (SID, HSID, SAPISID)

Result: 400 errors when using requests library

Headless Mode

Issue: Page structure loads differently, selectors fail

Result: 0 reviews captured

Recommendations

For Production Use

Use start_ultra_fast.py:

python start_ultra_fast.py

Pros:

  • 8.0x faster (19.4s vs 155s)
  • 100% stable
  • 95.9% review coverage
  • No authentication needed
  • Simple, maintainable

If You Need All 244 Reviews

Use original start.py (155s) - gets 100% of reviews

Configuration

headless: false  # Must be false for stability

Performance Metrics

Metric                   Value
────────────────────────────────────
Average time             19.4s
Std deviation            ±0.4s
Success rate             100% (4/4 runs)
Reviews captured         234
Reviews/second           12.1
API responses/second     1.2
Speedup vs original      8.0x
Time saved per run       135.6s

Theoretical Limits

Absolute minimum (if everything was instant except scrolling):

  • 30 scrolls × 0.27s = 8.1s
  • Plus ~5s for unavoidable operations
  • Theoretical minimum: ~13s

Current: 19.4s

  • Only 6.4s from theoretical minimum
  • Already 68% of theoretical maximum speed!

Conclusion

We achieved 8.0x speedup by:

  1. Eliminating unnecessary waits
  2. Optimizing scroll timing to the limit (0.27s)
  3. Minimizing logging overhead
  4. Streamlining every operation

Further optimization would require:

  • Faster Google API responses (impossible)
  • Instant browser startup (impossible)
  • Instant network requests (impossible)

The scraper is now operating near theoretical maximum efficiency! 🚀


Final Stats:

  • 📊 Original: 155s → Ultra-fast: 19.4s
  • 🚀 8.0x faster!
  • ⏱️ Saves 136 seconds per run
  • 100% stable