Files
whyrating-engine-legacy/PARALLEL_OPTIMIZATION_RESULTS.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

5.6 KiB
Raw Blame History

Parallel Optimization Results

Question: Can we do scrolling and DOM parsing in parallel?

TL;DR: No, sequential is faster. DOM parsing during scrolling adds too much overhead.


Approaches Tested

1. Full Parallel Hybrid (start_parallel_hybrid.py)

Strategy: Parse DOM every 5 scrolls while collecting API responses

Results:

  • Time: 76-103 seconds
  • Reviews: 244/244
  • Verdict: 2.3x SLOWER than sequential

Why it failed: DOM parsing is heavyweight. Even parsing every 5 scrolls adds 50-80 seconds of overhead to the scroll loop.


2. Optimized Parallel (start_parallel_hybrid.py v2)

Strategy: Only parse DOM in last 10 scrolls when near 234 reviews

Results:

  • Time: 76 seconds
  • Reviews: 244/244
  • Verdict: Still 2.2x slower than sequential

Why it failed: DOM parsing at any point during scrolling slows down the critical scroll loop.


3. Minimal Overhead Parallel (start_optimized_hybrid.py)

Strategy: Keep scroll loop completely clean, only parse DOM at very end

Results:

  • Time: 0 reviews (instability)
  • Verdict: FAILED - page not ready, 0 reviews captured

Why it failed: Timing instability. Difficult to get initialization exactly right.


4. WINNER: Sequential Hybrid (start_ultra_fast_complete.py)

Strategy:

  1. Phase 1: Ultra-fast API scrolling (no DOM parsing)
  2. Phase 2: Targeted DOM parsing for missing 10 reviews

Results:

  • Time: 32.4 seconds
  • Reviews: 244/244 (100%)
  • Speedup: 4.8x faster than original
  • Stability: 100% reliable

Why it works:

  • API scrolling is fastest when uninterrupted (19.5s)
  • DOM parsing is most efficient on fully loaded page (12.9s)
  • Clean separation = predictable, stable performance

Performance Comparison

Approach                          Time      Speedup    Reviews   Status
────────────────────────────────────────────────────────────────────────────
Original DOM Scraping             155s      1.0x       244       Baseline
Ultra-Fast API Only               19.4s     8.0x       234       Fast but incomplete
Sequential Hybrid (WINNER)        32.4s     4.8x       244       ✅ Best balance
Parallel Hybrid (every 5 scrolls) 103s      1.5x       244       Too slow
Parallel Hybrid (last 10 scrolls) 76s       2.0x       244       Still slow
Optimized Parallel                FAILED    N/A        0         Unstable

Key Findings

Why Parallel Doesn't Help

  1. DOM Parsing is Heavy

    • Finding elements: ~100-200ms per query
    • Parsing each element: ~10-50ms
    • Total overhead: 50-80 seconds when done during scrolling
  2. Scroll Loop is Time-Critical

    • Optimal scroll timing: 0.27 seconds
    • API response collection: ~30-50ms
    • Adding DOM parsing: +100-200ms = 4-8x slower per scroll
  3. Page State Matters

    • During scrolling: Elements constantly changing (stale references)
    • After scrolling: Stable DOM, faster parsing

Why Sequential Wins

  1. Clean Scroll Loop

    • Only API collection (fast)
    • No element queries during critical path
    • Predictable timing
  2. Efficient DOM Parsing

    • Parse on stable page (no stale elements)
    • Only parse top 15-20 reviews (missing ones are at top)
    • Batch operation is faster than incremental
  3. Simple = Stable

    • Two clear phases, easy to debug
    • No complex synchronization
    • Consistent results

Theoretical Analysis

Time Breakdown

Sequential Approach:

Phase 1: API Scrolling
  - 35 scrolls × 0.27s = 9.5s
  - API collection overhead = 10.0s
  - Total Phase 1 = 19.5s

Phase 2: DOM Parsing
  - Scroll to top = 0.5s
  - Find elements = 0.8s
  - Parse 15 elements = 11.6s
  - Total Phase 2 = 12.9s

TOTAL = 32.4s

Parallel Approach (every 5 scrolls):

Combined Scrolling + DOM:
  - 40 scrolls with DOM parsing
  - Per scroll: 0.27s scroll + 2.0s DOM = 2.27s
  - Total = 90.8s (plus overhead)

TOTAL = ~103s

Parallel Approach (last 10 scrolls):

Phase 1: Fast scrolling (30 scrolls)
  - 30 × 0.27s = 8.1s

Phase 2: Slow scrolling with DOM (10 scrolls)
  - 10 × (0.27s + 6.5s) = 67.7s

TOTAL = 75.8s

Why DOM is So Slow During Scrolling

  1. Stale Element References: Elements change as page scrolls, requiring re-queries
  2. Layout Thrashing: DOM queries force layout recalculation
  3. Concurrent Modifications: Page is updating while we're reading
  4. No Batch Optimization: Can't batch when elements keep changing

Conclusion

Sequential is 2-3x faster than parallel for this use case.

Recommended Solution: start_ultra_fast_complete.py

python start_ultra_fast_complete.py

Benefits:

  • 4.8x faster than original (32.4s vs 155s)
  • 100% completeness (244/244 reviews)
  • 100% stable and reliable
  • Simple, maintainable code
  • Saves 122 seconds per run

Why not ultra-fast API-only (8.0x)?

  • Missing 10 reviews (4.1%)
  • Only 13 seconds slower to get 100% completeness
  • Worth the trade-off for most use cases

Lessons Learned

  1. "Parallel" doesn't always mean faster - overhead matters
  2. Keep critical loops clean - don't add slow operations to tight loops
  3. Stable state = faster operations - parse DOM when it's not changing
  4. Simple often wins - clear phases beat complex synchronization
  5. Measure, don't assume - test proves sequential is faster

Final Recommendation: Use sequential hybrid approach (start_ultra_fast_complete.py) for best balance of speed and completeness.