Files

Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-18 19:49:24 +00:00

5.6 KiB

Raw Blame History

Parallel Optimization Results

Question: Can we do scrolling and DOM parsing in parallel?

TL;DR: No, sequential is faster. DOM parsing during scrolling adds too much overhead.

Approaches Tested

1. ❌ Full Parallel Hybrid (`start_parallel_hybrid.py`)

Strategy: Parse DOM every 5 scrolls while collecting API responses

Results:

Time: 76-103 seconds
Reviews: 244/244
Verdict: 2.3x SLOWER than sequential

Why it failed: DOM parsing is heavyweight. Even parsing every 5 scrolls adds 50-80 seconds of overhead to the scroll loop.

2. ❌ Optimized Parallel (`start_parallel_hybrid.py` v2)

Strategy: Only parse DOM in last 10 scrolls when near 234 reviews

Results:

Time: 76 seconds
Reviews: 244/244
Verdict: Still 2.2x slower than sequential

Why it failed: DOM parsing at any point during scrolling slows down the critical scroll loop.

3. ❌ Minimal Overhead Parallel (`start_optimized_hybrid.py`)

Strategy: Keep scroll loop completely clean, only parse DOM at very end

Results:

Time: 0 reviews (instability)
Verdict: FAILED - page not ready, 0 reviews captured

Why it failed: Timing instability. Difficult to get initialization exactly right.

4. ✅ WINNER: Sequential Hybrid (`start_ultra_fast_complete.py`)

Strategy:

Phase 1: Ultra-fast API scrolling (no DOM parsing)
Phase 2: Targeted DOM parsing for missing 10 reviews

Results:

Time: 32.4 seconds
Reviews: 244/244 (100%)
Speedup: 4.8x faster than original
Stability: 100% reliable

Why it works:

API scrolling is fastest when uninterrupted (19.5s)
DOM parsing is most efficient on fully loaded page (12.9s)
Clean separation = predictable, stable performance

Performance Comparison

Approach                          Time      Speedup    Reviews   Status
────────────────────────────────────────────────────────────────────────────
Original DOM Scraping             155s      1.0x       244       Baseline
Ultra-Fast API Only               19.4s     8.0x       234       Fast but incomplete
Sequential Hybrid (WINNER)        32.4s     4.8x       244       ✅ Best balance
Parallel Hybrid (every 5 scrolls) 103s      1.5x       244       Too slow
Parallel Hybrid (last 10 scrolls) 76s       2.0x       244       Still slow
Optimized Parallel                FAILED    N/A        0         Unstable

Key Findings

Why Parallel Doesn't Help

DOM Parsing is Heavy
- Finding elements: ~100-200ms per query
- Parsing each element: ~10-50ms
- Total overhead: 50-80 seconds when done during scrolling
Scroll Loop is Time-Critical
- Optimal scroll timing: 0.27 seconds
- API response collection: ~30-50ms
- Adding DOM parsing: +100-200ms = 4-8x slower per scroll
Page State Matters
- During scrolling: Elements constantly changing (stale references)
- After scrolling: Stable DOM, faster parsing

Why Sequential Wins

Clean Scroll Loop
- Only API collection (fast)
- No element queries during critical path
- Predictable timing
Efficient DOM Parsing
- Parse on stable page (no stale elements)
- Only parse top 15-20 reviews (missing ones are at top)
- Batch operation is faster than incremental
Simple = Stable
- Two clear phases, easy to debug
- No complex synchronization
- Consistent results

Theoretical Analysis

Time Breakdown

Sequential Approach:

Phase 1: API Scrolling
  - 35 scrolls × 0.27s = 9.5s
  - API collection overhead = 10.0s
  - Total Phase 1 = 19.5s

Phase 2: DOM Parsing
  - Scroll to top = 0.5s
  - Find elements = 0.8s
  - Parse 15 elements = 11.6s
  - Total Phase 2 = 12.9s

TOTAL = 32.4s

Parallel Approach (every 5 scrolls):

Combined Scrolling + DOM:
  - 40 scrolls with DOM parsing
  - Per scroll: 0.27s scroll + 2.0s DOM = 2.27s
  - Total = 90.8s (plus overhead)

TOTAL = ~103s

Parallel Approach (last 10 scrolls):

Phase 1: Fast scrolling (30 scrolls)
  - 30 × 0.27s = 8.1s

Phase 2: Slow scrolling with DOM (10 scrolls)
  - 10 × (0.27s + 6.5s) = 67.7s

TOTAL = 75.8s

Why DOM is So Slow During Scrolling

Stale Element References: Elements change as page scrolls, requiring re-queries
Layout Thrashing: DOM queries force layout recalculation
Concurrent Modifications: Page is updating while we're reading
No Batch Optimization: Can't batch when elements keep changing

Conclusion

Sequential is 2-3x faster than parallel for this use case.

Recommended Solution: start_ultra_fast_complete.py

python start_ultra_fast_complete.py

Benefits:

✅ 4.8x faster than original (32.4s vs 155s)
✅ 100% completeness (244/244 reviews)
✅ 100% stable and reliable
✅ Simple, maintainable code
✅ Saves 122 seconds per run

Why not ultra-fast API-only (8.0x)?

Missing 10 reviews (4.1%)
Only 13 seconds slower to get 100% completeness
Worth the trade-off for most use cases

Lessons Learned

"Parallel" doesn't always mean faster - overhead matters
Keep critical loops clean - don't add slow operations to tight loops
Stable state = faster operations - parse DOM when it's not changing
Simple often wins - clear phases beat complex synchronization
Measure, don't assume - test proves sequential is faster

Final Recommendation: Use sequential hybrid approach (start_ultra_fast_complete.py) for best balance of speed and completeness.

5.6 KiB Raw Blame History Unescape Escape

Parallel Optimization Results

Question: Can we do scrolling and DOM parsing in parallel?

Approaches Tested

1. ❌ Full Parallel Hybrid (start_parallel_hybrid.py)

2. ❌ Optimized Parallel (start_parallel_hybrid.py v2)

3. ❌ Minimal Overhead Parallel (start_optimized_hybrid.py)

4. ✅ WINNER: Sequential Hybrid (start_ultra_fast_complete.py)

Performance Comparison

Key Findings

Why Parallel Doesn't Help

Why Sequential Wins

Theoretical Analysis

Time Breakdown

Why DOM is So Slow During Scrolling

Conclusion

Lessons Learned

5.6 KiB

Raw Blame History

1. ❌ Full Parallel Hybrid (`start_parallel_hybrid.py`)

2. ❌ Optimized Parallel (`start_parallel_hybrid.py` v2)

3. ❌ Minimal Overhead Parallel (`start_optimized_hybrid.py`)

4. ✅ WINNER: Sequential Hybrid (`start_ultra_fast_complete.py`)