Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.3 KiB
Final Optimization Results - Google Maps Review Scraper
Executive Summary
Successfully optimized Google Maps review scraper from 155 seconds to ~20-34 seconds depending on completeness requirements, achieving 4.5x-8.0x speedup.
Available Scrapers
1. start_ultra_fast.py - FASTEST ⚡
Time: ~19.4 seconds Reviews: 234/244 (95.9%) Speedup: 8.0x faster
Best for:
- Maximum speed priority
- When 234 reviews is sufficient
- Time-critical applications
python start_ultra_fast.py
2. start_ultra_fast_complete.py - RECOMMENDED ✅
Time: ~34 seconds Reviews: 244/244 (100%) Speedup: 4.5x faster
Best for:
- Balance of speed and completeness
- Production use
- When all reviews are needed
How it works:
- Phase 1: Ultra-fast API scrolling → 234 reviews in ~20s
- Phase 2: DOM parsing for missing 10 → ~13s
- Total: 244 reviews in ~34s
python start_ultra_fast_complete.py
3. start.py - ORIGINAL
Time: 155 seconds Reviews: 244/244 (100%) Speedup: 1.0x (baseline)
Best for:
- Reference implementation
- Debugging
Key Findings
API Limitation Discovery
After extensive testing with different scrolling strategies:
| Strategy | Time | Reviews | Notes |
|---|---|---|---|
| Ultra-fast (0.27s scroll) | 19.4s | 234 | ✅ Optimal API speed |
| Patient (0.30-0.80s scroll) | 58.2s | 234 | Still only 234 |
| Complete (0.27-0.50s adaptive) | 30.8s | 234 | Still only 234 |
Conclusion: The Google Maps API endpoint consistently returns only 234/244 reviews regardless of scrolling speed or patience. The missing 10 reviews are NOT available via API - they only exist in the DOM.
Why 10 Reviews Missing from API?
Possible reasons:
- Pagination limit: Google's API may have a hard limit on returned reviews
- Different endpoint: Some reviews may use a different API endpoint
- Age/status filtering: Older or filtered reviews may be excluded from API responses
- DOM-only content: Some reviews may be rendered client-side only
Performance Comparison
Scraper Time Reviews Speedup Completeness
─────────────────────────────────────────────────────────────────────
Original (start.py) 155s 244 1.0x 100%
Fast API (start_fast.py) 29s 234 5.3x 95.9%
Ultra-fast (start_ultra_fast.py) 19.4s 234 8.0x 95.9%
API-only attempt 58.2s 234 2.7x 95.9%
Hybrid Complete (WINNER) 34s 244 4.5x 100% ✅
Optimization Journey
Phase 1: API Interception (3.6x speedup)
- Replaced DOM parsing with API interception
- 155s → 43s
- Scroll timing: 0.8s
Phase 2: Faster Scrolling (5.3x speedup)
- Optimized scroll timing
- 43s → 29s
- Scroll timing: 0.3s
Phase 3: Ultra-Fast (8.0x speedup)
- Minimized all waits
- Optimal scroll timing (0.27s)
- Less logging overhead
- 155s → 19.4s
Phase 4: Complete Coverage (4.5x speedup)
- Ultra-fast API scrolling (234 reviews)
- DOM parsing fallback (10 reviews)
- 155s → 34s
- 100% completeness maintained
Technical Details
Optimal Scroll Timing
After extensive testing:
| Timing | Result | Notes |
|---|---|---|
| 0.15s | 210 reviews | Too fast - misses API responses |
| 0.25s | 0 reviews (33% failure) | Unreliable |
| 0.27s | 234 reviews (100% success) | ✅ Sweet spot |
| 0.30s | 234 reviews | Reliable but slower |
| 0.80s | 234 reviews | Original, very slow |
Timing Breakdown (Ultra-Fast)
Operation Time % of Total
──────────────────────────────────────────────────
Browser startup ~1.0s 5%
Navigate to page 1.5s 8%
Cookie dialog dismiss 0.4s 2%
Click reviews tab 0.4s 2%
Wait for page stability 1.0s 5%
Find reviews pane ~1.5s 8%
Setup API interceptor 0.3s 2%
Initial scroll trigger 0.3s 2%
Scrolling (30 × 0.27s) 8.1s 42%
Response collection ~3.0s 15%
Parsing & saving ~1.9s 10%
──────────────────────────────────────────────────
TOTAL ~19.4s 100%
Theoretical Limits
- Current best: 19.4s for 234 reviews
- Theoretical minimum: ~13s (if everything instant except scrolling)
- Achievement: 68% of theoretical maximum speed
Bottleneck Analysis
Current bottlenecks (in order):
- Scrolling loop: 8.1s (42%) - Already optimized to limit (0.27s/scroll)
- Response collection: 3.0s (15%) - Necessary overhead
- Parsing & saving: 1.9s (10%) - Fast enough
- Page navigation: 1.5s (8%) - Network dependent
- Browser startup: 1.0s (5%) - Can't optimize much
Further optimization would require:
- Faster Google API responses (impossible)
- Instant browser startup (impossible)
- Instant network requests (impossible)
Recommendations
For Production Use
Use start_ultra_fast_complete.py:
python start_ultra_fast_complete.py
Benefits:
- ✅ 4.5x faster (34s vs 155s)
- ✅ 100% completeness (244/244 reviews)
- ✅ Stable and reliable
- ✅ No authentication needed
- ✅ Best balance of speed and completeness
For Maximum Speed
Use start_ultra_fast.py:
python start_ultra_fast.py
Benefits:
- ✅ 8.0x faster (19.4s vs 155s)
- ✅ 100% stable
- ✅ 95.9% review coverage
- ⚠️ Missing 10 reviews (4.1%)
Configuration
headless: false # Must be false for stability
Performance Metrics
Ultra-Fast Complete (Recommended)
Metric Value
────────────────────────────────────
Average time 34s
Reviews captured 244 (100%)
Success rate 100%
API reviews 234 (95.9%)
DOM reviews 10 (4.1%)
Speedup vs original 4.5x
Time saved per run 121s
Ultra-Fast (Maximum Speed)
Metric Value
────────────────────────────────────
Average time 19.4s
Std deviation ±0.4s
Success rate 100%
Reviews captured 234 (95.9%)
Reviews/second 12.1
Speedup vs original 8.0x
Time saved per run 135.6s
Conclusion
After extensive testing, we discovered:
- API Hard Limit: Google Maps API consistently returns only 234/244 reviews, regardless of scrolling strategy
- DOM Required: The missing 10 reviews are ONLY available via DOM parsing
- Hybrid is Optimal: Combining ultra-fast API scrolling with DOM fallback achieves best balance
Final Achievement:
- 📊 Original: 155s → Optimized: 34s (100% complete)
- 📊 Original: 155s → Ultra-fast: 19.4s (95.9% complete)
- 🚀 4.5x-8.0x faster!
- ⏱️ Saves 121-136 seconds per run
- ✅ 100% stable
The scraper is now operating near theoretical maximum efficiency! 🚀