Commit Graph

16 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
b55a7a0fb1 Refresh scroll container after sorting to prevent stale reference
After sorting by newest, Google Maps may recreate DOM elements which
makes the Python scroll_container reference stale. Now re-find the
container after sorting to ensure we have a valid reference.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:37:19 +00:00
Alejandro Gutiérrez
5db277ad2f Stop immediately when all reviews collected
- Check total_reviews before recovery attempts
- Exit loop as soon as current_count >= total_reviews
- Reduces scrape time significantly (13s vs 56s for 247 reviews)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:19:45 +00:00
Alejandro Gutiérrez
f1f1aa0785 Sort output by DOM visual order + fix browser issue
- Track DOM order for all reviews (review_order dict)
- Sort output by DOM position (preserves "Newest" sort order)
- API content + DOM order = best of both
- Remove click in recovery method 4 to avoid opening profile pages

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:17:11 +00:00
Alejandro Gutiérrez
7abff25dc6 Full text + deduplication: API parser + More button expansion
- Fix API parser to use correct Google Maps response structure
  - Review ID at [0], Author at [1][4][5][0], Rating at [2][0][0]
  - Text at [2][15][0][0], Timestamp at [1][6]
- Use review_id as key for both API and DOM to avoid duplicates
- Prefer API data (original language, full text)
- Expand "More" buttons before sorting and during scroll loop
- Results: 246/247 full text (99.6%), down from 36/247 before

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:09:40 +00:00
Alejandro Gutiérrez
b4fae38027 Add polling for total count detection on page load
- Poll for up to 5s waiting for span[role="img"][aria-label*="review"]
- Element may not be present immediately after consent handling
- Tested: Soho Club 247/247 reviews in 31.4s with correct total

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:30:17 +00:00
Alejandro Gutiérrez
94240ef2cc Fix total review count detection - use robust selector on Overview tab
- Detect total BEFORE clicking reviews tab (element is on Overview)
- Use span[role="img"][aria-label*="review"] (robust, no class names)
- Extract count from aria-label (e.g., "260 reviews" → 260)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:23:00 +00:00
Alejandro Gutiérrez
10b32244d7 Add delayed separator removal to keep DOM light
- Remove separators (AyRUI, TFQHme) adjacent to already-hidden cards
- Separators removed on next cycle, not immediately (preserves scroll)
- DOM growth reduced by ~50% during long scrapes
- Tested: 2000 reviews in 103s (19.3/s) with all features

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:18:50 +00:00
Alejandro Gutiérrez
cbc2e9c617 Robust selectors: Replace CSS class names with data/aria attributes
- Use [data-review-id] + aria-label check for review cards
- Extract author from button[aria-label^="Photo of"]
- Use span[role="img"][aria-label*="star"] for rating
- Pattern matching for timestamp ("X time ago")
- Longest text span heuristic for review text

A/B tested: 100% match with old class-based selectors.
Survives Google's CSS class name changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 10:20:51 +00:00
Alejandro Gutiérrez
d989178119 7x faster scraping with JS parsing + batch flushing
Performance improvements:
- JS-based DOM parsing (single browser call vs Selenium round-trips)
- Batch flushing to disk every 500 reviews to free memory
- Hide parsed elements (display:none) to reduce DOM overhead
- Cycle timing instrumentation for debugging slowdowns

Results: 2826 reviews in 6.7min (7.1/sec) vs 2190 in 37min (1.0/sec)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 10:01:22 +00:00
Alejandro Gutiérrez
0778b2e07d Fix total review count detection - sum star ratings
Previous detection was matching wrong elements (partial counts).
Now sums "X stars, Y reviews" aria-labels for accurate total.

Fallback methods:
1. Sum star rating counts (most accurate)
2. Reviews tab text like "Reviews (247)"
3. Span with "X reviews" text

Tested: Soho Club 247/247 correctly detected

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 22:50:06 +00:00
Alejandro Gutiérrez
6934838a69 Real-time parsing + image blocking for large datasets
Key improvements:
- Parse reviews immediately during scroll (not at end)
- Fixes virtual scroll issue - was losing reviews after ~1000
- Block images via CDP for faster loading
- Smart recovery: 4 methods (keys, wheel, scroll up/down, click card)
- Dynamic timeout based on scroll state and content growth
- Spinner + network activity detection resets idle timer
- Sort by newest first option

Results: 1930 reviews (was 990) on 2433-review location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 22:25:26 +00:00
Alejandro Gutiérrez
6a75159ebe Use immediate element detection with 10ms polling
- Replace fixed waits with tight polling loops
- 10ms sleep between polls (responsive but low CPU)
- Consent, tabs, scroll container all detected immediately
- Total time reduced to ~11-12 seconds

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:52:18 +00:00
Alejandro Gutiérrez
4f48fb28cd Optimize wait times for faster scraping
- Reduce initial page load wait: 3s -> 1s
- Reduce consent click wait: 2s -> 0.5s
- Reduce post-consent reload wait: 3s -> 1s
- Reduce tab click wait: 2s -> 0.3s
- Use smart polling for tabs (0.25s intervals, up to 2.5s)
- Use faster scroll container polling (0.25s intervals)
- Remove redundant 2s wait after reviews load

Total execution time reduced from ~22s to ~13s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:49:12 +00:00
Alejandro Gutiérrez
218927bd9b Filter out garbage API data (language codes, metadata)
- Reject authors with <= 3 chars (language codes like "es", "it", "no")
- Reject known non-review authors ("google", "maps", etc.)
- Reject timestamps that are URLs or very short strings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:47:08 +00:00
Alejandro Gutiérrez
0e8a711a9c Fix clean scraper: specific selectors, consent reload, DOM parsing
- Use div.jftiEf[data-review-id] selector to exclude button elements
- Reload original URL after consent (prevents URL corruption)
- Parse full DOM data after scrolling stops
- Deduplicate API reviews by author match
- Remove slow "More" button clicking for speed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:40:15 +00:00
Alejandro Gutiérrez
2c7ba2ae40 Add clean scraper with fixed smooth scrolling
Key improvements:
- Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll
- JavaScript-based review ID collection (doesn't affect scroll position)
- API interception via injected fetch/XHR interceptor
- Total review count extraction from page
- Auto-stop when all reviews collected or timeout reached

The scroll issue was caused by Selenium's find_elements() affecting scroll
position. Using pure JavaScript for data collection keeps scroll pinned to bottom.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:28:24 +00:00