After sorting by newest, Google Maps may recreate DOM elements which
makes the Python scroll_container reference stale. Now re-find the
container after sorting to ensure we have a valid reference.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Check total_reviews before recovery attempts
- Exit loop as soon as current_count >= total_reviews
- Reduces scrape time significantly (13s vs 56s for 247 reviews)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Track DOM order for all reviews (review_order dict)
- Sort output by DOM position (preserves "Newest" sort order)
- API content + DOM order = best of both
- Remove click in recovery method 4 to avoid opening profile pages
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix API parser to use correct Google Maps response structure
- Review ID at [0], Author at [1][4][5][0], Rating at [2][0][0]
- Text at [2][15][0][0], Timestamp at [1][6]
- Use review_id as key for both API and DOM to avoid duplicates
- Prefer API data (original language, full text)
- Expand "More" buttons before sorting and during scroll loop
- Results: 246/247 full text (99.6%), down from 36/247 before
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Poll for up to 5s waiting for span[role="img"][aria-label*="review"]
- Element may not be present immediately after consent handling
- Tested: Soho Club 247/247 reviews in 31.4s with correct total
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Detect total BEFORE clicking reviews tab (element is on Overview)
- Use span[role="img"][aria-label*="review"] (robust, no class names)
- Extract count from aria-label (e.g., "260 reviews" → 260)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove separators (AyRUI, TFQHme) adjacent to already-hidden cards
- Separators removed on next cycle, not immediately (preserves scroll)
- DOM growth reduced by ~50% during long scrapes
- Tested: 2000 reviews in 103s (19.3/s) with all features
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use [data-review-id] + aria-label check for review cards
- Extract author from button[aria-label^="Photo of"]
- Use span[role="img"][aria-label*="star"] for rating
- Pattern matching for timestamp ("X time ago")
- Longest text span heuristic for review text
A/B tested: 100% match with old class-based selectors.
Survives Google's CSS class name changes.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Performance improvements:
- JS-based DOM parsing (single browser call vs Selenium round-trips)
- Batch flushing to disk every 500 reviews to free memory
- Hide parsed elements (display:none) to reduce DOM overhead
- Cycle timing instrumentation for debugging slowdowns
Results: 2826 reviews in 6.7min (7.1/sec) vs 2190 in 37min (1.0/sec)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previous detection was matching wrong elements (partial counts).
Now sums "X stars, Y reviews" aria-labels for accurate total.
Fallback methods:
1. Sum star rating counts (most accurate)
2. Reviews tab text like "Reviews (247)"
3. Span with "X reviews" text
Tested: Soho Club 247/247 correctly detected
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace fixed waits with tight polling loops
- 10ms sleep between polls (responsive but low CPU)
- Consent, tabs, scroll container all detected immediately
- Total time reduced to ~11-12 seconds
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reject authors with <= 3 chars (language codes like "es", "it", "no")
- Reject known non-review authors ("google", "maps", etc.)
- Reject timestamps that are URLs or very short strings
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use div.jftiEf[data-review-id] selector to exclude button elements
- Reload original URL after consent (prevents URL corruption)
- Parse full DOM data after scrolling stops
- Deduplicate API reviews by author match
- Remove slow "More" button clicking for speed
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Key improvements:
- Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll
- JavaScript-based review ID collection (doesn't affect scroll position)
- API interception via injected fetch/XHR interceptor
- Total review count extraction from page
- Auto-stop when all reviews collected or timeout reached
The scroll issue was caused by Selenium's find_elements() affecting scroll
position. Using pure JavaScript for data collection keeps scroll pinned to bottom.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>