Commit Graph

20 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
0682c0ec61 Add get_business_card_info to scraper_clean with multilingual support
Replaces fast_scraper validation with efficient polling-based extraction
using the same navigation pattern as scrape_reviews:
- 10ms polling for consent handling (no fixed waits)
- 100ms polling for data extraction
- Exits early when data found

Supports multiple languages:
- Rating: stars/estrellas/étoiles/sterne/stelle
- Reviews: reviews/reseñas/avis/bewertungen/recensioni
- Handles comma decimals (4,8 -> 4.8)

Result: 6.3s to extract name, address, rating, total_reviews

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:52:06 +00:00
Alejandro Gutiérrez
80e7771c00 Fix DOM cleanup: hide cards from API interception too
The continue statement was skipping the card.style.display='none'
and card.innerHTML='' cleanup for cards already seen via API
interception. This caused DOM to grow unbounded during long scrapes.

Now ALL processed cards are hidden regardless of data source.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:23:51 +00:00
Alejandro Gutiérrez
a6d6531543 Switch production to scraper_clean with hard refresh recovery
- Add fast_scrape_reviews() wrapper to scraper_clean.py for API compatibility
- Set window size (1200x900) in wrapper to ensure proper Google Maps rendering
- Update job_manager.py to import from scraper_clean instead of fast_scraper
- Production now uses clean scraper with:
  - Hard refresh recovery when stuck after 8+ soft recovery attempts
  - API interception + DOM parsing for complete data collection
  - Automatic deduplication across refreshes

Tested: 589/589 reviews collected in 55s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:18:10 +00:00
Alejandro Gutiérrez
ff03a4a1b7 Add hard refresh recovery for stuck scraper
When the scraper gets stuck (8+ failed soft recovery attempts), it now
does a hard page refresh and re-setups everything:
- Reloads the page
- Re-clicks reviews tab
- Re-sorts by newest
- Re-injects API interceptor
- Continues collecting with existing seen_ids for deduplication

Key changes:
- Extract page setup into reusable setup_reviews_page() function
- Add do_hard_refresh() that calls setup on refresh
- Trigger hard refresh after 8 failed soft recoveries
- Try hard refresh before timeout gives up completely
- Max 3 hard refreshes before truly giving up
- Reset recovery counter after successful hard refresh

This ensures the scraper can recover from browser issues, DOM detachment,
or other problems that soft recovery (scroll tricks) can't fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:42:54 +00:00
Alejandro Gutiérrez
b55a7a0fb1 Refresh scroll container after sorting to prevent stale reference
After sorting by newest, Google Maps may recreate DOM elements which
makes the Python scroll_container reference stale. Now re-find the
container after sorting to ensure we have a valid reference.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:37:19 +00:00
Alejandro Gutiérrez
5db277ad2f Stop immediately when all reviews collected
- Check total_reviews before recovery attempts
- Exit loop as soon as current_count >= total_reviews
- Reduces scrape time significantly (13s vs 56s for 247 reviews)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:19:45 +00:00
Alejandro Gutiérrez
f1f1aa0785 Sort output by DOM visual order + fix browser issue
- Track DOM order for all reviews (review_order dict)
- Sort output by DOM position (preserves "Newest" sort order)
- API content + DOM order = best of both
- Remove click in recovery method 4 to avoid opening profile pages

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:17:11 +00:00
Alejandro Gutiérrez
7abff25dc6 Full text + deduplication: API parser + More button expansion
- Fix API parser to use correct Google Maps response structure
  - Review ID at [0], Author at [1][4][5][0], Rating at [2][0][0]
  - Text at [2][15][0][0], Timestamp at [1][6]
- Use review_id as key for both API and DOM to avoid duplicates
- Prefer API data (original language, full text)
- Expand "More" buttons before sorting and during scroll loop
- Results: 246/247 full text (99.6%), down from 36/247 before

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:09:40 +00:00
Alejandro Gutiérrez
b4fae38027 Add polling for total count detection on page load
- Poll for up to 5s waiting for span[role="img"][aria-label*="review"]
- Element may not be present immediately after consent handling
- Tested: Soho Club 247/247 reviews in 31.4s with correct total

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:30:17 +00:00
Alejandro Gutiérrez
94240ef2cc Fix total review count detection - use robust selector on Overview tab
- Detect total BEFORE clicking reviews tab (element is on Overview)
- Use span[role="img"][aria-label*="review"] (robust, no class names)
- Extract count from aria-label (e.g., "260 reviews" → 260)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:23:00 +00:00
Alejandro Gutiérrez
10b32244d7 Add delayed separator removal to keep DOM light
- Remove separators (AyRUI, TFQHme) adjacent to already-hidden cards
- Separators removed on next cycle, not immediately (preserves scroll)
- DOM growth reduced by ~50% during long scrapes
- Tested: 2000 reviews in 103s (19.3/s) with all features

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:18:50 +00:00
Alejandro Gutiérrez
cbc2e9c617 Robust selectors: Replace CSS class names with data/aria attributes
- Use [data-review-id] + aria-label check for review cards
- Extract author from button[aria-label^="Photo of"]
- Use span[role="img"][aria-label*="star"] for rating
- Pattern matching for timestamp ("X time ago")
- Longest text span heuristic for review text

A/B tested: 100% match with old class-based selectors.
Survives Google's CSS class name changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 10:20:51 +00:00
Alejandro Gutiérrez
d989178119 7x faster scraping with JS parsing + batch flushing
Performance improvements:
- JS-based DOM parsing (single browser call vs Selenium round-trips)
- Batch flushing to disk every 500 reviews to free memory
- Hide parsed elements (display:none) to reduce DOM overhead
- Cycle timing instrumentation for debugging slowdowns

Results: 2826 reviews in 6.7min (7.1/sec) vs 2190 in 37min (1.0/sec)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 10:01:22 +00:00
Alejandro Gutiérrez
0778b2e07d Fix total review count detection - sum star ratings
Previous detection was matching wrong elements (partial counts).
Now sums "X stars, Y reviews" aria-labels for accurate total.

Fallback methods:
1. Sum star rating counts (most accurate)
2. Reviews tab text like "Reviews (247)"
3. Span with "X reviews" text

Tested: Soho Club 247/247 correctly detected

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 22:50:06 +00:00
Alejandro Gutiérrez
6934838a69 Real-time parsing + image blocking for large datasets
Key improvements:
- Parse reviews immediately during scroll (not at end)
- Fixes virtual scroll issue - was losing reviews after ~1000
- Block images via CDP for faster loading
- Smart recovery: 4 methods (keys, wheel, scroll up/down, click card)
- Dynamic timeout based on scroll state and content growth
- Spinner + network activity detection resets idle timer
- Sort by newest first option

Results: 1930 reviews (was 990) on 2433-review location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 22:25:26 +00:00
Alejandro Gutiérrez
6a75159ebe Use immediate element detection with 10ms polling
- Replace fixed waits with tight polling loops
- 10ms sleep between polls (responsive but low CPU)
- Consent, tabs, scroll container all detected immediately
- Total time reduced to ~11-12 seconds

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:52:18 +00:00
Alejandro Gutiérrez
4f48fb28cd Optimize wait times for faster scraping
- Reduce initial page load wait: 3s -> 1s
- Reduce consent click wait: 2s -> 0.5s
- Reduce post-consent reload wait: 3s -> 1s
- Reduce tab click wait: 2s -> 0.3s
- Use smart polling for tabs (0.25s intervals, up to 2.5s)
- Use faster scroll container polling (0.25s intervals)
- Remove redundant 2s wait after reviews load

Total execution time reduced from ~22s to ~13s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:49:12 +00:00
Alejandro Gutiérrez
218927bd9b Filter out garbage API data (language codes, metadata)
- Reject authors with <= 3 chars (language codes like "es", "it", "no")
- Reject known non-review authors ("google", "maps", etc.)
- Reject timestamps that are URLs or very short strings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:47:08 +00:00
Alejandro Gutiérrez
0e8a711a9c Fix clean scraper: specific selectors, consent reload, DOM parsing
- Use div.jftiEf[data-review-id] selector to exclude button elements
- Reload original URL after consent (prevents URL corruption)
- Parse full DOM data after scrolling stops
- Deduplicate API reviews by author match
- Remove slow "More" button clicking for speed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:40:15 +00:00
Alejandro Gutiérrez
2c7ba2ae40 Add clean scraper with fixed smooth scrolling
Key improvements:
- Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll
- JavaScript-based review ID collection (doesn't affect scroll position)
- API interception via injected fetch/XHR interceptor
- Total review count extraction from page
- Auto-stop when all reviews collected or timeout reached

The scroll issue was caused by Selenium's find_elements() affecting scroll
position. Using pure JavaScript for data collection keeps scroll pinned to bottom.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:28:24 +00:00