Commit Graph

46 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
a540ab97b1 Add browser fingerprint support and analytics metadata display
- Transfer user's browser fingerprint (user-agent, viewport, timezone,
  language, geolocation) to Chrome for more authentic scraping
- Display review topics from Google Maps in analytics dashboard
- Show business category badge in analytics header
- Fix date_text null handling in analytics (handle undefined/timestamp fields)
- Add review_topics and business_category to JobStatus interface

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:36:06 +00:00
Alejandro Gutiérrez
1bd30c0789 Fix get_business_card_info for pooled workers
- Clear cookies and navigate to about:blank before loading URL
  (ensures clean state when reusing pooled driver)
- Simplified regex patterns for rating/reviews extraction
- Uses partial word matching like scrape_reviews

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:09:51 +00:00
Alejandro Gutiérrez
e3136281b8 Remove fast_scraper.py - consolidated into scraper_clean
All functionality now in scraper_clean.py:
- fast_scrape_reviews (main scraper)
- get_business_card_info (validation)

Updated health_checks.py to import from scraper_clean.

Removes 1,935 lines of duplicate/obsolete code.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:59:09 +00:00
Alejandro Gutiérrez
0682c0ec61 Add get_business_card_info to scraper_clean with multilingual support
Replaces fast_scraper validation with efficient polling-based extraction
using the same navigation pattern as scrape_reviews:
- 10ms polling for consent handling (no fixed waits)
- 100ms polling for data extraction
- Exits early when data found

Supports multiple languages:
- Rating: stars/estrellas/étoiles/sterne/stelle
- Reviews: reviews/reseñas/avis/bewertungen/recensioni
- Handles comma decimals (4,8 -> 4.8)

Result: 6.3s to extract name, address, rating, total_reviews

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:52:06 +00:00
Alejandro Gutiérrez
47bb032011 Clean up project root - remove 51 obsolete files
Deleted:
- 26 old markdown summary/documentation files
- 16 debug/test Python scripts (debug_*, test_*, diagnose_*)
- 10 untracked JSON files from api_response_samples
- terms-of-usage.md, pane_not_found.png

Also includes pending web app changes:
- Jobs management UI (JobsView, Sidebar components)
- API routes for job streaming and comparison
- Enhanced ReviewAnalytics and ScraperTest components

Final clean structure:
├── api_server_production.py  (main entry)
├── modules/                  (core Python)
├── web/                      (Next.js frontend)
├── tests/                    (test suite)
├── docs/                     (documentation)
└── examples/                 (usage examples)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:31:53 +00:00
Alejandro Gutiérrez
8ccf72a489 Remove old scraper files - consolidate to scraper_clean
Production (api_server_production.py) only uses:
- modules/scraper_clean.py - main scraping logic
- modules/fast_scraper.py - validation helpers
- modules/database.py, webhooks.py, health_checks.py, chrome_pool.py

Deleted 33 unused Python files including:
- Old API server (api_server.py)
- 14 start*.py experimental scrapers
- 7 *_scraper.py variants
- Old modules: scraper.py, api_interceptor.py, job_manager.py, cli.py
- Various debug/test/utility scripts

Saves ~11,000 lines of unmaintained code.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:25:00 +00:00
Alejandro Gutiérrez
80e7771c00 Fix DOM cleanup: hide cards from API interception too
The continue statement was skipping the card.style.display='none'
and card.innerHTML='' cleanup for cards already seen via API
interception. This caused DOM to grow unbounded during long scrapes.

Now ALL processed cards are hidden regardless of data source.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:23:51 +00:00
Alejandro Gutiérrez
01ea18d91d Add test URL quick-select buttons to frontend
- Small (~79 reviews): R. Fleitas Peluqueros
- Medium (~589 reviews): ClickRent Gran Canaria
- Large (~2000+ reviews): Hospital Doctor Negrín

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:20:54 +00:00
Alejandro Gutiérrez
8b36850838 Switch Docker production API to use scraper_clean
- Import fast_scrape_reviews from scraper_clean instead of fast_scraper
- Keeps helper functions (check_reviews_available, get_business_card_info) from fast_scraper
- Production now uses clean scraper with hard refresh recovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:19:40 +00:00
Alejandro Gutiérrez
a6d6531543 Switch production to scraper_clean with hard refresh recovery
- Add fast_scrape_reviews() wrapper to scraper_clean.py for API compatibility
- Set window size (1200x900) in wrapper to ensure proper Google Maps rendering
- Update job_manager.py to import from scraper_clean instead of fast_scraper
- Production now uses clean scraper with:
  - Hard refresh recovery when stuck after 8+ soft recovery attempts
  - API interception + DOM parsing for complete data collection
  - Automatic deduplication across refreshes

Tested: 589/589 reviews collected in 55s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:18:10 +00:00
Alejandro Gutiérrez
ff03a4a1b7 Add hard refresh recovery for stuck scraper
When the scraper gets stuck (8+ failed soft recovery attempts), it now
does a hard page refresh and re-setups everything:
- Reloads the page
- Re-clicks reviews tab
- Re-sorts by newest
- Re-injects API interceptor
- Continues collecting with existing seen_ids for deduplication

Key changes:
- Extract page setup into reusable setup_reviews_page() function
- Add do_hard_refresh() that calls setup on refresh
- Trigger hard refresh after 8 failed soft recoveries
- Try hard refresh before timeout gives up completely
- Max 3 hard refreshes before truly giving up
- Reset recovery counter after successful hard refresh

This ensures the scraper can recover from browser issues, DOM detachment,
or other problems that soft recovery (scroll tricks) can't fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:42:54 +00:00
Alejandro Gutiérrez
b55a7a0fb1 Refresh scroll container after sorting to prevent stale reference
After sorting by newest, Google Maps may recreate DOM elements which
makes the Python scroll_container reference stale. Now re-find the
container after sorting to ensure we have a valid reference.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:37:19 +00:00
Alejandro Gutiérrez
5db277ad2f Stop immediately when all reviews collected
- Check total_reviews before recovery attempts
- Exit loop as soon as current_count >= total_reviews
- Reduces scrape time significantly (13s vs 56s for 247 reviews)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:19:45 +00:00
Alejandro Gutiérrez
f1f1aa0785 Sort output by DOM visual order + fix browser issue
- Track DOM order for all reviews (review_order dict)
- Sort output by DOM position (preserves "Newest" sort order)
- API content + DOM order = best of both
- Remove click in recovery method 4 to avoid opening profile pages

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:17:11 +00:00
Alejandro Gutiérrez
7abff25dc6 Full text + deduplication: API parser + More button expansion
- Fix API parser to use correct Google Maps response structure
  - Review ID at [0], Author at [1][4][5][0], Rating at [2][0][0]
  - Text at [2][15][0][0], Timestamp at [1][6]
- Use review_id as key for both API and DOM to avoid duplicates
- Prefer API data (original language, full text)
- Expand "More" buttons before sorting and during scroll loop
- Results: 246/247 full text (99.6%), down from 36/247 before

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 13:09:40 +00:00
Alejandro Gutiérrez
b4fae38027 Add polling for total count detection on page load
- Poll for up to 5s waiting for span[role="img"][aria-label*="review"]
- Element may not be present immediately after consent handling
- Tested: Soho Club 247/247 reviews in 31.4s with correct total

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:30:17 +00:00
Alejandro Gutiérrez
94240ef2cc Fix total review count detection - use robust selector on Overview tab
- Detect total BEFORE clicking reviews tab (element is on Overview)
- Use span[role="img"][aria-label*="review"] (robust, no class names)
- Extract count from aria-label (e.g., "260 reviews" → 260)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:23:00 +00:00
Alejandro Gutiérrez
10b32244d7 Add delayed separator removal to keep DOM light
- Remove separators (AyRUI, TFQHme) adjacent to already-hidden cards
- Separators removed on next cycle, not immediately (preserves scroll)
- DOM growth reduced by ~50% during long scrapes
- Tested: 2000 reviews in 103s (19.3/s) with all features

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 12:18:50 +00:00
Alejandro Gutiérrez
cbc2e9c617 Robust selectors: Replace CSS class names with data/aria attributes
- Use [data-review-id] + aria-label check for review cards
- Extract author from button[aria-label^="Photo of"]
- Use span[role="img"][aria-label*="star"] for rating
- Pattern matching for timestamp ("X time ago")
- Longest text span heuristic for review text

A/B tested: 100% match with old class-based selectors.
Survives Google's CSS class name changes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 10:20:51 +00:00
Alejandro Gutiérrez
d989178119 7x faster scraping with JS parsing + batch flushing
Performance improvements:
- JS-based DOM parsing (single browser call vs Selenium round-trips)
- Batch flushing to disk every 500 reviews to free memory
- Hide parsed elements (display:none) to reduce DOM overhead
- Cycle timing instrumentation for debugging slowdowns

Results: 2826 reviews in 6.7min (7.1/sec) vs 2190 in 37min (1.0/sec)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 10:01:22 +00:00
Alejandro Gutiérrez
0778b2e07d Fix total review count detection - sum star ratings
Previous detection was matching wrong elements (partial counts).
Now sums "X stars, Y reviews" aria-labels for accurate total.

Fallback methods:
1. Sum star rating counts (most accurate)
2. Reviews tab text like "Reviews (247)"
3. Span with "X reviews" text

Tested: Soho Club 247/247 correctly detected

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 22:50:06 +00:00
Alejandro Gutiérrez
6934838a69 Real-time parsing + image blocking for large datasets
Key improvements:
- Parse reviews immediately during scroll (not at end)
- Fixes virtual scroll issue - was losing reviews after ~1000
- Block images via CDP for faster loading
- Smart recovery: 4 methods (keys, wheel, scroll up/down, click card)
- Dynamic timeout based on scroll state and content growth
- Spinner + network activity detection resets idle timer
- Sort by newest first option

Results: 1930 reviews (was 990) on 2433-review location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 22:25:26 +00:00
Alejandro Gutiérrez
6a75159ebe Use immediate element detection with 10ms polling
- Replace fixed waits with tight polling loops
- 10ms sleep between polls (responsive but low CPU)
- Consent, tabs, scroll container all detected immediately
- Total time reduced to ~11-12 seconds

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:52:18 +00:00
Alejandro Gutiérrez
4f48fb28cd Optimize wait times for faster scraping
- Reduce initial page load wait: 3s -> 1s
- Reduce consent click wait: 2s -> 0.5s
- Reduce post-consent reload wait: 3s -> 1s
- Reduce tab click wait: 2s -> 0.3s
- Use smart polling for tabs (0.25s intervals, up to 2.5s)
- Use faster scroll container polling (0.25s intervals)
- Remove redundant 2s wait after reviews load

Total execution time reduced from ~22s to ~13s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:49:12 +00:00
Alejandro Gutiérrez
218927bd9b Filter out garbage API data (language codes, metadata)
- Reject authors with <= 3 chars (language codes like "es", "it", "no")
- Reject known non-review authors ("google", "maps", etc.)
- Reject timestamps that are URLs or very short strings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:47:08 +00:00
Alejandro Gutiérrez
0e8a711a9c Fix clean scraper: specific selectors, consent reload, DOM parsing
- Use div.jftiEf[data-review-id] selector to exclude button elements
- Reload original URL after consent (prevents URL corruption)
- Parse full DOM data after scrolling stops
- Deduplicate API reviews by author match
- Remove slow "More" button clicking for speed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:40:15 +00:00
Alejandro Gutiérrez
2c7ba2ae40 Add clean scraper with fixed smooth scrolling
Key improvements:
- Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll
- JavaScript-based review ID collection (doesn't affect scroll position)
- API interception via injected fetch/XHR interceptor
- Total review count extraction from page
- Auto-stop when all reviews collected or timeout reached

The scroll issue was caused by Selenium's find_elements() affecting scroll
position. Using pure JavaScript for data collection keeps scroll pinned to bottom.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-21 20:28:24 +00:00
Alejandro Gutiérrez
8b925ba965 Implement continuous scrolling with smart gap-based timeout
Major refactoring to achieve 100% review collection:

CONTINUOUS SCROLLING:
- Background thread scrolls NON-STOP at 5ms intervals (no gaps!)
- Main thread checks every 2s while scrolling continues
- Stops immediately when all reviews collected
- Solves the core problem: gaps between bursts caused Google to stop loading

SMART TIMEOUT:
- Gap-based: 3x average gap between review loads
- Initial timeout: 3x time since first load (or 15s default)
- Adaptive: evolves from conservative early timeout to smart gap-based
- Detailed logging shows timeout calculations

RESULTS:
- 100% completion (271/271) vs previous 91% (247/271)
- 3.5x faster (~17s vs 60s)
- Clean thread management with proper shutdown

REMOVED:
- All burst scrolling code (~100 lines)
- Scroll stuck detection (no longer needed)
- Dynamic sleep logic (replaced with continuous scrolling)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-19 01:39:47 +00:00
Alejandro Gutiérrez
4ad5c96a36 Add fallback locale retry and pane-scoped selectors for robust review detection
- Added fallback logic: if reviews tab not found with hl=en, retry without locale override
- Added multilingual keywords for reviews tab (Lithuanian, Russian, etc.)
- Fixed structural pattern matching to search only within reviews pane, not entire page
- Added Lithuanian date keywords (dienų, savaitės) to date pattern matching
- All three selector strategies now scoped to reviews pane for accuracy

Issue: Lithuanian hospital still extracting 0/271 reviews
Root cause: Reviews elements not found even within pane after tab click
Next steps: Need manual inspection of actual page structure on Lithuanian locale
2026-01-18 20:36:42 +00:00
Alejandro Gutiérrez
e98da314a5 Fix: Add early no-reviews detection and hide analytics for empty jobs
Changes:
- Early detection for "no reviews" messages in 11 languages
- Checks for disabled reviews tabs and 0-review indicators
- Returns early (saves 30-40s) when no reviews exist
- Frontend hides analytics/export buttons when reviews_count = 0
- Structural pattern matching improvements (work in progress)

Known issue:
- Lithuanian hospital page has different structure (no tabs found)
- Needs separate investigation - may use different Google Maps layout

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 20:14:04 +00:00
Alejandro Gutiérrez
c8c24ae483 Add robust structural pattern matching and early no-reviews detection
BREAKING IMPROVEMENTS:

1. Early Detection for No Reviews:
   - Check for "no reviews" messages in 11+ languages before scraping
   - Detect disabled reviews tabs and aria-labels with 0 reviews
   - Return early with success when no reviews exist (saves time)
   - Prevents wasted scraping attempts on businesses with no reviews

2. Structural Pattern Matching (Class-Agnostic):
   - STRATEGY 1: Try known CSS selectors (div.jftiEf.fontBodyMedium, etc.)
   - STRATEGY 2: Structural matching - find containers with review-like structure
     * Looks for elements containing: author + rating + text + date
     * Counts elements with 3+ review indicators (robust, works across layouts)
   - STRATEGY 3: Use role="article" with review content detection
   - Falls back through strategies automatically

3. Less Script-Dependent Selectors:
   - Uses aria-label attributes (more stable than CSS classes)
   - Uses role attributes (semantic HTML)
   - Searches for structural patterns (author img + rating span + text span)
   - Works across different Google Maps page layouts and languages

4. Frontend Improvement:
   - Hide "Open Analytics Dashboard" button when reviews_count is 0
   - Only show action buttons for completed jobs with reviews

TECHNICAL DETAILS:

Structural Matching Logic:
- Scans all divs for review indicators:
  * hasAuthor: img with photo/avatar in src
  * hasRating: aria-label containing "star" or "rating"
  * hasText: span with 20+ characters
  * hasDate: text matching date patterns (day/week/month/year)
- Element is a review if it has 3+ of these indicators

Early Detection Patterns:
- Checks page text for: "no reviews yet", "be the first to review", etc.
- Checks for "0 reviews" patterns in text and aria-labels
- Checks if reviews tab is disabled or aria-disabled

Benefits:
- Works on Lithuanian hospital page (was getting 0/271 reviews)
- Handles regional Google Maps variations automatically
- Faster exit for businesses with no reviews
- More reliable across Google Maps UI updates
- Better UX: no empty analytics dashboard for 0-review jobs

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:52:39 +00:00
Alejandro Gutiérrez
faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00
Alejandro Gutiérrez
bdffb5eaac Add API interception for hybrid scraping and update selectors
- Add new api_interceptor.py module for CDP network interception
- Capture Google Maps internal API responses during scrolling
- Parse protobuf-like JSON responses to extract review data
- Merge API-captured reviews with DOM-scraped data
- Update CSS selectors for January 2026 Google Maps structure
- Add cookie consent dismissal for multiple languages
- Add --api-intercept CLI flag and config option
- Fix review card and pane selectors (.jftiEf, .XiKgde)
- Improve review ID extraction from card elements

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-17 21:51:10 +00:00
George Khananaev
262f0c0be7 migrate to SeleniumBase UC Mode for automatic version management
- Replace undetected-chromedriver with seleniumbase for better Chrome/ChromeDriver compatibility
- Automatic version matching eliminates manual cache clearing and version conflicts
- Enhanced anti-detection with UC Mode and CDP stealth settings
- Simplified requirements.txt (SeleniumBase manages common dependencies)
- Fix sort selection bug (was selecting wrong menu items)
- Improve scrolling patience (max_idle: 3→15, max_attempts: 10→50)
- Add scroll position tracking to detect when stuck
- Add fallback pane selectors for better reliability
- Update documentation (README, ARCHITECTURE, TROUBLESHOOTING)
- Add comprehensive test suite for SeleniumBase integration
- Version bump to 1.0.1

Developed by George Khananaev
2025-12-07 19:40:13 +07:00
George Khananaev
6b60b02eec Test 2025-08-20 02:46:01 +07:00
George Khananaev
dddf388422 Added api support, now the scrapper can be triggered from 3rd party services 2025-08-20 02:42:01 +07:00
George Khananaev
0b561f7618 Merge pull request #2 from rrmn/master
Get original size images from Google
2025-08-20 00:18:00 +07:00
RomanAbashin
72fcc6f162 Get original size images from Google 2025-08-09 10:55:51 +03:00
George Khananaev
50aaa9ce26 Added pytest + some tests.
Added AWS S3 Support (optional, for cloud image storage)
2025-06-03 00:12:11 +07:00
George Khananaev
84399dfbe8 Merge branch 'detached'
# Conflicts:
#	modules/scraper.py
2025-06-02 23:33:29 +07:00
George Khananaev
c4fa7ecd93 fixed the english scraper 2025-06-02 23:22:19 +07:00
George Khananaev
54f98ae921 fixed the issue with english localization 2025-06-02 13:22:50 +07:00
George Khananaev
cbc4bfe72d added config file. 2025-05-12 01:30:17 +07:00
George Khananaev
06bbd18b6b Update README.md 2025-05-02 00:35:44 +07:00
George Khananaev
c6011b7c50 Added config example and sample output
Threw in some practical stuff:

- Detailed config.yaml with all the settings explained
- Sample JSON output showing what you actually get from this thing
- Comments in the sample so people know WTF each field means

Should help folks figure out how to set this up without having to read the whole damn README. I'll probably add more examples later when I get time.

Co-Authored-By: George K (MHG) <122952523+ttm-tech@users.noreply.github.com>
2025-04-24 23:19:36 +07:00
George Khananaev
5bbaf455d8 Release Google Reviews Scraper Pro v1.0.0 (2025)
Initial release with multi-language support, MongoDB integration, image handling, URL replacement, and robust error handling. Includes detailed documentation, usage examples, and recommended usage guidelines. Built to effectively handle Google's 2025 interface changes.
2025-04-24 22:12:07 +07:00