whyrating-engine-legacy

Author	SHA1	Message	Date
Alejandro Gutiérrez	94240ef2cc	Fix total review count detection - use robust selector on Overview tab - Detect total BEFORE clicking reviews tab (element is on Overview) - Use span[role="img"][aria-label*="review"] (robust, no class names) - Extract count from aria-label (e.g., "260 reviews" → 260) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 12:23:00 +00:00
Alejandro Gutiérrez	10b32244d7	Add delayed separator removal to keep DOM light - Remove separators (AyRUI, TFQHme) adjacent to already-hidden cards - Separators removed on next cycle, not immediately (preserves scroll) - DOM growth reduced by ~50% during long scrapes - Tested: 2000 reviews in 103s (19.3/s) with all features Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 12:18:50 +00:00
Alejandro Gutiérrez	cbc2e9c617	Robust selectors: Replace CSS class names with data/aria attributes - Use [data-review-id] + aria-label check for review cards - Extract author from button[aria-label^="Photo of"] - Use span[role="img"][aria-label*="star"] for rating - Pattern matching for timestamp ("X time ago") - Longest text span heuristic for review text A/B tested: 100% match with old class-based selectors. Survives Google's CSS class name changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 10:20:51 +00:00
Alejandro Gutiérrez	d989178119	7x faster scraping with JS parsing + batch flushing Performance improvements: - JS-based DOM parsing (single browser call vs Selenium round-trips) - Batch flushing to disk every 500 reviews to free memory - Hide parsed elements (display:none) to reduce DOM overhead - Cycle timing instrumentation for debugging slowdowns Results: 2826 reviews in 6.7min (7.1/sec) vs 2190 in 37min (1.0/sec) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 10:01:22 +00:00
Alejandro Gutiérrez	0778b2e07d	Fix total review count detection - sum star ratings Previous detection was matching wrong elements (partial counts). Now sums "X stars, Y reviews" aria-labels for accurate total. Fallback methods: 1. Sum star rating counts (most accurate) 2. Reviews tab text like "Reviews (247)" 3. Span with "X reviews" text Tested: Soho Club 247/247 correctly detected Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 22:50:06 +00:00
Alejandro Gutiérrez	6934838a69	Real-time parsing + image blocking for large datasets Key improvements: - Parse reviews immediately during scroll (not at end) - Fixes virtual scroll issue - was losing reviews after ~1000 - Block images via CDP for faster loading - Smart recovery: 4 methods (keys, wheel, scroll up/down, click card) - Dynamic timeout based on scroll state and content growth - Spinner + network activity detection resets idle timer - Sort by newest first option Results: 1930 reviews (was 990) on 2433-review location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 22:25:26 +00:00
Alejandro Gutiérrez	6a75159ebe	Use immediate element detection with 10ms polling - Replace fixed waits with tight polling loops - 10ms sleep between polls (responsive but low CPU) - Consent, tabs, scroll container all detected immediately - Total time reduced to ~11-12 seconds Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:52:18 +00:00
Alejandro Gutiérrez	4f48fb28cd	Optimize wait times for faster scraping - Reduce initial page load wait: 3s -> 1s - Reduce consent click wait: 2s -> 0.5s - Reduce post-consent reload wait: 3s -> 1s - Reduce tab click wait: 2s -> 0.3s - Use smart polling for tabs (0.25s intervals, up to 2.5s) - Use faster scroll container polling (0.25s intervals) - Remove redundant 2s wait after reviews load Total execution time reduced from ~22s to ~13s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:49:12 +00:00
Alejandro Gutiérrez	218927bd9b	Filter out garbage API data (language codes, metadata) - Reject authors with <= 3 chars (language codes like "es", "it", "no") - Reject known non-review authors ("google", "maps", etc.) - Reject timestamps that are URLs or very short strings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:47:08 +00:00
Alejandro Gutiérrez	0e8a711a9c	Fix clean scraper: specific selectors, consent reload, DOM parsing - Use div.jftiEf[data-review-id] selector to exclude button elements - Reload original URL after consent (prevents URL corruption) - Parse full DOM data after scrolling stops - Deduplicate API reviews by author match - Remove slow "More" button clicking for speed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:40:15 +00:00
Alejandro Gutiérrez	2c7ba2ae40	Add clean scraper with fixed smooth scrolling Key improvements: - Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll - JavaScript-based review ID collection (doesn't affect scroll position) - API interception via injected fetch/XHR interceptor - Total review count extraction from page - Auto-stop when all reviews collected or timeout reached The scroll issue was caused by Selenium's find_elements() affecting scroll position. Using pure JavaScript for data collection keeps scroll pinned to bottom. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:28:24 +00:00

11 Commits