whyrating-engine-legacy

Author	SHA1	Message	Date
Alejandro Gutiérrez	12d37e350b	Fix JobDevTools contrast + log normalization, add Platform Spec - Fix contrast issues in JobDevTools (level badges, text colors, timestamps) - Make log normalization more robust (handles old/new formats, edge cases) - Add ReviewIQ Platform Spec v1.2 defining: - Multi-tenant scraping-as-a-service architecture - Requester metadata, batches, webhooks, priority - Scraper versioning with A/B testing (stable/beta/canary) - API endpoints for job types, dashboard, admin - Output schemas for external service integration - Project structure reorganization plan Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 15:13:19 +00:00
Alejandro Gutiérrez	1e5401a9d1	Fix: Handle undefined rating_snapshot in job detail page	2026-01-24 13:15:14 +00:00
Alejandro Gutiérrez	eab0b4a7e9	Fix: Maximum update depth exceeded in NewScrapePage Wrap handleJobsChange in useCallback to prevent infinite re-renders caused by onJobsChange dependency changing on every render. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 13:14:23 +00:00
Alejandro Gutiérrez	cd9639f3b1	Wave 7: Integrate JobDevTools into job detail page (FINAL) - Task #18: Complete integration of all JobDevTools components - Updated job detail page (/jobs/[id]) with full JobDevTools UI - Connected SSE stream for real-time structured logs + metrics - Added crash-report and retry API routes for Next.js - Added format conversion for old/new log formats - Added DevTools links to JobsView modal and actions column - Wired up CrashReport retry with auto-fix parameters - Integrated SessionPanel for fingerprint display - Integrated MetricsDashboard for real-time charts Job DevTools implementation complete: 18/18 tasks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 13:11:19 +00:00
Alejandro Gutiérrez	f99827717f	Final polish: v3.1.2 operational safety constraints - Add chk_dedup_scoped constraint enforcing tenant-scoped dedup format - Filter location_type='owned' in populate_facts() for 'ALL' rollup - Document competitor exclusion from 'ALL' sentinel rollups - Add explicit comments in aggregation code for maintainability Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:55:31 +00:00
Alejandro Gutiérrez	c6443166b2	Wave 6: CopyToolbar utilities and LogEntry row component - Task #7: Create CopyToolbar and copy utilities (copy-utils.ts with text/JSON/CSV formatting, clipboard API with fallback) (CopyToolbar with copy all/selected, format dropdown, download export) - Task #8: Create LogEntry row component (click-to-copy with visual feedback, expandable metrics view) (level/category badges, search highlighting, shift+click selection) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:51:48 +00:00
Alejandro Gutiérrez	3987a9ab4e	Document v3.1.2 conventions: dedup scoping and sentinel values Two micro-risk mitigations documented: 1. dedup_group_id: Format "{business_id}:{hash}" to prevent cross-tenant collision on similar reviews. 2. Sentinel conventions: 'ALL' (spatial) vs 'all' (semantic). Case matters — do not normalize. Spec frozen as v3.1.2. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:50:29 +00:00
Alejandro Gutiérrez	5ce3248efd	Wave 5: LogViewer virtualized list and CrashReport component - Task #6: Create LogViewer with react-window virtualization (search with highlighting, auto-scroll toggle, timestamp format toggle) (shift+click range selection, level/category color badges) - Task #12: Create CrashReport frontend component (crash timeline SVG, pattern analysis with confidence bar) (auto-fix params display, retry API integration) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:44:35 +00:00
Alejandro Gutiérrez	2637d982e0	Wave 4: JobDevTools UI components and crash report API - Task #5: Create JobDevTools container component (tabs: All/Scraper/Browser/Network/System, level filters, count badges) - Task #11: Add crash report API endpoints (GET /jobs/{id}/crash-report, POST /jobs/{id}/retry?apply_fix=true, GET /crashes/stats) - Task #14: Create SessionPanel component (fingerprint display, bot detection indicators, collapsible sections) - Task #15: Create MetricsDashboard with recharts (extraction rate, cumulative reviews, memory usage, scroll progress) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:37:56 +00:00
Alejandro Gutiérrez	9515dd2d42	Polish ReviewIQ v3.1.2: tenant-scoping and FK integrity Final fixes for production-ready spec: 1. locations.location_type: Added 'owned'\|'competitor' flag. Competitors now inserted into locations (preserves FK integrity). 2. Competitor fact query: Added business_id filter to prevent cross-tenant contamination when same competitor tracked by multiple customers. 3. issue_events versioning: Added source + review_version columns for complete review reference in audit log. 4. Enrichment tenant-scoping: business_id now passed from ingest job (not looked up). Validates place_id exists under tenant. 5. Footer: Fixed version string v3.1.1 → v3.1.2. Status: Ship-ready specification. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:34:35 +00:00
Alejandro Gutiérrez	f4ca60349e	Wave 3: SSE structured logs, crash analyzer, session fingerprint - Task #3: Update SSE stream to emit structured log events (type: "log" for entries, type: "metrics" every 5s, ?format=legacy for backward compat) - Task #10: Create crash pattern analyzer module (6 patterns: memory_exhaustion, dom_bloat, rate_limited, consent_loop, scroll_timeout, element_stale) (confidence scoring, auto-fix params, summarize_crash_patterns for recurring issues) - Task #13: Capture session fingerprint in backend (user_agent, platform, timezone, webgl, canvas, bot_detection_tests) (saved on success and failure for debugging) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:34:17 +00:00
Alejandro Gutiérrez	44d017b3f7	Finalize ReviewIQ Architecture v3.1.2 (production-ready) Three final fixes applied: 1. issue_spans versioning: Added source + review_version columns with FK to reviews_enriched(source, review_id, review_version). Spans now correctly reference the exact review version. 2. Competitor business_id rule: Clarified that competitor reviews use customer's business_id + competitor's place_id (not NULL). Keeps facts and joins working without special-case logic. 3. Trust-weighted facts: Clarified trust_weighted_* columns are reserved but not populated in v3.1. Trust scoring applies to issue priority only. Aggregation deferred to v3.2. Status: Production-grade architecture specification. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:31:16 +00:00
Alejandro Gutiérrez	d43c574b0c	Add ReviewIQ Architecture v3.1.1 specification Complete pipeline architecture for Google Reviews intelligence: - Versioned reviews_enriched with (source, review_id, version) PK - Tenant-scoped locations with (business_id, place_id) PK - Relational issue_spans replacing array aggregation - Unified fact_timeseries spine with 'ALL' sentinel for rollups - Clean competitor model (separate table, no fake business_ids) - Trust scoring and dedup support - KPI-ready join keys Reviewed and fixed: PK for edited reviews, multi-tenant overlap, param ordering bugs, fact population scope, entity field deferral. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:25:46 +00:00
Alejandro Gutiérrez	9e1bcde981	Wave 2: Migrate scraper to StructuredLogger, add crash detection & topic tags - Task #2: Migrate scraper_clean.py to use StructuredLogger with categories (37 log calls with metrics across browser/scraper/network/system) - Task #4: Add crash_reports table schema and database methods (save_crash_report, get_crash_report, get_crash_stats) - Task #9: Implement crash detection wrapper with metrics sampling (get_chrome_memory, get_dom_node_count, classify_crash) - Task #17: Add topic tags to frontend ReviewAnalytics (topic filter UI, tags on cards, topics in modal) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:17:23 +00:00
Alejandro Gutiérrez	313e32f358	Wave 1: Add StructuredLogger and review topics inference Task #1: StructuredLogger class (modules/structured_logger.py) - LogEntry dataclass with timestamp, level, category, metrics, network - Thread-safe storage with automatic pruning at 10k entries - Level methods: debug(), info(), warn(), error(), fatal() - Backward-compatible log() method for migration - Filter methods: get_logs_by_category(), get_logs_by_level() Task #16: Review topics inference (modules/scraper_clean.py) - get_topic_variants(): Generate word variants (plural, -ing, -ed forms) - infer_review_topics(): Match review text to topic keywords - Word boundary matching to avoid false positives - Integrated into scrape_reviews() to add 'topics' field to reviews Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 11:27:32 +00:00
Alejandro Gutiérrez	3da243be79	Add ReviewIQ pipeline spec and metadata extraction test - reviewiq-pipeline-v1-final.md: Earlier pipeline specification - test_metadata_extraction.py: Test script for metadata extraction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 11:21:33 +00:00
Alejandro Gutiérrez	59368a5bd5	Add Job DevTools implementation task breakdown 18 tasks organized in 5 parallel tracks: - Track A: Backend logging infrastructure (4 tasks) - Track B: Frontend log viewer (5 tasks) - Track C: Crash analysis (4 tasks) - Track D: Session & metrics (3 tasks) - Track E: Review topics (2 tasks) Includes dependency graph and 7-wave execution plan for parallel AI agent workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 11:14:02 +00:00
Alejandro Gutiérrez	65fcaf43e8	Add Job DevTools specification document Comprehensive spec for observability suite including: - Structured logging system with categories - Crash intelligence and pattern analysis - Copy/export functionality - Session fingerprint panel - Real-time metrics dashboard - Review topics inference Organized by priority (P0-P3) with parallel implementation tracks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 11:10:34 +00:00
Alejandro Gutiérrez	b1296059a9	Add URL-based routing with sidebar navigation Replace client-side state switching with proper Next.js routes: - /new - New scrape form - /jobs - Jobs list with table view - /jobs/[id] - Individual job details and logs - /analytics - Analytics overview (completed jobs) - /analytics/[id] - Analytics for specific job Add JobsContext for shared state across routes. Update Sidebar to use next/link with pathname matching. Root page redirects to /new. Also adds partial job status styling to JobsView. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 10:58:48 +00:00
Alejandro Gutiérrez	3eda9bdbfa	Add complete URT v5.1 taxonomy framework (11 artifacts) Universal Review Taxonomy v5.1 implementation with: - Track A (Training): A1 Quickstart, A2 QA Protocol, A3 Calibration Set, A4 Full Manual - Track B (Engineering): B1 Code Registry, B2 Database Schema, B3 Owner Routing, B4 API Contract - Track C (Analytics): C1 Issue Lifecycle, C2 KPI Mapping Guide - Track D (Integration): D1 Dashboard Specification Covers 7 domains, 28 categories, 138 subcodes, 16 causal codes, and 7 metadata dimensions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 10:51:41 +00:00
Alejandro Gutiérrez	a540ab97b1	Add browser fingerprint support and analytics metadata display - Transfer user's browser fingerprint (user-agent, viewport, timezone, language, geolocation) to Chrome for more authentic scraping - Display review topics from Google Maps in analytics dashboard - Show business category badge in analytics header - Fix date_text null handling in analytics (handle undefined/timestamp fields) - Add review_topics and business_category to JobStatus interface Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 10:36:06 +00:00
Alejandro Gutiérrez	1bd30c0789	Fix get_business_card_info for pooled workers - Clear cookies and navigate to about:blank before loading URL (ensures clean state when reusing pooled driver) - Simplified regex patterns for rating/reviews extraction - Uses partial word matching like scrape_reviews Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 18:09:51 +00:00
Alejandro Gutiérrez	e3136281b8	Remove fast_scraper.py - consolidated into scraper_clean All functionality now in scraper_clean.py: - fast_scrape_reviews (main scraper) - get_business_card_info (validation) Updated health_checks.py to import from scraper_clean. Removes 1,935 lines of duplicate/obsolete code. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:59:09 +00:00
Alejandro Gutiérrez	0682c0ec61	Add get_business_card_info to scraper_clean with multilingual support Replaces fast_scraper validation with efficient polling-based extraction using the same navigation pattern as scrape_reviews: - 10ms polling for consent handling (no fixed waits) - 100ms polling for data extraction - Exits early when data found Supports multiple languages: - Rating: stars/estrellas/étoiles/sterne/stelle - Reviews: reviews/reseñas/avis/bewertungen/recensioni - Handles comma decimals (4,8 -> 4.8) Result: 6.3s to extract name, address, rating, total_reviews Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:52:06 +00:00
Alejandro Gutiérrez	47bb032011	Clean up project root - remove 51 obsolete files Deleted: - 26 old markdown summary/documentation files - 16 debug/test Python scripts (debug_, test_, diagnose_*) - 10 untracked JSON files from api_response_samples - terms-of-usage.md, pane_not_found.png Also includes pending web app changes: - Jobs management UI (JobsView, Sidebar components) - API routes for job streaming and comparison - Enhanced ReviewAnalytics and ScraperTest components Final clean structure: ├── api_server_production.py (main entry) ├── modules/ (core Python) ├── web/ (Next.js frontend) ├── tests/ (test suite) ├── docs/ (documentation) └── examples/ (usage examples) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:31:53 +00:00
Alejandro Gutiérrez	8ccf72a489	Remove old scraper files - consolidate to scraper_clean Production (api_server_production.py) only uses: - modules/scraper_clean.py - main scraping logic - modules/fast_scraper.py - validation helpers - modules/database.py, webhooks.py, health_checks.py, chrome_pool.py Deleted 33 unused Python files including: - Old API server (api_server.py) - 14 start.py experimental scrapers - 7 _scraper.py variants - Old modules: scraper.py, api_interceptor.py, job_manager.py, cli.py - Various debug/test/utility scripts Saves ~11,000 lines of unmaintained code. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:25:00 +00:00
Alejandro Gutiérrez	80e7771c00	Fix DOM cleanup: hide cards from API interception too The continue statement was skipping the card.style.display='none' and card.innerHTML='' cleanup for cards already seen via API interception. This caused DOM to grow unbounded during long scrapes. Now ALL processed cards are hidden regardless of data source. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:23:51 +00:00
Alejandro Gutiérrez	01ea18d91d	Add test URL quick-select buttons to frontend - Small (~79 reviews): R. Fleitas Peluqueros - Medium (~589 reviews): ClickRent Gran Canaria - Large (~2000+ reviews): Hospital Doctor Negrín Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 14:20:54 +00:00
Alejandro Gutiérrez	8b36850838	Switch Docker production API to use scraper_clean - Import fast_scrape_reviews from scraper_clean instead of fast_scraper - Keeps helper functions (check_reviews_available, get_business_card_info) from fast_scraper - Production now uses clean scraper with hard refresh recovery Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 14:19:40 +00:00
Alejandro Gutiérrez	a6d6531543	Switch production to scraper_clean with hard refresh recovery - Add fast_scrape_reviews() wrapper to scraper_clean.py for API compatibility - Set window size (1200x900) in wrapper to ensure proper Google Maps rendering - Update job_manager.py to import from scraper_clean instead of fast_scraper - Production now uses clean scraper with: - Hard refresh recovery when stuck after 8+ soft recovery attempts - API interception + DOM parsing for complete data collection - Automatic deduplication across refreshes Tested: 589/589 reviews collected in 55s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 14:18:10 +00:00
Alejandro Gutiérrez	ff03a4a1b7	Add hard refresh recovery for stuck scraper When the scraper gets stuck (8+ failed soft recovery attempts), it now does a hard page refresh and re-setups everything: - Reloads the page - Re-clicks reviews tab - Re-sorts by newest - Re-injects API interceptor - Continues collecting with existing seen_ids for deduplication Key changes: - Extract page setup into reusable setup_reviews_page() function - Add do_hard_refresh() that calls setup on refresh - Trigger hard refresh after 8 failed soft recoveries - Try hard refresh before timeout gives up completely - Max 3 hard refreshes before truly giving up - Reset recovery counter after successful hard refresh This ensures the scraper can recover from browser issues, DOM detachment, or other problems that soft recovery (scroll tricks) can't fix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 13:42:54 +00:00
Alejandro Gutiérrez	b55a7a0fb1	Refresh scroll container after sorting to prevent stale reference After sorting by newest, Google Maps may recreate DOM elements which makes the Python scroll_container reference stale. Now re-find the container after sorting to ensure we have a valid reference. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 13:37:19 +00:00
Alejandro Gutiérrez	5db277ad2f	Stop immediately when all reviews collected - Check total_reviews before recovery attempts - Exit loop as soon as current_count >= total_reviews - Reduces scrape time significantly (13s vs 56s for 247 reviews) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 13:19:45 +00:00
Alejandro Gutiérrez	f1f1aa0785	Sort output by DOM visual order + fix browser issue - Track DOM order for all reviews (review_order dict) - Sort output by DOM position (preserves "Newest" sort order) - API content + DOM order = best of both - Remove click in recovery method 4 to avoid opening profile pages Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 13:17:11 +00:00
Alejandro Gutiérrez	7abff25dc6	Full text + deduplication: API parser + More button expansion - Fix API parser to use correct Google Maps response structure - Review ID at [0], Author at [1][4][5][0], Rating at [2][0][0] - Text at [2][15][0][0], Timestamp at [1][6] - Use review_id as key for both API and DOM to avoid duplicates - Prefer API data (original language, full text) - Expand "More" buttons before sorting and during scroll loop - Results: 246/247 full text (99.6%), down from 36/247 before Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 13:09:40 +00:00
Alejandro Gutiérrez	b4fae38027	Add polling for total count detection on page load - Poll for up to 5s waiting for span[role="img"][aria-label*="review"] - Element may not be present immediately after consent handling - Tested: Soho Club 247/247 reviews in 31.4s with correct total Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 12:30:17 +00:00
Alejandro Gutiérrez	94240ef2cc	Fix total review count detection - use robust selector on Overview tab - Detect total BEFORE clicking reviews tab (element is on Overview) - Use span[role="img"][aria-label*="review"] (robust, no class names) - Extract count from aria-label (e.g., "260 reviews" → 260) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 12:23:00 +00:00
Alejandro Gutiérrez	10b32244d7	Add delayed separator removal to keep DOM light - Remove separators (AyRUI, TFQHme) adjacent to already-hidden cards - Separators removed on next cycle, not immediately (preserves scroll) - DOM growth reduced by ~50% during long scrapes - Tested: 2000 reviews in 103s (19.3/s) with all features Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 12:18:50 +00:00
Alejandro Gutiérrez	cbc2e9c617	Robust selectors: Replace CSS class names with data/aria attributes - Use [data-review-id] + aria-label check for review cards - Extract author from button[aria-label^="Photo of"] - Use span[role="img"][aria-label*="star"] for rating - Pattern matching for timestamp ("X time ago") - Longest text span heuristic for review text A/B tested: 100% match with old class-based selectors. Survives Google's CSS class name changes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 10:20:51 +00:00
Alejandro Gutiérrez	d989178119	7x faster scraping with JS parsing + batch flushing Performance improvements: - JS-based DOM parsing (single browser call vs Selenium round-trips) - Batch flushing to disk every 500 reviews to free memory - Hide parsed elements (display:none) to reduce DOM overhead - Cycle timing instrumentation for debugging slowdowns Results: 2826 reviews in 6.7min (7.1/sec) vs 2190 in 37min (1.0/sec) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 10:01:22 +00:00
Alejandro Gutiérrez	0778b2e07d	Fix total review count detection - sum star ratings Previous detection was matching wrong elements (partial counts). Now sums "X stars, Y reviews" aria-labels for accurate total. Fallback methods: 1. Sum star rating counts (most accurate) 2. Reviews tab text like "Reviews (247)" 3. Span with "X reviews" text Tested: Soho Club 247/247 correctly detected Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 22:50:06 +00:00
Alejandro Gutiérrez	6934838a69	Real-time parsing + image blocking for large datasets Key improvements: - Parse reviews immediately during scroll (not at end) - Fixes virtual scroll issue - was losing reviews after ~1000 - Block images via CDP for faster loading - Smart recovery: 4 methods (keys, wheel, scroll up/down, click card) - Dynamic timeout based on scroll state and content growth - Spinner + network activity detection resets idle timer - Sort by newest first option Results: 1930 reviews (was 990) on 2433-review location Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 22:25:26 +00:00
Alejandro Gutiérrez	6a75159ebe	Use immediate element detection with 10ms polling - Replace fixed waits with tight polling loops - 10ms sleep between polls (responsive but low CPU) - Consent, tabs, scroll container all detected immediately - Total time reduced to ~11-12 seconds Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:52:18 +00:00
Alejandro Gutiérrez	4f48fb28cd	Optimize wait times for faster scraping - Reduce initial page load wait: 3s -> 1s - Reduce consent click wait: 2s -> 0.5s - Reduce post-consent reload wait: 3s -> 1s - Reduce tab click wait: 2s -> 0.3s - Use smart polling for tabs (0.25s intervals, up to 2.5s) - Use faster scroll container polling (0.25s intervals) - Remove redundant 2s wait after reviews load Total execution time reduced from ~22s to ~13s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:49:12 +00:00
Alejandro Gutiérrez	218927bd9b	Filter out garbage API data (language codes, metadata) - Reject authors with <= 3 chars (language codes like "es", "it", "no") - Reject known non-review authors ("google", "maps", etc.) - Reject timestamps that are URLs or very short strings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:47:08 +00:00
Alejandro Gutiérrez	0e8a711a9c	Fix clean scraper: specific selectors, consent reload, DOM parsing - Use div.jftiEf[data-review-id] selector to exclude button elements - Reload original URL after consent (prevents URL corruption) - Parse full DOM data after scrolling stops - Deduplicate API reviews by author match - Remove slow "More" button clicking for speed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:40:15 +00:00
Alejandro Gutiérrez	2c7ba2ae40	Add clean scraper with fixed smooth scrolling Key improvements: - Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll - JavaScript-based review ID collection (doesn't affect scroll position) - API interception via injected fetch/XHR interceptor - Total review count extraction from page - Auto-stop when all reviews collected or timeout reached The scroll issue was caused by Selenium's find_elements() affecting scroll position. Using pure JavaScript for data collection keeps scroll pinned to bottom. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:28:24 +00:00
Alejandro Gutiérrez	8b925ba965	Implement continuous scrolling with smart gap-based timeout Major refactoring to achieve 100% review collection: CONTINUOUS SCROLLING: - Background thread scrolls NON-STOP at 5ms intervals (no gaps!) - Main thread checks every 2s while scrolling continues - Stops immediately when all reviews collected - Solves the core problem: gaps between bursts caused Google to stop loading SMART TIMEOUT: - Gap-based: 3x average gap between review loads - Initial timeout: 3x time since first load (or 15s default) - Adaptive: evolves from conservative early timeout to smart gap-based - Detailed logging shows timeout calculations RESULTS: - 100% completion (271/271) vs previous 91% (247/271) - 3.5x faster (~17s vs 60s) - Clean thread management with proper shutdown REMOVED: - All burst scrolling code (~100 lines) - Scroll stuck detection (no longer needed) - Dynamic sleep logic (replaced with continuous scrolling) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-19 01:39:47 +00:00
Alejandro Gutiérrez	4ad5c96a36	Add fallback locale retry and pane-scoped selectors for robust review detection - Added fallback logic: if reviews tab not found with hl=en, retry without locale override - Added multilingual keywords for reviews tab (Lithuanian, Russian, etc.) - Fixed structural pattern matching to search only within reviews pane, not entire page - Added Lithuanian date keywords (dienų, savaitės) to date pattern matching - All three selector strategies now scoped to reviews pane for accuracy Issue: Lithuanian hospital still extracting 0/271 reviews Root cause: Reviews elements not found even within pane after tab click Next steps: Need manual inspection of actual page structure on Lithuanian locale	2026-01-18 20:36:42 +00:00
Alejandro Gutiérrez	e98da314a5	Fix: Add early no-reviews detection and hide analytics for empty jobs Changes: - Early detection for "no reviews" messages in 11 languages - Checks for disabled reviews tabs and 0-review indicators - Returns early (saves 30-40s) when no reviews exist - Frontend hides analytics/export buttons when reviews_count = 0 - Structural pattern matching improvements (work in progress) Known issue: - Lithuanian hospital page has different structure (no tabs found) - Needs separate investigation - may use different Google Maps layout Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-18 20:14:04 +00:00

1 2

66 Commits