whyrating-engine-legacy

Author	SHA1	Message	Date
Alejandro Gutiérrez	2206ddeff2	Initial commit - WhyRating Engine (Google Reviews Scraper)	2026-02-02 18:19:00 +00:00
Alejandro Gutiérrez	d64f06ba9e	feat: Add scraper version routing with v1.1.0 as default - Import both v1.0.0 and v1.1.0 scraper versions - Add SCRAPER_VERSIONS registry mapping version strings to functions - Add get_scraper_for_version() to route based on job metadata - Default to v1.1.0 (multi-sort) for new jobs - Frontend can select specific version via scraper_version parameter - Validation endpoint continues using v1.0.0 for speed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:04:06 +00:00
Alejandro Gutiérrez	39c80fc8be	Phases 5-7: Dashboard UI, Admin API, and Auth middleware Phase 5 - Main Dashboard: - Dashboard overview page with system health stats - Jobs by status breakdown, success rates, top clients - Dashboard API (/api/dashboard/overview, by-client, problems, by-version) Phase 6 - Admin/Scraper Management: - Scrapers management page with traffic allocation UI - Admin API for scraper CRUD operations - Traffic percentage updates for A/B testing - Promote/deprecate scraper versions Phase 7 - Authentication: - API key authentication middleware - SHA-256 key hashing (keys never stored in plain text) - Scope-based authorization (jobs:read, jobs:write, admin) - Rate limiting per API key Also: - Updated api_server_production.py to include new routers - Extended core/database.py with dashboard query methods - Added dashboard link to sidebar navigation - Updated CONTEXT-KEEPER.md to mark all phases complete Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 15:43:00 +00:00
Alejandro Gutiérrez	788ef84756	Phases 2-4: Requester support, batches, webhooks, scraper registry Phase 2 - Requester & Batch Support: - core/database.py: Added create_job params (requester_, batch_, priority, callback_*) - core/database.py: Added batch methods (create_batch, get_batch, update_batch_progress, get_batches) - core/database.py: Added update_job_callback for tracking webhook delivery - api/routes/batches.py: New endpoints: - POST /api/scrape/google-reviews/batch (submit batch) - GET /api/batches (list batches) - GET /api/batches/{id} (batch detail) - DELETE /api/batches/{id} (cancel batch) - api_server_production.py: Updated /api/scrape with requester, priority, callback fields - api_server_production.py: New primary endpoint POST /api/scrape/google-reviews Phase 3 - Webhooks: - services/job_callback_service.py: New service with: - JobCallbackService: send_job_callback, send_batch_callback, retry_failed_callbacks - JobCallbackDispatcher: Background worker for callback monitoring - Payload formats per spec (job.completed, job.failed, batch.completed) - Exponential backoff for retries - Error classification for failure payloads Phase 4 - Scraper Registry: - scrapers/registry.py: Database-backed version routing: - get_scraper(): Version/variant/A/B routing - _get_weighted_scraper(): Traffic-weighted random selection - 60-second TTL cache for performance - register_scraper, deprecate_scraper, update_traffic_allocation - LegacyScraperRegistry preserved for backwards compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 15:35:58 +00:00
Alejandro Gutiérrez	544e028c3f	Phase 0: Project restructure to ReviewIQ platform architecture New structure: - scrapers/google_reviews/v1_0_0.py (was modules/scraper_clean.py) - scrapers/base.py (BaseScraper interface) - scrapers/registry.py (ScraperRegistry for version routing) - core/database.py, models.py, config.py, enums.py - utils/logger.py, crash_analyzer.py, health_checks.py, helpers.py, date_converter.py - workers/chrome_pool.py - services/webhook_service.py - api/ routes structure (empty, ready for Phase 2) - tests/ structure mirroring source All imports updated in: - api_server_production.py (7 import paths updated) - utils/health_checks.py (scraper import path) Legacy modules moved to modules/_legacy/: - data_storage.py, image_handler.py, s3_handler.py (unused) Syntax verified, frontend build passing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 15:22:08 +00:00
Alejandro Gutiérrez	2637d982e0	Wave 4: JobDevTools UI components and crash report API - Task #5: Create JobDevTools container component (tabs: All/Scraper/Browser/Network/System, level filters, count badges) - Task #11: Add crash report API endpoints (GET /jobs/{id}/crash-report, POST /jobs/{id}/retry?apply_fix=true, GET /crashes/stats) - Task #14: Create SessionPanel component (fingerprint display, bot detection indicators, collapsible sections) - Task #15: Create MetricsDashboard with recharts (extraction rate, cumulative reviews, memory usage, scroll progress) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:37:56 +00:00
Alejandro Gutiérrez	f4ca60349e	Wave 3: SSE structured logs, crash analyzer, session fingerprint - Task #3: Update SSE stream to emit structured log events (type: "log" for entries, type: "metrics" every 5s, ?format=legacy for backward compat) - Task #10: Create crash pattern analyzer module (6 patterns: memory_exhaustion, dom_bloat, rate_limited, consent_loop, scroll_timeout, element_stale) (confidence scoring, auto-fix params, summarize_crash_patterns for recurring issues) - Task #13: Capture session fingerprint in backend (user_agent, platform, timezone, webgl, canvas, bot_detection_tests) (saved on success and failure for debugging) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 12:34:17 +00:00
Alejandro Gutiérrez	a540ab97b1	Add browser fingerprint support and analytics metadata display - Transfer user's browser fingerprint (user-agent, viewport, timezone, language, geolocation) to Chrome for more authentic scraping - Display review topics from Google Maps in analytics dashboard - Show business category badge in analytics header - Fix date_text null handling in analytics (handle undefined/timestamp fields) - Add review_topics and business_category to JobStatus interface Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 10:36:06 +00:00
Alejandro Gutiérrez	0682c0ec61	Add get_business_card_info to scraper_clean with multilingual support Replaces fast_scraper validation with efficient polling-based extraction using the same navigation pattern as scrape_reviews: - 10ms polling for consent handling (no fixed waits) - 100ms polling for data extraction - Exits early when data found Supports multiple languages: - Rating: stars/estrellas/étoiles/sterne/stelle - Reviews: reviews/reseñas/avis/bewertungen/recensioni - Handles comma decimals (4,8 -> 4.8) Result: 6.3s to extract name, address, rating, total_reviews Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:52:06 +00:00
Alejandro Gutiérrez	8ccf72a489	Remove old scraper files - consolidate to scraper_clean Production (api_server_production.py) only uses: - modules/scraper_clean.py - main scraping logic - modules/fast_scraper.py - validation helpers - modules/database.py, webhooks.py, health_checks.py, chrome_pool.py Deleted 33 unused Python files including: - Old API server (api_server.py) - 14 start.py experimental scrapers - 7 _scraper.py variants - Old modules: scraper.py, api_interceptor.py, job_manager.py, cli.py - Various debug/test/utility scripts Saves ~11,000 lines of unmaintained code. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:25:00 +00:00
Alejandro Gutiérrez	8b36850838	Switch Docker production API to use scraper_clean - Import fast_scrape_reviews from scraper_clean instead of fast_scraper - Keeps helper functions (check_reviews_available, get_business_card_info) from fast_scraper - Production now uses clean scraper with hard refresh recovery Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 14:19:40 +00:00
Alejandro Gutiérrez	faa0704737	Optimize scraper performance and add fallback selectors for robustness Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-18 19:49:24 +00:00

12 Commits