whyrating-engine-legacy

Author	SHA1	Message	Date
Alejandro Gutiérrez	fbd61ff7f7	feat: Add multi-sort scraper v1.1.0 and improve v1.0.0 reliability v1.0.0 improvements: - Add captcha detection (reCAPTCHA, unusual traffic, challenges) - Block fonts, analytics, maps tiles for faster scrolling - Add 95% close-enough threshold to skip unnecessary retries - Stop immediately if captcha detected instead of retrying v1.1.0 new features: - Multi-sort strategy to bypass ~1000 review limit - Cycles through newest/lowest/highest/relevant sorts - Auto mode: enables multi-sort when total > 1000 - Diminishing returns detection (stops if <5% new per pass) - Configurable sort order and thresholds Also adds test_scraper_v110.py CLI tool for testing multi-sort. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 18:47:30 +00:00
Alejandro Gutiérrez	e2d7f6f118	feat: Add ScraperV1Adapter and real data pipeline test - Add ScraperV1Adapter to transform scraped reviews into pipeline format - Handles relative timestamps (centerDate) - Generates deterministic IDs for DOM-sourced reviews - Filters out empty (rating-only) reviews - Add sample barbershop reviews (79 reviews, 46 with text) - Real data from Las Palmas barbershop - Multi-language: Spanish, English, German, Norwegian, Italian - Add test_pipeline_real_data.py for E2E testing with real data - Uses mock classifier based on keywords and rating - Full pipeline flow: raw -> enriched -> spans -> issues -> facts Test results with real data: - 46 reviews processed - 6 languages detected (es: 35, en: 7, de: 1, no: 1, it: 1, ca: 1) - 3 issues identified from negative reviews - 29 fact records aggregated across date range 2017-2025 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 18:35:09 +00:00
Alejandro Gutiérrez	3e57c887e9	test: Add E2E pipeline test with real database Tests the full pipeline flow: - Stage 1: Insert raw reviews, normalize text - Stage 2: Mock LLM classification, insert spans - Stage 3: Route negative spans to issues - Stage 4: Aggregate facts by URT code and date Validates all pipeline.* tables are populated correctly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 18:28:53 +00:00
Alejandro Gutiérrez	03ed7029e2	feat: Add decoupled pipeline schema with separate PostgreSQL namespace - Create consolidated migration (005_create_pipeline_schema.sql) with 'pipeline' schema for all classification tables - Update pipeline repositories to use schema prefix (pipeline.*) - Add run_migrations() method to DatabaseManager - Add CLI tool for running versioned migrations Tables created in pipeline schema: - reviews_raw, reviews_enriched (Stage 1) - review_spans (Stage 2) - issues, issue_spans, issue_events (Stage 3) - fact_timeseries (Stage 4) - urt_domains, urt_categories (taxonomy lookup) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 18:17:20 +00:00
Alejandro Gutiérrez	b780a23b66	fix: Correct imports in test_scraper CLI tool - Import LogCapture from scraper module - Remove unused StructuredLogger import - Use correct log_capture parameter name Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 17:24:07 +00:00
Alejandro Gutiérrez	84f5efb5c7	feat: Add CLI tool for quick scraper testing Usage: python tools/test_scraper.py "ClickRent Gran Canaria" python tools/test_scraper.py "Starbucks NYC" --max 100 python tools/test_scraper.py --url "https://..." --headless python tools/test_scraper.py "Business" -o results.json -v Features: - Search by business name or direct URL - Configurable max reviews and timeout - Headless mode support - JSON output option - Real-time progress display - Verbose logging mode Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 17:20:12 +00:00

6 Commits