v1.0.0 improvements:
- Add captcha detection (reCAPTCHA, unusual traffic, challenges)
- Block fonts, analytics, maps tiles for faster scrolling
- Add 95% close-enough threshold to skip unnecessary retries
- Stop immediately if captcha detected instead of retrying
v1.1.0 new features:
- Multi-sort strategy to bypass ~1000 review limit
- Cycles through newest/lowest/highest/relevant sorts
- Auto mode: enables multi-sort when total > 1000
- Diminishing returns detection (stops if <5% new per pass)
- Configurable sort order and thresholds
Also adds test_scraper_v110.py CLI tool for testing multi-sort.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ScraperV1Adapter to transform scraped reviews into pipeline format
- Handles relative timestamps (centerDate)
- Generates deterministic IDs for DOM-sourced reviews
- Filters out empty (rating-only) reviews
- Add sample barbershop reviews (79 reviews, 46 with text)
- Real data from Las Palmas barbershop
- Multi-language: Spanish, English, German, Norwegian, Italian
- Add test_pipeline_real_data.py for E2E testing with real data
- Uses mock classifier based on keywords and rating
- Full pipeline flow: raw -> enriched -> spans -> issues -> facts
Test results with real data:
- 46 reviews processed
- 6 languages detected (es: 35, en: 7, de: 1, no: 1, it: 1, ca: 1)
- 3 issues identified from negative reviews
- 29 fact records aggregated across date range 2017-2025
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests the full pipeline flow:
- Stage 1: Insert raw reviews, normalize text
- Stage 2: Mock LLM classification, insert spans
- Stage 3: Route negative spans to issues
- Stage 4: Aggregate facts by URT code and date
Validates all pipeline.* tables are populated correctly.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Import LogCapture from scraper module
- Remove unused StructuredLogger import
- Use correct log_capture parameter name
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>