Commit Graph

5 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
e2d7f6f118 feat: Add ScraperV1Adapter and real data pipeline test
- Add ScraperV1Adapter to transform scraped reviews into pipeline format
  - Handles relative timestamps (centerDate)
  - Generates deterministic IDs for DOM-sourced reviews
  - Filters out empty (rating-only) reviews

- Add sample barbershop reviews (79 reviews, 46 with text)
  - Real data from Las Palmas barbershop
  - Multi-language: Spanish, English, German, Norwegian, Italian

- Add test_pipeline_real_data.py for E2E testing with real data
  - Uses mock classifier based on keywords and rating
  - Full pipeline flow: raw -> enriched -> spans -> issues -> facts

Test results with real data:
- 46 reviews processed
- 6 languages detected (es: 35, en: 7, de: 1, no: 1, it: 1, ca: 1)
- 3 issues identified from negative reviews
- 29 fact records aggregated across date range 2017-2025

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:35:09 +00:00
Alejandro Gutiérrez
3e57c887e9 test: Add E2E pipeline test with real database
Tests the full pipeline flow:
- Stage 1: Insert raw reviews, normalize text
- Stage 2: Mock LLM classification, insert spans
- Stage 3: Route negative spans to issues
- Stage 4: Aggregate facts by URT code and date

Validates all pipeline.* tables are populated correctly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:28:53 +00:00
Alejandro Gutiérrez
03ed7029e2 feat: Add decoupled pipeline schema with separate PostgreSQL namespace
- Create consolidated migration (005_create_pipeline_schema.sql) with
  'pipeline' schema for all classification tables
- Update pipeline repositories to use schema prefix (pipeline.*)
- Add run_migrations() method to DatabaseManager
- Add CLI tool for running versioned migrations

Tables created in pipeline schema:
- reviews_raw, reviews_enriched (Stage 1)
- review_spans (Stage 2)
- issues, issue_spans, issue_events (Stage 3)
- fact_timeseries (Stage 4)
- urt_domains, urt_categories (taxonomy lookup)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 18:17:20 +00:00
Alejandro Gutiérrez
b780a23b66 fix: Correct imports in test_scraper CLI tool
- Import LogCapture from scraper module
- Remove unused StructuredLogger import
- Use correct log_capture parameter name

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:24:07 +00:00
Alejandro Gutiérrez
84f5efb5c7 feat: Add CLI tool for quick scraper testing
Usage:
  python tools/test_scraper.py "ClickRent Gran Canaria"
  python tools/test_scraper.py "Starbucks NYC" --max 100
  python tools/test_scraper.py --url "https://..." --headless
  python tools/test_scraper.py "Business" -o results.json -v

Features:
- Search by business name or direct URL
- Configurable max reviews and timeout
- Headless mode support
- JSON output option
- Real-time progress display
- Verbose logging mode

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:20:12 +00:00