Commit Graph

86 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
acd3b22e88 docs: Add pipeline development artifacts for parallel implementation
New artifacts:
- ReviewIQ-Pipeline-DevGuide.md: Entry point for pipeline work
- ReviewIQ-Pipeline-Contracts-v1.md: Stage I/O specs, validation rules, test fixtures
- ReviewIQ-Pipeline-Checklist.md: Per-stage implementation checklists
- ReviewIQ-Codebase-Overview.md: File structure, integration points
- ReviewIQ-v3.2.1-Taxonomy-Versioning.md: Taxonomy versioning addendum

Updated:
- ReviewIQ-v32-Decisions.md: Added B2 audit findings, taxonomy versioning decisions, pipeline status

These artifacts enable parallel development of pipeline stages 1-4 with:
- Independent validation (35 rules across stages)
- Clear input/output contracts
- Test fixtures for each stage
- Definition of done criteria

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:08:40 +00:00
Alejandro Gutiérrez
c2996bef1e fix: Calculate job speed using last successful data retrieval timestamp
- Use updated_at (last successful data loop) instead of Date.now()
- Speed now reflects actual data retrieval rate, not declining over time
- Updated in table column, monitored job view, and stats row
- Fall back to Date.now() if updated_at is not available

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:04:35 +00:00
Alejandro Gutiérrez
5165d65152 fix: Center confirmation modal using transform
- Use fixed positioning with top/left 50% and translate -50%
- More reliable centering regardless of parent containers
- Add max-width for mobile responsiveness

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:50:08 +00:00
Alejandro Gutiérrez
83b245bbfc fix: Show blue background with spinner during validation
- Keep blue background when isCheckingReviews is true
- Add cursor-wait during validation
- Move disabled styling to explicit condition check
- White spinner now visible on blue background

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:49:35 +00:00
Alejandro Gutiérrez
e0e86d2830 feat: Persist jobs to localStorage and reset search after launch
- Reset search fields after job is successfully launched
- Allow user to immediately start another scrape
- Save active jobs to localStorage for persistence across refresh
- Restore jobs from localStorage on page load
- Resume polling for non-terminal jobs (pending/running)
- Filter out jobs older than 24 hours
- Add remove button (X) to each job card
- Clean up localStorage when jobs are removed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:47:01 +00:00
Alejandro Gutiérrez
0c8da54045 fix: Center confirmation modal properly
- Remove w-full that caused alignment issues
- Use fixed width (400px) for consistent centering

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:40:54 +00:00
Alejandro Gutiérrez
ccfe00cebe fix: Properly center map click modal
- Remove w-full and mx-auto that caused alignment issues
- Use fixed width (280px) instead of max-w-xs
- Let flex container handle centering

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:40:12 +00:00
Alejandro Gutiérrez
956d5dacda fix: Center map click modal with proper padding
- Center modal properly within map preview area
- Add 24px padding from map edges
- Make modal more compact (max-w-xs)
- Reduce text and element sizes for better fit

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:38:49 +00:00
Alejandro Gutiérrez
d4c3018429 refactor: Change search fields to horizontal layout
- Place Business Name, Location, and Validate button in same row
- Reduce padding and font sizes for compact inline layout
- Show abbreviated text on mobile (responsive)
- Use checkmark indicator for auto-detected location

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:37:08 +00:00
Alejandro Gutiérrez
82b2c51e4e feat: Split search into Business Name + Location fields
- Split single search input into two fields: Business Name (required)
  and Location (auto-detected from IP geolocation)
- Auto-fill location field with city/country from IP on page load
- Add click overlay on map iframe to prevent interaction
- Add warning modal when user clicks map, directing them to use search
- Update test URLs to use split format
- Make Validate button full-width for better UX

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:35:15 +00:00
Alejandro Gutiérrez
afab5127b3 Restore Google Maps iframe preview
- Restore original Google Maps embed iframe approach
- URL: maps.google.com/maps?q=...&output=embed&z=15
- Add "Open in Maps" overlay button on the map
- Height 300px for better visibility

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:29:33 +00:00
Alejandro Gutiérrez
43fd1515d2 Align artifacts with canonical URT v5.1 specification
Fixes inconsistencies discovered during audit against urt-taxonomy/:

- urt_profile ENUM: Add 'lite' and 'core' profiles (was missing)
- USN format: Use canonical regex from spec (was non-compliant)
- USN valence encoding: Add V0 (0) and V± (±) support
- USN grammar: Add Lite (URT:L:) and Core (URT:C:) formats
- Dimension codes: Fix temporal (TC/TR/TH/TF), evidence (ES/EI/EC),
  comparative (CR-N/CR-B/CR-W/CR-S) in decisions doc
- LLM contract: Full USN regex validation pattern

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:21:21 +00:00
Alejandro Gutiérrez
7666b7aea2 Fix: Replace broken Google Maps iframe with interactive preview + add scraper type selection
- Replace non-working Google Maps embed iframe with animated location preview
- Add "Open in Google Maps" button to open location in new tab
- Add scraper type selection dropdown fetching from /api/admin/scrapers
- Show selected scraper info with formatted labels (Google Reviews v1.0.0)
- Include scraper_version and scraper_variant in job submission

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:15:58 +00:00
Alejandro Gutiérrez
46cd54e275 Add LLM Classification Contract v1.0
Defines prompt, output schema, and validation rules for span-level
URT classification:

- System prompt with span extraction rules
- JSON schema for structured output
- 4 few-shot examples (multi-span, temporal, comparative)
- Structural and semantic validation rules
- Error handling with retry + fallback
- Performance considerations (token budget, batching, caching)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:07:31 +00:00
Alejandro Gutiérrez
3317553658 Wire frontend to real API endpoints
Dashboard page:
- Fetch top clients from /api/dashboard/by-client
- Show loading state while fetching
- Display empty state when no client data
- Show real client_id, job count, and success rate

Scrapers page:
- Fetch versions from /api/admin/scrapers
- Wire promote/deprecate buttons to real API calls
- Wire add version form to POST /api/admin/scrapers
- Wire traffic allocation to PUT /api/admin/scrapers/{id}/traffic
- Add loading and error states

Dockerfile:
- Add COPY commands for new directories (api/, core/, scrapers/, etc.)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 16:05:29 +00:00
Alejandro Gutiérrez
39c80fc8be Phases 5-7: Dashboard UI, Admin API, and Auth middleware
Phase 5 - Main Dashboard:
- Dashboard overview page with system health stats
- Jobs by status breakdown, success rates, top clients
- Dashboard API (/api/dashboard/overview, by-client, problems, by-version)

Phase 6 - Admin/Scraper Management:
- Scrapers management page with traffic allocation UI
- Admin API for scraper CRUD operations
- Traffic percentage updates for A/B testing
- Promote/deprecate scraper versions

Phase 7 - Authentication:
- API key authentication middleware
- SHA-256 key hashing (keys never stored in plain text)
- Scope-based authorization (jobs:read, jobs:write, admin)
- Rate limiting per API key

Also:
- Updated api_server_production.py to include new routers
- Extended core/database.py with dashboard query methods
- Added dashboard link to sidebar navigation
- Updated CONTEXT-KEEPER.md to mark all phases complete

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:43:00 +00:00
Alejandro Gutiérrez
788ef84756 Phases 2-4: Requester support, batches, webhooks, scraper registry
Phase 2 - Requester & Batch Support:
- core/database.py: Added create_job params (requester_*, batch_*, priority, callback_*)
- core/database.py: Added batch methods (create_batch, get_batch, update_batch_progress, get_batches)
- core/database.py: Added update_job_callback for tracking webhook delivery
- api/routes/batches.py: New endpoints:
  - POST /api/scrape/google-reviews/batch (submit batch)
  - GET /api/batches (list batches)
  - GET /api/batches/{id} (batch detail)
  - DELETE /api/batches/{id} (cancel batch)
- api_server_production.py: Updated /api/scrape with requester, priority, callback fields
- api_server_production.py: New primary endpoint POST /api/scrape/google-reviews

Phase 3 - Webhooks:
- services/job_callback_service.py: New service with:
  - JobCallbackService: send_job_callback, send_batch_callback, retry_failed_callbacks
  - JobCallbackDispatcher: Background worker for callback monitoring
  - Payload formats per spec (job.completed, job.failed, batch.completed)
  - Exponential backoff for retries
  - Error classification for failure payloads

Phase 4 - Scraper Registry:
- scrapers/registry.py: Database-backed version routing:
  - get_scraper(): Version/variant/A/B routing
  - _get_weighted_scraper(): Traffic-weighted random selection
  - 60-second TTL cache for performance
  - register_scraper, deprecate_scraper, update_traffic_allocation
  - LegacyScraperRegistry preserved for backwards compatibility

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:35:58 +00:00
Alejandro Gutiérrez
2412996c54 Phase 1: Database migrations for platform features
Migrations created:
- 001_add_job_platform_fields.sql: Add 15 new columns to jobs table
  - Requester tracking (client_id, source, purpose, metadata)
  - Batch support (batch_id, batch_index)
  - Execution tracking (job_type, scraper_version, variant, priority)
  - Webhook callbacks (url, status, sent_at, attempts)
  - Result summary (JSONB for cross-type dashboard)
  - 7 indexes for query performance
  - 5 CHECK constraints for data validation

- 002_create_batches_table.sql: Batch job grouping
  - Tracks batch progress (total/completed/failed)
  - Batch-level callbacks
  - Requester association

- 003_create_scraper_registry.sql: Scraper version management
  - Version routing (stable/beta/canary variants)
  - A/B traffic splitting (traffic_pct)
  - Priority-based routing
  - Seeds google_reviews v1.0.0 as stable default

- 004_create_api_keys.sql: API authentication
  - Secure key storage (SHA-256 hashes, not plaintext)
  - Scopes-based permissions
  - Rate limiting support
  - Key lifecycle (expiry, active status)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:24:28 +00:00
Alejandro Gutiérrez
544e028c3f Phase 0: Project restructure to ReviewIQ platform architecture
New structure:
- scrapers/google_reviews/v1_0_0.py (was modules/scraper_clean.py)
- scrapers/base.py (BaseScraper interface)
- scrapers/registry.py (ScraperRegistry for version routing)
- core/database.py, models.py, config.py, enums.py
- utils/logger.py, crash_analyzer.py, health_checks.py, helpers.py, date_converter.py
- workers/chrome_pool.py
- services/webhook_service.py
- api/ routes structure (empty, ready for Phase 2)
- tests/ structure mirroring source

All imports updated in:
- api_server_production.py (7 import paths updated)
- utils/health_checks.py (scraper import path)

Legacy modules moved to modules/_legacy/:
- data_storage.py, image_handler.py, s3_handler.py (unused)

Syntax verified, frontend build passing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:22:08 +00:00
Alejandro Gutiérrez
bb0291f265 Add CONTEXT-KEEPER.md for conversation continuity
Quick-reference document for resuming work after context compaction.
Contains: project overview, current state, spec summary, phases,
key decisions, file locations, and resumption instructions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:14:01 +00:00
Alejandro Gutiérrez
12d37e350b Fix JobDevTools contrast + log normalization, add Platform Spec
- Fix contrast issues in JobDevTools (level badges, text colors, timestamps)
- Make log normalization more robust (handles old/new formats, edge cases)
- Add ReviewIQ Platform Spec v1.2 defining:
  - Multi-tenant scraping-as-a-service architecture
  - Requester metadata, batches, webhooks, priority
  - Scraper versioning with A/B testing (stable/beta/canary)
  - API endpoints for job types, dashboard, admin
  - Output schemas for external service integration
  - Project structure reorganization plan

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:13:19 +00:00
Alejandro Gutiérrez
1e5401a9d1 Fix: Handle undefined rating_snapshot in job detail page 2026-01-24 13:15:14 +00:00
Alejandro Gutiérrez
eab0b4a7e9 Fix: Maximum update depth exceeded in NewScrapePage
Wrap handleJobsChange in useCallback to prevent infinite re-renders
caused by onJobsChange dependency changing on every render.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 13:14:23 +00:00
Alejandro Gutiérrez
cd9639f3b1 Wave 7: Integrate JobDevTools into job detail page (FINAL)
- Task #18: Complete integration of all JobDevTools components
  - Updated job detail page (/jobs/[id]) with full JobDevTools UI
  - Connected SSE stream for real-time structured logs + metrics
  - Added crash-report and retry API routes for Next.js
  - Added format conversion for old/new log formats
  - Added DevTools links to JobsView modal and actions column
  - Wired up CrashReport retry with auto-fix parameters
  - Integrated SessionPanel for fingerprint display
  - Integrated MetricsDashboard for real-time charts

Job DevTools implementation complete: 18/18 tasks

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 13:11:19 +00:00
Alejandro Gutiérrez
f99827717f Final polish: v3.1.2 operational safety constraints
- Add chk_dedup_scoped constraint enforcing tenant-scoped dedup format
- Filter location_type='owned' in populate_facts() for 'ALL' rollup
- Document competitor exclusion from 'ALL' sentinel rollups
- Add explicit comments in aggregation code for maintainability

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:55:31 +00:00
Alejandro Gutiérrez
c6443166b2 Wave 6: CopyToolbar utilities and LogEntry row component
- Task #7: Create CopyToolbar and copy utilities
  (copy-utils.ts with text/JSON/CSV formatting, clipboard API with fallback)
  (CopyToolbar with copy all/selected, format dropdown, download export)
- Task #8: Create LogEntry row component
  (click-to-copy with visual feedback, expandable metrics view)
  (level/category badges, search highlighting, shift+click selection)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:51:48 +00:00
Alejandro Gutiérrez
3987a9ab4e Document v3.1.2 conventions: dedup scoping and sentinel values
Two micro-risk mitigations documented:

1. dedup_group_id: Format "{business_id}:{hash}" to prevent
   cross-tenant collision on similar reviews.

2. Sentinel conventions: 'ALL' (spatial) vs 'all' (semantic).
   Case matters — do not normalize.

Spec frozen as v3.1.2.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:50:29 +00:00
Alejandro Gutiérrez
5ce3248efd Wave 5: LogViewer virtualized list and CrashReport component
- Task #6: Create LogViewer with react-window virtualization
  (search with highlighting, auto-scroll toggle, timestamp format toggle)
  (shift+click range selection, level/category color badges)
- Task #12: Create CrashReport frontend component
  (crash timeline SVG, pattern analysis with confidence bar)
  (auto-fix params display, retry API integration)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:44:35 +00:00
Alejandro Gutiérrez
2637d982e0 Wave 4: JobDevTools UI components and crash report API
- Task #5: Create JobDevTools container component
  (tabs: All/Scraper/Browser/Network/System, level filters, count badges)
- Task #11: Add crash report API endpoints
  (GET /jobs/{id}/crash-report, POST /jobs/{id}/retry?apply_fix=true, GET /crashes/stats)
- Task #14: Create SessionPanel component
  (fingerprint display, bot detection indicators, collapsible sections)
- Task #15: Create MetricsDashboard with recharts
  (extraction rate, cumulative reviews, memory usage, scroll progress)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:37:56 +00:00
Alejandro Gutiérrez
9515dd2d42 Polish ReviewIQ v3.1.2: tenant-scoping and FK integrity
Final fixes for production-ready spec:

1. locations.location_type: Added 'owned'|'competitor' flag.
   Competitors now inserted into locations (preserves FK integrity).

2. Competitor fact query: Added business_id filter to prevent
   cross-tenant contamination when same competitor tracked by
   multiple customers.

3. issue_events versioning: Added source + review_version columns
   for complete review reference in audit log.

4. Enrichment tenant-scoping: business_id now passed from ingest
   job (not looked up). Validates place_id exists under tenant.

5. Footer: Fixed version string v3.1.1 → v3.1.2.

Status: Ship-ready specification.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:34:35 +00:00
Alejandro Gutiérrez
f4ca60349e Wave 3: SSE structured logs, crash analyzer, session fingerprint
- Task #3: Update SSE stream to emit structured log events
  (type: "log" for entries, type: "metrics" every 5s, ?format=legacy for backward compat)
- Task #10: Create crash pattern analyzer module
  (6 patterns: memory_exhaustion, dom_bloat, rate_limited, consent_loop, scroll_timeout, element_stale)
  (confidence scoring, auto-fix params, summarize_crash_patterns for recurring issues)
- Task #13: Capture session fingerprint in backend
  (user_agent, platform, timezone, webgl, canvas, bot_detection_tests)
  (saved on success and failure for debugging)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:34:17 +00:00
Alejandro Gutiérrez
44d017b3f7 Finalize ReviewIQ Architecture v3.1.2 (production-ready)
Three final fixes applied:

1. issue_spans versioning: Added source + review_version columns
   with FK to reviews_enriched(source, review_id, review_version).
   Spans now correctly reference the exact review version.

2. Competitor business_id rule: Clarified that competitor reviews
   use customer's business_id + competitor's place_id (not NULL).
   Keeps facts and joins working without special-case logic.

3. Trust-weighted facts: Clarified trust_weighted_* columns are
   reserved but not populated in v3.1. Trust scoring applies to
   issue priority only. Aggregation deferred to v3.2.

Status: Production-grade architecture specification.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:31:16 +00:00
Alejandro Gutiérrez
d43c574b0c Add ReviewIQ Architecture v3.1.1 specification
Complete pipeline architecture for Google Reviews intelligence:
- Versioned reviews_enriched with (source, review_id, version) PK
- Tenant-scoped locations with (business_id, place_id) PK
- Relational issue_spans replacing array aggregation
- Unified fact_timeseries spine with 'ALL' sentinel for rollups
- Clean competitor model (separate table, no fake business_ids)
- Trust scoring and dedup support
- KPI-ready join keys

Reviewed and fixed: PK for edited reviews, multi-tenant overlap,
param ordering bugs, fact population scope, entity field deferral.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:25:46 +00:00
Alejandro Gutiérrez
9e1bcde981 Wave 2: Migrate scraper to StructuredLogger, add crash detection & topic tags
- Task #2: Migrate scraper_clean.py to use StructuredLogger with categories
  (37 log calls with metrics across browser/scraper/network/system)
- Task #4: Add crash_reports table schema and database methods
  (save_crash_report, get_crash_report, get_crash_stats)
- Task #9: Implement crash detection wrapper with metrics sampling
  (get_chrome_memory, get_dom_node_count, classify_crash)
- Task #17: Add topic tags to frontend ReviewAnalytics
  (topic filter UI, tags on cards, topics in modal)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:17:23 +00:00
Alejandro Gutiérrez
313e32f358 Wave 1: Add StructuredLogger and review topics inference
Task #1: StructuredLogger class (modules/structured_logger.py)
- LogEntry dataclass with timestamp, level, category, metrics, network
- Thread-safe storage with automatic pruning at 10k entries
- Level methods: debug(), info(), warn(), error(), fatal()
- Backward-compatible log() method for migration
- Filter methods: get_logs_by_category(), get_logs_by_level()

Task #16: Review topics inference (modules/scraper_clean.py)
- get_topic_variants(): Generate word variants (plural, -ing, -ed forms)
- infer_review_topics(): Match review text to topic keywords
- Word boundary matching to avoid false positives
- Integrated into scrape_reviews() to add 'topics' field to reviews

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 11:27:32 +00:00
Alejandro Gutiérrez
3da243be79 Add ReviewIQ pipeline spec and metadata extraction test
- reviewiq-pipeline-v1-final.md: Earlier pipeline specification
- test_metadata_extraction.py: Test script for metadata extraction

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 11:21:33 +00:00
Alejandro Gutiérrez
59368a5bd5 Add Job DevTools implementation task breakdown
18 tasks organized in 5 parallel tracks:
- Track A: Backend logging infrastructure (4 tasks)
- Track B: Frontend log viewer (5 tasks)
- Track C: Crash analysis (4 tasks)
- Track D: Session & metrics (3 tasks)
- Track E: Review topics (2 tasks)

Includes dependency graph and 7-wave execution plan
for parallel AI agent workflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 11:14:02 +00:00
Alejandro Gutiérrez
65fcaf43e8 Add Job DevTools specification document
Comprehensive spec for observability suite including:
- Structured logging system with categories
- Crash intelligence and pattern analysis
- Copy/export functionality
- Session fingerprint panel
- Real-time metrics dashboard
- Review topics inference

Organized by priority (P0-P3) with parallel implementation tracks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 11:10:34 +00:00
Alejandro Gutiérrez
b1296059a9 Add URL-based routing with sidebar navigation
Replace client-side state switching with proper Next.js routes:
- /new - New scrape form
- /jobs - Jobs list with table view
- /jobs/[id] - Individual job details and logs
- /analytics - Analytics overview (completed jobs)
- /analytics/[id] - Analytics for specific job

Add JobsContext for shared state across routes. Update Sidebar
to use next/link with pathname matching. Root page redirects to /new.

Also adds partial job status styling to JobsView.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:58:48 +00:00
Alejandro Gutiérrez
3eda9bdbfa Add complete URT v5.1 taxonomy framework (11 artifacts)
Universal Review Taxonomy v5.1 implementation with:
- Track A (Training): A1 Quickstart, A2 QA Protocol, A3 Calibration Set, A4 Full Manual
- Track B (Engineering): B1 Code Registry, B2 Database Schema, B3 Owner Routing, B4 API Contract
- Track C (Analytics): C1 Issue Lifecycle, C2 KPI Mapping Guide
- Track D (Integration): D1 Dashboard Specification

Covers 7 domains, 28 categories, 138 subcodes, 16 causal codes, and 7 metadata dimensions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:51:41 +00:00
Alejandro Gutiérrez
a540ab97b1 Add browser fingerprint support and analytics metadata display
- Transfer user's browser fingerprint (user-agent, viewport, timezone,
  language, geolocation) to Chrome for more authentic scraping
- Display review topics from Google Maps in analytics dashboard
- Show business category badge in analytics header
- Fix date_text null handling in analytics (handle undefined/timestamp fields)
- Add review_topics and business_category to JobStatus interface

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:36:06 +00:00
Alejandro Gutiérrez
1bd30c0789 Fix get_business_card_info for pooled workers
- Clear cookies and navigate to about:blank before loading URL
  (ensures clean state when reusing pooled driver)
- Simplified regex patterns for rating/reviews extraction
- Uses partial word matching like scrape_reviews

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 18:09:51 +00:00
Alejandro Gutiérrez
e3136281b8 Remove fast_scraper.py - consolidated into scraper_clean
All functionality now in scraper_clean.py:
- fast_scrape_reviews (main scraper)
- get_business_card_info (validation)

Updated health_checks.py to import from scraper_clean.

Removes 1,935 lines of duplicate/obsolete code.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:59:09 +00:00
Alejandro Gutiérrez
0682c0ec61 Add get_business_card_info to scraper_clean with multilingual support
Replaces fast_scraper validation with efficient polling-based extraction
using the same navigation pattern as scrape_reviews:
- 10ms polling for consent handling (no fixed waits)
- 100ms polling for data extraction
- Exits early when data found

Supports multiple languages:
- Rating: stars/estrellas/étoiles/sterne/stelle
- Reviews: reviews/reseñas/avis/bewertungen/recensioni
- Handles comma decimals (4,8 -> 4.8)

Result: 6.3s to extract name, address, rating, total_reviews

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:52:06 +00:00
Alejandro Gutiérrez
47bb032011 Clean up project root - remove 51 obsolete files
Deleted:
- 26 old markdown summary/documentation files
- 16 debug/test Python scripts (debug_*, test_*, diagnose_*)
- 10 untracked JSON files from api_response_samples
- terms-of-usage.md, pane_not_found.png

Also includes pending web app changes:
- Jobs management UI (JobsView, Sidebar components)
- API routes for job streaming and comparison
- Enhanced ReviewAnalytics and ScraperTest components

Final clean structure:
├── api_server_production.py  (main entry)
├── modules/                  (core Python)
├── web/                      (Next.js frontend)
├── tests/                    (test suite)
├── docs/                     (documentation)
└── examples/                 (usage examples)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:31:53 +00:00
Alejandro Gutiérrez
8ccf72a489 Remove old scraper files - consolidate to scraper_clean
Production (api_server_production.py) only uses:
- modules/scraper_clean.py - main scraping logic
- modules/fast_scraper.py - validation helpers
- modules/database.py, webhooks.py, health_checks.py, chrome_pool.py

Deleted 33 unused Python files including:
- Old API server (api_server.py)
- 14 start*.py experimental scrapers
- 7 *_scraper.py variants
- Old modules: scraper.py, api_interceptor.py, job_manager.py, cli.py
- Various debug/test/utility scripts

Saves ~11,000 lines of unmaintained code.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:25:00 +00:00
Alejandro Gutiérrez
80e7771c00 Fix DOM cleanup: hide cards from API interception too
The continue statement was skipping the card.style.display='none'
and card.innerHTML='' cleanup for cards already seen via API
interception. This caused DOM to grow unbounded during long scrapes.

Now ALL processed cards are hidden regardless of data source.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:23:51 +00:00
Alejandro Gutiérrez
01ea18d91d Add test URL quick-select buttons to frontend
- Small (~79 reviews): R. Fleitas Peluqueros
- Medium (~589 reviews): ClickRent Gran Canaria
- Large (~2000+ reviews): Hospital Doctor Negrín

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:20:54 +00:00
Alejandro Gutiérrez
8b36850838 Switch Docker production API to use scraper_clean
- Import fast_scrape_reviews from scraper_clean instead of fast_scraper
- Keeps helper functions (check_reviews_available, get_business_card_info) from fast_scraper
- Production now uses clean scraper with hard refresh recovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:19:40 +00:00
Alejandro Gutiérrez
a6d6531543 Switch production to scraper_clean with hard refresh recovery
- Add fast_scrape_reviews() wrapper to scraper_clean.py for API compatibility
- Set window size (1200x900) in wrapper to ensure proper Google Maps rendering
- Update job_manager.py to import from scraper_clean instead of fast_scraper
- Production now uses clean scraper with:
  - Hard refresh recovery when stuck after 8+ soft recovery attempts
  - API interception + DOM parsing for complete data collection
  - Automatic deduplication across refreshes

Tested: 589/589 reviews collected in 55s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:18:10 +00:00