docs: Add pipeline development artifacts for parallel implementation

New artifacts: - ReviewIQ-Pipeline-DevGuide.md: Entry point for pipeline work - ReviewIQ-Pipeline-Contracts-v1.md: Stage I/O specs, validation rules, test fixtures - ReviewIQ-Pipeline-Checklist.md: Per-stage implementation checklists - ReviewIQ-Codebase-Overview.md: File structure, integration points - ReviewIQ-v3.2.1-Taxonomy-Versioning.md: Taxonomy versioning addendum Updated: - ReviewIQ-v32-Decisions.md: Added B2 audit findings, taxonomy versioning decisions, pipeline status These artifacts enable parallel development of pipeline stages 1-4 with: - Independent validation (35 rules across stages) - Clear input/output contracts - Test fixtures for each stage - Definition of done criteria Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:08:40 +00:00
parent c2996bef1e
commit acd3b22e88
6 changed files with 3600 additions and 4 deletions
--- a/.artifacts/ReviewIQ-Codebase-Overview.md
+++ b/.artifacts/ReviewIQ-Codebase-Overview.md
@@ -0,0 +1,450 @@
+# ReviewIQ Codebase Overview
+
+**Purpose**: Map existing code for agents starting fresh
+**Last Updated**: 2026-01-24
+**Status**: ~55% implemented (scraping done, enrichment pipeline missing)
+
+---
+
+## Quick Start for New Agents
+
+1. **Read first**: `ReviewIQ-v32-Decisions.md` (context recovery)
+2. **For implementation**: `ReviewIQ-Pipeline-Contracts-v1.md` + `ReviewIQ-Pipeline-Checklist.md`
+3. **For schema**: `ReviewIQ-Architecture-v3.2.md` § Part 2
+4. **For LLM prompts**: `LLM-Classification-Contract-v1.md`
+
+---
+
+## Implementation Status Summary
+
+```
+PIPELINE COMPLETION: ~55%
+
+✅ COMPLETE (Working in Production)
+├── Google Maps scraping (v1.0.0)
+├── Job orchestration & queuing
+├── Chrome worker pool
+├── Webhook delivery
+├── SSE real-time streaming
+├── Frontend job management
+└── Basic analytics dashboard
+
+❌ NOT IMPLEMENTED (Spec'd Only)
+├── Stage 1: Normalization
+├── Stage 2: LLM Classification
+├── Stage 3: Issue Routing
+├── Stage 4: Fact Aggregation
+├── Enrichment database schema
+└── Advanced analytics UI
+```
+
+---
+
+## Directory Structure
+
+```
+google-reviews-scraper-pro/
+│
+├── .artifacts/                    # Design documents (YOU ARE HERE)
+│   ├── ReviewIQ-v32-Decisions.md         # START HERE - context recovery
+│   ├── ReviewIQ-Architecture-v3.2.md     # Full v3.2 spec
+│   ├── ReviewIQ-Pipeline-Contracts-v1.md # Stage I/O contracts
+│   ├── ReviewIQ-Pipeline-Checklist.md    # Implementation checklist
+│   ├── LLM-Classification-Contract-v1.md # LLM prompt spec
+│   ├── URT-v5.1-Reference.md             # URT dimension codes
+│   └── ReviewIQ-Codebase-Overview.md     # THIS FILE
+│
+├── api/                           # ✅ API routes (FastAPI)
+│   ├── routes/
+│   │   ├── admin.py              # Scraper management endpoints
+│   │   ├── dashboard.py          # Analytics endpoints
+│   │   └── batches.py            # Batch job endpoints
+│   └── __init__.py
+│
+├── core/                          # ✅ Core services
+│   ├── database.py               # AsyncPG database layer (~1200 lines)
+│   ├── config.py                 # Configuration management
+│   └── logging.py                # Structured logging
+│
+├── services/                      # ✅ Background services
+│   ├── webhook_service.py        # Async webhook delivery
+│   └── job_callback_service.py   # Callback handling
+│
+├── workers/                       # ✅ Worker pool
+│   └── chrome_pool.py            # Chrome instance pooling
+│
+├── scrapers/                      # ✅ Scraper implementations
+│   └── google_reviews/
+│       └── v1_0_0.py             # Main scraper (~2000 lines)
+│
+├── pipeline/                      # ❌ TO BE CREATED
+│   ├── stage1_normalize.py       # TODO
+│   ├── stage2_classify.py        # TODO
+│   ├── stage3_route.py           # TODO
+│   ├── stage4_aggregate.py       # TODO
+│   └── tests/                    # TODO
+│
+├── migrations/                    # Database migrations
+│   ├── 001_add_job_platform_fields.sql    # ✅ Deployed
+│   ├── 002_create_batches_table.sql       # ✅ Deployed
+│   ├── 003_create_scraper_registry.sql    # ✅ Deployed
+│   ├── 004_create_api_keys.sql            # ✅ Deployed
+│   ├── 005_create_reviews_tables.sql      # ❌ TODO
+│   ├── 006_create_spans_table.sql         # ❌ TODO
+│   ├── 007_create_urt_enums.sql           # ❌ TODO
+│   ├── 008_create_issues_tables.sql       # ❌ TODO
+│   └── 009_create_facts_table.sql         # ❌ TODO
+│
+├── frontend/                      # ✅ Next.js frontend
+│   ├── app/
+│   │   ├── dashboard/            # System overview
+│   │   ├── jobs/                 # Job list & detail
+│   │   ├── analytics/            # Basic charts
+│   │   └── new/                  # Job submission forms
+│   └── components/
+│
+├── api_server_production.py      # ✅ Main API server (~1920 lines)
+├── Dockerfile                    # ✅ Production container
+├── docker-compose.production.yml # ✅ Docker orchestration
+├── requirements-production.txt   # ✅ Python dependencies
+└── package.json                  # ✅ Node.js dependencies
+```
+
+---
+
+## Key Files to Understand
+
+### 1. API Server Entry Point
+**File**: `api_server_production.py`
+**Lines**: ~1920
+**What it does**:
+- FastAPI application setup
+- All endpoint definitions
+- Job submission and management
+- SSE streaming
+- Health checks
+
+**Key endpoints**:
+```python
+POST /api/scrape/google-reviews  # Submit scrape job
+GET  /jobs/{job_id}              # Get job status
+GET  /jobs/{job_id}/reviews      # Get scraped reviews
+GET  /jobs/{job_id}/stream       # SSE real-time updates
+```
+
+### 2. Database Layer
+**File**: `core/database.py`
+**Lines**: ~1200
+**What it does**:
+- AsyncPG connection pooling
+- Job CRUD operations
+- Review storage (currently JSONB blob)
+- Webhook tracking
+
+**Key functions**:
+```python
+async def create_job(job_data: dict) -> str
+async def update_job_status(job_id: str, status: str, ...)
+async def get_job(job_id: str) -> dict
+async def store_reviews(job_id: str, reviews: list) -> int
+```
+
+**Note**: Currently stores reviews as JSONB in `jobs.reviews_data`.
+The enrichment pipeline will need to:
+1. Read from `jobs.reviews_data`
+2. Write to `reviews_raw` and `reviews_enriched` tables
+
+### 3. Google Scraper
+**File**: `scrapers/google_reviews/v1_0_0.py`
+**Lines**: ~2000
+**What it does**:
+- SeleniumBase Chrome automation
+- DOM scraping + API interception
+- Review extraction (text, rating, author, date)
+- Business metadata extraction
+- Pagination handling
+
+**Output format** (stored in `jobs.reviews_data`):
+```json
+{
+  "business_info": {...},
+  "reviews": [
+    {
+      "review_id": "...",
+      "author_name": "...",
+      "rating": 4,
+      "text": "...",
+      "review_time": "2026-01-20T14:30:00Z"
+    }
+  ]
+}
+```
+
+### 4. Chrome Worker Pool
+**File**: `workers/chrome_pool.py`
+**What it does**:
+- Pre-warms Chrome instances
+- Manages concurrent scraping jobs
+- Handles resource cleanup
+
+---
+
+## Database Schema (Current State)
+
+### Deployed Tables
+
+```sql
+-- Core job tracking
+CREATE TABLE jobs (
+    job_id UUID PRIMARY KEY,
+    status VARCHAR(50),
+    url TEXT,
+    reviews_data JSONB,          -- Raw scraped reviews live here
+    reviews_count INTEGER,
+    started_at TIMESTAMP,
+    completed_at TIMESTAMP,
+    -- ... 20+ more columns
+);
+
+-- Batch processing
+CREATE TABLE batches (...);
+
+-- Webhook tracking
+CREATE TABLE webhook_attempts (...);
+
+-- Scraper versioning
+CREATE TABLE scraper_registry (...);
+
+-- API authentication (not enforced)
+CREATE TABLE api_keys (...);
+```
+
+### NOT Deployed (Defined in v3.2 Spec)
+
+```sql
+-- These tables need to be created via migrations 005-009
+
+CREATE TABLE locations (...);           -- Multi-tenant locations
+CREATE TABLE reviews_raw (...);         -- Immutable raw storage
+CREATE TABLE reviews_enriched (...);    -- Classified reviews
+CREATE TABLE review_spans (...);        -- Span-level classification
+CREATE TABLE urt_codes (...);           -- URT reference data
+CREATE TABLE issues (...);              -- Aggregated issues
+CREATE TABLE issue_spans (...);         -- Issue-span links
+CREATE TABLE issue_events (...);        -- Audit log
+CREATE TABLE fact_timeseries (...);     -- Pre-aggregated analytics
+```
+
+---
+
+## Integration Points
+
+### Where New Pipeline Code Connects
+
+```
+                    EXISTING CODE                    NEW CODE
+                    ════════════                    ════════
+
+api_server_production.py
+        │
+        ▼
+    jobs table
+    (reviews_data JSONB) ──────────────────▶ pipeline/stage1_normalize.py
+                                                    │
+                                                    ▼
+                                            reviews_raw table
+                                            reviews_enriched table
+                                                    │
+                                                    ▼
+                                            pipeline/stage2_classify.py
+                                                    │
+                                                    ▼
+                                            review_spans table
+                                                    │
+                                                    ▼
+                                            pipeline/stage3_route.py
+                                                    │
+                                                    ▼
+                                            issues table
+                                            issue_spans table
+                                                    │
+                                                    ▼
+                                            pipeline/stage4_aggregate.py
+                                                    │
+                                                    ▼
+                                            fact_timeseries table
+```
+
+### How to Trigger Pipeline
+
+**Option A: Post-scrape hook** (recommended)
+```python
+# In api_server_production.py, after job completes:
+async def on_job_complete(job_id: str):
+    # Existing: send webhook
+    await webhook_service.dispatch(job_id)
+
+    # NEW: trigger enrichment pipeline
+    await pipeline.stage1.process_job(job_id)
+```
+
+**Option B: Background worker**
+```python
+# New file: workers/enrichment_worker.py
+async def enrichment_loop():
+    while True:
+        jobs = await db.query("""
+            SELECT job_id FROM jobs
+            WHERE status = 'completed'
+              AND enrichment_status IS NULL
+            LIMIT 10
+        """)
+        for job in jobs:
+            await pipeline.process(job['job_id'])
+        await asyncio.sleep(60)
+```
+
+**Option C: Manual trigger via API**
+```python
+# New endpoint in api_server_production.py
+@app.post("/api/jobs/{job_id}/enrich")
+async def trigger_enrichment(job_id: str):
+    await pipeline.process(job_id)
+    return {"status": "processing"}
+```
+
+---
+
+## Environment Setup
+
+### Required Environment Variables
+
+```bash
+# Database (required)
+DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq
+
+# LLM Provider (for Stage 2)
+OPENAI_API_KEY=sk-...
+# OR
+ANTHROPIC_API_KEY=sk-ant-...
+
+# Embedding model (for Stage 2)
+EMBEDDING_MODEL=all-MiniLM-L6-v2
+
+# Taxonomy version
+DEFAULT_TAXONOMY_VERSION=v5.1
+```
+
+### Local Development
+
+```bash
+# Start database
+docker-compose -f docker-compose.production.yml up -d postgres
+
+# Run migrations
+psql $DATABASE_URL -f migrations/001_add_job_platform_fields.sql
+# ... repeat for all migrations
+
+# Start API server
+python api_server_production.py
+
+# Start frontend (separate terminal)
+cd frontend && npm run dev
+```
+
+### Running Tests
+
+```bash
+# Unit tests
+pytest pipeline/tests/ -v
+
+# Integration tests
+pytest pipeline/tests/integration/ -v
+
+# Full E2E validation
+python -m pipeline.validate --job-id <JOB_ID>
+```
+
+---
+
+## Tech Stack
+
+| Component | Technology | Version |
+|-----------|------------|---------|
+| API | FastAPI | 0.100+ |
+| Database | PostgreSQL | 15+ |
+| DB Driver | asyncpg | 0.28+ |
+| Scraping | SeleniumBase | 4.20+ |
+| Browser | Chrome (headless) | 120+ |
+| Frontend | Next.js | 16.1.3 |
+| UI | React | 19.2.3 |
+| Charts | Recharts | 2.x |
+| LLM | OpenAI / Anthropic | Latest |
+| Embeddings | sentence-transformers | 2.x |
+| Vectors | pgvector | 0.5+ |
+
+---
+
+## Common Patterns
+
+### Database Queries (asyncpg)
+
+```python
+# In core/database.py pattern:
+async def get_job(job_id: str) -> dict:
+    async with pool.acquire() as conn:
+        row = await conn.fetchrow(
+            "SELECT * FROM jobs WHERE job_id = $1",
+            job_id
+        )
+        return dict(row) if row else None
+```
+
+### API Endpoints (FastAPI)
+
+```python
+# In api_server_production.py pattern:
+@app.get("/jobs/{job_id}")
+async def get_job(job_id: str):
+    job = await db.get_job(job_id)
+    if not job:
+        raise HTTPException(404, "Job not found")
+    return job
+```
+
+### Background Tasks
+
+```python
+# Pattern for async processing:
+async def process_in_background(job_id: str):
+    asyncio.create_task(do_heavy_work(job_id))
+    return {"status": "processing"}
+```
+
+---
+
+## Gotchas & Notes
+
+1. **Reviews are JSONB blobs** - Currently in `jobs.reviews_data`, not normalized tables
+2. **No auth enforcement** - `api_keys` table exists but not used
+3. **CORS is wide open** - Set to `*` in production (fix before launch)
+4. **Scraper is single-threaded per job** - Chrome pool handles concurrency
+5. **Webhooks have retry logic** - 3 attempts with exponential backoff
+6. **SSE streaming works** - Real-time job updates via `/jobs/{job_id}/stream`
+
+---
+
+## Next Steps for Implementation
+
+1. **Create migrations 005-009** - Deploy enrichment schema
+2. **Create `pipeline/` directory** - New code goes here
+3. **Implement Stage 1** - Read from jobs.reviews_data, write to reviews_raw/enriched
+4. **Implement Stage 2** - LLM classification with span extraction
+5. **Implement Stage 3** - Issue routing
+6. **Implement Stage 4** - Fact aggregation
+7. **Add pipeline trigger** - Hook into job completion or create worker
+8. **Update frontend** - Add enrichment views
+
+---
+
+*This document should be updated when significant code changes occur.*
--- a/.artifacts/ReviewIQ-Pipeline-Checklist.md
+++ b/.artifacts/ReviewIQ-Pipeline-Checklist.md
@@ -0,0 +1,371 @@
+# ReviewIQ Pipeline Implementation Checklist
+
+**Purpose**: Quick reference for agents to verify stage completion
+**Reference**: `ReviewIQ-Pipeline-Contracts-v1.md` for full specs
+
+---
+
+## Pipeline Overview
+
+```
+Stage 0 ──▶ Stage 1 ──▶ Stage 2 ──▶ Stage 3 ──▶ Stage 4
+Scrape      Normalize   Classify    Route       Aggregate
+✅ DONE     ❌ TODO     ❌ TODO     ❌ TODO     ❌ TODO
+```
+
+---
+
+## Stage 0: Raw Ingestion ✅ COMPLETE
+
+No action needed. Already implemented in `scrapers/google_reviews/v1_0_0.py`.
+
+**Output Location**: `jobs.reviews_data` (JSONB)
+
+---
+
+## Stage 1: Normalization ❌ TODO
+
+### Files to Create
+- [ ] `pipeline/stage1_normalize.py` - Main normalization logic
+- [ ] `pipeline/tests/test_stage1.py` - Unit tests
+- [ ] `migrations/005_create_reviews_tables.sql` - Schema migration
+
+### Database Schema Required
+```sql
+-- Must exist before Stage 1 can write
+CREATE TABLE reviews_raw (...);
+CREATE TABLE reviews_enriched (...);
+```
+
+### Implementation Checklist
+- [ ] Read from `jobs.reviews_data` where `status = 'completed'`
+- [ ] Filter out empty/null review texts
+- [ ] Normalize text (lowercase, whitespace, emoji)
+- [ ] Detect language (ISO 639-1)
+- [ ] Compute content hash (SHA256)
+- [ ] Check for duplicates within business
+- [ ] Insert into `reviews_raw` (immutable)
+- [ ] Insert stub into `reviews_enriched` (classification fields NULL)
+- [ ] Return `Stage1Output` with stats
+
+### Validation (run after implementation)
+```bash
+python -m pytest pipeline/tests/test_stage1.py -v
+```
+
+### Definition of Done
+- [ ] All V1.1-V1.6 validation rules pass
+- [ ] `reviews_raw` populated with immutable records
+- [ ] `reviews_enriched` has stubs ready for Stage 2
+- [ ] Integration test: Stage 0 output → Stage 1 input passes
+- [ ] No empty texts in output
+- [ ] Duplicate detection working
+
+---
+
+## Stage 2: LLM Classification ❌ TODO
+
+### Files to Create
+- [ ] `pipeline/stage2_classify.py` - LLM classification logic
+- [ ] `pipeline/llm_client.py` - LLM provider abstraction
+- [ ] `pipeline/span_extractor.py` - Span boundary detection
+- [ ] `pipeline/tests/test_stage2.py` - Unit tests
+- [ ] `migrations/006_create_spans_table.sql` - Schema migration
+- [ ] `migrations/007_create_urt_enums.sql` - ENUM types
+
+### Database Schema Required
+```sql
+-- ENUM types
+CREATE TYPE urt_valence AS ENUM (...);
+CREATE TYPE urt_intensity AS ENUM (...);
+-- ... all 12 ENUMs from v3.2 spec
+
+-- Spans table
+CREATE TABLE review_spans (...);
+```
+
+### Implementation Checklist
+- [ ] Query unclassified reviews from `reviews_enriched`
+- [ ] Build LLM prompt per `LLM-Classification-Contract-v1.md`
+- [ ] Call LLM API (support GPT-4o-mini, Claude-3-haiku)
+- [ ] Parse structured JSON response
+- [ ] Extract spans with character offsets
+- [ ] Validate span_text matches original text substring
+- [ ] Check spans don't overlap
+- [ ] Select primary span (I3 > I2 > I1, V- > V± > V0 > V+)
+- [ ] Generate embeddings (384-dim)
+- [ ] Compute trust score (0.2 floor)
+- [ ] Build USN string per profile
+- [ ] Update `reviews_enriched` with classification
+- [ ] Insert spans into `review_spans`
+- [ ] Return `Stage2Output` with stats
+
+### Validation (run after implementation)
+```bash
+python -m pytest pipeline/tests/test_stage2.py -v
+```
+
+### Definition of Done
+- [ ] All V2.1-V2.12 validation rules pass
+- [ ] LLM calls working with retry logic
+- [ ] Span offsets correct (text substring matches)
+- [ ] No overlapping spans
+- [ ] Exactly one primary span per review
+- [ ] Embeddings are 384-dim vectors
+- [ ] Trust scores clamped to [0.2, 1.0]
+- [ ] USN format valid per profile
+- [ ] Integration test: Stage 1 output → Stage 2 input passes
+
+---
+
+## Stage 3: Issue Routing ❌ TODO
+
+### Files to Create
+- [ ] `pipeline/stage3_route.py` - Issue routing logic
+- [ ] `pipeline/issue_manager.py` - Issue create/update logic
+- [ ] `pipeline/tests/test_stage3.py` - Unit tests
+- [ ] `migrations/008_create_issues_tables.sql` - Schema migration
+
+### Database Schema Required
+```sql
+CREATE TABLE issues (...);
+CREATE TABLE issue_spans (...);
+CREATE TABLE issue_events (...);
+```
+
+### Implementation Checklist
+- [ ] Query unrouted spans where `valence IN ('V-', 'V±')`
+- [ ] Generate deterministic `issue_id` from routing key
+- [ ] Check if issue exists
+- [ ] Create new issue OR update existing counters
+- [ ] Insert `issue_spans` link (enforce 1:1 with UNIQUE)
+- [ ] Log event to `issue_events`
+- [ ] Recalculate priority score
+- [ ] Return `Stage3Output` with stats
+
+### Validation (run after implementation)
+```bash
+python -m pytest pipeline/tests/test_stage3.py -v
+```
+
+### Definition of Done
+- [ ] All V3.1-V3.5 validation rules pass
+- [ ] Issue IDs are deterministic (same key = same ID)
+- [ ] 1:1 span-to-issue mapping enforced
+- [ ] Only V-/V± spans create issues
+- [ ] Issue counters updated correctly
+- [ ] Events logged for audit
+- [ ] Integration test: Stage 2 output → Stage 3 input passes
+
+---
+
+## Stage 4: Fact Aggregation ❌ TODO
+
+### Files to Create
+- [ ] `pipeline/stage4_aggregate.py` - Fact aggregation logic
+- [ ] `pipeline/tests/test_stage4.py` - Unit tests
+- [ ] `migrations/009_create_facts_table.sql` - Schema migration
+
+### Database Schema Required
+```sql
+CREATE TABLE fact_timeseries (...);
+```
+
+### Implementation Checklist
+- [ ] Accept business_id, date, bucket_types
+- [ ] Query spans joined with reviews for the period
+- [ ] Aggregate by URT code (per location + 'ALL' rollup)
+- [ ] Compute: review_count, span_count, valence counts
+- [ ] Compute: strength_score, negative_strength, positive_strength
+- [ ] Compute: intensity distribution (I1/I2/I3)
+- [ ] Compute: CR counts (better/worse/same)
+- [ ] Compute: trust-weighted metrics
+- [ ] UPSERT into `fact_timeseries`
+- [ ] Return `Stage4Output` with stats
+
+### Validation (run after implementation)
+```bash
+python -m pytest pipeline/tests/test_stage4.py -v
+```
+
+### Definition of Done
+- [ ] All V4.1-V4.7 validation rules pass
+- [ ] Valence counts sum to span_count
+- [ ] Intensity counts sum to span_count
+- [ ] 'ALL' rollup includes owned locations only
+- [ ] Facts are idempotent (re-run produces same result)
+- [ ] Integration test: Full pipeline E2E passes
+
+---
+
+## Integration Tests
+
+### Handoff Tests (run after each stage)
+```bash
+# Stage 0 → 1
+python -m pytest pipeline/tests/integration/test_stage0_to_1.py
+
+# Stage 1 → 2
+python -m pytest pipeline/tests/integration/test_stage1_to_2.py
+
+# Stage 2 → 3
+python -m pytest pipeline/tests/integration/test_stage2_to_3.py
+
+# Full E2E
+python -m pytest pipeline/tests/integration/test_e2e.py
+```
+
+### E2E Validation Command
+```bash
+# Run full pipeline validation
+python -m pipeline.validate --job-id <JOB_ID> --verbose
+```
+
+Expected output:
+```
+Stage 0: ✅ PASS (5/5 rules)
+Stage 1: ✅ PASS (6/6 rules)
+Stage 2: ✅ PASS (12/12 rules)
+Stage 3: ✅ PASS (5/5 rules)
+Stage 4: ✅ PASS (7/7 rules)
+
+E2E Integration: ✅ PASS
+- Reviews scraped: 47
+- Reviews normalized: 45 (2 empty filtered)
+- Spans extracted: 127
+- Issues created: 23
+- Facts written: 156
+```
+
+---
+
+## Quick Reference: Validation Rules
+
+### Stage 1
+| Rule | Description |
+|------|-------------|
+| V1.1 | `text` is non-empty |
+| V1.2 | `text_normalized` has no control chars |
+| V1.3 | `content_hash` is 64-char hex |
+| V1.4 | `review_version` >= 1 |
+| V1.5 | `text_language` is valid ISO 639-1 |
+| V1.6 | `raw_id` references valid row |
+
+### Stage 2
+| Rule | Description |
+|------|-------------|
+| V2.1 | `urt_primary` matches `^[OPJEAVR][1-4]\.[0-9]{2}$` |
+| V2.2 | `urt_secondary` max 2 elements |
+| V2.3 | `valence` is valid enum |
+| V2.4 | `intensity` is valid enum |
+| V2.5 | `span_end > span_start` |
+| V2.6 | `span_text == text[start:end]` |
+| V2.7 | Spans don't overlap |
+| V2.8 | Exactly one `is_primary = true` |
+| V2.9 | `trust_score` in [0.2, 1.0] |
+| V2.10 | `embedding` is 384-dim |
+| V2.11 | `usn` matches profile regex |
+| V2.12 | `related_span_index` valid if set |
+
+### Stage 3
+| Rule | Description |
+|------|-------------|
+| V3.1 | `issue_id` matches `^ISS-[a-f0-9]{16}$` |
+| V3.2 | `routing_key` non-empty |
+| V3.3 | Span not already linked elsewhere |
+| V3.4 | Issue exists in `issues` table |
+| V3.5 | Only V-/V± spans routed |
+
+### Stage 4
+| Rule | Description |
+|------|-------------|
+| V4.1 | `place_id` valid or 'ALL' |
+| V4.2 | `period_date` matches bucket |
+| V4.3 | `span_count >= review_count` |
+| V4.4 | Valence counts sum correctly |
+| V4.5 | Intensity counts sum correctly |
+| V4.6 | `strength_score >= 0` |
+| V4.7 | `avg_rating` in [1.0, 5.0] or NULL |
+
+---
+
+## Migration Execution Order
+
+```bash
+# Run in sequence
+psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql
+psql $DATABASE_URL -f migrations/006_create_spans_table.sql
+psql $DATABASE_URL -f migrations/007_create_urt_enums.sql
+psql $DATABASE_URL -f migrations/008_create_issues_tables.sql
+psql $DATABASE_URL -f migrations/009_create_facts_table.sql
+```
+
+---
+
+## Environment Variables Required
+
+```bash
+# LLM Provider (Stage 2)
+OPENAI_API_KEY=sk-...
+# OR
+ANTHROPIC_API_KEY=sk-ant-...
+
+# Embedding Model (Stage 2)
+EMBEDDING_MODEL=all-MiniLM-L6-v2
+
+# Database
+DATABASE_URL=postgresql://...
+
+# Taxonomy
+DEFAULT_TAXONOMY_VERSION=v5.1
+```
+
+---
+
+## File Structure After Implementation
+
+```
+pipeline/
+├── __init__.py
+├── stage1_normalize.py      # ❌ TODO
+├── stage2_classify.py       # ❌ TODO
+├── stage3_route.py          # ❌ TODO
+├── stage4_aggregate.py      # ❌ TODO
+├── llm_client.py            # ❌ TODO
+├── span_extractor.py        # ❌ TODO
+├── issue_manager.py         # ❌ TODO
+├── validate.py              # ❌ TODO (CLI validator)
+├── contracts.py             # ❌ TODO (TypedDict definitions)
+└── tests/
+    ├── __init__.py
+    ├── test_stage1.py       # ❌ TODO
+    ├── test_stage2.py       # ❌ TODO
+    ├── test_stage3.py       # ❌ TODO
+    ├── test_stage4.py       # ❌ TODO
+    ├── fixtures/
+    │   ├── stage0_output.json
+    │   ├── stage1_output.json
+    │   ├── stage2_output.json
+    │   ├── stage3_output.json
+    │   └── stage4_output.json
+    └── integration/
+        ├── test_stage0_to_1.py
+        ├── test_stage1_to_2.py
+        ├── test_stage2_to_3.py
+        └── test_e2e.py
+
+migrations/
+├── 001_add_job_platform_fields.sql    # ✅ EXISTS
+├── 002_create_batches_table.sql       # ✅ EXISTS
+├── 003_create_scraper_registry.sql    # ✅ EXISTS
+├── 004_create_api_keys.sql            # ✅ EXISTS
+├── 005_create_reviews_tables.sql      # ❌ TODO
+├── 006_create_spans_table.sql         # ❌ TODO
+├── 007_create_urt_enums.sql           # ❌ TODO
+├── 008_create_issues_tables.sql       # ❌ TODO
+└── 009_create_facts_table.sql         # ❌ TODO
+```
+
+---
+
+*Last Updated: 2026-01-24*
--- a/.artifacts/ReviewIQ-Pipeline-Contracts-v1.md
+++ b/.artifacts/ReviewIQ-Pipeline-Contracts-v1.md
--- a/.artifacts/ReviewIQ-Pipeline-DevGuide.md
+++ b/.artifacts/ReviewIQ-Pipeline-DevGuide.md
@@ -0,0 +1,312 @@
+# ReviewIQ Pipeline Development Guide
+
+**Purpose**: Entry point for agents implementing the enrichment pipeline
+**Last Updated**: 2026-01-24
+
+---
+
+## TL;DR - Current State
+
+**Pipeline Implementation: ~55% complete**
+
+```
+✅ WORKING                          ❌ NOT IMPLEMENTED
+──────────                          ──────────────────
+Google Maps scraping                Stage 1: Normalization
+Job orchestration                   Stage 2: LLM Classification
+Chrome worker pool                  Stage 3: Issue Routing
+Webhook delivery                    Stage 4: Fact Aggregation
+SSE streaming                       Enrichment database schema
+Frontend (job management)           Advanced analytics UI
+```
+
+**Estimated effort to 100%**: 6-8 weeks
+
+---
+
+## Cold Start Instructions
+
+A new agent should:
+
+| Step | Action | Time |
+|------|--------|------|
+| 1 | Read this file (`ReviewIQ-Pipeline-DevGuide.md`) | 2 min |
+| 2 | Read `ReviewIQ-v32-Decisions.md` | 5 min |
+| 3 | Read `ReviewIQ-Codebase-Overview.md` | 10 min |
+| 4 | Read assigned stage in `ReviewIQ-Pipeline-Contracts-v1.md` | 15 min |
+| 5 | Use `ReviewIQ-Pipeline-Checklist.md` to verify completion | Reference |
+
+---
+
+## Document Map
+
+```
+                       ┌─────────────────────────────────────┐
+                       │  ReviewIQ-Pipeline-DevGuide.md      │
+                       │         (YOU ARE HERE)              │
+                       └─────────────────┬───────────────────┘
+                                         │
+           ┌─────────────────────────────┼─────────────────────────────┐
+           │                             │                             │
+           ▼                             ▼                             ▼
+┌─────────────────────┐    ┌─────────────────────────┐    ┌─────────────────────┐
+│ CONTEXT RECOVERY    │    │    IMPLEMENTATION       │    │    REFERENCE        │
+├─────────────────────┤    ├─────────────────────────┤    ├─────────────────────┤
+│                     │    │                         │    │                     │
+│ ReviewIQ-v32-       │    │ Pipeline-Contracts-v1   │    │ Architecture-v3.2   │
+│ Decisions.md        │    │ (I/O specs, validation) │    │ (full DDL spec)     │
+│ (key decisions,     │    │                         │    │                     │
+│ markpoint)          │    │ Pipeline-Checklist      │    │ v3.2.1-Taxonomy-    │
+│                     │    │ (implementation tasks)  │    │ Versioning          │
+│ Codebase-Overview   │    │                         │    │ (versioning spec)   │
+│ (file structure,    │    │ LLM-Classification-     │    │                     │
+│ integration points) │    │ Contract-v1             │    │ URT-v5.1-Reference  │
+│                     │    │ (prompt engineering)    │    │ (dimension codes)   │
+└─────────────────────┘    └─────────────────────────┘    └─────────────────────┘
+```
+
+---
+
+## Core Documents
+
+### Context & Status (Read First)
+
+| File | Purpose | Est. Read Time |
+|------|---------|----------------|
+| `ReviewIQ-Pipeline-DevGuide.md` | Entry point, document map | 2 min |
+| `ReviewIQ-v32-Decisions.md` | Key decisions, current markpoint | 5 min |
+| `ReviewIQ-Codebase-Overview.md` | File structure, what code exists, integration points | 10 min |
+
+### Implementation Guides (For Building)
+
+| File | Purpose | Est. Read Time |
+|------|---------|----------------|
+| `ReviewIQ-Pipeline-Contracts-v1.md` | Stage I/O specs, validation rules, test fixtures | 15 min |
+| `ReviewIQ-Pipeline-Checklist.md` | Per-stage implementation checklist, definition of done | 5 min |
+| `LLM-Classification-Contract-v1.md` | LLM prompt engineering spec (Stage 2) | 10 min |
+
+### Full Specifications (Reference)
+
+| File | Purpose | When to Read |
+|------|---------|--------------|
+| `ReviewIQ-Architecture-v3.2.md` | Complete v3.2 spec with DDL | Schema details |
+| `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` | Taxonomy versioning addendum | Future-proofing |
+| `URT-v5.1-Reference.md` | URT dimension codes reference | Classification reference |
+
+### Legacy (Superseded - Reference Only)
+
+| File | Note |
+|------|------|
+| `ReviewIQ-Architecture-v2.md` | Superseded by v3.2 |
+| `ReviewIQ-Architecture-v3.md` | Superseded by v3.2 |
+| `ReviewIQ-Architecture-v3.1.md` | Superseded by v3.2 |
+| `CONTEXT-KEEPER.md` | Use `ReviewIQ-v32-Decisions.md` instead |
+
+---
+
+## What's Captured in Artifacts
+
+| Context | Document |
+|---------|----------|
+| Key architectural decisions | `ReviewIQ-v32-Decisions.md` |
+| Current implementation status (~55%) | `ReviewIQ-Codebase-Overview.md` |
+| Existing file structure | `ReviewIQ-Codebase-Overview.md` |
+| Integration points (where new code connects) | `ReviewIQ-Codebase-Overview.md` |
+| Stage input/output contracts | `ReviewIQ-Pipeline-Contracts-v1.md` |
+| Validation rules (35 total across stages) | `ReviewIQ-Pipeline-Contracts-v1.md` |
+| Test fixtures (5 sample JSON payloads) | `ReviewIQ-Pipeline-Contracts-v1.md` |
+| Implementation checklists | `ReviewIQ-Pipeline-Checklist.md` |
+| Definition of done per stage | `ReviewIQ-Pipeline-Checklist.md` |
+| LLM prompt specification | `LLM-Classification-Contract-v1.md` |
+| URT taxonomy codes | `URT-v5.1-Reference.md` |
+| Full database DDL | `ReviewIQ-Architecture-v3.2.md` |
+| Taxonomy versioning schema | `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` |
+
+---
+
+## Pipeline Stages
+
+| Stage | Name | Status | Contract Section | Validation Rules |
+|-------|------|--------|------------------|------------------|
+| 0 | Raw Ingestion | ✅ Done | Pipeline-Contracts § Stage 0 | V0.1-V0.5 |
+| 1 | Normalization | ❌ TODO | Pipeline-Contracts § Stage 1 | V1.1-V1.6 |
+| 2 | LLM Classification | ❌ TODO | Pipeline-Contracts § Stage 2 | V2.1-V2.12 |
+| 3 | Issue Routing | ❌ TODO | Pipeline-Contracts § Stage 3 | V3.1-V3.5 |
+| 4 | Fact Aggregation | ❌ TODO | Pipeline-Contracts § Stage 4 | V4.1-V4.7 |
+
+---
+
+## Parallel Development Assignment
+
+### Agent 1 - Stage 1 (Normalization)
+```
+Read:
+  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 1
+  - ReviewIQ-Codebase-Overview.md (integration points)
+
+Create:
+  - pipeline/stage1_normalize.py
+  - migrations/005_create_reviews_tables.sql
+  - pipeline/tests/test_stage1.py
+
+Validate:
+  - V1.1-V1.6 rules pass
+  - Integration test: Stage 0 → Stage 1 passes
+```
+
+### Agent 2 - Stage 2 (LLM Classification)
+```
+Read:
+  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 2
+  - LLM-Classification-Contract-v1.md
+  - URT-v5.1-Reference.md
+
+Create:
+  - pipeline/stage2_classify.py
+  - pipeline/llm_client.py
+  - pipeline/span_extractor.py
+  - migrations/006_create_spans_table.sql
+  - migrations/007_create_urt_enums.sql
+  - pipeline/tests/test_stage2.py
+
+Validate:
+  - V2.1-V2.12 rules pass
+  - Integration test: Stage 1 → Stage 2 passes
+```
+
+### Agent 3 - Stage 3 (Issue Routing)
+```
+Read:
+  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 3
+  - ReviewIQ-Architecture-v3.2.md § Part 5 (issue lifecycle)
+
+Create:
+  - pipeline/stage3_route.py
+  - pipeline/issue_manager.py
+  - migrations/008_create_issues_tables.sql
+  - pipeline/tests/test_stage3.py
+
+Validate:
+  - V3.1-V3.5 rules pass
+  - Integration test: Stage 2 → Stage 3 passes
+```
+
+### Agent 4 - Stage 4 (Fact Aggregation)
+```
+Read:
+  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 4
+  - ReviewIQ-Architecture-v3.2.md § Part 6 (analytics)
+
+Create:
+  - pipeline/stage4_aggregate.py
+  - migrations/009_create_facts_table.sql
+  - pipeline/tests/test_stage4.py
+
+Validate:
+  - V4.1-V4.7 rules pass
+  - E2E pipeline test passes
+```
+
+---
+
+## Success Criteria
+
+Pipeline is complete when:
+
+```bash
+python -m pipeline.validate --job-id <JOB_ID> --verbose
+
+# Expected output:
+Stage 0: ✅ PASS (5/5 rules)
+Stage 1: ✅ PASS (6/6 rules)
+Stage 2: ✅ PASS (12/12 rules)
+Stage 3: ✅ PASS (5/5 rules)
+Stage 4: ✅ PASS (7/7 rules)
+E2E Integration: ✅ PASS
+```
+
+---
+
+## Quick Commands
+
+```bash
+# Check current branch
+git branch --show-current
+# Expected: feature/platform-restructure
+
+# View recent commits
+git log --oneline -5
+
+# Start database
+docker-compose -f docker-compose.production.yml up -d postgres
+
+# Run API server
+python api_server_production.py
+
+# Run frontend
+cd frontend && npm run dev
+
+# Run migrations (when created)
+psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql
+
+# Run tests
+pytest pipeline/tests/ -v
+
+# Validate pipeline
+python -m pipeline.validate --job-id <JOB_ID>
+```
+
+---
+
+## Environment Variables
+
+```bash
+# Database (required)
+DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq
+
+# LLM Provider (Stage 2)
+OPENAI_API_KEY=sk-...
+# OR
+ANTHROPIC_API_KEY=sk-ant-...
+
+# Embedding model (Stage 2)
+EMBEDDING_MODEL=all-MiniLM-L6-v2
+
+# Taxonomy version
+DEFAULT_TAXONOMY_VERSION=v5.1
+```
+
+---
+
+## File Structure After Implementation
+
+```
+google-reviews-scraper-pro/
+├── .artifacts/                    # ← Design documents
+│   ├── ReviewIQ-Pipeline-DevGuide.md  # ← START HERE (for pipeline work)
+│   ├── ReviewIQ-v32-Decisions.md
+│   ├── ReviewIQ-Codebase-Overview.md
+│   ├── ReviewIQ-Pipeline-Contracts-v1.md
+│   ├── ReviewIQ-Pipeline-Checklist.md
+│   └── ...
+│
+├── api_server_production.py       # ✅ Exists - Main API
+├── core/database.py               # ✅ Exists - DB layer
+├── scrapers/google_reviews/       # ✅ Exists - Scraper
+│
+├── pipeline/                      # ❌ TO CREATE
+│   ├── stage1_normalize.py
+│   ├── stage2_classify.py
+│   ├── stage3_route.py
+│   ├── stage4_aggregate.py
+│   ├── llm_client.py
+│   └── tests/
+│
+└── migrations/
+    ├── 001-004                    # ✅ Exists
+    └── 005-009                    # ❌ TO CREATE
+```
+
+---
+
+*Keep this guide updated when adding new artifacts or completing stages.*
--- a/.artifacts/ReviewIQ-v3.2.1-Taxonomy-Versioning.md
+++ b/.artifacts/ReviewIQ-v3.2.1-Taxonomy-Versioning.md
--- a/.artifacts/ReviewIQ-v32-Decisions.md
+++ b/.artifacts/ReviewIQ-v32-Decisions.md
@@ -7,9 +7,18 @@
 ## 1. Markpoint

 ```
-ID:       reviewiq-v32-span-layer-2026-01-24-001
-Status:   v3.2 span layer complete
-Based on: v3.1.2 (commit f998277)
+ID:       reviewiq-v32-span-layer-2026-01-24-004
+Status:   Pipeline contracts defined, ready for parallel implementation
+Based on: v3.2 (commit 43fd151)
+
+START HERE: ReviewIQ-Pipeline-DevGuide.md (for pipeline implementation)
+
+Key Documents:
+  - ReviewIQ-Pipeline-DevGuide.md      (entry point for pipeline work)
+  - ReviewIQ-Codebase-Overview.md      (file structure, what exists)
+  - ReviewIQ-Pipeline-Contracts-v1.md  (stage I/O contracts, validation)
+  - ReviewIQ-Pipeline-Checklist.md     (implementation checklist)
+  - ReviewIQ-v3.2.1-Taxonomy-Versioning.md (taxonomy versioning spec)
 ```

 ---
@@ -152,6 +161,98 @@ Full:     URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
 | Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
 | Reprocessing strategy? | **Soft-switch** with is_active flag |
 | TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
+| Taxonomy evolution tracking? | **Yes** — versioned codes with explicit mappings (v3.2.1) |
+| B2 schema vs v3.2 divergence? | **Documented** — B2 is canonical URT, v3.2 is app layer |
+| Taxonomy versioning? | **Yes** — `taxonomy_version` column on spans, versioned code tables |
+
+---
+
+## 13. B2 Schema Audit Findings
+
+**Audit Date**: 2026-01-24
+
+The B2-database-schema.sql (canonical URT v5.1) and ReviewIQ v3.2 spec have deliberate divergences:
+
+| Aspect | B2 (URT v5.1) | v3.2 (ReviewIQ) | Resolution |
+|--------|---------------|-----------------|------------|
+| Purpose | Source-agnostic taxonomy | Google Reviews app layer | Keep both |
+| ID strategy | UUIDs + sequential | Deterministic SHA256 | v3.2 choice |
+| Type safety | VARCHAR + CHECK | Postgres ENUMs | v3.2 choice |
+| Span table | `spans` | `review_spans` | v3.2 naming |
+| Offset columns | `char_start/char_end` | `span_start/span_end` | Document divergence |
+| Tenant model | Single-tenant | Multi-tenant (business_id) | v3.2 requirement |
+| Issue-span mapping | Many-to-many | One-to-one | v3.2 choice |
+| Causal chain | Normalized table | JSONB column | v3.2 flexibility |
+| Reprocessing | Not supported | Soft-switch pattern | v3.2 innovation |
+
+**Action Items**:
+1. Import reference data (domains, categories, subcodes) from B2 INSERTs
+2. Seed `urt_codes` / `urt_codes_versioned` from B1-urt-codes.yaml
+3. Do NOT adopt B2 structure directly — v3.2 has specific app requirements
+
+---
+
+## 14. Taxonomy Versioning (v3.2.1)
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Track taxonomy version | Required column on spans | Classifications only meaningful in version context |
+| Version ID format | `v{major}.{minor}` | Human-readable, matches URT releases |
+| Code FK strategy | Composite `(code, version_id)` | Prevents orphaned classifications |
+| Cross-version mappings | Explicit mapping table | Enables normalized trend queries |
+| Mapping direction | Forward only (old→new) | Simpler model, matches time flow |
+| Default version | `'v5.1'` hardcoded | Safe baseline, explicit upgrade path |
+| Fact table versioning | Per-row `taxonomy_version` | Enables version-specific aggregation |
+
+**Key Tables Added**:
+- `urt_taxonomy_versions` — Version registry with validity periods
+- `urt_codes_versioned` — Full code definitions per version (SCD Type 2)
+- `urt_code_mappings` — Cross-version translation rules
+
+**Key Functions Added**:
+- `translate_urt_code(code, from_version, to_version)` — Single code translation
+- `get_code_lineage(code, version)` — Full historical lineage
+- `detect_taxonomy_drift(from_version, to_version)` — Impact analysis
+- `aggregate_spans_normalized(...)` — Version-normalized aggregation
+
+**Principle**: Facts are immutable. A span classified as `J1.01` in v5.1 stays that way forever. Translation is explicit and auditable.
+
+See: `.artifacts/ReviewIQ-v3.2.1-Taxonomy-Versioning.md`
+
+---
+
+## 15. Pipeline Implementation Status
+
+**Overall: ~55% Complete** (as of 2026-01-24)
+
+| Stage | Name | Status | Owner |
+|-------|------|--------|-------|
+| 0 | Raw Ingestion | ✅ DONE | Scraper Team |
+| 1 | Normalization | ❌ TODO | TBD |
+| 2 | LLM Classification | ❌ TODO | TBD |
+| 3 | Issue Routing | ❌ TODO | TBD |
+| 4 | Fact Aggregation | ❌ TODO | TBD |
+
+**What's Working**:
+- Google Maps scraping (v1.0.0)
+- Job orchestration & queuing
+- Webhook delivery
+- Frontend job management
+- Real-time SSE streaming
+
+**What's Missing**:
+- Entire enrichment pipeline (Stages 1-4)
+- LLM integration
+- Span extraction
+- Issue routing
+- Analytics aggregation
+
+**Parallel Development**:
+Each stage can be implemented independently using the contracts defined in:
+- `ReviewIQ-Pipeline-Contracts-v1.md` — Full I/O specs, validation rules, test fixtures
+- `ReviewIQ-Pipeline-Checklist.md` — Implementation checklist, definition of done
+
+**Estimated Effort to 100%**: 6-8 weeks

 ---

@@ -180,4 +281,4 @@ GREATEST(0.2, base_trust * modifiers)  -- Floor prevents collapse

 ---

-*Last updated: 2026-01-24*
+*Last updated: 2026-01-24 (pipeline contracts + codebase overview)*