whyrating-engine-legacy/.artifacts/ReviewIQ-Pipeline-DevGuide.md

# ReviewIQ Pipeline Development Guide

**Purpose**: Entry point for agents implementing the enrichment pipeline
**Last Updated**: 2026-01-24

---

## TL;DR - Current State

**Pipeline Implementation: ~55% complete**

```
✅ WORKING                          ❌ NOT IMPLEMENTED
──────────                          ──────────────────
Google Maps scraping                Stage 1: Normalization
Job orchestration                   Stage 2: LLM Classification
Chrome worker pool                  Stage 3: Issue Routing
Webhook delivery                    Stage 4: Fact Aggregation
SSE streaming                       Enrichment database schema
Frontend (job management)           Advanced analytics UI
```

**Estimated effort to 100%**: 6-8 weeks

---

## Cold Start Instructions

A new agent should:

| Step | Action | Time |
|------|--------|------|
| 1 | Read this file (`ReviewIQ-Pipeline-DevGuide.md`) | 2 min |
| 2 | Read `ReviewIQ-v32-Decisions.md` | 5 min |
| 3 | Read `ReviewIQ-Codebase-Overview.md` | 10 min |
| 4 | Read assigned stage in `ReviewIQ-Pipeline-Contracts-v1.md` | 15 min |
| 5 | Use `ReviewIQ-Pipeline-Checklist.md` to verify completion | Reference |

---

## Document Map

```
                       ┌─────────────────────────────────────┐
                       │  ReviewIQ-Pipeline-DevGuide.md      │
                       │         (YOU ARE HERE)              │
                       └─────────────────┬───────────────────┘
                                         │
           ┌─────────────────────────────┼─────────────────────────────┐
           │                             │                             │
           ▼                             ▼                             ▼
┌─────────────────────┐    ┌─────────────────────────┐    ┌─────────────────────┐
│ CONTEXT RECOVERY    │    │    IMPLEMENTATION       │    │    REFERENCE        │
├─────────────────────┤    ├─────────────────────────┤    ├─────────────────────┤
│                     │    │                         │    │                     │
│ ReviewIQ-v32-       │    │ Pipeline-Contracts-v1   │    │ Architecture-v3.2   │
│ Decisions.md        │    │ (I/O specs, validation) │    │ (full DDL spec)     │
│ (key decisions,     │    │                         │    │                     │
│ markpoint)          │    │ Pipeline-Checklist      │    │ v3.2.1-Taxonomy-    │
│                     │    │ (implementation tasks)  │    │ Versioning          │
│ Codebase-Overview   │    │                         │    │ (versioning spec)   │
│ (file structure,    │    │ LLM-Classification-     │    │                     │
│ integration points) │    │ Contract-v1             │    │ URT-v5.1-Reference  │
│                     │    │ (prompt engineering)    │    │ (dimension codes)   │
└─────────────────────┘    └─────────────────────────┘    └─────────────────────┘
```

---

## Core Documents

### Context & Status (Read First)

| File | Purpose | Est. Read Time |
|------|---------|----------------|
| `ReviewIQ-Pipeline-DevGuide.md` | Entry point, document map | 2 min |
| `ReviewIQ-v32-Decisions.md` | Key decisions, current markpoint | 5 min |
| `ReviewIQ-Codebase-Overview.md` | File structure, what code exists, integration points | 10 min |

### Implementation Guides (For Building)

| File | Purpose | Est. Read Time |
|------|---------|----------------|
| `ReviewIQ-Pipeline-Contracts-v1.md` | Stage I/O specs, validation rules, test fixtures | 15 min |
| `ReviewIQ-Pipeline-Checklist.md` | Per-stage implementation checklist, definition of done | 5 min |
| `LLM-Classification-Contract-v1.md` | LLM prompt engineering spec (Stage 2) | 10 min |

### Full Specifications (Reference)

| File | Purpose | When to Read |
|------|---------|--------------|
| `ReviewIQ-Architecture-v3.2.md` | Complete v3.2 spec with DDL | Schema details |
| `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` | Taxonomy versioning addendum | Future-proofing |
| `URT-v5.1-Reference.md` | URT dimension codes reference | Classification reference |

### Legacy (Superseded - Reference Only)

| File | Note |
|------|------|
| `ReviewIQ-Architecture-v2.md` | Superseded by v3.2 |
| `ReviewIQ-Architecture-v3.md` | Superseded by v3.2 |
| `ReviewIQ-Architecture-v3.1.md` | Superseded by v3.2 |
| `CONTEXT-KEEPER.md` | Use `ReviewIQ-v32-Decisions.md` instead |

---

## What's Captured in Artifacts

| Context | Document |
|---------|----------|
| Key architectural decisions | `ReviewIQ-v32-Decisions.md` |
| Current implementation status (~55%) | `ReviewIQ-Codebase-Overview.md` |
| Existing file structure | `ReviewIQ-Codebase-Overview.md` |
| Integration points (where new code connects) | `ReviewIQ-Codebase-Overview.md` |
| Stage input/output contracts | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Validation rules (35 total across stages) | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Test fixtures (5 sample JSON payloads) | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Implementation checklists | `ReviewIQ-Pipeline-Checklist.md` |
| Definition of done per stage | `ReviewIQ-Pipeline-Checklist.md` |
| LLM prompt specification | `LLM-Classification-Contract-v1.md` |
| URT taxonomy codes | `URT-v5.1-Reference.md` |
| Full database DDL | `ReviewIQ-Architecture-v3.2.md` |
| Taxonomy versioning schema | `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` |

---

## Pipeline Stages

| Stage | Name | Status | Contract Section | Validation Rules |
|-------|------|--------|------------------|------------------|
| 0 | Raw Ingestion | ✅ Done | Pipeline-Contracts § Stage 0 | V0.1-V0.5 |
| 1 | Normalization | ❌ TODO | Pipeline-Contracts § Stage 1 | V1.1-V1.6 |
| 2 | LLM Classification | ❌ TODO | Pipeline-Contracts § Stage 2 | V2.1-V2.12 |
| 3 | Issue Routing | ❌ TODO | Pipeline-Contracts § Stage 3 | V3.1-V3.5 |
| 4 | Fact Aggregation | ❌ TODO | Pipeline-Contracts § Stage 4 | V4.1-V4.7 |

---

## Parallel Development Assignment

### Agent 1 - Stage 1 (Normalization)
```
Read:
  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 1
  - ReviewIQ-Codebase-Overview.md (integration points)

Create:
  - pipeline/stage1_normalize.py
  - migrations/005_create_reviews_tables.sql
  - pipeline/tests/test_stage1.py

Validate:
  - V1.1-V1.6 rules pass
  - Integration test: Stage 0 → Stage 1 passes
```

### Agent 2 - Stage 2 (LLM Classification)
```
Read:
  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 2
  - LLM-Classification-Contract-v1.md
  - URT-v5.1-Reference.md

Create:
  - pipeline/stage2_classify.py
  - pipeline/llm_client.py
  - pipeline/span_extractor.py
  - migrations/006_create_spans_table.sql
  - migrations/007_create_urt_enums.sql
  - pipeline/tests/test_stage2.py

Validate:
  - V2.1-V2.12 rules pass
  - Integration test: Stage 1 → Stage 2 passes
```

### Agent 3 - Stage 3 (Issue Routing)
```
Read:
  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 3
  - ReviewIQ-Architecture-v3.2.md § Part 5 (issue lifecycle)

Create:
  - pipeline/stage3_route.py
  - pipeline/issue_manager.py
  - migrations/008_create_issues_tables.sql
  - pipeline/tests/test_stage3.py

Validate:
  - V3.1-V3.5 rules pass
  - Integration test: Stage 2 → Stage 3 passes
```

### Agent 4 - Stage 4 (Fact Aggregation)
```
Read:
  - ReviewIQ-Pipeline-Contracts-v1.md § Stage 4
  - ReviewIQ-Architecture-v3.2.md § Part 6 (analytics)

Create:
  - pipeline/stage4_aggregate.py
  - migrations/009_create_facts_table.sql
  - pipeline/tests/test_stage4.py

Validate:
  - V4.1-V4.7 rules pass
  - E2E pipeline test passes
```

---

## Success Criteria

Pipeline is complete when:

```bash
python -m pipeline.validate --job-id <JOB_ID> --verbose

# Expected output:
Stage 0: ✅ PASS (5/5 rules)
Stage 1: ✅ PASS (6/6 rules)
Stage 2: ✅ PASS (12/12 rules)
Stage 3: ✅ PASS (5/5 rules)
Stage 4: ✅ PASS (7/7 rules)
E2E Integration: ✅ PASS
```

---

## Quick Commands

```bash
# Check current branch
git branch --show-current
# Expected: feature/platform-restructure

# View recent commits
git log --oneline -5

# Start database
docker-compose -f docker-compose.production.yml up -d postgres

# Run API server
python api_server_production.py

# Run frontend
cd frontend && npm run dev

# Run migrations (when created)
psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql

# Run tests
pytest pipeline/tests/ -v

# Validate pipeline
python -m pipeline.validate --job-id <JOB_ID>
```

---

## Environment Variables

```bash
# Database (required)
DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq

# LLM Provider (Stage 2)
OPENAI_API_KEY=sk-...
# OR
ANTHROPIC_API_KEY=sk-ant-...

# Embedding model (Stage 2)
EMBEDDING_MODEL=all-MiniLM-L6-v2

# Taxonomy version
DEFAULT_TAXONOMY_VERSION=v5.1
```

---

## File Structure After Implementation

```
google-reviews-scraper-pro/
├── .artifacts/                    # ← Design documents
│   ├── ReviewIQ-Pipeline-DevGuide.md  # ← START HERE (for pipeline work)
│   ├── ReviewIQ-v32-Decisions.md
│   ├── ReviewIQ-Codebase-Overview.md
│   ├── ReviewIQ-Pipeline-Contracts-v1.md
│   ├── ReviewIQ-Pipeline-Checklist.md
│   └── ...
│
├── api_server_production.py       # ✅ Exists - Main API
├── core/database.py               # ✅ Exists - DB layer
├── scrapers/google_reviews/       # ✅ Exists - Scraper
│
├── pipeline/                      # ❌ TO CREATE
│   ├── stage1_normalize.py
│   ├── stage2_classify.py
│   ├── stage3_route.py
│   ├── stage4_aggregate.py
│   ├── llm_client.py
│   └── tests/
│
└── migrations/
    ├── 001-004                    # ✅ Exists
    └── 005-009                    # ❌ TO CREATE
```

---

*Keep this guide updated when adding new artifacts or completing stages.*