Files
whyrating-engine-legacy/.artifacts/ReviewIQ-Pipeline-DevGuide.md
Alejandro Gutiérrez acd3b22e88 docs: Add pipeline development artifacts for parallel implementation
New artifacts:
- ReviewIQ-Pipeline-DevGuide.md: Entry point for pipeline work
- ReviewIQ-Pipeline-Contracts-v1.md: Stage I/O specs, validation rules, test fixtures
- ReviewIQ-Pipeline-Checklist.md: Per-stage implementation checklists
- ReviewIQ-Codebase-Overview.md: File structure, integration points
- ReviewIQ-v3.2.1-Taxonomy-Versioning.md: Taxonomy versioning addendum

Updated:
- ReviewIQ-v32-Decisions.md: Added B2 audit findings, taxonomy versioning decisions, pipeline status

These artifacts enable parallel development of pipeline stages 1-4 with:
- Independent validation (35 rules across stages)
- Clear input/output contracts
- Test fixtures for each stage
- Definition of done criteria

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:08:40 +00:00

313 lines
10 KiB
Markdown

# ReviewIQ Pipeline Development Guide
**Purpose**: Entry point for agents implementing the enrichment pipeline
**Last Updated**: 2026-01-24
---
## TL;DR - Current State
**Pipeline Implementation: ~55% complete**
```
✅ WORKING ❌ NOT IMPLEMENTED
────────── ──────────────────
Google Maps scraping Stage 1: Normalization
Job orchestration Stage 2: LLM Classification
Chrome worker pool Stage 3: Issue Routing
Webhook delivery Stage 4: Fact Aggregation
SSE streaming Enrichment database schema
Frontend (job management) Advanced analytics UI
```
**Estimated effort to 100%**: 6-8 weeks
---
## Cold Start Instructions
A new agent should:
| Step | Action | Time |
|------|--------|------|
| 1 | Read this file (`ReviewIQ-Pipeline-DevGuide.md`) | 2 min |
| 2 | Read `ReviewIQ-v32-Decisions.md` | 5 min |
| 3 | Read `ReviewIQ-Codebase-Overview.md` | 10 min |
| 4 | Read assigned stage in `ReviewIQ-Pipeline-Contracts-v1.md` | 15 min |
| 5 | Use `ReviewIQ-Pipeline-Checklist.md` to verify completion | Reference |
---
## Document Map
```
┌─────────────────────────────────────┐
│ ReviewIQ-Pipeline-DevGuide.md │
│ (YOU ARE HERE) │
└─────────────────┬───────────────────┘
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────┐
│ CONTEXT RECOVERY │ │ IMPLEMENTATION │ │ REFERENCE │
├─────────────────────┤ ├─────────────────────────┤ ├─────────────────────┤
│ │ │ │ │ │
│ ReviewIQ-v32- │ │ Pipeline-Contracts-v1 │ │ Architecture-v3.2 │
│ Decisions.md │ │ (I/O specs, validation) │ │ (full DDL spec) │
│ (key decisions, │ │ │ │ │
│ markpoint) │ │ Pipeline-Checklist │ │ v3.2.1-Taxonomy- │
│ │ │ (implementation tasks) │ │ Versioning │
│ Codebase-Overview │ │ │ │ (versioning spec) │
│ (file structure, │ │ LLM-Classification- │ │ │
│ integration points) │ │ Contract-v1 │ │ URT-v5.1-Reference │
│ │ │ (prompt engineering) │ │ (dimension codes) │
└─────────────────────┘ └─────────────────────────┘ └─────────────────────┘
```
---
## Core Documents
### Context & Status (Read First)
| File | Purpose | Est. Read Time |
|------|---------|----------------|
| `ReviewIQ-Pipeline-DevGuide.md` | Entry point, document map | 2 min |
| `ReviewIQ-v32-Decisions.md` | Key decisions, current markpoint | 5 min |
| `ReviewIQ-Codebase-Overview.md` | File structure, what code exists, integration points | 10 min |
### Implementation Guides (For Building)
| File | Purpose | Est. Read Time |
|------|---------|----------------|
| `ReviewIQ-Pipeline-Contracts-v1.md` | Stage I/O specs, validation rules, test fixtures | 15 min |
| `ReviewIQ-Pipeline-Checklist.md` | Per-stage implementation checklist, definition of done | 5 min |
| `LLM-Classification-Contract-v1.md` | LLM prompt engineering spec (Stage 2) | 10 min |
### Full Specifications (Reference)
| File | Purpose | When to Read |
|------|---------|--------------|
| `ReviewIQ-Architecture-v3.2.md` | Complete v3.2 spec with DDL | Schema details |
| `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` | Taxonomy versioning addendum | Future-proofing |
| `URT-v5.1-Reference.md` | URT dimension codes reference | Classification reference |
### Legacy (Superseded - Reference Only)
| File | Note |
|------|------|
| `ReviewIQ-Architecture-v2.md` | Superseded by v3.2 |
| `ReviewIQ-Architecture-v3.md` | Superseded by v3.2 |
| `ReviewIQ-Architecture-v3.1.md` | Superseded by v3.2 |
| `CONTEXT-KEEPER.md` | Use `ReviewIQ-v32-Decisions.md` instead |
---
## What's Captured in Artifacts
| Context | Document |
|---------|----------|
| Key architectural decisions | `ReviewIQ-v32-Decisions.md` |
| Current implementation status (~55%) | `ReviewIQ-Codebase-Overview.md` |
| Existing file structure | `ReviewIQ-Codebase-Overview.md` |
| Integration points (where new code connects) | `ReviewIQ-Codebase-Overview.md` |
| Stage input/output contracts | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Validation rules (35 total across stages) | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Test fixtures (5 sample JSON payloads) | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Implementation checklists | `ReviewIQ-Pipeline-Checklist.md` |
| Definition of done per stage | `ReviewIQ-Pipeline-Checklist.md` |
| LLM prompt specification | `LLM-Classification-Contract-v1.md` |
| URT taxonomy codes | `URT-v5.1-Reference.md` |
| Full database DDL | `ReviewIQ-Architecture-v3.2.md` |
| Taxonomy versioning schema | `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` |
---
## Pipeline Stages
| Stage | Name | Status | Contract Section | Validation Rules |
|-------|------|--------|------------------|------------------|
| 0 | Raw Ingestion | ✅ Done | Pipeline-Contracts § Stage 0 | V0.1-V0.5 |
| 1 | Normalization | ❌ TODO | Pipeline-Contracts § Stage 1 | V1.1-V1.6 |
| 2 | LLM Classification | ❌ TODO | Pipeline-Contracts § Stage 2 | V2.1-V2.12 |
| 3 | Issue Routing | ❌ TODO | Pipeline-Contracts § Stage 3 | V3.1-V3.5 |
| 4 | Fact Aggregation | ❌ TODO | Pipeline-Contracts § Stage 4 | V4.1-V4.7 |
---
## Parallel Development Assignment
### Agent 1 - Stage 1 (Normalization)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 1
- ReviewIQ-Codebase-Overview.md (integration points)
Create:
- pipeline/stage1_normalize.py
- migrations/005_create_reviews_tables.sql
- pipeline/tests/test_stage1.py
Validate:
- V1.1-V1.6 rules pass
- Integration test: Stage 0 → Stage 1 passes
```
### Agent 2 - Stage 2 (LLM Classification)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 2
- LLM-Classification-Contract-v1.md
- URT-v5.1-Reference.md
Create:
- pipeline/stage2_classify.py
- pipeline/llm_client.py
- pipeline/span_extractor.py
- migrations/006_create_spans_table.sql
- migrations/007_create_urt_enums.sql
- pipeline/tests/test_stage2.py
Validate:
- V2.1-V2.12 rules pass
- Integration test: Stage 1 → Stage 2 passes
```
### Agent 3 - Stage 3 (Issue Routing)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 3
- ReviewIQ-Architecture-v3.2.md § Part 5 (issue lifecycle)
Create:
- pipeline/stage3_route.py
- pipeline/issue_manager.py
- migrations/008_create_issues_tables.sql
- pipeline/tests/test_stage3.py
Validate:
- V3.1-V3.5 rules pass
- Integration test: Stage 2 → Stage 3 passes
```
### Agent 4 - Stage 4 (Fact Aggregation)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 4
- ReviewIQ-Architecture-v3.2.md § Part 6 (analytics)
Create:
- pipeline/stage4_aggregate.py
- migrations/009_create_facts_table.sql
- pipeline/tests/test_stage4.py
Validate:
- V4.1-V4.7 rules pass
- E2E pipeline test passes
```
---
## Success Criteria
Pipeline is complete when:
```bash
python -m pipeline.validate --job-id <JOB_ID> --verbose
# Expected output:
Stage 0: ✅ PASS (5/5 rules)
Stage 1: ✅ PASS (6/6 rules)
Stage 2: ✅ PASS (12/12 rules)
Stage 3: ✅ PASS (5/5 rules)
Stage 4: ✅ PASS (7/7 rules)
E2E Integration: ✅ PASS
```
---
## Quick Commands
```bash
# Check current branch
git branch --show-current
# Expected: feature/platform-restructure
# View recent commits
git log --oneline -5
# Start database
docker-compose -f docker-compose.production.yml up -d postgres
# Run API server
python api_server_production.py
# Run frontend
cd frontend && npm run dev
# Run migrations (when created)
psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql
# Run tests
pytest pipeline/tests/ -v
# Validate pipeline
python -m pipeline.validate --job-id <JOB_ID>
```
---
## Environment Variables
```bash
# Database (required)
DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq
# LLM Provider (Stage 2)
OPENAI_API_KEY=sk-...
# OR
ANTHROPIC_API_KEY=sk-ant-...
# Embedding model (Stage 2)
EMBEDDING_MODEL=all-MiniLM-L6-v2
# Taxonomy version
DEFAULT_TAXONOMY_VERSION=v5.1
```
---
## File Structure After Implementation
```
google-reviews-scraper-pro/
├── .artifacts/ # ← Design documents
│ ├── ReviewIQ-Pipeline-DevGuide.md # ← START HERE (for pipeline work)
│ ├── ReviewIQ-v32-Decisions.md
│ ├── ReviewIQ-Codebase-Overview.md
│ ├── ReviewIQ-Pipeline-Contracts-v1.md
│ ├── ReviewIQ-Pipeline-Checklist.md
│ └── ...
├── api_server_production.py # ✅ Exists - Main API
├── core/database.py # ✅ Exists - DB layer
├── scrapers/google_reviews/ # ✅ Exists - Scraper
├── pipeline/ # ❌ TO CREATE
│ ├── stage1_normalize.py
│ ├── stage2_classify.py
│ ├── stage3_route.py
│ ├── stage4_aggregate.py
│ ├── llm_client.py
│ └── tests/
└── migrations/
├── 001-004 # ✅ Exists
└── 005-009 # ❌ TO CREATE
```
---
*Keep this guide updated when adding new artifacts or completing stages.*