docs: Add pipeline development artifacts for parallel implementation

New artifacts:
- ReviewIQ-Pipeline-DevGuide.md: Entry point for pipeline work
- ReviewIQ-Pipeline-Contracts-v1.md: Stage I/O specs, validation rules, test fixtures
- ReviewIQ-Pipeline-Checklist.md: Per-stage implementation checklists
- ReviewIQ-Codebase-Overview.md: File structure, integration points
- ReviewIQ-v3.2.1-Taxonomy-Versioning.md: Taxonomy versioning addendum

Updated:
- ReviewIQ-v32-Decisions.md: Added B2 audit findings, taxonomy versioning decisions, pipeline status

These artifacts enable parallel development of pipeline stages 1-4 with:
- Independent validation (35 rules across stages)
- Clear input/output contracts
- Test fixtures for each stage
- Definition of done criteria

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-24 17:08:40 +00:00
parent c2996bef1e
commit acd3b22e88
6 changed files with 3600 additions and 4 deletions

View File

@@ -0,0 +1,450 @@
# ReviewIQ Codebase Overview
**Purpose**: Map existing code for agents starting fresh
**Last Updated**: 2026-01-24
**Status**: ~55% implemented (scraping done, enrichment pipeline missing)
---
## Quick Start for New Agents
1. **Read first**: `ReviewIQ-v32-Decisions.md` (context recovery)
2. **For implementation**: `ReviewIQ-Pipeline-Contracts-v1.md` + `ReviewIQ-Pipeline-Checklist.md`
3. **For schema**: `ReviewIQ-Architecture-v3.2.md` § Part 2
4. **For LLM prompts**: `LLM-Classification-Contract-v1.md`
---
## Implementation Status Summary
```
PIPELINE COMPLETION: ~55%
✅ COMPLETE (Working in Production)
├── Google Maps scraping (v1.0.0)
├── Job orchestration & queuing
├── Chrome worker pool
├── Webhook delivery
├── SSE real-time streaming
├── Frontend job management
└── Basic analytics dashboard
❌ NOT IMPLEMENTED (Spec'd Only)
├── Stage 1: Normalization
├── Stage 2: LLM Classification
├── Stage 3: Issue Routing
├── Stage 4: Fact Aggregation
├── Enrichment database schema
└── Advanced analytics UI
```
---
## Directory Structure
```
google-reviews-scraper-pro/
├── .artifacts/ # Design documents (YOU ARE HERE)
│ ├── ReviewIQ-v32-Decisions.md # START HERE - context recovery
│ ├── ReviewIQ-Architecture-v3.2.md # Full v3.2 spec
│ ├── ReviewIQ-Pipeline-Contracts-v1.md # Stage I/O contracts
│ ├── ReviewIQ-Pipeline-Checklist.md # Implementation checklist
│ ├── LLM-Classification-Contract-v1.md # LLM prompt spec
│ ├── URT-v5.1-Reference.md # URT dimension codes
│ └── ReviewIQ-Codebase-Overview.md # THIS FILE
├── api/ # ✅ API routes (FastAPI)
│ ├── routes/
│ │ ├── admin.py # Scraper management endpoints
│ │ ├── dashboard.py # Analytics endpoints
│ │ └── batches.py # Batch job endpoints
│ └── __init__.py
├── core/ # ✅ Core services
│ ├── database.py # AsyncPG database layer (~1200 lines)
│ ├── config.py # Configuration management
│ └── logging.py # Structured logging
├── services/ # ✅ Background services
│ ├── webhook_service.py # Async webhook delivery
│ └── job_callback_service.py # Callback handling
├── workers/ # ✅ Worker pool
│ └── chrome_pool.py # Chrome instance pooling
├── scrapers/ # ✅ Scraper implementations
│ └── google_reviews/
│ └── v1_0_0.py # Main scraper (~2000 lines)
├── pipeline/ # ❌ TO BE CREATED
│ ├── stage1_normalize.py # TODO
│ ├── stage2_classify.py # TODO
│ ├── stage3_route.py # TODO
│ ├── stage4_aggregate.py # TODO
│ └── tests/ # TODO
├── migrations/ # Database migrations
│ ├── 001_add_job_platform_fields.sql # ✅ Deployed
│ ├── 002_create_batches_table.sql # ✅ Deployed
│ ├── 003_create_scraper_registry.sql # ✅ Deployed
│ ├── 004_create_api_keys.sql # ✅ Deployed
│ ├── 005_create_reviews_tables.sql # ❌ TODO
│ ├── 006_create_spans_table.sql # ❌ TODO
│ ├── 007_create_urt_enums.sql # ❌ TODO
│ ├── 008_create_issues_tables.sql # ❌ TODO
│ └── 009_create_facts_table.sql # ❌ TODO
├── frontend/ # ✅ Next.js frontend
│ ├── app/
│ │ ├── dashboard/ # System overview
│ │ ├── jobs/ # Job list & detail
│ │ ├── analytics/ # Basic charts
│ │ └── new/ # Job submission forms
│ └── components/
├── api_server_production.py # ✅ Main API server (~1920 lines)
├── Dockerfile # ✅ Production container
├── docker-compose.production.yml # ✅ Docker orchestration
├── requirements-production.txt # ✅ Python dependencies
└── package.json # ✅ Node.js dependencies
```
---
## Key Files to Understand
### 1. API Server Entry Point
**File**: `api_server_production.py`
**Lines**: ~1920
**What it does**:
- FastAPI application setup
- All endpoint definitions
- Job submission and management
- SSE streaming
- Health checks
**Key endpoints**:
```python
POST /api/scrape/google-reviews # Submit scrape job
GET /jobs/{job_id} # Get job status
GET /jobs/{job_id}/reviews # Get scraped reviews
GET /jobs/{job_id}/stream # SSE real-time updates
```
### 2. Database Layer
**File**: `core/database.py`
**Lines**: ~1200
**What it does**:
- AsyncPG connection pooling
- Job CRUD operations
- Review storage (currently JSONB blob)
- Webhook tracking
**Key functions**:
```python
async def create_job(job_data: dict) -> str
async def update_job_status(job_id: str, status: str, ...)
async def get_job(job_id: str) -> dict
async def store_reviews(job_id: str, reviews: list) -> int
```
**Note**: Currently stores reviews as JSONB in `jobs.reviews_data`.
The enrichment pipeline will need to:
1. Read from `jobs.reviews_data`
2. Write to `reviews_raw` and `reviews_enriched` tables
### 3. Google Scraper
**File**: `scrapers/google_reviews/v1_0_0.py`
**Lines**: ~2000
**What it does**:
- SeleniumBase Chrome automation
- DOM scraping + API interception
- Review extraction (text, rating, author, date)
- Business metadata extraction
- Pagination handling
**Output format** (stored in `jobs.reviews_data`):
```json
{
"business_info": {...},
"reviews": [
{
"review_id": "...",
"author_name": "...",
"rating": 4,
"text": "...",
"review_time": "2026-01-20T14:30:00Z"
}
]
}
```
### 4. Chrome Worker Pool
**File**: `workers/chrome_pool.py`
**What it does**:
- Pre-warms Chrome instances
- Manages concurrent scraping jobs
- Handles resource cleanup
---
## Database Schema (Current State)
### Deployed Tables
```sql
-- Core job tracking
CREATE TABLE jobs (
job_id UUID PRIMARY KEY,
status VARCHAR(50),
url TEXT,
reviews_data JSONB, -- Raw scraped reviews live here
reviews_count INTEGER,
started_at TIMESTAMP,
completed_at TIMESTAMP,
-- ... 20+ more columns
);
-- Batch processing
CREATE TABLE batches (...);
-- Webhook tracking
CREATE TABLE webhook_attempts (...);
-- Scraper versioning
CREATE TABLE scraper_registry (...);
-- API authentication (not enforced)
CREATE TABLE api_keys (...);
```
### NOT Deployed (Defined in v3.2 Spec)
```sql
-- These tables need to be created via migrations 005-009
CREATE TABLE locations (...); -- Multi-tenant locations
CREATE TABLE reviews_raw (...); -- Immutable raw storage
CREATE TABLE reviews_enriched (...); -- Classified reviews
CREATE TABLE review_spans (...); -- Span-level classification
CREATE TABLE urt_codes (...); -- URT reference data
CREATE TABLE issues (...); -- Aggregated issues
CREATE TABLE issue_spans (...); -- Issue-span links
CREATE TABLE issue_events (...); -- Audit log
CREATE TABLE fact_timeseries (...); -- Pre-aggregated analytics
```
---
## Integration Points
### Where New Pipeline Code Connects
```
EXISTING CODE NEW CODE
════════════ ════════
api_server_production.py
jobs table
(reviews_data JSONB) ──────────────────▶ pipeline/stage1_normalize.py
reviews_raw table
reviews_enriched table
pipeline/stage2_classify.py
review_spans table
pipeline/stage3_route.py
issues table
issue_spans table
pipeline/stage4_aggregate.py
fact_timeseries table
```
### How to Trigger Pipeline
**Option A: Post-scrape hook** (recommended)
```python
# In api_server_production.py, after job completes:
async def on_job_complete(job_id: str):
# Existing: send webhook
await webhook_service.dispatch(job_id)
# NEW: trigger enrichment pipeline
await pipeline.stage1.process_job(job_id)
```
**Option B: Background worker**
```python
# New file: workers/enrichment_worker.py
async def enrichment_loop():
while True:
jobs = await db.query("""
SELECT job_id FROM jobs
WHERE status = 'completed'
AND enrichment_status IS NULL
LIMIT 10
""")
for job in jobs:
await pipeline.process(job['job_id'])
await asyncio.sleep(60)
```
**Option C: Manual trigger via API**
```python
# New endpoint in api_server_production.py
@app.post("/api/jobs/{job_id}/enrich")
async def trigger_enrichment(job_id: str):
await pipeline.process(job_id)
return {"status": "processing"}
```
---
## Environment Setup
### Required Environment Variables
```bash
# Database (required)
DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq
# LLM Provider (for Stage 2)
OPENAI_API_KEY=sk-...
# OR
ANTHROPIC_API_KEY=sk-ant-...
# Embedding model (for Stage 2)
EMBEDDING_MODEL=all-MiniLM-L6-v2
# Taxonomy version
DEFAULT_TAXONOMY_VERSION=v5.1
```
### Local Development
```bash
# Start database
docker-compose -f docker-compose.production.yml up -d postgres
# Run migrations
psql $DATABASE_URL -f migrations/001_add_job_platform_fields.sql
# ... repeat for all migrations
# Start API server
python api_server_production.py
# Start frontend (separate terminal)
cd frontend && npm run dev
```
### Running Tests
```bash
# Unit tests
pytest pipeline/tests/ -v
# Integration tests
pytest pipeline/tests/integration/ -v
# Full E2E validation
python -m pipeline.validate --job-id <JOB_ID>
```
---
## Tech Stack
| Component | Technology | Version |
|-----------|------------|---------|
| API | FastAPI | 0.100+ |
| Database | PostgreSQL | 15+ |
| DB Driver | asyncpg | 0.28+ |
| Scraping | SeleniumBase | 4.20+ |
| Browser | Chrome (headless) | 120+ |
| Frontend | Next.js | 16.1.3 |
| UI | React | 19.2.3 |
| Charts | Recharts | 2.x |
| LLM | OpenAI / Anthropic | Latest |
| Embeddings | sentence-transformers | 2.x |
| Vectors | pgvector | 0.5+ |
---
## Common Patterns
### Database Queries (asyncpg)
```python
# In core/database.py pattern:
async def get_job(job_id: str) -> dict:
async with pool.acquire() as conn:
row = await conn.fetchrow(
"SELECT * FROM jobs WHERE job_id = $1",
job_id
)
return dict(row) if row else None
```
### API Endpoints (FastAPI)
```python
# In api_server_production.py pattern:
@app.get("/jobs/{job_id}")
async def get_job(job_id: str):
job = await db.get_job(job_id)
if not job:
raise HTTPException(404, "Job not found")
return job
```
### Background Tasks
```python
# Pattern for async processing:
async def process_in_background(job_id: str):
asyncio.create_task(do_heavy_work(job_id))
return {"status": "processing"}
```
---
## Gotchas & Notes
1. **Reviews are JSONB blobs** - Currently in `jobs.reviews_data`, not normalized tables
2. **No auth enforcement** - `api_keys` table exists but not used
3. **CORS is wide open** - Set to `*` in production (fix before launch)
4. **Scraper is single-threaded per job** - Chrome pool handles concurrency
5. **Webhooks have retry logic** - 3 attempts with exponential backoff
6. **SSE streaming works** - Real-time job updates via `/jobs/{job_id}/stream`
---
## Next Steps for Implementation
1. **Create migrations 005-009** - Deploy enrichment schema
2. **Create `pipeline/` directory** - New code goes here
3. **Implement Stage 1** - Read from jobs.reviews_data, write to reviews_raw/enriched
4. **Implement Stage 2** - LLM classification with span extraction
5. **Implement Stage 3** - Issue routing
6. **Implement Stage 4** - Fact aggregation
7. **Add pipeline trigger** - Hook into job completion or create worker
8. **Update frontend** - Add enrichment views
---
*This document should be updated when significant code changes occur.*

View File

@@ -0,0 +1,371 @@
# ReviewIQ Pipeline Implementation Checklist
**Purpose**: Quick reference for agents to verify stage completion
**Reference**: `ReviewIQ-Pipeline-Contracts-v1.md` for full specs
---
## Pipeline Overview
```
Stage 0 ──▶ Stage 1 ──▶ Stage 2 ──▶ Stage 3 ──▶ Stage 4
Scrape Normalize Classify Route Aggregate
✅ DONE ❌ TODO ❌ TODO ❌ TODO ❌ TODO
```
---
## Stage 0: Raw Ingestion ✅ COMPLETE
No action needed. Already implemented in `scrapers/google_reviews/v1_0_0.py`.
**Output Location**: `jobs.reviews_data` (JSONB)
---
## Stage 1: Normalization ❌ TODO
### Files to Create
- [ ] `pipeline/stage1_normalize.py` - Main normalization logic
- [ ] `pipeline/tests/test_stage1.py` - Unit tests
- [ ] `migrations/005_create_reviews_tables.sql` - Schema migration
### Database Schema Required
```sql
-- Must exist before Stage 1 can write
CREATE TABLE reviews_raw (...);
CREATE TABLE reviews_enriched (...);
```
### Implementation Checklist
- [ ] Read from `jobs.reviews_data` where `status = 'completed'`
- [ ] Filter out empty/null review texts
- [ ] Normalize text (lowercase, whitespace, emoji)
- [ ] Detect language (ISO 639-1)
- [ ] Compute content hash (SHA256)
- [ ] Check for duplicates within business
- [ ] Insert into `reviews_raw` (immutable)
- [ ] Insert stub into `reviews_enriched` (classification fields NULL)
- [ ] Return `Stage1Output` with stats
### Validation (run after implementation)
```bash
python -m pytest pipeline/tests/test_stage1.py -v
```
### Definition of Done
- [ ] All V1.1-V1.6 validation rules pass
- [ ] `reviews_raw` populated with immutable records
- [ ] `reviews_enriched` has stubs ready for Stage 2
- [ ] Integration test: Stage 0 output → Stage 1 input passes
- [ ] No empty texts in output
- [ ] Duplicate detection working
---
## Stage 2: LLM Classification ❌ TODO
### Files to Create
- [ ] `pipeline/stage2_classify.py` - LLM classification logic
- [ ] `pipeline/llm_client.py` - LLM provider abstraction
- [ ] `pipeline/span_extractor.py` - Span boundary detection
- [ ] `pipeline/tests/test_stage2.py` - Unit tests
- [ ] `migrations/006_create_spans_table.sql` - Schema migration
- [ ] `migrations/007_create_urt_enums.sql` - ENUM types
### Database Schema Required
```sql
-- ENUM types
CREATE TYPE urt_valence AS ENUM (...);
CREATE TYPE urt_intensity AS ENUM (...);
-- ... all 12 ENUMs from v3.2 spec
-- Spans table
CREATE TABLE review_spans (...);
```
### Implementation Checklist
- [ ] Query unclassified reviews from `reviews_enriched`
- [ ] Build LLM prompt per `LLM-Classification-Contract-v1.md`
- [ ] Call LLM API (support GPT-4o-mini, Claude-3-haiku)
- [ ] Parse structured JSON response
- [ ] Extract spans with character offsets
- [ ] Validate span_text matches original text substring
- [ ] Check spans don't overlap
- [ ] Select primary span (I3 > I2 > I1, V- > V± > V0 > V+)
- [ ] Generate embeddings (384-dim)
- [ ] Compute trust score (0.2 floor)
- [ ] Build USN string per profile
- [ ] Update `reviews_enriched` with classification
- [ ] Insert spans into `review_spans`
- [ ] Return `Stage2Output` with stats
### Validation (run after implementation)
```bash
python -m pytest pipeline/tests/test_stage2.py -v
```
### Definition of Done
- [ ] All V2.1-V2.12 validation rules pass
- [ ] LLM calls working with retry logic
- [ ] Span offsets correct (text substring matches)
- [ ] No overlapping spans
- [ ] Exactly one primary span per review
- [ ] Embeddings are 384-dim vectors
- [ ] Trust scores clamped to [0.2, 1.0]
- [ ] USN format valid per profile
- [ ] Integration test: Stage 1 output → Stage 2 input passes
---
## Stage 3: Issue Routing ❌ TODO
### Files to Create
- [ ] `pipeline/stage3_route.py` - Issue routing logic
- [ ] `pipeline/issue_manager.py` - Issue create/update logic
- [ ] `pipeline/tests/test_stage3.py` - Unit tests
- [ ] `migrations/008_create_issues_tables.sql` - Schema migration
### Database Schema Required
```sql
CREATE TABLE issues (...);
CREATE TABLE issue_spans (...);
CREATE TABLE issue_events (...);
```
### Implementation Checklist
- [ ] Query unrouted spans where `valence IN ('V-', 'V±')`
- [ ] Generate deterministic `issue_id` from routing key
- [ ] Check if issue exists
- [ ] Create new issue OR update existing counters
- [ ] Insert `issue_spans` link (enforce 1:1 with UNIQUE)
- [ ] Log event to `issue_events`
- [ ] Recalculate priority score
- [ ] Return `Stage3Output` with stats
### Validation (run after implementation)
```bash
python -m pytest pipeline/tests/test_stage3.py -v
```
### Definition of Done
- [ ] All V3.1-V3.5 validation rules pass
- [ ] Issue IDs are deterministic (same key = same ID)
- [ ] 1:1 span-to-issue mapping enforced
- [ ] Only V-/V± spans create issues
- [ ] Issue counters updated correctly
- [ ] Events logged for audit
- [ ] Integration test: Stage 2 output → Stage 3 input passes
---
## Stage 4: Fact Aggregation ❌ TODO
### Files to Create
- [ ] `pipeline/stage4_aggregate.py` - Fact aggregation logic
- [ ] `pipeline/tests/test_stage4.py` - Unit tests
- [ ] `migrations/009_create_facts_table.sql` - Schema migration
### Database Schema Required
```sql
CREATE TABLE fact_timeseries (...);
```
### Implementation Checklist
- [ ] Accept business_id, date, bucket_types
- [ ] Query spans joined with reviews for the period
- [ ] Aggregate by URT code (per location + 'ALL' rollup)
- [ ] Compute: review_count, span_count, valence counts
- [ ] Compute: strength_score, negative_strength, positive_strength
- [ ] Compute: intensity distribution (I1/I2/I3)
- [ ] Compute: CR counts (better/worse/same)
- [ ] Compute: trust-weighted metrics
- [ ] UPSERT into `fact_timeseries`
- [ ] Return `Stage4Output` with stats
### Validation (run after implementation)
```bash
python -m pytest pipeline/tests/test_stage4.py -v
```
### Definition of Done
- [ ] All V4.1-V4.7 validation rules pass
- [ ] Valence counts sum to span_count
- [ ] Intensity counts sum to span_count
- [ ] 'ALL' rollup includes owned locations only
- [ ] Facts are idempotent (re-run produces same result)
- [ ] Integration test: Full pipeline E2E passes
---
## Integration Tests
### Handoff Tests (run after each stage)
```bash
# Stage 0 → 1
python -m pytest pipeline/tests/integration/test_stage0_to_1.py
# Stage 1 → 2
python -m pytest pipeline/tests/integration/test_stage1_to_2.py
# Stage 2 → 3
python -m pytest pipeline/tests/integration/test_stage2_to_3.py
# Full E2E
python -m pytest pipeline/tests/integration/test_e2e.py
```
### E2E Validation Command
```bash
# Run full pipeline validation
python -m pipeline.validate --job-id <JOB_ID> --verbose
```
Expected output:
```
Stage 0: ✅ PASS (5/5 rules)
Stage 1: ✅ PASS (6/6 rules)
Stage 2: ✅ PASS (12/12 rules)
Stage 3: ✅ PASS (5/5 rules)
Stage 4: ✅ PASS (7/7 rules)
E2E Integration: ✅ PASS
- Reviews scraped: 47
- Reviews normalized: 45 (2 empty filtered)
- Spans extracted: 127
- Issues created: 23
- Facts written: 156
```
---
## Quick Reference: Validation Rules
### Stage 1
| Rule | Description |
|------|-------------|
| V1.1 | `text` is non-empty |
| V1.2 | `text_normalized` has no control chars |
| V1.3 | `content_hash` is 64-char hex |
| V1.4 | `review_version` >= 1 |
| V1.5 | `text_language` is valid ISO 639-1 |
| V1.6 | `raw_id` references valid row |
### Stage 2
| Rule | Description |
|------|-------------|
| V2.1 | `urt_primary` matches `^[OPJEAVR][1-4]\.[0-9]{2}$` |
| V2.2 | `urt_secondary` max 2 elements |
| V2.3 | `valence` is valid enum |
| V2.4 | `intensity` is valid enum |
| V2.5 | `span_end > span_start` |
| V2.6 | `span_text == text[start:end]` |
| V2.7 | Spans don't overlap |
| V2.8 | Exactly one `is_primary = true` |
| V2.9 | `trust_score` in [0.2, 1.0] |
| V2.10 | `embedding` is 384-dim |
| V2.11 | `usn` matches profile regex |
| V2.12 | `related_span_index` valid if set |
### Stage 3
| Rule | Description |
|------|-------------|
| V3.1 | `issue_id` matches `^ISS-[a-f0-9]{16}$` |
| V3.2 | `routing_key` non-empty |
| V3.3 | Span not already linked elsewhere |
| V3.4 | Issue exists in `issues` table |
| V3.5 | Only V-/V± spans routed |
### Stage 4
| Rule | Description |
|------|-------------|
| V4.1 | `place_id` valid or 'ALL' |
| V4.2 | `period_date` matches bucket |
| V4.3 | `span_count >= review_count` |
| V4.4 | Valence counts sum correctly |
| V4.5 | Intensity counts sum correctly |
| V4.6 | `strength_score >= 0` |
| V4.7 | `avg_rating` in [1.0, 5.0] or NULL |
---
## Migration Execution Order
```bash
# Run in sequence
psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql
psql $DATABASE_URL -f migrations/006_create_spans_table.sql
psql $DATABASE_URL -f migrations/007_create_urt_enums.sql
psql $DATABASE_URL -f migrations/008_create_issues_tables.sql
psql $DATABASE_URL -f migrations/009_create_facts_table.sql
```
---
## Environment Variables Required
```bash
# LLM Provider (Stage 2)
OPENAI_API_KEY=sk-...
# OR
ANTHROPIC_API_KEY=sk-ant-...
# Embedding Model (Stage 2)
EMBEDDING_MODEL=all-MiniLM-L6-v2
# Database
DATABASE_URL=postgresql://...
# Taxonomy
DEFAULT_TAXONOMY_VERSION=v5.1
```
---
## File Structure After Implementation
```
pipeline/
├── __init__.py
├── stage1_normalize.py # ❌ TODO
├── stage2_classify.py # ❌ TODO
├── stage3_route.py # ❌ TODO
├── stage4_aggregate.py # ❌ TODO
├── llm_client.py # ❌ TODO
├── span_extractor.py # ❌ TODO
├── issue_manager.py # ❌ TODO
├── validate.py # ❌ TODO (CLI validator)
├── contracts.py # ❌ TODO (TypedDict definitions)
└── tests/
├── __init__.py
├── test_stage1.py # ❌ TODO
├── test_stage2.py # ❌ TODO
├── test_stage3.py # ❌ TODO
├── test_stage4.py # ❌ TODO
├── fixtures/
│ ├── stage0_output.json
│ ├── stage1_output.json
│ ├── stage2_output.json
│ ├── stage3_output.json
│ └── stage4_output.json
└── integration/
├── test_stage0_to_1.py
├── test_stage1_to_2.py
├── test_stage2_to_3.py
└── test_e2e.py
migrations/
├── 001_add_job_platform_fields.sql # ✅ EXISTS
├── 002_create_batches_table.sql # ✅ EXISTS
├── 003_create_scraper_registry.sql # ✅ EXISTS
├── 004_create_api_keys.sql # ✅ EXISTS
├── 005_create_reviews_tables.sql # ❌ TODO
├── 006_create_spans_table.sql # ❌ TODO
├── 007_create_urt_enums.sql # ❌ TODO
├── 008_create_issues_tables.sql # ❌ TODO
└── 009_create_facts_table.sql # ❌ TODO
```
---
*Last Updated: 2026-01-24*

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,312 @@
# ReviewIQ Pipeline Development Guide
**Purpose**: Entry point for agents implementing the enrichment pipeline
**Last Updated**: 2026-01-24
---
## TL;DR - Current State
**Pipeline Implementation: ~55% complete**
```
✅ WORKING ❌ NOT IMPLEMENTED
────────── ──────────────────
Google Maps scraping Stage 1: Normalization
Job orchestration Stage 2: LLM Classification
Chrome worker pool Stage 3: Issue Routing
Webhook delivery Stage 4: Fact Aggregation
SSE streaming Enrichment database schema
Frontend (job management) Advanced analytics UI
```
**Estimated effort to 100%**: 6-8 weeks
---
## Cold Start Instructions
A new agent should:
| Step | Action | Time |
|------|--------|------|
| 1 | Read this file (`ReviewIQ-Pipeline-DevGuide.md`) | 2 min |
| 2 | Read `ReviewIQ-v32-Decisions.md` | 5 min |
| 3 | Read `ReviewIQ-Codebase-Overview.md` | 10 min |
| 4 | Read assigned stage in `ReviewIQ-Pipeline-Contracts-v1.md` | 15 min |
| 5 | Use `ReviewIQ-Pipeline-Checklist.md` to verify completion | Reference |
---
## Document Map
```
┌─────────────────────────────────────┐
│ ReviewIQ-Pipeline-DevGuide.md │
│ (YOU ARE HERE) │
└─────────────────┬───────────────────┘
┌─────────────────────────────┼─────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────┐
│ CONTEXT RECOVERY │ │ IMPLEMENTATION │ │ REFERENCE │
├─────────────────────┤ ├─────────────────────────┤ ├─────────────────────┤
│ │ │ │ │ │
│ ReviewIQ-v32- │ │ Pipeline-Contracts-v1 │ │ Architecture-v3.2 │
│ Decisions.md │ │ (I/O specs, validation) │ │ (full DDL spec) │
│ (key decisions, │ │ │ │ │
│ markpoint) │ │ Pipeline-Checklist │ │ v3.2.1-Taxonomy- │
│ │ │ (implementation tasks) │ │ Versioning │
│ Codebase-Overview │ │ │ │ (versioning spec) │
│ (file structure, │ │ LLM-Classification- │ │ │
│ integration points) │ │ Contract-v1 │ │ URT-v5.1-Reference │
│ │ │ (prompt engineering) │ │ (dimension codes) │
└─────────────────────┘ └─────────────────────────┘ └─────────────────────┘
```
---
## Core Documents
### Context & Status (Read First)
| File | Purpose | Est. Read Time |
|------|---------|----------------|
| `ReviewIQ-Pipeline-DevGuide.md` | Entry point, document map | 2 min |
| `ReviewIQ-v32-Decisions.md` | Key decisions, current markpoint | 5 min |
| `ReviewIQ-Codebase-Overview.md` | File structure, what code exists, integration points | 10 min |
### Implementation Guides (For Building)
| File | Purpose | Est. Read Time |
|------|---------|----------------|
| `ReviewIQ-Pipeline-Contracts-v1.md` | Stage I/O specs, validation rules, test fixtures | 15 min |
| `ReviewIQ-Pipeline-Checklist.md` | Per-stage implementation checklist, definition of done | 5 min |
| `LLM-Classification-Contract-v1.md` | LLM prompt engineering spec (Stage 2) | 10 min |
### Full Specifications (Reference)
| File | Purpose | When to Read |
|------|---------|--------------|
| `ReviewIQ-Architecture-v3.2.md` | Complete v3.2 spec with DDL | Schema details |
| `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` | Taxonomy versioning addendum | Future-proofing |
| `URT-v5.1-Reference.md` | URT dimension codes reference | Classification reference |
### Legacy (Superseded - Reference Only)
| File | Note |
|------|------|
| `ReviewIQ-Architecture-v2.md` | Superseded by v3.2 |
| `ReviewIQ-Architecture-v3.md` | Superseded by v3.2 |
| `ReviewIQ-Architecture-v3.1.md` | Superseded by v3.2 |
| `CONTEXT-KEEPER.md` | Use `ReviewIQ-v32-Decisions.md` instead |
---
## What's Captured in Artifacts
| Context | Document |
|---------|----------|
| Key architectural decisions | `ReviewIQ-v32-Decisions.md` |
| Current implementation status (~55%) | `ReviewIQ-Codebase-Overview.md` |
| Existing file structure | `ReviewIQ-Codebase-Overview.md` |
| Integration points (where new code connects) | `ReviewIQ-Codebase-Overview.md` |
| Stage input/output contracts | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Validation rules (35 total across stages) | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Test fixtures (5 sample JSON payloads) | `ReviewIQ-Pipeline-Contracts-v1.md` |
| Implementation checklists | `ReviewIQ-Pipeline-Checklist.md` |
| Definition of done per stage | `ReviewIQ-Pipeline-Checklist.md` |
| LLM prompt specification | `LLM-Classification-Contract-v1.md` |
| URT taxonomy codes | `URT-v5.1-Reference.md` |
| Full database DDL | `ReviewIQ-Architecture-v3.2.md` |
| Taxonomy versioning schema | `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` |
---
## Pipeline Stages
| Stage | Name | Status | Contract Section | Validation Rules |
|-------|------|--------|------------------|------------------|
| 0 | Raw Ingestion | ✅ Done | Pipeline-Contracts § Stage 0 | V0.1-V0.5 |
| 1 | Normalization | ❌ TODO | Pipeline-Contracts § Stage 1 | V1.1-V1.6 |
| 2 | LLM Classification | ❌ TODO | Pipeline-Contracts § Stage 2 | V2.1-V2.12 |
| 3 | Issue Routing | ❌ TODO | Pipeline-Contracts § Stage 3 | V3.1-V3.5 |
| 4 | Fact Aggregation | ❌ TODO | Pipeline-Contracts § Stage 4 | V4.1-V4.7 |
---
## Parallel Development Assignment
### Agent 1 - Stage 1 (Normalization)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 1
- ReviewIQ-Codebase-Overview.md (integration points)
Create:
- pipeline/stage1_normalize.py
- migrations/005_create_reviews_tables.sql
- pipeline/tests/test_stage1.py
Validate:
- V1.1-V1.6 rules pass
- Integration test: Stage 0 → Stage 1 passes
```
### Agent 2 - Stage 2 (LLM Classification)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 2
- LLM-Classification-Contract-v1.md
- URT-v5.1-Reference.md
Create:
- pipeline/stage2_classify.py
- pipeline/llm_client.py
- pipeline/span_extractor.py
- migrations/006_create_spans_table.sql
- migrations/007_create_urt_enums.sql
- pipeline/tests/test_stage2.py
Validate:
- V2.1-V2.12 rules pass
- Integration test: Stage 1 → Stage 2 passes
```
### Agent 3 - Stage 3 (Issue Routing)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 3
- ReviewIQ-Architecture-v3.2.md § Part 5 (issue lifecycle)
Create:
- pipeline/stage3_route.py
- pipeline/issue_manager.py
- migrations/008_create_issues_tables.sql
- pipeline/tests/test_stage3.py
Validate:
- V3.1-V3.5 rules pass
- Integration test: Stage 2 → Stage 3 passes
```
### Agent 4 - Stage 4 (Fact Aggregation)
```
Read:
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 4
- ReviewIQ-Architecture-v3.2.md § Part 6 (analytics)
Create:
- pipeline/stage4_aggregate.py
- migrations/009_create_facts_table.sql
- pipeline/tests/test_stage4.py
Validate:
- V4.1-V4.7 rules pass
- E2E pipeline test passes
```
---
## Success Criteria
Pipeline is complete when:
```bash
python -m pipeline.validate --job-id <JOB_ID> --verbose
# Expected output:
Stage 0: ✅ PASS (5/5 rules)
Stage 1: ✅ PASS (6/6 rules)
Stage 2: ✅ PASS (12/12 rules)
Stage 3: ✅ PASS (5/5 rules)
Stage 4: ✅ PASS (7/7 rules)
E2E Integration: ✅ PASS
```
---
## Quick Commands
```bash
# Check current branch
git branch --show-current
# Expected: feature/platform-restructure
# View recent commits
git log --oneline -5
# Start database
docker-compose -f docker-compose.production.yml up -d postgres
# Run API server
python api_server_production.py
# Run frontend
cd frontend && npm run dev
# Run migrations (when created)
psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql
# Run tests
pytest pipeline/tests/ -v
# Validate pipeline
python -m pipeline.validate --job-id <JOB_ID>
```
---
## Environment Variables
```bash
# Database (required)
DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq
# LLM Provider (Stage 2)
OPENAI_API_KEY=sk-...
# OR
ANTHROPIC_API_KEY=sk-ant-...
# Embedding model (Stage 2)
EMBEDDING_MODEL=all-MiniLM-L6-v2
# Taxonomy version
DEFAULT_TAXONOMY_VERSION=v5.1
```
---
## File Structure After Implementation
```
google-reviews-scraper-pro/
├── .artifacts/ # ← Design documents
│ ├── ReviewIQ-Pipeline-DevGuide.md # ← START HERE (for pipeline work)
│ ├── ReviewIQ-v32-Decisions.md
│ ├── ReviewIQ-Codebase-Overview.md
│ ├── ReviewIQ-Pipeline-Contracts-v1.md
│ ├── ReviewIQ-Pipeline-Checklist.md
│ └── ...
├── api_server_production.py # ✅ Exists - Main API
├── core/database.py # ✅ Exists - DB layer
├── scrapers/google_reviews/ # ✅ Exists - Scraper
├── pipeline/ # ❌ TO CREATE
│ ├── stage1_normalize.py
│ ├── stage2_classify.py
│ ├── stage3_route.py
│ ├── stage4_aggregate.py
│ ├── llm_client.py
│ └── tests/
└── migrations/
├── 001-004 # ✅ Exists
└── 005-009 # ❌ TO CREATE
```
---
*Keep this guide updated when adding new artifacts or completing stages.*

File diff suppressed because it is too large Load Diff

View File

@@ -7,9 +7,18 @@
## 1. Markpoint
```
ID: reviewiq-v32-span-layer-2026-01-24-001
Status: v3.2 span layer complete
Based on: v3.1.2 (commit f998277)
ID: reviewiq-v32-span-layer-2026-01-24-004
Status: Pipeline contracts defined, ready for parallel implementation
Based on: v3.2 (commit 43fd151)
START HERE: ReviewIQ-Pipeline-DevGuide.md (for pipeline implementation)
Key Documents:
- ReviewIQ-Pipeline-DevGuide.md (entry point for pipeline work)
- ReviewIQ-Codebase-Overview.md (file structure, what exists)
- ReviewIQ-Pipeline-Contracts-v1.md (stage I/O contracts, validation)
- ReviewIQ-Pipeline-Checklist.md (implementation checklist)
- ReviewIQ-v3.2.1-Taxonomy-Versioning.md (taxonomy versioning spec)
```
---
@@ -152,6 +161,98 @@ Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
| Reprocessing strategy? | **Soft-switch** with is_active flag |
| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
| Taxonomy evolution tracking? | **Yes** — versioned codes with explicit mappings (v3.2.1) |
| B2 schema vs v3.2 divergence? | **Documented** — B2 is canonical URT, v3.2 is app layer |
| Taxonomy versioning? | **Yes**`taxonomy_version` column on spans, versioned code tables |
---
## 13. B2 Schema Audit Findings
**Audit Date**: 2026-01-24
The B2-database-schema.sql (canonical URT v5.1) and ReviewIQ v3.2 spec have deliberate divergences:
| Aspect | B2 (URT v5.1) | v3.2 (ReviewIQ) | Resolution |
|--------|---------------|-----------------|------------|
| Purpose | Source-agnostic taxonomy | Google Reviews app layer | Keep both |
| ID strategy | UUIDs + sequential | Deterministic SHA256 | v3.2 choice |
| Type safety | VARCHAR + CHECK | Postgres ENUMs | v3.2 choice |
| Span table | `spans` | `review_spans` | v3.2 naming |
| Offset columns | `char_start/char_end` | `span_start/span_end` | Document divergence |
| Tenant model | Single-tenant | Multi-tenant (business_id) | v3.2 requirement |
| Issue-span mapping | Many-to-many | One-to-one | v3.2 choice |
| Causal chain | Normalized table | JSONB column | v3.2 flexibility |
| Reprocessing | Not supported | Soft-switch pattern | v3.2 innovation |
**Action Items**:
1. Import reference data (domains, categories, subcodes) from B2 INSERTs
2. Seed `urt_codes` / `urt_codes_versioned` from B1-urt-codes.yaml
3. Do NOT adopt B2 structure directly — v3.2 has specific app requirements
---
## 14. Taxonomy Versioning (v3.2.1)
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Track taxonomy version | Required column on spans | Classifications only meaningful in version context |
| Version ID format | `v{major}.{minor}` | Human-readable, matches URT releases |
| Code FK strategy | Composite `(code, version_id)` | Prevents orphaned classifications |
| Cross-version mappings | Explicit mapping table | Enables normalized trend queries |
| Mapping direction | Forward only (old→new) | Simpler model, matches time flow |
| Default version | `'v5.1'` hardcoded | Safe baseline, explicit upgrade path |
| Fact table versioning | Per-row `taxonomy_version` | Enables version-specific aggregation |
**Key Tables Added**:
- `urt_taxonomy_versions` — Version registry with validity periods
- `urt_codes_versioned` — Full code definitions per version (SCD Type 2)
- `urt_code_mappings` — Cross-version translation rules
**Key Functions Added**:
- `translate_urt_code(code, from_version, to_version)` — Single code translation
- `get_code_lineage(code, version)` — Full historical lineage
- `detect_taxonomy_drift(from_version, to_version)` — Impact analysis
- `aggregate_spans_normalized(...)` — Version-normalized aggregation
**Principle**: Facts are immutable. A span classified as `J1.01` in v5.1 stays that way forever. Translation is explicit and auditable.
See: `.artifacts/ReviewIQ-v3.2.1-Taxonomy-Versioning.md`
---
## 15. Pipeline Implementation Status
**Overall: ~55% Complete** (as of 2026-01-24)
| Stage | Name | Status | Owner |
|-------|------|--------|-------|
| 0 | Raw Ingestion | ✅ DONE | Scraper Team |
| 1 | Normalization | ❌ TODO | TBD |
| 2 | LLM Classification | ❌ TODO | TBD |
| 3 | Issue Routing | ❌ TODO | TBD |
| 4 | Fact Aggregation | ❌ TODO | TBD |
**What's Working**:
- Google Maps scraping (v1.0.0)
- Job orchestration & queuing
- Webhook delivery
- Frontend job management
- Real-time SSE streaming
**What's Missing**:
- Entire enrichment pipeline (Stages 1-4)
- LLM integration
- Span extraction
- Issue routing
- Analytics aggregation
**Parallel Development**:
Each stage can be implemented independently using the contracts defined in:
- `ReviewIQ-Pipeline-Contracts-v1.md` — Full I/O specs, validation rules, test fixtures
- `ReviewIQ-Pipeline-Checklist.md` — Implementation checklist, definition of done
**Estimated Effort to 100%**: 6-8 weeks
---
@@ -180,4 +281,4 @@ GREATEST(0.2, base_trust * modifiers) -- Floor prevents collapse
---
*Last updated: 2026-01-24*
*Last updated: 2026-01-24 (pipeline contracts + codebase overview)*