docs: Add pipeline development artifacts for parallel implementation
New artifacts: - ReviewIQ-Pipeline-DevGuide.md: Entry point for pipeline work - ReviewIQ-Pipeline-Contracts-v1.md: Stage I/O specs, validation rules, test fixtures - ReviewIQ-Pipeline-Checklist.md: Per-stage implementation checklists - ReviewIQ-Codebase-Overview.md: File structure, integration points - ReviewIQ-v3.2.1-Taxonomy-Versioning.md: Taxonomy versioning addendum Updated: - ReviewIQ-v32-Decisions.md: Added B2 audit findings, taxonomy versioning decisions, pipeline status These artifacts enable parallel development of pipeline stages 1-4 with: - Independent validation (35 rules across stages) - Clear input/output contracts - Test fixtures for each stage - Definition of done criteria Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
450
.artifacts/ReviewIQ-Codebase-Overview.md
Normal file
450
.artifacts/ReviewIQ-Codebase-Overview.md
Normal file
@@ -0,0 +1,450 @@
|
|||||||
|
# ReviewIQ Codebase Overview
|
||||||
|
|
||||||
|
**Purpose**: Map existing code for agents starting fresh
|
||||||
|
**Last Updated**: 2026-01-24
|
||||||
|
**Status**: ~55% implemented (scraping done, enrichment pipeline missing)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Start for New Agents
|
||||||
|
|
||||||
|
1. **Read first**: `ReviewIQ-v32-Decisions.md` (context recovery)
|
||||||
|
2. **For implementation**: `ReviewIQ-Pipeline-Contracts-v1.md` + `ReviewIQ-Pipeline-Checklist.md`
|
||||||
|
3. **For schema**: `ReviewIQ-Architecture-v3.2.md` § Part 2
|
||||||
|
4. **For LLM prompts**: `LLM-Classification-Contract-v1.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Status Summary
|
||||||
|
|
||||||
|
```
|
||||||
|
PIPELINE COMPLETION: ~55%
|
||||||
|
|
||||||
|
✅ COMPLETE (Working in Production)
|
||||||
|
├── Google Maps scraping (v1.0.0)
|
||||||
|
├── Job orchestration & queuing
|
||||||
|
├── Chrome worker pool
|
||||||
|
├── Webhook delivery
|
||||||
|
├── SSE real-time streaming
|
||||||
|
├── Frontend job management
|
||||||
|
└── Basic analytics dashboard
|
||||||
|
|
||||||
|
❌ NOT IMPLEMENTED (Spec'd Only)
|
||||||
|
├── Stage 1: Normalization
|
||||||
|
├── Stage 2: LLM Classification
|
||||||
|
├── Stage 3: Issue Routing
|
||||||
|
├── Stage 4: Fact Aggregation
|
||||||
|
├── Enrichment database schema
|
||||||
|
└── Advanced analytics UI
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
google-reviews-scraper-pro/
|
||||||
|
│
|
||||||
|
├── .artifacts/ # Design documents (YOU ARE HERE)
|
||||||
|
│ ├── ReviewIQ-v32-Decisions.md # START HERE - context recovery
|
||||||
|
│ ├── ReviewIQ-Architecture-v3.2.md # Full v3.2 spec
|
||||||
|
│ ├── ReviewIQ-Pipeline-Contracts-v1.md # Stage I/O contracts
|
||||||
|
│ ├── ReviewIQ-Pipeline-Checklist.md # Implementation checklist
|
||||||
|
│ ├── LLM-Classification-Contract-v1.md # LLM prompt spec
|
||||||
|
│ ├── URT-v5.1-Reference.md # URT dimension codes
|
||||||
|
│ └── ReviewIQ-Codebase-Overview.md # THIS FILE
|
||||||
|
│
|
||||||
|
├── api/ # ✅ API routes (FastAPI)
|
||||||
|
│ ├── routes/
|
||||||
|
│ │ ├── admin.py # Scraper management endpoints
|
||||||
|
│ │ ├── dashboard.py # Analytics endpoints
|
||||||
|
│ │ └── batches.py # Batch job endpoints
|
||||||
|
│ └── __init__.py
|
||||||
|
│
|
||||||
|
├── core/ # ✅ Core services
|
||||||
|
│ ├── database.py # AsyncPG database layer (~1200 lines)
|
||||||
|
│ ├── config.py # Configuration management
|
||||||
|
│ └── logging.py # Structured logging
|
||||||
|
│
|
||||||
|
├── services/ # ✅ Background services
|
||||||
|
│ ├── webhook_service.py # Async webhook delivery
|
||||||
|
│ └── job_callback_service.py # Callback handling
|
||||||
|
│
|
||||||
|
├── workers/ # ✅ Worker pool
|
||||||
|
│ └── chrome_pool.py # Chrome instance pooling
|
||||||
|
│
|
||||||
|
├── scrapers/ # ✅ Scraper implementations
|
||||||
|
│ └── google_reviews/
|
||||||
|
│ └── v1_0_0.py # Main scraper (~2000 lines)
|
||||||
|
│
|
||||||
|
├── pipeline/ # ❌ TO BE CREATED
|
||||||
|
│ ├── stage1_normalize.py # TODO
|
||||||
|
│ ├── stage2_classify.py # TODO
|
||||||
|
│ ├── stage3_route.py # TODO
|
||||||
|
│ ├── stage4_aggregate.py # TODO
|
||||||
|
│ └── tests/ # TODO
|
||||||
|
│
|
||||||
|
├── migrations/ # Database migrations
|
||||||
|
│ ├── 001_add_job_platform_fields.sql # ✅ Deployed
|
||||||
|
│ ├── 002_create_batches_table.sql # ✅ Deployed
|
||||||
|
│ ├── 003_create_scraper_registry.sql # ✅ Deployed
|
||||||
|
│ ├── 004_create_api_keys.sql # ✅ Deployed
|
||||||
|
│ ├── 005_create_reviews_tables.sql # ❌ TODO
|
||||||
|
│ ├── 006_create_spans_table.sql # ❌ TODO
|
||||||
|
│ ├── 007_create_urt_enums.sql # ❌ TODO
|
||||||
|
│ ├── 008_create_issues_tables.sql # ❌ TODO
|
||||||
|
│ └── 009_create_facts_table.sql # ❌ TODO
|
||||||
|
│
|
||||||
|
├── frontend/ # ✅ Next.js frontend
|
||||||
|
│ ├── app/
|
||||||
|
│ │ ├── dashboard/ # System overview
|
||||||
|
│ │ ├── jobs/ # Job list & detail
|
||||||
|
│ │ ├── analytics/ # Basic charts
|
||||||
|
│ │ └── new/ # Job submission forms
|
||||||
|
│ └── components/
|
||||||
|
│
|
||||||
|
├── api_server_production.py # ✅ Main API server (~1920 lines)
|
||||||
|
├── Dockerfile # ✅ Production container
|
||||||
|
├── docker-compose.production.yml # ✅ Docker orchestration
|
||||||
|
├── requirements-production.txt # ✅ Python dependencies
|
||||||
|
└── package.json # ✅ Node.js dependencies
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Files to Understand
|
||||||
|
|
||||||
|
### 1. API Server Entry Point
|
||||||
|
**File**: `api_server_production.py`
|
||||||
|
**Lines**: ~1920
|
||||||
|
**What it does**:
|
||||||
|
- FastAPI application setup
|
||||||
|
- All endpoint definitions
|
||||||
|
- Job submission and management
|
||||||
|
- SSE streaming
|
||||||
|
- Health checks
|
||||||
|
|
||||||
|
**Key endpoints**:
|
||||||
|
```python
|
||||||
|
POST /api/scrape/google-reviews # Submit scrape job
|
||||||
|
GET /jobs/{job_id} # Get job status
|
||||||
|
GET /jobs/{job_id}/reviews # Get scraped reviews
|
||||||
|
GET /jobs/{job_id}/stream # SSE real-time updates
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Database Layer
|
||||||
|
**File**: `core/database.py`
|
||||||
|
**Lines**: ~1200
|
||||||
|
**What it does**:
|
||||||
|
- AsyncPG connection pooling
|
||||||
|
- Job CRUD operations
|
||||||
|
- Review storage (currently JSONB blob)
|
||||||
|
- Webhook tracking
|
||||||
|
|
||||||
|
**Key functions**:
|
||||||
|
```python
|
||||||
|
async def create_job(job_data: dict) -> str
|
||||||
|
async def update_job_status(job_id: str, status: str, ...)
|
||||||
|
async def get_job(job_id: str) -> dict
|
||||||
|
async def store_reviews(job_id: str, reviews: list) -> int
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note**: Currently stores reviews as JSONB in `jobs.reviews_data`.
|
||||||
|
The enrichment pipeline will need to:
|
||||||
|
1. Read from `jobs.reviews_data`
|
||||||
|
2. Write to `reviews_raw` and `reviews_enriched` tables
|
||||||
|
|
||||||
|
### 3. Google Scraper
|
||||||
|
**File**: `scrapers/google_reviews/v1_0_0.py`
|
||||||
|
**Lines**: ~2000
|
||||||
|
**What it does**:
|
||||||
|
- SeleniumBase Chrome automation
|
||||||
|
- DOM scraping + API interception
|
||||||
|
- Review extraction (text, rating, author, date)
|
||||||
|
- Business metadata extraction
|
||||||
|
- Pagination handling
|
||||||
|
|
||||||
|
**Output format** (stored in `jobs.reviews_data`):
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"business_info": {...},
|
||||||
|
"reviews": [
|
||||||
|
{
|
||||||
|
"review_id": "...",
|
||||||
|
"author_name": "...",
|
||||||
|
"rating": 4,
|
||||||
|
"text": "...",
|
||||||
|
"review_time": "2026-01-20T14:30:00Z"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Chrome Worker Pool
|
||||||
|
**File**: `workers/chrome_pool.py`
|
||||||
|
**What it does**:
|
||||||
|
- Pre-warms Chrome instances
|
||||||
|
- Manages concurrent scraping jobs
|
||||||
|
- Handles resource cleanup
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Database Schema (Current State)
|
||||||
|
|
||||||
|
### Deployed Tables
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Core job tracking
|
||||||
|
CREATE TABLE jobs (
|
||||||
|
job_id UUID PRIMARY KEY,
|
||||||
|
status VARCHAR(50),
|
||||||
|
url TEXT,
|
||||||
|
reviews_data JSONB, -- Raw scraped reviews live here
|
||||||
|
reviews_count INTEGER,
|
||||||
|
started_at TIMESTAMP,
|
||||||
|
completed_at TIMESTAMP,
|
||||||
|
-- ... 20+ more columns
|
||||||
|
);
|
||||||
|
|
||||||
|
-- Batch processing
|
||||||
|
CREATE TABLE batches (...);
|
||||||
|
|
||||||
|
-- Webhook tracking
|
||||||
|
CREATE TABLE webhook_attempts (...);
|
||||||
|
|
||||||
|
-- Scraper versioning
|
||||||
|
CREATE TABLE scraper_registry (...);
|
||||||
|
|
||||||
|
-- API authentication (not enforced)
|
||||||
|
CREATE TABLE api_keys (...);
|
||||||
|
```
|
||||||
|
|
||||||
|
### NOT Deployed (Defined in v3.2 Spec)
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- These tables need to be created via migrations 005-009
|
||||||
|
|
||||||
|
CREATE TABLE locations (...); -- Multi-tenant locations
|
||||||
|
CREATE TABLE reviews_raw (...); -- Immutable raw storage
|
||||||
|
CREATE TABLE reviews_enriched (...); -- Classified reviews
|
||||||
|
CREATE TABLE review_spans (...); -- Span-level classification
|
||||||
|
CREATE TABLE urt_codes (...); -- URT reference data
|
||||||
|
CREATE TABLE issues (...); -- Aggregated issues
|
||||||
|
CREATE TABLE issue_spans (...); -- Issue-span links
|
||||||
|
CREATE TABLE issue_events (...); -- Audit log
|
||||||
|
CREATE TABLE fact_timeseries (...); -- Pre-aggregated analytics
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Points
|
||||||
|
|
||||||
|
### Where New Pipeline Code Connects
|
||||||
|
|
||||||
|
```
|
||||||
|
EXISTING CODE NEW CODE
|
||||||
|
════════════ ════════
|
||||||
|
|
||||||
|
api_server_production.py
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
jobs table
|
||||||
|
(reviews_data JSONB) ──────────────────▶ pipeline/stage1_normalize.py
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
reviews_raw table
|
||||||
|
reviews_enriched table
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
pipeline/stage2_classify.py
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
review_spans table
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
pipeline/stage3_route.py
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
issues table
|
||||||
|
issue_spans table
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
pipeline/stage4_aggregate.py
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
fact_timeseries table
|
||||||
|
```
|
||||||
|
|
||||||
|
### How to Trigger Pipeline
|
||||||
|
|
||||||
|
**Option A: Post-scrape hook** (recommended)
|
||||||
|
```python
|
||||||
|
# In api_server_production.py, after job completes:
|
||||||
|
async def on_job_complete(job_id: str):
|
||||||
|
# Existing: send webhook
|
||||||
|
await webhook_service.dispatch(job_id)
|
||||||
|
|
||||||
|
# NEW: trigger enrichment pipeline
|
||||||
|
await pipeline.stage1.process_job(job_id)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option B: Background worker**
|
||||||
|
```python
|
||||||
|
# New file: workers/enrichment_worker.py
|
||||||
|
async def enrichment_loop():
|
||||||
|
while True:
|
||||||
|
jobs = await db.query("""
|
||||||
|
SELECT job_id FROM jobs
|
||||||
|
WHERE status = 'completed'
|
||||||
|
AND enrichment_status IS NULL
|
||||||
|
LIMIT 10
|
||||||
|
""")
|
||||||
|
for job in jobs:
|
||||||
|
await pipeline.process(job['job_id'])
|
||||||
|
await asyncio.sleep(60)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option C: Manual trigger via API**
|
||||||
|
```python
|
||||||
|
# New endpoint in api_server_production.py
|
||||||
|
@app.post("/api/jobs/{job_id}/enrich")
|
||||||
|
async def trigger_enrichment(job_id: str):
|
||||||
|
await pipeline.process(job_id)
|
||||||
|
return {"status": "processing"}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment Setup
|
||||||
|
|
||||||
|
### Required Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Database (required)
|
||||||
|
DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq
|
||||||
|
|
||||||
|
# LLM Provider (for Stage 2)
|
||||||
|
OPENAI_API_KEY=sk-...
|
||||||
|
# OR
|
||||||
|
ANTHROPIC_API_KEY=sk-ant-...
|
||||||
|
|
||||||
|
# Embedding model (for Stage 2)
|
||||||
|
EMBEDDING_MODEL=all-MiniLM-L6-v2
|
||||||
|
|
||||||
|
# Taxonomy version
|
||||||
|
DEFAULT_TAXONOMY_VERSION=v5.1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Local Development
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start database
|
||||||
|
docker-compose -f docker-compose.production.yml up -d postgres
|
||||||
|
|
||||||
|
# Run migrations
|
||||||
|
psql $DATABASE_URL -f migrations/001_add_job_platform_fields.sql
|
||||||
|
# ... repeat for all migrations
|
||||||
|
|
||||||
|
# Start API server
|
||||||
|
python api_server_production.py
|
||||||
|
|
||||||
|
# Start frontend (separate terminal)
|
||||||
|
cd frontend && npm run dev
|
||||||
|
```
|
||||||
|
|
||||||
|
### Running Tests
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Unit tests
|
||||||
|
pytest pipeline/tests/ -v
|
||||||
|
|
||||||
|
# Integration tests
|
||||||
|
pytest pipeline/tests/integration/ -v
|
||||||
|
|
||||||
|
# Full E2E validation
|
||||||
|
python -m pipeline.validate --job-id <JOB_ID>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tech Stack
|
||||||
|
|
||||||
|
| Component | Technology | Version |
|
||||||
|
|-----------|------------|---------|
|
||||||
|
| API | FastAPI | 0.100+ |
|
||||||
|
| Database | PostgreSQL | 15+ |
|
||||||
|
| DB Driver | asyncpg | 0.28+ |
|
||||||
|
| Scraping | SeleniumBase | 4.20+ |
|
||||||
|
| Browser | Chrome (headless) | 120+ |
|
||||||
|
| Frontend | Next.js | 16.1.3 |
|
||||||
|
| UI | React | 19.2.3 |
|
||||||
|
| Charts | Recharts | 2.x |
|
||||||
|
| LLM | OpenAI / Anthropic | Latest |
|
||||||
|
| Embeddings | sentence-transformers | 2.x |
|
||||||
|
| Vectors | pgvector | 0.5+ |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Patterns
|
||||||
|
|
||||||
|
### Database Queries (asyncpg)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In core/database.py pattern:
|
||||||
|
async def get_job(job_id: str) -> dict:
|
||||||
|
async with pool.acquire() as conn:
|
||||||
|
row = await conn.fetchrow(
|
||||||
|
"SELECT * FROM jobs WHERE job_id = $1",
|
||||||
|
job_id
|
||||||
|
)
|
||||||
|
return dict(row) if row else None
|
||||||
|
```
|
||||||
|
|
||||||
|
### API Endpoints (FastAPI)
|
||||||
|
|
||||||
|
```python
|
||||||
|
# In api_server_production.py pattern:
|
||||||
|
@app.get("/jobs/{job_id}")
|
||||||
|
async def get_job(job_id: str):
|
||||||
|
job = await db.get_job(job_id)
|
||||||
|
if not job:
|
||||||
|
raise HTTPException(404, "Job not found")
|
||||||
|
return job
|
||||||
|
```
|
||||||
|
|
||||||
|
### Background Tasks
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Pattern for async processing:
|
||||||
|
async def process_in_background(job_id: str):
|
||||||
|
asyncio.create_task(do_heavy_work(job_id))
|
||||||
|
return {"status": "processing"}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Gotchas & Notes
|
||||||
|
|
||||||
|
1. **Reviews are JSONB blobs** - Currently in `jobs.reviews_data`, not normalized tables
|
||||||
|
2. **No auth enforcement** - `api_keys` table exists but not used
|
||||||
|
3. **CORS is wide open** - Set to `*` in production (fix before launch)
|
||||||
|
4. **Scraper is single-threaded per job** - Chrome pool handles concurrency
|
||||||
|
5. **Webhooks have retry logic** - 3 attempts with exponential backoff
|
||||||
|
6. **SSE streaming works** - Real-time job updates via `/jobs/{job_id}/stream`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps for Implementation
|
||||||
|
|
||||||
|
1. **Create migrations 005-009** - Deploy enrichment schema
|
||||||
|
2. **Create `pipeline/` directory** - New code goes here
|
||||||
|
3. **Implement Stage 1** - Read from jobs.reviews_data, write to reviews_raw/enriched
|
||||||
|
4. **Implement Stage 2** - LLM classification with span extraction
|
||||||
|
5. **Implement Stage 3** - Issue routing
|
||||||
|
6. **Implement Stage 4** - Fact aggregation
|
||||||
|
7. **Add pipeline trigger** - Hook into job completion or create worker
|
||||||
|
8. **Update frontend** - Add enrichment views
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This document should be updated when significant code changes occur.*
|
||||||
371
.artifacts/ReviewIQ-Pipeline-Checklist.md
Normal file
371
.artifacts/ReviewIQ-Pipeline-Checklist.md
Normal file
@@ -0,0 +1,371 @@
|
|||||||
|
# ReviewIQ Pipeline Implementation Checklist
|
||||||
|
|
||||||
|
**Purpose**: Quick reference for agents to verify stage completion
|
||||||
|
**Reference**: `ReviewIQ-Pipeline-Contracts-v1.md` for full specs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pipeline Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
Stage 0 ──▶ Stage 1 ──▶ Stage 2 ──▶ Stage 3 ──▶ Stage 4
|
||||||
|
Scrape Normalize Classify Route Aggregate
|
||||||
|
✅ DONE ❌ TODO ❌ TODO ❌ TODO ❌ TODO
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stage 0: Raw Ingestion ✅ COMPLETE
|
||||||
|
|
||||||
|
No action needed. Already implemented in `scrapers/google_reviews/v1_0_0.py`.
|
||||||
|
|
||||||
|
**Output Location**: `jobs.reviews_data` (JSONB)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stage 1: Normalization ❌ TODO
|
||||||
|
|
||||||
|
### Files to Create
|
||||||
|
- [ ] `pipeline/stage1_normalize.py` - Main normalization logic
|
||||||
|
- [ ] `pipeline/tests/test_stage1.py` - Unit tests
|
||||||
|
- [ ] `migrations/005_create_reviews_tables.sql` - Schema migration
|
||||||
|
|
||||||
|
### Database Schema Required
|
||||||
|
```sql
|
||||||
|
-- Must exist before Stage 1 can write
|
||||||
|
CREATE TABLE reviews_raw (...);
|
||||||
|
CREATE TABLE reviews_enriched (...);
|
||||||
|
```
|
||||||
|
|
||||||
|
### Implementation Checklist
|
||||||
|
- [ ] Read from `jobs.reviews_data` where `status = 'completed'`
|
||||||
|
- [ ] Filter out empty/null review texts
|
||||||
|
- [ ] Normalize text (lowercase, whitespace, emoji)
|
||||||
|
- [ ] Detect language (ISO 639-1)
|
||||||
|
- [ ] Compute content hash (SHA256)
|
||||||
|
- [ ] Check for duplicates within business
|
||||||
|
- [ ] Insert into `reviews_raw` (immutable)
|
||||||
|
- [ ] Insert stub into `reviews_enriched` (classification fields NULL)
|
||||||
|
- [ ] Return `Stage1Output` with stats
|
||||||
|
|
||||||
|
### Validation (run after implementation)
|
||||||
|
```bash
|
||||||
|
python -m pytest pipeline/tests/test_stage1.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Definition of Done
|
||||||
|
- [ ] All V1.1-V1.6 validation rules pass
|
||||||
|
- [ ] `reviews_raw` populated with immutable records
|
||||||
|
- [ ] `reviews_enriched` has stubs ready for Stage 2
|
||||||
|
- [ ] Integration test: Stage 0 output → Stage 1 input passes
|
||||||
|
- [ ] No empty texts in output
|
||||||
|
- [ ] Duplicate detection working
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stage 2: LLM Classification ❌ TODO
|
||||||
|
|
||||||
|
### Files to Create
|
||||||
|
- [ ] `pipeline/stage2_classify.py` - LLM classification logic
|
||||||
|
- [ ] `pipeline/llm_client.py` - LLM provider abstraction
|
||||||
|
- [ ] `pipeline/span_extractor.py` - Span boundary detection
|
||||||
|
- [ ] `pipeline/tests/test_stage2.py` - Unit tests
|
||||||
|
- [ ] `migrations/006_create_spans_table.sql` - Schema migration
|
||||||
|
- [ ] `migrations/007_create_urt_enums.sql` - ENUM types
|
||||||
|
|
||||||
|
### Database Schema Required
|
||||||
|
```sql
|
||||||
|
-- ENUM types
|
||||||
|
CREATE TYPE urt_valence AS ENUM (...);
|
||||||
|
CREATE TYPE urt_intensity AS ENUM (...);
|
||||||
|
-- ... all 12 ENUMs from v3.2 spec
|
||||||
|
|
||||||
|
-- Spans table
|
||||||
|
CREATE TABLE review_spans (...);
|
||||||
|
```
|
||||||
|
|
||||||
|
### Implementation Checklist
|
||||||
|
- [ ] Query unclassified reviews from `reviews_enriched`
|
||||||
|
- [ ] Build LLM prompt per `LLM-Classification-Contract-v1.md`
|
||||||
|
- [ ] Call LLM API (support GPT-4o-mini, Claude-3-haiku)
|
||||||
|
- [ ] Parse structured JSON response
|
||||||
|
- [ ] Extract spans with character offsets
|
||||||
|
- [ ] Validate span_text matches original text substring
|
||||||
|
- [ ] Check spans don't overlap
|
||||||
|
- [ ] Select primary span (I3 > I2 > I1, V- > V± > V0 > V+)
|
||||||
|
- [ ] Generate embeddings (384-dim)
|
||||||
|
- [ ] Compute trust score (0.2 floor)
|
||||||
|
- [ ] Build USN string per profile
|
||||||
|
- [ ] Update `reviews_enriched` with classification
|
||||||
|
- [ ] Insert spans into `review_spans`
|
||||||
|
- [ ] Return `Stage2Output` with stats
|
||||||
|
|
||||||
|
### Validation (run after implementation)
|
||||||
|
```bash
|
||||||
|
python -m pytest pipeline/tests/test_stage2.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Definition of Done
|
||||||
|
- [ ] All V2.1-V2.12 validation rules pass
|
||||||
|
- [ ] LLM calls working with retry logic
|
||||||
|
- [ ] Span offsets correct (text substring matches)
|
||||||
|
- [ ] No overlapping spans
|
||||||
|
- [ ] Exactly one primary span per review
|
||||||
|
- [ ] Embeddings are 384-dim vectors
|
||||||
|
- [ ] Trust scores clamped to [0.2, 1.0]
|
||||||
|
- [ ] USN format valid per profile
|
||||||
|
- [ ] Integration test: Stage 1 output → Stage 2 input passes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stage 3: Issue Routing ❌ TODO
|
||||||
|
|
||||||
|
### Files to Create
|
||||||
|
- [ ] `pipeline/stage3_route.py` - Issue routing logic
|
||||||
|
- [ ] `pipeline/issue_manager.py` - Issue create/update logic
|
||||||
|
- [ ] `pipeline/tests/test_stage3.py` - Unit tests
|
||||||
|
- [ ] `migrations/008_create_issues_tables.sql` - Schema migration
|
||||||
|
|
||||||
|
### Database Schema Required
|
||||||
|
```sql
|
||||||
|
CREATE TABLE issues (...);
|
||||||
|
CREATE TABLE issue_spans (...);
|
||||||
|
CREATE TABLE issue_events (...);
|
||||||
|
```
|
||||||
|
|
||||||
|
### Implementation Checklist
|
||||||
|
- [ ] Query unrouted spans where `valence IN ('V-', 'V±')`
|
||||||
|
- [ ] Generate deterministic `issue_id` from routing key
|
||||||
|
- [ ] Check if issue exists
|
||||||
|
- [ ] Create new issue OR update existing counters
|
||||||
|
- [ ] Insert `issue_spans` link (enforce 1:1 with UNIQUE)
|
||||||
|
- [ ] Log event to `issue_events`
|
||||||
|
- [ ] Recalculate priority score
|
||||||
|
- [ ] Return `Stage3Output` with stats
|
||||||
|
|
||||||
|
### Validation (run after implementation)
|
||||||
|
```bash
|
||||||
|
python -m pytest pipeline/tests/test_stage3.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Definition of Done
|
||||||
|
- [ ] All V3.1-V3.5 validation rules pass
|
||||||
|
- [ ] Issue IDs are deterministic (same key = same ID)
|
||||||
|
- [ ] 1:1 span-to-issue mapping enforced
|
||||||
|
- [ ] Only V-/V± spans create issues
|
||||||
|
- [ ] Issue counters updated correctly
|
||||||
|
- [ ] Events logged for audit
|
||||||
|
- [ ] Integration test: Stage 2 output → Stage 3 input passes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stage 4: Fact Aggregation ❌ TODO
|
||||||
|
|
||||||
|
### Files to Create
|
||||||
|
- [ ] `pipeline/stage4_aggregate.py` - Fact aggregation logic
|
||||||
|
- [ ] `pipeline/tests/test_stage4.py` - Unit tests
|
||||||
|
- [ ] `migrations/009_create_facts_table.sql` - Schema migration
|
||||||
|
|
||||||
|
### Database Schema Required
|
||||||
|
```sql
|
||||||
|
CREATE TABLE fact_timeseries (...);
|
||||||
|
```
|
||||||
|
|
||||||
|
### Implementation Checklist
|
||||||
|
- [ ] Accept business_id, date, bucket_types
|
||||||
|
- [ ] Query spans joined with reviews for the period
|
||||||
|
- [ ] Aggregate by URT code (per location + 'ALL' rollup)
|
||||||
|
- [ ] Compute: review_count, span_count, valence counts
|
||||||
|
- [ ] Compute: strength_score, negative_strength, positive_strength
|
||||||
|
- [ ] Compute: intensity distribution (I1/I2/I3)
|
||||||
|
- [ ] Compute: CR counts (better/worse/same)
|
||||||
|
- [ ] Compute: trust-weighted metrics
|
||||||
|
- [ ] UPSERT into `fact_timeseries`
|
||||||
|
- [ ] Return `Stage4Output` with stats
|
||||||
|
|
||||||
|
### Validation (run after implementation)
|
||||||
|
```bash
|
||||||
|
python -m pytest pipeline/tests/test_stage4.py -v
|
||||||
|
```
|
||||||
|
|
||||||
|
### Definition of Done
|
||||||
|
- [ ] All V4.1-V4.7 validation rules pass
|
||||||
|
- [ ] Valence counts sum to span_count
|
||||||
|
- [ ] Intensity counts sum to span_count
|
||||||
|
- [ ] 'ALL' rollup includes owned locations only
|
||||||
|
- [ ] Facts are idempotent (re-run produces same result)
|
||||||
|
- [ ] Integration test: Full pipeline E2E passes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Tests
|
||||||
|
|
||||||
|
### Handoff Tests (run after each stage)
|
||||||
|
```bash
|
||||||
|
# Stage 0 → 1
|
||||||
|
python -m pytest pipeline/tests/integration/test_stage0_to_1.py
|
||||||
|
|
||||||
|
# Stage 1 → 2
|
||||||
|
python -m pytest pipeline/tests/integration/test_stage1_to_2.py
|
||||||
|
|
||||||
|
# Stage 2 → 3
|
||||||
|
python -m pytest pipeline/tests/integration/test_stage2_to_3.py
|
||||||
|
|
||||||
|
# Full E2E
|
||||||
|
python -m pytest pipeline/tests/integration/test_e2e.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### E2E Validation Command
|
||||||
|
```bash
|
||||||
|
# Run full pipeline validation
|
||||||
|
python -m pipeline.validate --job-id <JOB_ID> --verbose
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output:
|
||||||
|
```
|
||||||
|
Stage 0: ✅ PASS (5/5 rules)
|
||||||
|
Stage 1: ✅ PASS (6/6 rules)
|
||||||
|
Stage 2: ✅ PASS (12/12 rules)
|
||||||
|
Stage 3: ✅ PASS (5/5 rules)
|
||||||
|
Stage 4: ✅ PASS (7/7 rules)
|
||||||
|
|
||||||
|
E2E Integration: ✅ PASS
|
||||||
|
- Reviews scraped: 47
|
||||||
|
- Reviews normalized: 45 (2 empty filtered)
|
||||||
|
- Spans extracted: 127
|
||||||
|
- Issues created: 23
|
||||||
|
- Facts written: 156
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference: Validation Rules
|
||||||
|
|
||||||
|
### Stage 1
|
||||||
|
| Rule | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| V1.1 | `text` is non-empty |
|
||||||
|
| V1.2 | `text_normalized` has no control chars |
|
||||||
|
| V1.3 | `content_hash` is 64-char hex |
|
||||||
|
| V1.4 | `review_version` >= 1 |
|
||||||
|
| V1.5 | `text_language` is valid ISO 639-1 |
|
||||||
|
| V1.6 | `raw_id` references valid row |
|
||||||
|
|
||||||
|
### Stage 2
|
||||||
|
| Rule | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| V2.1 | `urt_primary` matches `^[OPJEAVR][1-4]\.[0-9]{2}$` |
|
||||||
|
| V2.2 | `urt_secondary` max 2 elements |
|
||||||
|
| V2.3 | `valence` is valid enum |
|
||||||
|
| V2.4 | `intensity` is valid enum |
|
||||||
|
| V2.5 | `span_end > span_start` |
|
||||||
|
| V2.6 | `span_text == text[start:end]` |
|
||||||
|
| V2.7 | Spans don't overlap |
|
||||||
|
| V2.8 | Exactly one `is_primary = true` |
|
||||||
|
| V2.9 | `trust_score` in [0.2, 1.0] |
|
||||||
|
| V2.10 | `embedding` is 384-dim |
|
||||||
|
| V2.11 | `usn` matches profile regex |
|
||||||
|
| V2.12 | `related_span_index` valid if set |
|
||||||
|
|
||||||
|
### Stage 3
|
||||||
|
| Rule | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| V3.1 | `issue_id` matches `^ISS-[a-f0-9]{16}$` |
|
||||||
|
| V3.2 | `routing_key` non-empty |
|
||||||
|
| V3.3 | Span not already linked elsewhere |
|
||||||
|
| V3.4 | Issue exists in `issues` table |
|
||||||
|
| V3.5 | Only V-/V± spans routed |
|
||||||
|
|
||||||
|
### Stage 4
|
||||||
|
| Rule | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| V4.1 | `place_id` valid or 'ALL' |
|
||||||
|
| V4.2 | `period_date` matches bucket |
|
||||||
|
| V4.3 | `span_count >= review_count` |
|
||||||
|
| V4.4 | Valence counts sum correctly |
|
||||||
|
| V4.5 | Intensity counts sum correctly |
|
||||||
|
| V4.6 | `strength_score >= 0` |
|
||||||
|
| V4.7 | `avg_rating` in [1.0, 5.0] or NULL |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Migration Execution Order
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run in sequence
|
||||||
|
psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql
|
||||||
|
psql $DATABASE_URL -f migrations/006_create_spans_table.sql
|
||||||
|
psql $DATABASE_URL -f migrations/007_create_urt_enums.sql
|
||||||
|
psql $DATABASE_URL -f migrations/008_create_issues_tables.sql
|
||||||
|
psql $DATABASE_URL -f migrations/009_create_facts_table.sql
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment Variables Required
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# LLM Provider (Stage 2)
|
||||||
|
OPENAI_API_KEY=sk-...
|
||||||
|
# OR
|
||||||
|
ANTHROPIC_API_KEY=sk-ant-...
|
||||||
|
|
||||||
|
# Embedding Model (Stage 2)
|
||||||
|
EMBEDDING_MODEL=all-MiniLM-L6-v2
|
||||||
|
|
||||||
|
# Database
|
||||||
|
DATABASE_URL=postgresql://...
|
||||||
|
|
||||||
|
# Taxonomy
|
||||||
|
DEFAULT_TAXONOMY_VERSION=v5.1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure After Implementation
|
||||||
|
|
||||||
|
```
|
||||||
|
pipeline/
|
||||||
|
├── __init__.py
|
||||||
|
├── stage1_normalize.py # ❌ TODO
|
||||||
|
├── stage2_classify.py # ❌ TODO
|
||||||
|
├── stage3_route.py # ❌ TODO
|
||||||
|
├── stage4_aggregate.py # ❌ TODO
|
||||||
|
├── llm_client.py # ❌ TODO
|
||||||
|
├── span_extractor.py # ❌ TODO
|
||||||
|
├── issue_manager.py # ❌ TODO
|
||||||
|
├── validate.py # ❌ TODO (CLI validator)
|
||||||
|
├── contracts.py # ❌ TODO (TypedDict definitions)
|
||||||
|
└── tests/
|
||||||
|
├── __init__.py
|
||||||
|
├── test_stage1.py # ❌ TODO
|
||||||
|
├── test_stage2.py # ❌ TODO
|
||||||
|
├── test_stage3.py # ❌ TODO
|
||||||
|
├── test_stage4.py # ❌ TODO
|
||||||
|
├── fixtures/
|
||||||
|
│ ├── stage0_output.json
|
||||||
|
│ ├── stage1_output.json
|
||||||
|
│ ├── stage2_output.json
|
||||||
|
│ ├── stage3_output.json
|
||||||
|
│ └── stage4_output.json
|
||||||
|
└── integration/
|
||||||
|
├── test_stage0_to_1.py
|
||||||
|
├── test_stage1_to_2.py
|
||||||
|
├── test_stage2_to_3.py
|
||||||
|
└── test_e2e.py
|
||||||
|
|
||||||
|
migrations/
|
||||||
|
├── 001_add_job_platform_fields.sql # ✅ EXISTS
|
||||||
|
├── 002_create_batches_table.sql # ✅ EXISTS
|
||||||
|
├── 003_create_scraper_registry.sql # ✅ EXISTS
|
||||||
|
├── 004_create_api_keys.sql # ✅ EXISTS
|
||||||
|
├── 005_create_reviews_tables.sql # ❌ TODO
|
||||||
|
├── 006_create_spans_table.sql # ❌ TODO
|
||||||
|
├── 007_create_urt_enums.sql # ❌ TODO
|
||||||
|
├── 008_create_issues_tables.sql # ❌ TODO
|
||||||
|
└── 009_create_facts_table.sql # ❌ TODO
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last Updated: 2026-01-24*
|
||||||
1255
.artifacts/ReviewIQ-Pipeline-Contracts-v1.md
Normal file
1255
.artifacts/ReviewIQ-Pipeline-Contracts-v1.md
Normal file
File diff suppressed because it is too large
Load Diff
312
.artifacts/ReviewIQ-Pipeline-DevGuide.md
Normal file
312
.artifacts/ReviewIQ-Pipeline-DevGuide.md
Normal file
@@ -0,0 +1,312 @@
|
|||||||
|
# ReviewIQ Pipeline Development Guide
|
||||||
|
|
||||||
|
**Purpose**: Entry point for agents implementing the enrichment pipeline
|
||||||
|
**Last Updated**: 2026-01-24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## TL;DR - Current State
|
||||||
|
|
||||||
|
**Pipeline Implementation: ~55% complete**
|
||||||
|
|
||||||
|
```
|
||||||
|
✅ WORKING ❌ NOT IMPLEMENTED
|
||||||
|
────────── ──────────────────
|
||||||
|
Google Maps scraping Stage 1: Normalization
|
||||||
|
Job orchestration Stage 2: LLM Classification
|
||||||
|
Chrome worker pool Stage 3: Issue Routing
|
||||||
|
Webhook delivery Stage 4: Fact Aggregation
|
||||||
|
SSE streaming Enrichment database schema
|
||||||
|
Frontend (job management) Advanced analytics UI
|
||||||
|
```
|
||||||
|
|
||||||
|
**Estimated effort to 100%**: 6-8 weeks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cold Start Instructions
|
||||||
|
|
||||||
|
A new agent should:
|
||||||
|
|
||||||
|
| Step | Action | Time |
|
||||||
|
|------|--------|------|
|
||||||
|
| 1 | Read this file (`ReviewIQ-Pipeline-DevGuide.md`) | 2 min |
|
||||||
|
| 2 | Read `ReviewIQ-v32-Decisions.md` | 5 min |
|
||||||
|
| 3 | Read `ReviewIQ-Codebase-Overview.md` | 10 min |
|
||||||
|
| 4 | Read assigned stage in `ReviewIQ-Pipeline-Contracts-v1.md` | 15 min |
|
||||||
|
| 5 | Use `ReviewIQ-Pipeline-Checklist.md` to verify completion | Reference |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Document Map
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────┐
|
||||||
|
│ ReviewIQ-Pipeline-DevGuide.md │
|
||||||
|
│ (YOU ARE HERE) │
|
||||||
|
└─────────────────┬───────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────────────────┼─────────────────────────────┐
|
||||||
|
│ │ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌─────────────────────┐ ┌─────────────────────────┐ ┌─────────────────────┐
|
||||||
|
│ CONTEXT RECOVERY │ │ IMPLEMENTATION │ │ REFERENCE │
|
||||||
|
├─────────────────────┤ ├─────────────────────────┤ ├─────────────────────┤
|
||||||
|
│ │ │ │ │ │
|
||||||
|
│ ReviewIQ-v32- │ │ Pipeline-Contracts-v1 │ │ Architecture-v3.2 │
|
||||||
|
│ Decisions.md │ │ (I/O specs, validation) │ │ (full DDL spec) │
|
||||||
|
│ (key decisions, │ │ │ │ │
|
||||||
|
│ markpoint) │ │ Pipeline-Checklist │ │ v3.2.1-Taxonomy- │
|
||||||
|
│ │ │ (implementation tasks) │ │ Versioning │
|
||||||
|
│ Codebase-Overview │ │ │ │ (versioning spec) │
|
||||||
|
│ (file structure, │ │ LLM-Classification- │ │ │
|
||||||
|
│ integration points) │ │ Contract-v1 │ │ URT-v5.1-Reference │
|
||||||
|
│ │ │ (prompt engineering) │ │ (dimension codes) │
|
||||||
|
└─────────────────────┘ └─────────────────────────┘ └─────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Documents
|
||||||
|
|
||||||
|
### Context & Status (Read First)
|
||||||
|
|
||||||
|
| File | Purpose | Est. Read Time |
|
||||||
|
|------|---------|----------------|
|
||||||
|
| `ReviewIQ-Pipeline-DevGuide.md` | Entry point, document map | 2 min |
|
||||||
|
| `ReviewIQ-v32-Decisions.md` | Key decisions, current markpoint | 5 min |
|
||||||
|
| `ReviewIQ-Codebase-Overview.md` | File structure, what code exists, integration points | 10 min |
|
||||||
|
|
||||||
|
### Implementation Guides (For Building)
|
||||||
|
|
||||||
|
| File | Purpose | Est. Read Time |
|
||||||
|
|------|---------|----------------|
|
||||||
|
| `ReviewIQ-Pipeline-Contracts-v1.md` | Stage I/O specs, validation rules, test fixtures | 15 min |
|
||||||
|
| `ReviewIQ-Pipeline-Checklist.md` | Per-stage implementation checklist, definition of done | 5 min |
|
||||||
|
| `LLM-Classification-Contract-v1.md` | LLM prompt engineering spec (Stage 2) | 10 min |
|
||||||
|
|
||||||
|
### Full Specifications (Reference)
|
||||||
|
|
||||||
|
| File | Purpose | When to Read |
|
||||||
|
|------|---------|--------------|
|
||||||
|
| `ReviewIQ-Architecture-v3.2.md` | Complete v3.2 spec with DDL | Schema details |
|
||||||
|
| `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` | Taxonomy versioning addendum | Future-proofing |
|
||||||
|
| `URT-v5.1-Reference.md` | URT dimension codes reference | Classification reference |
|
||||||
|
|
||||||
|
### Legacy (Superseded - Reference Only)
|
||||||
|
|
||||||
|
| File | Note |
|
||||||
|
|------|------|
|
||||||
|
| `ReviewIQ-Architecture-v2.md` | Superseded by v3.2 |
|
||||||
|
| `ReviewIQ-Architecture-v3.md` | Superseded by v3.2 |
|
||||||
|
| `ReviewIQ-Architecture-v3.1.md` | Superseded by v3.2 |
|
||||||
|
| `CONTEXT-KEEPER.md` | Use `ReviewIQ-v32-Decisions.md` instead |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What's Captured in Artifacts
|
||||||
|
|
||||||
|
| Context | Document |
|
||||||
|
|---------|----------|
|
||||||
|
| Key architectural decisions | `ReviewIQ-v32-Decisions.md` |
|
||||||
|
| Current implementation status (~55%) | `ReviewIQ-Codebase-Overview.md` |
|
||||||
|
| Existing file structure | `ReviewIQ-Codebase-Overview.md` |
|
||||||
|
| Integration points (where new code connects) | `ReviewIQ-Codebase-Overview.md` |
|
||||||
|
| Stage input/output contracts | `ReviewIQ-Pipeline-Contracts-v1.md` |
|
||||||
|
| Validation rules (35 total across stages) | `ReviewIQ-Pipeline-Contracts-v1.md` |
|
||||||
|
| Test fixtures (5 sample JSON payloads) | `ReviewIQ-Pipeline-Contracts-v1.md` |
|
||||||
|
| Implementation checklists | `ReviewIQ-Pipeline-Checklist.md` |
|
||||||
|
| Definition of done per stage | `ReviewIQ-Pipeline-Checklist.md` |
|
||||||
|
| LLM prompt specification | `LLM-Classification-Contract-v1.md` |
|
||||||
|
| URT taxonomy codes | `URT-v5.1-Reference.md` |
|
||||||
|
| Full database DDL | `ReviewIQ-Architecture-v3.2.md` |
|
||||||
|
| Taxonomy versioning schema | `ReviewIQ-v3.2.1-Taxonomy-Versioning.md` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Pipeline Stages
|
||||||
|
|
||||||
|
| Stage | Name | Status | Contract Section | Validation Rules |
|
||||||
|
|-------|------|--------|------------------|------------------|
|
||||||
|
| 0 | Raw Ingestion | ✅ Done | Pipeline-Contracts § Stage 0 | V0.1-V0.5 |
|
||||||
|
| 1 | Normalization | ❌ TODO | Pipeline-Contracts § Stage 1 | V1.1-V1.6 |
|
||||||
|
| 2 | LLM Classification | ❌ TODO | Pipeline-Contracts § Stage 2 | V2.1-V2.12 |
|
||||||
|
| 3 | Issue Routing | ❌ TODO | Pipeline-Contracts § Stage 3 | V3.1-V3.5 |
|
||||||
|
| 4 | Fact Aggregation | ❌ TODO | Pipeline-Contracts § Stage 4 | V4.1-V4.7 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Parallel Development Assignment
|
||||||
|
|
||||||
|
### Agent 1 - Stage 1 (Normalization)
|
||||||
|
```
|
||||||
|
Read:
|
||||||
|
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 1
|
||||||
|
- ReviewIQ-Codebase-Overview.md (integration points)
|
||||||
|
|
||||||
|
Create:
|
||||||
|
- pipeline/stage1_normalize.py
|
||||||
|
- migrations/005_create_reviews_tables.sql
|
||||||
|
- pipeline/tests/test_stage1.py
|
||||||
|
|
||||||
|
Validate:
|
||||||
|
- V1.1-V1.6 rules pass
|
||||||
|
- Integration test: Stage 0 → Stage 1 passes
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent 2 - Stage 2 (LLM Classification)
|
||||||
|
```
|
||||||
|
Read:
|
||||||
|
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 2
|
||||||
|
- LLM-Classification-Contract-v1.md
|
||||||
|
- URT-v5.1-Reference.md
|
||||||
|
|
||||||
|
Create:
|
||||||
|
- pipeline/stage2_classify.py
|
||||||
|
- pipeline/llm_client.py
|
||||||
|
- pipeline/span_extractor.py
|
||||||
|
- migrations/006_create_spans_table.sql
|
||||||
|
- migrations/007_create_urt_enums.sql
|
||||||
|
- pipeline/tests/test_stage2.py
|
||||||
|
|
||||||
|
Validate:
|
||||||
|
- V2.1-V2.12 rules pass
|
||||||
|
- Integration test: Stage 1 → Stage 2 passes
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent 3 - Stage 3 (Issue Routing)
|
||||||
|
```
|
||||||
|
Read:
|
||||||
|
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 3
|
||||||
|
- ReviewIQ-Architecture-v3.2.md § Part 5 (issue lifecycle)
|
||||||
|
|
||||||
|
Create:
|
||||||
|
- pipeline/stage3_route.py
|
||||||
|
- pipeline/issue_manager.py
|
||||||
|
- migrations/008_create_issues_tables.sql
|
||||||
|
- pipeline/tests/test_stage3.py
|
||||||
|
|
||||||
|
Validate:
|
||||||
|
- V3.1-V3.5 rules pass
|
||||||
|
- Integration test: Stage 2 → Stage 3 passes
|
||||||
|
```
|
||||||
|
|
||||||
|
### Agent 4 - Stage 4 (Fact Aggregation)
|
||||||
|
```
|
||||||
|
Read:
|
||||||
|
- ReviewIQ-Pipeline-Contracts-v1.md § Stage 4
|
||||||
|
- ReviewIQ-Architecture-v3.2.md § Part 6 (analytics)
|
||||||
|
|
||||||
|
Create:
|
||||||
|
- pipeline/stage4_aggregate.py
|
||||||
|
- migrations/009_create_facts_table.sql
|
||||||
|
- pipeline/tests/test_stage4.py
|
||||||
|
|
||||||
|
Validate:
|
||||||
|
- V4.1-V4.7 rules pass
|
||||||
|
- E2E pipeline test passes
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Pipeline is complete when:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m pipeline.validate --job-id <JOB_ID> --verbose
|
||||||
|
|
||||||
|
# Expected output:
|
||||||
|
Stage 0: ✅ PASS (5/5 rules)
|
||||||
|
Stage 1: ✅ PASS (6/6 rules)
|
||||||
|
Stage 2: ✅ PASS (12/12 rules)
|
||||||
|
Stage 3: ✅ PASS (5/5 rules)
|
||||||
|
Stage 4: ✅ PASS (7/7 rules)
|
||||||
|
E2E Integration: ✅ PASS
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check current branch
|
||||||
|
git branch --show-current
|
||||||
|
# Expected: feature/platform-restructure
|
||||||
|
|
||||||
|
# View recent commits
|
||||||
|
git log --oneline -5
|
||||||
|
|
||||||
|
# Start database
|
||||||
|
docker-compose -f docker-compose.production.yml up -d postgres
|
||||||
|
|
||||||
|
# Run API server
|
||||||
|
python api_server_production.py
|
||||||
|
|
||||||
|
# Run frontend
|
||||||
|
cd frontend && npm run dev
|
||||||
|
|
||||||
|
# Run migrations (when created)
|
||||||
|
psql $DATABASE_URL -f migrations/005_create_reviews_tables.sql
|
||||||
|
|
||||||
|
# Run tests
|
||||||
|
pytest pipeline/tests/ -v
|
||||||
|
|
||||||
|
# Validate pipeline
|
||||||
|
python -m pipeline.validate --job-id <JOB_ID>
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Environment Variables
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Database (required)
|
||||||
|
DATABASE_URL=postgresql://user:pass@localhost:5432/reviewiq
|
||||||
|
|
||||||
|
# LLM Provider (Stage 2)
|
||||||
|
OPENAI_API_KEY=sk-...
|
||||||
|
# OR
|
||||||
|
ANTHROPIC_API_KEY=sk-ant-...
|
||||||
|
|
||||||
|
# Embedding model (Stage 2)
|
||||||
|
EMBEDDING_MODEL=all-MiniLM-L6-v2
|
||||||
|
|
||||||
|
# Taxonomy version
|
||||||
|
DEFAULT_TAXONOMY_VERSION=v5.1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## File Structure After Implementation
|
||||||
|
|
||||||
|
```
|
||||||
|
google-reviews-scraper-pro/
|
||||||
|
├── .artifacts/ # ← Design documents
|
||||||
|
│ ├── ReviewIQ-Pipeline-DevGuide.md # ← START HERE (for pipeline work)
|
||||||
|
│ ├── ReviewIQ-v32-Decisions.md
|
||||||
|
│ ├── ReviewIQ-Codebase-Overview.md
|
||||||
|
│ ├── ReviewIQ-Pipeline-Contracts-v1.md
|
||||||
|
│ ├── ReviewIQ-Pipeline-Checklist.md
|
||||||
|
│ └── ...
|
||||||
|
│
|
||||||
|
├── api_server_production.py # ✅ Exists - Main API
|
||||||
|
├── core/database.py # ✅ Exists - DB layer
|
||||||
|
├── scrapers/google_reviews/ # ✅ Exists - Scraper
|
||||||
|
│
|
||||||
|
├── pipeline/ # ❌ TO CREATE
|
||||||
|
│ ├── stage1_normalize.py
|
||||||
|
│ ├── stage2_classify.py
|
||||||
|
│ ├── stage3_route.py
|
||||||
|
│ ├── stage4_aggregate.py
|
||||||
|
│ ├── llm_client.py
|
||||||
|
│ └── tests/
|
||||||
|
│
|
||||||
|
└── migrations/
|
||||||
|
├── 001-004 # ✅ Exists
|
||||||
|
└── 005-009 # ❌ TO CREATE
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Keep this guide updated when adding new artifacts or completing stages.*
|
||||||
1107
.artifacts/ReviewIQ-v3.2.1-Taxonomy-Versioning.md
Normal file
1107
.artifacts/ReviewIQ-v3.2.1-Taxonomy-Versioning.md
Normal file
File diff suppressed because it is too large
Load Diff
@@ -7,9 +7,18 @@
|
|||||||
## 1. Markpoint
|
## 1. Markpoint
|
||||||
|
|
||||||
```
|
```
|
||||||
ID: reviewiq-v32-span-layer-2026-01-24-001
|
ID: reviewiq-v32-span-layer-2026-01-24-004
|
||||||
Status: v3.2 span layer complete
|
Status: Pipeline contracts defined, ready for parallel implementation
|
||||||
Based on: v3.1.2 (commit f998277)
|
Based on: v3.2 (commit 43fd151)
|
||||||
|
|
||||||
|
START HERE: ReviewIQ-Pipeline-DevGuide.md (for pipeline implementation)
|
||||||
|
|
||||||
|
Key Documents:
|
||||||
|
- ReviewIQ-Pipeline-DevGuide.md (entry point for pipeline work)
|
||||||
|
- ReviewIQ-Codebase-Overview.md (file structure, what exists)
|
||||||
|
- ReviewIQ-Pipeline-Contracts-v1.md (stage I/O contracts, validation)
|
||||||
|
- ReviewIQ-Pipeline-Checklist.md (implementation checklist)
|
||||||
|
- ReviewIQ-v3.2.1-Taxonomy-Versioning.md (taxonomy versioning spec)
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -152,6 +161,98 @@ Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
|
|||||||
| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
|
| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
|
||||||
| Reprocessing strategy? | **Soft-switch** with is_active flag |
|
| Reprocessing strategy? | **Soft-switch** with is_active flag |
|
||||||
| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
|
| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
|
||||||
|
| Taxonomy evolution tracking? | **Yes** — versioned codes with explicit mappings (v3.2.1) |
|
||||||
|
| B2 schema vs v3.2 divergence? | **Documented** — B2 is canonical URT, v3.2 is app layer |
|
||||||
|
| Taxonomy versioning? | **Yes** — `taxonomy_version` column on spans, versioned code tables |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. B2 Schema Audit Findings
|
||||||
|
|
||||||
|
**Audit Date**: 2026-01-24
|
||||||
|
|
||||||
|
The B2-database-schema.sql (canonical URT v5.1) and ReviewIQ v3.2 spec have deliberate divergences:
|
||||||
|
|
||||||
|
| Aspect | B2 (URT v5.1) | v3.2 (ReviewIQ) | Resolution |
|
||||||
|
|--------|---------------|-----------------|------------|
|
||||||
|
| Purpose | Source-agnostic taxonomy | Google Reviews app layer | Keep both |
|
||||||
|
| ID strategy | UUIDs + sequential | Deterministic SHA256 | v3.2 choice |
|
||||||
|
| Type safety | VARCHAR + CHECK | Postgres ENUMs | v3.2 choice |
|
||||||
|
| Span table | `spans` | `review_spans` | v3.2 naming |
|
||||||
|
| Offset columns | `char_start/char_end` | `span_start/span_end` | Document divergence |
|
||||||
|
| Tenant model | Single-tenant | Multi-tenant (business_id) | v3.2 requirement |
|
||||||
|
| Issue-span mapping | Many-to-many | One-to-one | v3.2 choice |
|
||||||
|
| Causal chain | Normalized table | JSONB column | v3.2 flexibility |
|
||||||
|
| Reprocessing | Not supported | Soft-switch pattern | v3.2 innovation |
|
||||||
|
|
||||||
|
**Action Items**:
|
||||||
|
1. Import reference data (domains, categories, subcodes) from B2 INSERTs
|
||||||
|
2. Seed `urt_codes` / `urt_codes_versioned` from B1-urt-codes.yaml
|
||||||
|
3. Do NOT adopt B2 structure directly — v3.2 has specific app requirements
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Taxonomy Versioning (v3.2.1)
|
||||||
|
|
||||||
|
| Decision | Choice | Rationale |
|
||||||
|
|----------|--------|-----------|
|
||||||
|
| Track taxonomy version | Required column on spans | Classifications only meaningful in version context |
|
||||||
|
| Version ID format | `v{major}.{minor}` | Human-readable, matches URT releases |
|
||||||
|
| Code FK strategy | Composite `(code, version_id)` | Prevents orphaned classifications |
|
||||||
|
| Cross-version mappings | Explicit mapping table | Enables normalized trend queries |
|
||||||
|
| Mapping direction | Forward only (old→new) | Simpler model, matches time flow |
|
||||||
|
| Default version | `'v5.1'` hardcoded | Safe baseline, explicit upgrade path |
|
||||||
|
| Fact table versioning | Per-row `taxonomy_version` | Enables version-specific aggregation |
|
||||||
|
|
||||||
|
**Key Tables Added**:
|
||||||
|
- `urt_taxonomy_versions` — Version registry with validity periods
|
||||||
|
- `urt_codes_versioned` — Full code definitions per version (SCD Type 2)
|
||||||
|
- `urt_code_mappings` — Cross-version translation rules
|
||||||
|
|
||||||
|
**Key Functions Added**:
|
||||||
|
- `translate_urt_code(code, from_version, to_version)` — Single code translation
|
||||||
|
- `get_code_lineage(code, version)` — Full historical lineage
|
||||||
|
- `detect_taxonomy_drift(from_version, to_version)` — Impact analysis
|
||||||
|
- `aggregate_spans_normalized(...)` — Version-normalized aggregation
|
||||||
|
|
||||||
|
**Principle**: Facts are immutable. A span classified as `J1.01` in v5.1 stays that way forever. Translation is explicit and auditable.
|
||||||
|
|
||||||
|
See: `.artifacts/ReviewIQ-v3.2.1-Taxonomy-Versioning.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 15. Pipeline Implementation Status
|
||||||
|
|
||||||
|
**Overall: ~55% Complete** (as of 2026-01-24)
|
||||||
|
|
||||||
|
| Stage | Name | Status | Owner |
|
||||||
|
|-------|------|--------|-------|
|
||||||
|
| 0 | Raw Ingestion | ✅ DONE | Scraper Team |
|
||||||
|
| 1 | Normalization | ❌ TODO | TBD |
|
||||||
|
| 2 | LLM Classification | ❌ TODO | TBD |
|
||||||
|
| 3 | Issue Routing | ❌ TODO | TBD |
|
||||||
|
| 4 | Fact Aggregation | ❌ TODO | TBD |
|
||||||
|
|
||||||
|
**What's Working**:
|
||||||
|
- Google Maps scraping (v1.0.0)
|
||||||
|
- Job orchestration & queuing
|
||||||
|
- Webhook delivery
|
||||||
|
- Frontend job management
|
||||||
|
- Real-time SSE streaming
|
||||||
|
|
||||||
|
**What's Missing**:
|
||||||
|
- Entire enrichment pipeline (Stages 1-4)
|
||||||
|
- LLM integration
|
||||||
|
- Span extraction
|
||||||
|
- Issue routing
|
||||||
|
- Analytics aggregation
|
||||||
|
|
||||||
|
**Parallel Development**:
|
||||||
|
Each stage can be implemented independently using the contracts defined in:
|
||||||
|
- `ReviewIQ-Pipeline-Contracts-v1.md` — Full I/O specs, validation rules, test fixtures
|
||||||
|
- `ReviewIQ-Pipeline-Checklist.md` — Implementation checklist, definition of done
|
||||||
|
|
||||||
|
**Estimated Effort to 100%**: 6-8 weeks
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -180,4 +281,4 @@ GREATEST(0.2, base_trust * modifiers) -- Floor prevents collapse
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
*Last updated: 2026-01-24*
|
*Last updated: 2026-01-24 (pipeline contracts + codebase overview)*
|
||||||
|
|||||||
Reference in New Issue
Block a user