feat: Add reviewiq-pipeline package for LLM-powered review classification

Implement a standalone Python package for processing customer reviews through
a 4-stage pipeline using URT (Universal Review Taxonomy) v5.1:

- Stage 1: Normalization (text cleaning, language detection, deduplication)
- Stage 2: LLM Classification (OpenAI/Anthropic span extraction with URT codes)
- Stage 3: Issue Routing (deterministic issue ID generation, span linking)
- Stage 4: Fact Aggregation (time series metrics for dashboards)

Package includes:
- TypedDict contracts matching Pipeline-Contracts-v1.md
- Async database layer with asyncpg and 5 SQL migrations
- LLM client abstraction supporting both OpenAI and Anthropic
- Sentence-transformers integration for embeddings
- Validation rules V1.x through V4.x
- CLI commands: migrate, run, validate, check
- 55 unit and integration tests (all passing)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-24 18:07:11 +00:00
parent b780a23b66
commit 7d720f5378
34 changed files with 7222 additions and 0 deletions

View File

@@ -0,0 +1,97 @@
# ReviewIQ Pipeline
LLM-powered review classification and analysis pipeline using URT (Universal Review Taxonomy) v5.1.
## Features
- **Stage 1: Normalization** - Text cleaning, language detection, deduplication
- **Stage 2: LLM Classification** - Span extraction with URT codes using OpenAI/Anthropic
- **Stage 3: Issue Routing** - Route negative spans to issues for tracking
- **Stage 4: Fact Aggregation** - Pre-aggregate metrics for dashboard queries
## Installation
```bash
pip install reviewiq-pipeline
```
Or install from source:
```bash
pip install -e packages/reviewiq-pipeline
```
## Quick Start
### Python API
```python
from reviewiq_pipeline import Pipeline, Config
# Initialize
config = Config(
database_url="postgresql://...",
llm_provider="openai",
llm_api_key="sk-...",
taxonomy_version="v5.1"
)
pipeline = Pipeline(config)
# Run full pipeline
result = await pipeline.process(scraper_output)
# Or run individual stages
stage1_result = await pipeline.normalize(scraper_output)
stage2_result = await pipeline.classify(stage1_result)
stage3_result = await pipeline.route(stage2_result)
stage4_result = await pipeline.aggregate(business_id, date)
# Validate
validation = await pipeline.validate(job_id)
```
### CLI
```bash
# Run migrations
reviewiq-pipeline migrate --database-url $DATABASE_URL
# Process a job
reviewiq-pipeline run --job-id <UUID> --stages 1,2,3,4
# Validate pipeline output
reviewiq-pipeline validate --job-id <UUID>
```
## Configuration
Environment variables:
- `DATABASE_URL` - PostgreSQL connection string
- `LLM_PROVIDER` - `openai` or `anthropic`
- `OPENAI_API_KEY` - OpenAI API key (if using OpenAI)
- `ANTHROPIC_API_KEY` - Anthropic API key (if using Anthropic)
- `TAXONOMY_VERSION` - URT taxonomy version (default: `v5.1`)
## Development
```bash
# Install with dev dependencies
pip install -e "packages/reviewiq-pipeline[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=reviewiq_pipeline
# Type checking
mypy src/reviewiq_pipeline
# Linting
ruff check src/reviewiq_pipeline
```
## License
MIT