docs: Add Classification System & Primitives Taxonomy documentation

Comprehensive documentation covering: - Actual production primitives (37 primitives across 5 domains) - O: TASTE, CRAFT, FRESHNESS, TEMPERATURE, EFFECTIVENESS, ACCURACY, CONDITION, CONSISTENCY - P: MANNER, COMPETENCE, ATTENTIVENESS, COMMUNICATION - J: SPEED, FRICTION, RELIABILITY, AVAILABILITY - E: CLEANLINESS, COMFORT, SAFETY, AMBIANCE, ACCESSIBILITY, DIGITAL_UX - V: PRICE_LEVEL, PRICE_FAIRNESS, PRICE_TRANSPARENCY, VALUE_FOR_MONEY - meta: HONESTY, ETHICS, PROMISES, etc. + UNMAPPED, NON_INFORMATIVE - Classification pipeline with config resolution - Non-informative detection (skip LLM for junk content) - Language detection and per-language UNMAPPED tracking - Database schema for detected_spans_v2 - Evaluation tooling and quality metrics Note: A larger taxonomy (~150 primitives) exists in gbp_primitive_prompts.py for future expansion. The production system uses the subset above. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 23:50:01 +00:00
parent ee596c7969
commit 0543a08242
1 changed files with 502 additions and 0 deletions
--- a/packages/reviewiq-pipeline/docs/CLASSIFICATION_SYSTEM.md
+++ b/packages/reviewiq-pipeline/docs/CLASSIFICATION_SYSTEM.md
@@ -0,0 +1,502 @@
+# Classification System & Primitives Taxonomy
+
+**Version:** 2.0
+**Status:** Production
+**Location:** `packages/reviewiq-pipeline/scripts/run_classification_v2.py`
+
+## Overview
+
+The Classification System transforms raw customer reviews into structured, actionable data by:
+
+1. **Extracting spans** - Identifying semantically meaningful segments within review text
+2. **Classifying primitives** - Mapping each span to a primitive (e.g., `MANNER`, `SPEED`, `VALUE_FOR_MONEY`)
+3. **Scoring** - Assigning valence, intensity, detail, and confidence to each span
+4. **Filtering** - Detecting non-informative content (emoji-only, translation artifacts)
+
+The output is stored in `pipeline.detected_spans_v2` and powers downstream analytics, issue routing, and the Reputation Report.
+
+> **Note:** There is a legacy system (`stage2_classify.py`) that uses URT codes (`J1.01`, `O1.01`). The current production system uses **primitives** with descriptive names.
+
+## Quick Start
+
+```bash
+# Classify reviews for a business (dry run)
+python scripts/run_classification_v2.py --business "Go Karts Mar Menor" --limit 100 --dry-run
+
+# Real LLM classification
+python scripts/run_classification_v2.py --business "Go Karts Mar Menor" --limit 100 --use-llm
+
+# Evaluate classification quality
+python scripts/run_classification_v2.py --evaluate "Go Karts Mar Menor"
+
+# Language analysis across all data
+python scripts/run_classification_v2.py --language-analysis
+```
+
+---
+
+## Primitives Taxonomy
+
+The production system uses **37 primitives** across **5 domains** plus meta primitives. These are defined in `reputation_report.py`'s `DOMAIN_MAP`.
+
+> **Note:** A larger taxonomy of ~150 primitives exists in `gbp_primitive_prompts.py` for future expansion and business-specific configuration. The production system currently uses the subset below.
+
+### Domain Structure
+
+| Domain | Code | Primitives |
+|--------|------|------------|
+| **Output** | O | TASTE, CRAFT, FRESHNESS, TEMPERATURE, EFFECTIVENESS, ACCURACY, CONDITION, CONSISTENCY |
+| **People** | P | MANNER, COMPETENCE, ATTENTIVENESS, COMMUNICATION |
+| **Journey** | J | SPEED, FRICTION, RELIABILITY, AVAILABILITY |
+| **Environment** | E | CLEANLINESS, COMFORT, SAFETY, AMBIANCE, ACCESSIBILITY, DIGITAL_UX |
+| **Value** | V | PRICE_LEVEL, PRICE_FAIRNESS, PRICE_TRANSPARENCY, VALUE_FOR_MONEY |
+| **Meta** | meta | HONESTY, ETHICS, PROMISES, ACKNOWLEDGMENT, RESPONSE_QUALITY, RECOVERY, RETURN_INTENT, RECOMMEND, RECOGNITION, UNMAPPED, NON_INFORMATIVE |
+
+### Special Primitives
+
+| Primitive | Purpose |
+|-----------|---------|
+| `UNMAPPED` | Could not classify to any primitive (target: <10%) |
+| `NON_INFORMATIVE` | No actionable content (emoji-only, translation artifacts) |
+
+---
+
+## Full Primitive Reference
+
+### OUTPUT (O) - Product/Service Quality
+
+| Primitive | Description | Example Signals |
+|-----------|-------------|-----------------|
+| `TASTE` | Flavor quality (food/beverage) | "delicious", "bland", "amazing flavor" |
+| `CRAFT` | Skill of execution | "expertly made", "sloppy work", "quality craftsmanship" |
+| `FRESHNESS` | How fresh/new the product is | "fresh ingredients", "stale", "just made" |
+| `TEMPERATURE` | Serving temperature | "served hot", "cold food", "perfect temperature" |
+| `EFFECTIVENESS` | Does it work/achieve purpose | "works great", "didn't work", "effective" |
+| `ACCURACY` | Correct execution of order | "exactly as ordered", "wrong order", "got it right" |
+| `CONDITION` | State at delivery | "arrived perfect", "damaged", "pristine condition" |
+| `CONSISTENCY` | Same quality each time | "always consistent", "hit or miss", "reliable quality" |
+
+### PEOPLE (P) - Staff Interactions
+
+| Primitive | Description | Example Signals |
+|-----------|-------------|-----------------|
+| `MANNER` | Friendliness and warmth | "so friendly", "rude", "welcoming" |
+| `COMPETENCE` | Knowledge and skill | "very knowledgeable", "clueless", "professional" |
+| `ATTENTIVENESS` | Being present and responsive | "attentive staff", "ignored us", "checked on us" |
+| `COMMUNICATION` | Clarity and updates | "kept us informed", "no updates", "explained clearly" |
+
+### JOURNEY (J) - Process and Timing
+
+| Primitive | Description | Example Signals |
+|-----------|-------------|-----------------|
+| `SPEED` | How fast things happen | "quick service", "took forever", "fast" |
+| `FRICTION` | Ease of process | "smooth process", "complicated", "hassle-free" |
+| `RELIABILITY` | Dependable service | "always reliable", "unreliable", "consistent" |
+| `AVAILABILITY` | Access to service/staff | "always available", "never open", "hard to reach" |
+
+### ENVIRONMENT (E) - Physical/Digital Space
+
+| Primitive | Description | Example Signals |
+|-----------|-------------|-----------------|
+| `CLEANLINESS` | Hygiene and tidiness | "spotless", "dirty", "very clean" |
+| `COMFORT` | Physical ease | "comfortable", "cramped", "cozy seating" |
+| `SAFETY` | Physical safety | "felt safe", "dangerous", "secure" |
+| `AMBIANCE` | Overall mood/atmosphere | "great vibe", "loud", "nice atmosphere" |
+| `ACCESSIBILITY` | Ease of access (physical/digital) | "wheelchair accessible", "hard to find", "easy to navigate" |
+| `DIGITAL_UX` | Digital experience | "easy to use app", "website broken", "smooth online booking" |
+
+### VALUE (V) - Cost and Worth
+
+| Primitive | Description | Example Signals |
+|-----------|-------------|-----------------|
+| `PRICE_LEVEL` | Absolute cost | "affordable", "expensive", "cheap" |
+| `PRICE_FAIRNESS` | Fair for what you get | "fair price", "overpriced", "worth every penny" |
+| `PRICE_TRANSPARENCY` | Clear about costs | "no hidden fees", "surprise charges", "upfront pricing" |
+| `VALUE_FOR_MONEY` | Overall value assessment | "great value", "not worth it", "bang for buck" |
+
+### META - Trust and Sentiment
+
+| Primitive | Description | Example Signals |
+|-----------|-------------|-----------------|
+| `HONESTY` | Truthfulness | "honest", "lied to us", "transparent" |
+| `ETHICS` | Moral conduct | "ethical", "scam", "trustworthy" |
+| `PROMISES` | Keeping commitments | "kept their word", "broke promises", "reliable" |
+| `ACKNOWLEDGMENT` | Recognizing issues | "admitted mistake", "denied problem", "apologized" |
+| `RESPONSE_QUALITY` | How business responds | "great response", "ignored complaint", "resolved quickly" |
+| `RECOVERY` | Making amends | "made it right", "no compensation", "fixed the issue" |
+| `RETURN_INTENT` | Would come back | "will be back", "never again", "definitely returning" |
+| `RECOMMEND` | Would suggest to others | "highly recommend", "don't go", "tell your friends" |
+| `RECOGNITION` | Customer acknowledgment | "remembered us", "treated like strangers", "knew our name" |
+
+---
+
+## Span Classification
+
+### What is a Span?
+
+A **span** is a contiguous segment of review text that expresses a single semantic unit about the customer experience.
+
+```
+Review: "The food was delicious but we waited 45 minutes for a table."
+
+Span 1: "The food was delicious"
+  → Primitive: TASTE (O)
+  → Valence: + (positive)
+  → Intensity: 2 (moderate)
+
+Span 2: "we waited 45 minutes for a table"
+  → Primitive: SPEED (J)
+  → Valence: - (negative)
+  → Intensity: 3 (high - specific number)
+```
+
+### Span Fields
+
+```typescript
+interface ClassificationSpan {
+  // Position
+  text: string;           // Extracted text from review
+  start: number;          // Character offset start
+  end: number;            // Character offset end
+
+  // Classification
+  primitive: string;      // e.g., "MANNER", "SPEED", "VALUE_FOR_MONEY", "UNMAPPED"
+  valence: "+" | "-" | "0" | "±";
+  intensity: 1 | 2 | 3;   // 1=low, 2=moderate, 3=high
+  detail: 1 | 2 | 3;      // 1=vague, 2=some detail, 3=specific
+  confidence: number;     // 0.0 - 1.0
+
+  // Entity extraction (optional)
+  entity?: string;        // Named entity (e.g., "John", "Room 302")
+  entity_type?: "staff" | "location" | "product" | "process" | "time";
+
+  // For UNMAPPED spans
+  unmapped_keywords?: string[];  // Keywords that couldn't be mapped
+}
+```
+
+### Valence Types
+
+| Code | Meaning | Example |
+|------|---------|---------|
+| `+` | Positive sentiment | "excellent service" |
+| `-` | Negative sentiment | "terrible wait" |
+| `0` | Neutral/factual | "open until 9pm" |
+| `±` | Mixed sentiment | "good but expensive" |
+
+### Intensity Levels
+
+| Value | Level | Signals |
+|-------|-------|---------|
+| `1` | Low | Generic mentions, implied sentiment |
+| `2` | Medium | Clear opinion, adjectives |
+| `3` | High | Strong language, specifics, numbers |
+
+### Detail Levels
+
+| Value | Level | Description |
+|-------|-------|-------------|
+| `1` | Vague | General statement, no specifics |
+| `2` | Some detail | Has some context or explanation |
+| `3` | Specific | Actionable detail, names, numbers |
+
+### Confidence
+
+A float from `0.0` to `1.0` indicating how confident the classifier is:
+
+- `≥ 0.8`: High confidence, clear signal
+- `0.5 - 0.8`: Medium confidence, reasonable inference
+- `< 0.5`: Low confidence - if below threshold, use `UNMAPPED`
+
+---
+
+## Classification Pipeline
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    Classification V2                             │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────┐ │
+│  │  Config Resolver│ ─→ │  LLM Classifier │ ─→ │  Store      │ │
+│  │                 │    │                 │    │             │ │
+│  │  • GBP path     │    │  • OpenAI API   │    │  • spans_v2 │ │
+│  │  • Sector brief │    │  • Primitives   │    │  • run_id   │ │
+│  │  • Enabled prims│    │  • Language det │    │  • audit    │ │
+│  └────────┬────────┘    └────────┬────────┘    └──────┬──────┘ │
+│           │                      │                     │        │
+│           │         ┌────────────┴────────────┐        │        │
+│           │         │   Non-Informative       │        │        │
+│           │         │   Detection (skip LLM)  │        │        │
+│           │         └─────────────────────────┘        │        │
+│           ▼                      ▼                     ▼        │
+│  ┌─────────────────────────────────────────────────────────────┐│
+│  │              pipeline.detected_spans_v2                     ││
+│  │    (primitive, valence, intensity, detail, confidence)      ││
+│  └─────────────────────────────────────────────────────────────┘│
+│                                                                  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### Non-Informative Detection
+
+Before calling the LLM, reviews are checked for non-informative content to save cost:
+
+```python
+# Conservative detection - only skip when VERY sure
+def is_non_informative(text: str) -> tuple[bool, str]:
+    """
+    Returns (is_non_informative, reason).
+    Reasons: 'empty', 'junk_pattern', 'no_content', 'pure_repetition'
+    """
+```
+
+**Detection Rules:**
+- Empty text
+- Emoji-only content: `^[\U0001F300-\U0001F9FF\s\.\!\?]+$`
+- Translation artifacts: `"translated by google"`
+- No alphanumeric content
+- Pure repetition: `"good good good good"`
+
+Reviews that pass detection go to the LLM.
+
+### Config Resolution
+
+Each business gets a resolved config based on its GBP (Google Business Profile) category:
+
+```python
+resolver = ConfigResolver()
+config = await resolver.resolve("Go Karts Mar Menor", pool)
+
+# Returns:
+{
+    "business_id": "Go Karts Mar Menor",
+    "sector_code": "recreation",
+    "gbp_path": "Recreation.Amusement_Parks.Go_Karts",
+    "config_version": "v2.1-2026-01-15",
+    "enabled_primitives": ["SPEED", "SAFETY", "VALUE_FOR_MONEY", ...],
+    "weights": {"SAFETY": 1.5, "SPEED": 1.2, ...},
+    "brief": {"what_customers_judge": [...]}
+}
+```
+
+### LLM Classification Prompt
+
+The classifier uses a structured prompt with business-specific primitives:
+
+```
+You are a review classifier using primitive-based analysis.
+
+## ENABLED PRIMITIVES (use ONLY these)
+- MANNER: Friendliness and warmth of staff (weight: 1.2x)
+- SPEED: How fast things happen
+- SAFETY: Physical safety and protection
+...
+
+## RULES
+1. Extract 1-5 spans per review
+2. Each span gets exactly ONE primitive
+3. If nothing fits with confidence ≥ 0.5, use UNMAPPED
+4. Valence: + (positive), - (negative), 0 (neutral), ± (mixed)
+5. Intensity: 1 (low), 2 (moderate), 3 (high)
+6. Detail: 1 (vague), 2 (some detail), 3 (specific)
+
+## OUTPUT FORMAT (JSON)
+{
+  "spans": [
+    {
+      "text": "exact text from review",
+      "primitive": "MANNER",
+      "valence": "+",
+      "intensity": 2,
+      "detail": 2,
+      "confidence": 0.85
+    }
+  ]
+}
+```
+
+### Language Detection
+
+The LLM classifier auto-detects review language and returns it with confidence. This enables:
+
+- Per-language UNMAPPED rate tracking
+- Identification of languages needing better signal coverage
+- Multilingual analytics (7+ languages: Spanish, English, Dutch, German, Polish, Finnish, Danish)
+
+---
+
+## Database Schema
+
+### `pipeline.detected_spans_v2`
+
+```sql
+CREATE TABLE pipeline.detected_spans_v2 (
+    id BIGSERIAL PRIMARY KEY,
+
+    -- Context
+    job_id VARCHAR(50),                   -- Scraper job ID
+    business_id VARCHAR(255) NOT NULL,
+    review_id VARCHAR(255) NOT NULL,
+    gbp_path ltree,                       -- e.g., 'Recreation.Go_Karts'
+    sector_code VARCHAR(50),              -- e.g., 'recreation'
+    config_version VARCHAR(100),          -- Config version used
+    run_id UUID,                          -- Classification run ID
+
+    -- Classification (primitives-based)
+    primitive VARCHAR(50) NOT NULL,       -- e.g., "MANNER", "SPEED", "UNMAPPED"
+    valence VARCHAR(5) NOT NULL,          -- +, -, 0, ±
+    intensity INTEGER,                    -- 1, 2, 3
+    detail INTEGER,                       -- 1, 2, 3
+    mode VARCHAR(50),                     -- e.g., "dine_in", "delivery"
+    confidence FLOAT NOT NULL,            -- 0.0 - 1.0
+
+    -- Span position
+    span_text TEXT NOT NULL,
+    span_start INTEGER,
+    span_end INTEGER,
+
+    -- Entity extraction
+    entity VARCHAR(255),
+    entity_type VARCHAR(50),
+    unmapped_keywords TEXT[],             -- Keywords for UNMAPPED spans
+
+    -- Audit trail
+    model VARCHAR(100),                   -- e.g., "gpt-4o-mini"
+    raw_response JSONB,                   -- Full LLM response
+    review_hash VARCHAR(32),              -- For deduplication
+    language VARCHAR(10),                 -- Detected language
+
+    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
+);
+
+-- Key indexes
+CREATE INDEX idx_spans_v2_business_id ON detected_spans_v2(business_id);
+CREATE INDEX idx_spans_v2_primitive ON detected_spans_v2(primitive);
+CREATE INDEX idx_spans_v2_valence ON detected_spans_v2(valence);
+CREATE INDEX idx_spans_v2_run_id ON detected_spans_v2(run_id);
+CREATE INDEX idx_spans_v2_language ON detected_spans_v2(language);
+```
+
+### Key Queries
+
+**Get all spans for a business in a time window:**
+```sql
+SELECT s.*, f.review_time_utc, f.rating
+FROM pipeline.detected_spans_v2 s
+JOIN pipeline.review_facts_v1 f
+  ON f.review_id = s.review_id
+ AND f.business_id = s.business_id    -- CRITICAL: join on both!
+WHERE s.business_id = $1
+  AND f.review_time_utc >= $2
+  AND f.review_time_utc < $3
+ORDER BY f.review_time_utc DESC;
+```
+
+**Aggregate by primitive:**
+```sql
+SELECT
+    primitive,
+    valence,
+    COUNT(*) as span_count,
+    AVG(confidence) as avg_confidence,
+    AVG(intensity) as avg_intensity
+FROM pipeline.detected_spans_v2
+WHERE business_id = $1
+  AND created_at >= $2
+GROUP BY primitive, valence
+ORDER BY span_count DESC;
+```
+
+---
+
+## Configuration
+
+### Environment Variables
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `OPENAI_API_KEY` | Yes | For LLM classification |
+| `DATABASE_URL` | Yes | PostgreSQL connection |
+
+### CLI Options
+
+```bash
+python run_classification_v2.py [OPTIONS]
+
+Options:
+  --business TEXT       Business name or pattern (required for classify/evaluate)
+  --limit INT           Max reviews to process (default: 100)
+  --dry-run             Don't store results to database
+  --evaluate BUSINESS   Evaluate existing classification quality
+  --language-analysis   Analyze UNMAPPED rates by language across all data
+  --use-llm             Use real LLM classification (default: mock)
+  --model TEXT          Model for LLM (default: gpt-4o-mini)
+```
+
+### Models
+
+| Model | Cost | Use Case |
+|-------|------|----------|
+| `gpt-4o-mini` | Low | Default, good balance |
+| `gpt-4o` | High | Complex reviews, higher accuracy |
+
+---
+
+## Evaluation
+
+The classifier includes built-in evaluation to measure quality:
+
+```bash
+# Evaluate classification quality for a business
+python run_classification_v2.py --evaluate "Go Karts Mar Menor"
+
+# Output includes:
+# - UNMAPPED rate (target: < 10%)
+# - UNMAPPED rate by language
+# - Top primitives distribution
+# - Contradiction detection (positive text + negative valence)
+# - Confidence distribution
+```
+
+### Quality Metrics
+
+| Metric | Target | Description |
+|--------|--------|-------------|
+| UNMAPPED rate | < 10% | Content spans that couldn't be classified |
+| NON_INFORMATIVE rate | < 30% | Reviews with no actionable content |
+| Avg confidence | > 0.7 | Average classifier confidence |
+| Contradictions | < 5% | Valence mismatches (e.g., "great" → negative) |
+
+### Language Analysis
+
+```bash
+# Analyze UNMAPPED rates across all languages and sectors
+python run_classification_v2.py --language-analysis
+
+# Exclude legacy data (auto/unknown language)
+python run_classification_v2.py --language-analysis --ignore-legacy-language
+
+# Only recent data
+python run_classification_v2.py --language-analysis --latest-hours 24
+```
+
+---
+
+## Changelog
+
+### v2.0 (2026-01)
+- New primitives-based taxonomy (MANNER, SPEED, etc.)
+- Config resolution from GBP category hierarchy
+- Sector-specific enabled primitives and weights
+- Language detection with per-language UNMAPPED tracking
+- Non-informative detection to skip LLM for junk content
+- run_id for tracking classification runs
+- Evaluation tooling built-in
+
+### v1.0 (Legacy)
+- URT code-based classification (J1.01, O1.01)
+- Stored in `review_spans` table
+- Part of original pipeline package