From 0543a0824248aff446dbedef6a12831288b0302e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20Guti=C3=A9rrez?= <35082514+alezmad@users.noreply.github.com> Date: Sat, 31 Jan 2026 23:50:01 +0000 Subject: [PATCH] docs: Add Classification System & Primitives Taxonomy documentation Comprehensive documentation covering: - Actual production primitives (37 primitives across 5 domains) - O: TASTE, CRAFT, FRESHNESS, TEMPERATURE, EFFECTIVENESS, ACCURACY, CONDITION, CONSISTENCY - P: MANNER, COMPETENCE, ATTENTIVENESS, COMMUNICATION - J: SPEED, FRICTION, RELIABILITY, AVAILABILITY - E: CLEANLINESS, COMFORT, SAFETY, AMBIANCE, ACCESSIBILITY, DIGITAL_UX - V: PRICE_LEVEL, PRICE_FAIRNESS, PRICE_TRANSPARENCY, VALUE_FOR_MONEY - meta: HONESTY, ETHICS, PROMISES, etc. + UNMAPPED, NON_INFORMATIVE - Classification pipeline with config resolution - Non-informative detection (skip LLM for junk content) - Language detection and per-language UNMAPPED tracking - Database schema for detected_spans_v2 - Evaluation tooling and quality metrics Note: A larger taxonomy (~150 primitives) exists in gbp_primitive_prompts.py for future expansion. The production system uses the subset above. Co-Authored-By: Claude Opus 4.5 --- .../docs/CLASSIFICATION_SYSTEM.md | 502 ++++++++++++++++++ 1 file changed, 502 insertions(+) create mode 100644 packages/reviewiq-pipeline/docs/CLASSIFICATION_SYSTEM.md diff --git a/packages/reviewiq-pipeline/docs/CLASSIFICATION_SYSTEM.md b/packages/reviewiq-pipeline/docs/CLASSIFICATION_SYSTEM.md new file mode 100644 index 0000000..990be21 --- /dev/null +++ b/packages/reviewiq-pipeline/docs/CLASSIFICATION_SYSTEM.md @@ -0,0 +1,502 @@ +# Classification System & Primitives Taxonomy + +**Version:** 2.0 +**Status:** Production +**Location:** `packages/reviewiq-pipeline/scripts/run_classification_v2.py` + +## Overview + +The Classification System transforms raw customer reviews into structured, actionable data by: + +1. **Extracting spans** - Identifying semantically meaningful segments within review text +2. **Classifying primitives** - Mapping each span to a primitive (e.g., `MANNER`, `SPEED`, `VALUE_FOR_MONEY`) +3. **Scoring** - Assigning valence, intensity, detail, and confidence to each span +4. **Filtering** - Detecting non-informative content (emoji-only, translation artifacts) + +The output is stored in `pipeline.detected_spans_v2` and powers downstream analytics, issue routing, and the Reputation Report. + +> **Note:** There is a legacy system (`stage2_classify.py`) that uses URT codes (`J1.01`, `O1.01`). The current production system uses **primitives** with descriptive names. + +## Quick Start + +```bash +# Classify reviews for a business (dry run) +python scripts/run_classification_v2.py --business "Go Karts Mar Menor" --limit 100 --dry-run + +# Real LLM classification +python scripts/run_classification_v2.py --business "Go Karts Mar Menor" --limit 100 --use-llm + +# Evaluate classification quality +python scripts/run_classification_v2.py --evaluate "Go Karts Mar Menor" + +# Language analysis across all data +python scripts/run_classification_v2.py --language-analysis +``` + +--- + +## Primitives Taxonomy + +The production system uses **37 primitives** across **5 domains** plus meta primitives. These are defined in `reputation_report.py`'s `DOMAIN_MAP`. + +> **Note:** A larger taxonomy of ~150 primitives exists in `gbp_primitive_prompts.py` for future expansion and business-specific configuration. The production system currently uses the subset below. + +### Domain Structure + +| Domain | Code | Primitives | +|--------|------|------------| +| **Output** | O | TASTE, CRAFT, FRESHNESS, TEMPERATURE, EFFECTIVENESS, ACCURACY, CONDITION, CONSISTENCY | +| **People** | P | MANNER, COMPETENCE, ATTENTIVENESS, COMMUNICATION | +| **Journey** | J | SPEED, FRICTION, RELIABILITY, AVAILABILITY | +| **Environment** | E | CLEANLINESS, COMFORT, SAFETY, AMBIANCE, ACCESSIBILITY, DIGITAL_UX | +| **Value** | V | PRICE_LEVEL, PRICE_FAIRNESS, PRICE_TRANSPARENCY, VALUE_FOR_MONEY | +| **Meta** | meta | HONESTY, ETHICS, PROMISES, ACKNOWLEDGMENT, RESPONSE_QUALITY, RECOVERY, RETURN_INTENT, RECOMMEND, RECOGNITION, UNMAPPED, NON_INFORMATIVE | + +### Special Primitives + +| Primitive | Purpose | +|-----------|---------| +| `UNMAPPED` | Could not classify to any primitive (target: <10%) | +| `NON_INFORMATIVE` | No actionable content (emoji-only, translation artifacts) | + +--- + +## Full Primitive Reference + +### OUTPUT (O) - Product/Service Quality + +| Primitive | Description | Example Signals | +|-----------|-------------|-----------------| +| `TASTE` | Flavor quality (food/beverage) | "delicious", "bland", "amazing flavor" | +| `CRAFT` | Skill of execution | "expertly made", "sloppy work", "quality craftsmanship" | +| `FRESHNESS` | How fresh/new the product is | "fresh ingredients", "stale", "just made" | +| `TEMPERATURE` | Serving temperature | "served hot", "cold food", "perfect temperature" | +| `EFFECTIVENESS` | Does it work/achieve purpose | "works great", "didn't work", "effective" | +| `ACCURACY` | Correct execution of order | "exactly as ordered", "wrong order", "got it right" | +| `CONDITION` | State at delivery | "arrived perfect", "damaged", "pristine condition" | +| `CONSISTENCY` | Same quality each time | "always consistent", "hit or miss", "reliable quality" | + +### PEOPLE (P) - Staff Interactions + +| Primitive | Description | Example Signals | +|-----------|-------------|-----------------| +| `MANNER` | Friendliness and warmth | "so friendly", "rude", "welcoming" | +| `COMPETENCE` | Knowledge and skill | "very knowledgeable", "clueless", "professional" | +| `ATTENTIVENESS` | Being present and responsive | "attentive staff", "ignored us", "checked on us" | +| `COMMUNICATION` | Clarity and updates | "kept us informed", "no updates", "explained clearly" | + +### JOURNEY (J) - Process and Timing + +| Primitive | Description | Example Signals | +|-----------|-------------|-----------------| +| `SPEED` | How fast things happen | "quick service", "took forever", "fast" | +| `FRICTION` | Ease of process | "smooth process", "complicated", "hassle-free" | +| `RELIABILITY` | Dependable service | "always reliable", "unreliable", "consistent" | +| `AVAILABILITY` | Access to service/staff | "always available", "never open", "hard to reach" | + +### ENVIRONMENT (E) - Physical/Digital Space + +| Primitive | Description | Example Signals | +|-----------|-------------|-----------------| +| `CLEANLINESS` | Hygiene and tidiness | "spotless", "dirty", "very clean" | +| `COMFORT` | Physical ease | "comfortable", "cramped", "cozy seating" | +| `SAFETY` | Physical safety | "felt safe", "dangerous", "secure" | +| `AMBIANCE` | Overall mood/atmosphere | "great vibe", "loud", "nice atmosphere" | +| `ACCESSIBILITY` | Ease of access (physical/digital) | "wheelchair accessible", "hard to find", "easy to navigate" | +| `DIGITAL_UX` | Digital experience | "easy to use app", "website broken", "smooth online booking" | + +### VALUE (V) - Cost and Worth + +| Primitive | Description | Example Signals | +|-----------|-------------|-----------------| +| `PRICE_LEVEL` | Absolute cost | "affordable", "expensive", "cheap" | +| `PRICE_FAIRNESS` | Fair for what you get | "fair price", "overpriced", "worth every penny" | +| `PRICE_TRANSPARENCY` | Clear about costs | "no hidden fees", "surprise charges", "upfront pricing" | +| `VALUE_FOR_MONEY` | Overall value assessment | "great value", "not worth it", "bang for buck" | + +### META - Trust and Sentiment + +| Primitive | Description | Example Signals | +|-----------|-------------|-----------------| +| `HONESTY` | Truthfulness | "honest", "lied to us", "transparent" | +| `ETHICS` | Moral conduct | "ethical", "scam", "trustworthy" | +| `PROMISES` | Keeping commitments | "kept their word", "broke promises", "reliable" | +| `ACKNOWLEDGMENT` | Recognizing issues | "admitted mistake", "denied problem", "apologized" | +| `RESPONSE_QUALITY` | How business responds | "great response", "ignored complaint", "resolved quickly" | +| `RECOVERY` | Making amends | "made it right", "no compensation", "fixed the issue" | +| `RETURN_INTENT` | Would come back | "will be back", "never again", "definitely returning" | +| `RECOMMEND` | Would suggest to others | "highly recommend", "don't go", "tell your friends" | +| `RECOGNITION` | Customer acknowledgment | "remembered us", "treated like strangers", "knew our name" | + +--- + +## Span Classification + +### What is a Span? + +A **span** is a contiguous segment of review text that expresses a single semantic unit about the customer experience. + +``` +Review: "The food was delicious but we waited 45 minutes for a table." + +Span 1: "The food was delicious" + → Primitive: TASTE (O) + → Valence: + (positive) + → Intensity: 2 (moderate) + +Span 2: "we waited 45 minutes for a table" + → Primitive: SPEED (J) + → Valence: - (negative) + → Intensity: 3 (high - specific number) +``` + +### Span Fields + +```typescript +interface ClassificationSpan { + // Position + text: string; // Extracted text from review + start: number; // Character offset start + end: number; // Character offset end + + // Classification + primitive: string; // e.g., "MANNER", "SPEED", "VALUE_FOR_MONEY", "UNMAPPED" + valence: "+" | "-" | "0" | "±"; + intensity: 1 | 2 | 3; // 1=low, 2=moderate, 3=high + detail: 1 | 2 | 3; // 1=vague, 2=some detail, 3=specific + confidence: number; // 0.0 - 1.0 + + // Entity extraction (optional) + entity?: string; // Named entity (e.g., "John", "Room 302") + entity_type?: "staff" | "location" | "product" | "process" | "time"; + + // For UNMAPPED spans + unmapped_keywords?: string[]; // Keywords that couldn't be mapped +} +``` + +### Valence Types + +| Code | Meaning | Example | +|------|---------|---------| +| `+` | Positive sentiment | "excellent service" | +| `-` | Negative sentiment | "terrible wait" | +| `0` | Neutral/factual | "open until 9pm" | +| `±` | Mixed sentiment | "good but expensive" | + +### Intensity Levels + +| Value | Level | Signals | +|-------|-------|---------| +| `1` | Low | Generic mentions, implied sentiment | +| `2` | Medium | Clear opinion, adjectives | +| `3` | High | Strong language, specifics, numbers | + +### Detail Levels + +| Value | Level | Description | +|-------|-------|-------------| +| `1` | Vague | General statement, no specifics | +| `2` | Some detail | Has some context or explanation | +| `3` | Specific | Actionable detail, names, numbers | + +### Confidence + +A float from `0.0` to `1.0` indicating how confident the classifier is: + +- `≥ 0.8`: High confidence, clear signal +- `0.5 - 0.8`: Medium confidence, reasonable inference +- `< 0.5`: Low confidence - if below threshold, use `UNMAPPED` + +--- + +## Classification Pipeline + +### Architecture + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Classification V2 │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │ +│ │ Config Resolver│ ─→ │ LLM Classifier │ ─→ │ Store │ │ +│ │ │ │ │ │ │ │ +│ │ • GBP path │ │ • OpenAI API │ │ • spans_v2 │ │ +│ │ • Sector brief │ │ • Primitives │ │ • run_id │ │ +│ │ • Enabled prims│ │ • Language det │ │ • audit │ │ +│ └────────┬────────┘ └────────┬────────┘ └──────┬──────┘ │ +│ │ │ │ │ +│ │ ┌────────────┴────────────┐ │ │ +│ │ │ Non-Informative │ │ │ +│ │ │ Detection (skip LLM) │ │ │ +│ │ └─────────────────────────┘ │ │ +│ ▼ ▼ ▼ │ +│ ┌─────────────────────────────────────────────────────────────┐│ +│ │ pipeline.detected_spans_v2 ││ +│ │ (primitive, valence, intensity, detail, confidence) ││ +│ └─────────────────────────────────────────────────────────────┘│ +│ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Non-Informative Detection + +Before calling the LLM, reviews are checked for non-informative content to save cost: + +```python +# Conservative detection - only skip when VERY sure +def is_non_informative(text: str) -> tuple[bool, str]: + """ + Returns (is_non_informative, reason). + Reasons: 'empty', 'junk_pattern', 'no_content', 'pure_repetition' + """ +``` + +**Detection Rules:** +- Empty text +- Emoji-only content: `^[\U0001F300-\U0001F9FF\s\.\!\?]+$` +- Translation artifacts: `"translated by google"` +- No alphanumeric content +- Pure repetition: `"good good good good"` + +Reviews that pass detection go to the LLM. + +### Config Resolution + +Each business gets a resolved config based on its GBP (Google Business Profile) category: + +```python +resolver = ConfigResolver() +config = await resolver.resolve("Go Karts Mar Menor", pool) + +# Returns: +{ + "business_id": "Go Karts Mar Menor", + "sector_code": "recreation", + "gbp_path": "Recreation.Amusement_Parks.Go_Karts", + "config_version": "v2.1-2026-01-15", + "enabled_primitives": ["SPEED", "SAFETY", "VALUE_FOR_MONEY", ...], + "weights": {"SAFETY": 1.5, "SPEED": 1.2, ...}, + "brief": {"what_customers_judge": [...]} +} +``` + +### LLM Classification Prompt + +The classifier uses a structured prompt with business-specific primitives: + +``` +You are a review classifier using primitive-based analysis. + +## ENABLED PRIMITIVES (use ONLY these) +- MANNER: Friendliness and warmth of staff (weight: 1.2x) +- SPEED: How fast things happen +- SAFETY: Physical safety and protection +... + +## RULES +1. Extract 1-5 spans per review +2. Each span gets exactly ONE primitive +3. If nothing fits with confidence ≥ 0.5, use UNMAPPED +4. Valence: + (positive), - (negative), 0 (neutral), ± (mixed) +5. Intensity: 1 (low), 2 (moderate), 3 (high) +6. Detail: 1 (vague), 2 (some detail), 3 (specific) + +## OUTPUT FORMAT (JSON) +{ + "spans": [ + { + "text": "exact text from review", + "primitive": "MANNER", + "valence": "+", + "intensity": 2, + "detail": 2, + "confidence": 0.85 + } + ] +} +``` + +### Language Detection + +The LLM classifier auto-detects review language and returns it with confidence. This enables: + +- Per-language UNMAPPED rate tracking +- Identification of languages needing better signal coverage +- Multilingual analytics (7+ languages: Spanish, English, Dutch, German, Polish, Finnish, Danish) + +--- + +## Database Schema + +### `pipeline.detected_spans_v2` + +```sql +CREATE TABLE pipeline.detected_spans_v2 ( + id BIGSERIAL PRIMARY KEY, + + -- Context + job_id VARCHAR(50), -- Scraper job ID + business_id VARCHAR(255) NOT NULL, + review_id VARCHAR(255) NOT NULL, + gbp_path ltree, -- e.g., 'Recreation.Go_Karts' + sector_code VARCHAR(50), -- e.g., 'recreation' + config_version VARCHAR(100), -- Config version used + run_id UUID, -- Classification run ID + + -- Classification (primitives-based) + primitive VARCHAR(50) NOT NULL, -- e.g., "MANNER", "SPEED", "UNMAPPED" + valence VARCHAR(5) NOT NULL, -- +, -, 0, ± + intensity INTEGER, -- 1, 2, 3 + detail INTEGER, -- 1, 2, 3 + mode VARCHAR(50), -- e.g., "dine_in", "delivery" + confidence FLOAT NOT NULL, -- 0.0 - 1.0 + + -- Span position + span_text TEXT NOT NULL, + span_start INTEGER, + span_end INTEGER, + + -- Entity extraction + entity VARCHAR(255), + entity_type VARCHAR(50), + unmapped_keywords TEXT[], -- Keywords for UNMAPPED spans + + -- Audit trail + model VARCHAR(100), -- e.g., "gpt-4o-mini" + raw_response JSONB, -- Full LLM response + review_hash VARCHAR(32), -- For deduplication + language VARCHAR(10), -- Detected language + + created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW() +); + +-- Key indexes +CREATE INDEX idx_spans_v2_business_id ON detected_spans_v2(business_id); +CREATE INDEX idx_spans_v2_primitive ON detected_spans_v2(primitive); +CREATE INDEX idx_spans_v2_valence ON detected_spans_v2(valence); +CREATE INDEX idx_spans_v2_run_id ON detected_spans_v2(run_id); +CREATE INDEX idx_spans_v2_language ON detected_spans_v2(language); +``` + +### Key Queries + +**Get all spans for a business in a time window:** +```sql +SELECT s.*, f.review_time_utc, f.rating +FROM pipeline.detected_spans_v2 s +JOIN pipeline.review_facts_v1 f + ON f.review_id = s.review_id + AND f.business_id = s.business_id -- CRITICAL: join on both! +WHERE s.business_id = $1 + AND f.review_time_utc >= $2 + AND f.review_time_utc < $3 +ORDER BY f.review_time_utc DESC; +``` + +**Aggregate by primitive:** +```sql +SELECT + primitive, + valence, + COUNT(*) as span_count, + AVG(confidence) as avg_confidence, + AVG(intensity) as avg_intensity +FROM pipeline.detected_spans_v2 +WHERE business_id = $1 + AND created_at >= $2 +GROUP BY primitive, valence +ORDER BY span_count DESC; +``` + +--- + +## Configuration + +### Environment Variables + +| Variable | Required | Description | +|----------|----------|-------------| +| `OPENAI_API_KEY` | Yes | For LLM classification | +| `DATABASE_URL` | Yes | PostgreSQL connection | + +### CLI Options + +```bash +python run_classification_v2.py [OPTIONS] + +Options: + --business TEXT Business name or pattern (required for classify/evaluate) + --limit INT Max reviews to process (default: 100) + --dry-run Don't store results to database + --evaluate BUSINESS Evaluate existing classification quality + --language-analysis Analyze UNMAPPED rates by language across all data + --use-llm Use real LLM classification (default: mock) + --model TEXT Model for LLM (default: gpt-4o-mini) +``` + +### Models + +| Model | Cost | Use Case | +|-------|------|----------| +| `gpt-4o-mini` | Low | Default, good balance | +| `gpt-4o` | High | Complex reviews, higher accuracy | + +--- + +## Evaluation + +The classifier includes built-in evaluation to measure quality: + +```bash +# Evaluate classification quality for a business +python run_classification_v2.py --evaluate "Go Karts Mar Menor" + +# Output includes: +# - UNMAPPED rate (target: < 10%) +# - UNMAPPED rate by language +# - Top primitives distribution +# - Contradiction detection (positive text + negative valence) +# - Confidence distribution +``` + +### Quality Metrics + +| Metric | Target | Description | +|--------|--------|-------------| +| UNMAPPED rate | < 10% | Content spans that couldn't be classified | +| NON_INFORMATIVE rate | < 30% | Reviews with no actionable content | +| Avg confidence | > 0.7 | Average classifier confidence | +| Contradictions | < 5% | Valence mismatches (e.g., "great" → negative) | + +### Language Analysis + +```bash +# Analyze UNMAPPED rates across all languages and sectors +python run_classification_v2.py --language-analysis + +# Exclude legacy data (auto/unknown language) +python run_classification_v2.py --language-analysis --ignore-legacy-language + +# Only recent data +python run_classification_v2.py --language-analysis --latest-hours 24 +``` + +--- + +## Changelog + +### v2.0 (2026-01) +- New primitives-based taxonomy (MANNER, SPEED, etc.) +- Config resolution from GBP category hierarchy +- Sector-specific enabled primitives and weights +- Language detection with per-language UNMAPPED tracking +- Non-informative detection to skip LLM for junk content +- run_id for tracking classification runs +- Evaluation tooling built-in + +### v1.0 (Legacy) +- URT code-based classification (J1.01, O1.01) +- Stored in `review_spans` table +- Part of original pipeline package