Files

Alejandro Gutiérrez 0543a08242 docs: Add Classification System & Primitives Taxonomy documentation

Comprehensive documentation covering:
- Actual production primitives (37 primitives across 5 domains)
  - O: TASTE, CRAFT, FRESHNESS, TEMPERATURE, EFFECTIVENESS, ACCURACY, CONDITION, CONSISTENCY
  - P: MANNER, COMPETENCE, ATTENTIVENESS, COMMUNICATION
  - J: SPEED, FRICTION, RELIABILITY, AVAILABILITY
  - E: CLEANLINESS, COMFORT, SAFETY, AMBIANCE, ACCESSIBILITY, DIGITAL_UX
  - V: PRICE_LEVEL, PRICE_FAIRNESS, PRICE_TRANSPARENCY, VALUE_FOR_MONEY
  - meta: HONESTY, ETHICS, PROMISES, etc. + UNMAPPED, NON_INFORMATIVE
- Classification pipeline with config resolution
- Non-informative detection (skip LLM for junk content)
- Language detection and per-language UNMAPPED tracking
- Database schema for detected_spans_v2
- Evaluation tooling and quality metrics

Note: A larger taxonomy (~150 primitives) exists in gbp_primitive_prompts.py
for future expansion. The production system uses the subset above.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-01 00:35:46 +00:00

19 KiB

Raw Blame History

Classification System & Primitives Taxonomy

Version: 2.0 Status: Production Location: packages/reviewiq-pipeline/scripts/run_classification_v2.py

Overview

The Classification System transforms raw customer reviews into structured, actionable data by:

Extracting spans - Identifying semantically meaningful segments within review text
Classifying primitives - Mapping each span to a primitive (e.g., MANNER, SPEED, VALUE_FOR_MONEY)
Scoring - Assigning valence, intensity, detail, and confidence to each span
Filtering - Detecting non-informative content (emoji-only, translation artifacts)

The output is stored in pipeline.detected_spans_v2 and powers downstream analytics, issue routing, and the Reputation Report.

Note: There is a legacy system (stage2_classify.py) that uses URT codes (J1.01, O1.01). The current production system uses primitives with descriptive names.

Quick Start

# Classify reviews for a business (dry run)
python scripts/run_classification_v2.py --business "Go Karts Mar Menor" --limit 100 --dry-run

# Real LLM classification
python scripts/run_classification_v2.py --business "Go Karts Mar Menor" --limit 100 --use-llm

# Evaluate classification quality
python scripts/run_classification_v2.py --evaluate "Go Karts Mar Menor"

# Language analysis across all data
python scripts/run_classification_v2.py --language-analysis

Primitives Taxonomy

The production system uses 37 primitives across 5 domains plus meta primitives. These are defined in reputation_report.py's DOMAIN_MAP.

Note: A larger taxonomy of ~150 primitives exists in gbp_primitive_prompts.py for future expansion and business-specific configuration. The production system currently uses the subset below.

Domain Structure

Domain	Code	Primitives
Output	O	TASTE, CRAFT, FRESHNESS, TEMPERATURE, EFFECTIVENESS, ACCURACY, CONDITION, CONSISTENCY
People	P	MANNER, COMPETENCE, ATTENTIVENESS, COMMUNICATION
Journey	J	SPEED, FRICTION, RELIABILITY, AVAILABILITY
Environment	E	CLEANLINESS, COMFORT, SAFETY, AMBIANCE, ACCESSIBILITY, DIGITAL_UX
Value	V	PRICE_LEVEL, PRICE_FAIRNESS, PRICE_TRANSPARENCY, VALUE_FOR_MONEY
Meta	meta	HONESTY, ETHICS, PROMISES, ACKNOWLEDGMENT, RESPONSE_QUALITY, RECOVERY, RETURN_INTENT, RECOMMEND, RECOGNITION, UNMAPPED, NON_INFORMATIVE

Special Primitives

Primitive	Purpose
`UNMAPPED`	Could not classify to any primitive (target: <10%)
`NON_INFORMATIVE`	No actionable content (emoji-only, translation artifacts)

Full Primitive Reference

OUTPUT (O) - Product/Service Quality

Primitive	Description	Example Signals
`TASTE`	Flavor quality (food/beverage)	"delicious", "bland", "amazing flavor"
`CRAFT`	Skill of execution	"expertly made", "sloppy work", "quality craftsmanship"
`FRESHNESS`	How fresh/new the product is	"fresh ingredients", "stale", "just made"
`TEMPERATURE`	Serving temperature	"served hot", "cold food", "perfect temperature"
`EFFECTIVENESS`	Does it work/achieve purpose	"works great", "didn't work", "effective"
`ACCURACY`	Correct execution of order	"exactly as ordered", "wrong order", "got it right"
`CONDITION`	State at delivery	"arrived perfect", "damaged", "pristine condition"
`CONSISTENCY`	Same quality each time	"always consistent", "hit or miss", "reliable quality"

PEOPLE (P) - Staff Interactions

Primitive	Description	Example Signals
`MANNER`	Friendliness and warmth	"so friendly", "rude", "welcoming"
`COMPETENCE`	Knowledge and skill	"very knowledgeable", "clueless", "professional"
`ATTENTIVENESS`	Being present and responsive	"attentive staff", "ignored us", "checked on us"
`COMMUNICATION`	Clarity and updates	"kept us informed", "no updates", "explained clearly"

JOURNEY (J) - Process and Timing

Primitive	Description	Example Signals
`SPEED`	How fast things happen	"quick service", "took forever", "fast"
`FRICTION`	Ease of process	"smooth process", "complicated", "hassle-free"
`RELIABILITY`	Dependable service	"always reliable", "unreliable", "consistent"
`AVAILABILITY`	Access to service/staff	"always available", "never open", "hard to reach"

ENVIRONMENT (E) - Physical/Digital Space

Primitive	Description	Example Signals
`CLEANLINESS`	Hygiene and tidiness	"spotless", "dirty", "very clean"
`COMFORT`	Physical ease	"comfortable", "cramped", "cozy seating"
`SAFETY`	Physical safety	"felt safe", "dangerous", "secure"
`AMBIANCE`	Overall mood/atmosphere	"great vibe", "loud", "nice atmosphere"
`ACCESSIBILITY`	Ease of access (physical/digital)	"wheelchair accessible", "hard to find", "easy to navigate"
`DIGITAL_UX`	Digital experience	"easy to use app", "website broken", "smooth online booking"

VALUE (V) - Cost and Worth

Primitive	Description	Example Signals
`PRICE_LEVEL`	Absolute cost	"affordable", "expensive", "cheap"
`PRICE_FAIRNESS`	Fair for what you get	"fair price", "overpriced", "worth every penny"
`PRICE_TRANSPARENCY`	Clear about costs	"no hidden fees", "surprise charges", "upfront pricing"
`VALUE_FOR_MONEY`	Overall value assessment	"great value", "not worth it", "bang for buck"

META - Trust and Sentiment

Primitive	Description	Example Signals
`HONESTY`	Truthfulness	"honest", "lied to us", "transparent"
`ETHICS`	Moral conduct	"ethical", "scam", "trustworthy"
`PROMISES`	Keeping commitments	"kept their word", "broke promises", "reliable"
`ACKNOWLEDGMENT`	Recognizing issues	"admitted mistake", "denied problem", "apologized"
`RESPONSE_QUALITY`	How business responds	"great response", "ignored complaint", "resolved quickly"
`RECOVERY`	Making amends	"made it right", "no compensation", "fixed the issue"
`RETURN_INTENT`	Would come back	"will be back", "never again", "definitely returning"
`RECOMMEND`	Would suggest to others	"highly recommend", "don't go", "tell your friends"
`RECOGNITION`	Customer acknowledgment	"remembered us", "treated like strangers", "knew our name"

Span Classification

What is a Span?

A span is a contiguous segment of review text that expresses a single semantic unit about the customer experience.

Review: "The food was delicious but we waited 45 minutes for a table."

Span 1: "The food was delicious"
  → Primitive: TASTE (O)
  → Valence: + (positive)
  → Intensity: 2 (moderate)

Span 2: "we waited 45 minutes for a table"
  → Primitive: SPEED (J)
  → Valence: - (negative)
  → Intensity: 3 (high - specific number)

Span Fields

interface ClassificationSpan {
  // Position
  text: string;           // Extracted text from review
  start: number;          // Character offset start
  end: number;            // Character offset end

  // Classification
  primitive: string;      // e.g., "MANNER", "SPEED", "VALUE_FOR_MONEY", "UNMAPPED"
  valence: "+" | "-" | "0" | "±";
  intensity: 1 | 2 | 3;   // 1=low, 2=moderate, 3=high
  detail: 1 | 2 | 3;      // 1=vague, 2=some detail, 3=specific
  confidence: number;     // 0.0 - 1.0

  // Entity extraction (optional)
  entity?: string;        // Named entity (e.g., "John", "Room 302")
  entity_type?: "staff" | "location" | "product" | "process" | "time";

  // For UNMAPPED spans
  unmapped_keywords?: string[];  // Keywords that couldn't be mapped
}

Valence Types

Code	Meaning	Example
`+`	Positive sentiment	"excellent service"
`-`	Negative sentiment	"terrible wait"
`0`	Neutral/factual	"open until 9pm"
`±`	Mixed sentiment	"good but expensive"

Intensity Levels

Value	Level	Signals
`1`	Low	Generic mentions, implied sentiment
`2`	Medium	Clear opinion, adjectives
`3`	High	Strong language, specifics, numbers

Detail Levels

Value	Level	Description
`1`	Vague	General statement, no specifics
`2`	Some detail	Has some context or explanation
`3`	Specific	Actionable detail, names, numbers

Confidence

A float from 0.0 to 1.0 indicating how confident the classifier is:

≥ 0.8: High confidence, clear signal
0.5 - 0.8: Medium confidence, reasonable inference
< 0.5: Low confidence - if below threshold, use UNMAPPED

Classification Pipeline

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Classification V2                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────┐ │
│  │  Config Resolver│ ─→ │  LLM Classifier │ ─→ │  Store      │ │
│  │                 │    │                 │    │             │ │
│  │  • GBP path     │    │  • OpenAI API   │    │  • spans_v2 │ │
│  │  • Sector brief │    │  • Primitives   │    │  • run_id   │ │
│  │  • Enabled prims│    │  • Language det │    │  • audit    │ │
│  └────────┬────────┘    └────────┬────────┘    └──────┬──────┘ │
│           │                      │                     │        │
│           │         ┌────────────┴────────────┐        │        │
│           │         │   Non-Informative       │        │        │
│           │         │   Detection (skip LLM)  │        │        │
│           │         └─────────────────────────┘        │        │
│           ▼                      ▼                     ▼        │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │              pipeline.detected_spans_v2                     ││
│  │    (primitive, valence, intensity, detail, confidence)      ││
│  └─────────────────────────────────────────────────────────────┘│
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Non-Informative Detection

Before calling the LLM, reviews are checked for non-informative content to save cost:

# Conservative detection - only skip when VERY sure
def is_non_informative(text: str) -> tuple[bool, str]:
    """
    Returns (is_non_informative, reason).
    Reasons: 'empty', 'junk_pattern', 'no_content', 'pure_repetition'
    """

Detection Rules:

Empty text
Emoji-only content: ^[\U0001F300-\U0001F9FF\s\.\!\?]+$
Translation artifacts: "translated by google"
No alphanumeric content
Pure repetition: "good good good good"

Reviews that pass detection go to the LLM.

Config Resolution

Each business gets a resolved config based on its GBP (Google Business Profile) category:

resolver = ConfigResolver()
config = await resolver.resolve("Go Karts Mar Menor", pool)

# Returns:
{
    "business_id": "Go Karts Mar Menor",
    "sector_code": "recreation",
    "gbp_path": "Recreation.Amusement_Parks.Go_Karts",
    "config_version": "v2.1-2026-01-15",
    "enabled_primitives": ["SPEED", "SAFETY", "VALUE_FOR_MONEY", ...],
    "weights": {"SAFETY": 1.5, "SPEED": 1.2, ...},
    "brief": {"what_customers_judge": [...]}
}

LLM Classification Prompt

The classifier uses a structured prompt with business-specific primitives:

You are a review classifier using primitive-based analysis.

## ENABLED PRIMITIVES (use ONLY these)
- MANNER: Friendliness and warmth of staff (weight: 1.2x)
- SPEED: How fast things happen
- SAFETY: Physical safety and protection
...

## RULES
1. Extract 1-5 spans per review
2. Each span gets exactly ONE primitive
3. If nothing fits with confidence ≥ 0.5, use UNMAPPED
4. Valence: + (positive), - (negative), 0 (neutral), ± (mixed)
5. Intensity: 1 (low), 2 (moderate), 3 (high)
6. Detail: 1 (vague), 2 (some detail), 3 (specific)

## OUTPUT FORMAT (JSON)
{
  "spans": [
    {
      "text": "exact text from review",
      "primitive": "MANNER",
      "valence": "+",
      "intensity": 2,
      "detail": 2,
      "confidence": 0.85
    }
  ]
}

Language Detection

The LLM classifier auto-detects review language and returns it with confidence. This enables:

Per-language UNMAPPED rate tracking
Identification of languages needing better signal coverage
Multilingual analytics (7+ languages: Spanish, English, Dutch, German, Polish, Finnish, Danish)

Database Schema

`pipeline.detected_spans_v2`

CREATE TABLE pipeline.detected_spans_v2 (
    id BIGSERIAL PRIMARY KEY,

    -- Context
    job_id VARCHAR(50),                   -- Scraper job ID
    business_id VARCHAR(255) NOT NULL,
    review_id VARCHAR(255) NOT NULL,
    gbp_path ltree,                       -- e.g., 'Recreation.Go_Karts'
    sector_code VARCHAR(50),              -- e.g., 'recreation'
    config_version VARCHAR(100),          -- Config version used
    run_id UUID,                          -- Classification run ID

    -- Classification (primitives-based)
    primitive VARCHAR(50) NOT NULL,       -- e.g., "MANNER", "SPEED", "UNMAPPED"
    valence VARCHAR(5) NOT NULL,          -- +, -, 0, ±
    intensity INTEGER,                    -- 1, 2, 3
    detail INTEGER,                       -- 1, 2, 3
    mode VARCHAR(50),                     -- e.g., "dine_in", "delivery"
    confidence FLOAT NOT NULL,            -- 0.0 - 1.0

    -- Span position
    span_text TEXT NOT NULL,
    span_start INTEGER,
    span_end INTEGER,

    -- Entity extraction
    entity VARCHAR(255),
    entity_type VARCHAR(50),
    unmapped_keywords TEXT[],             -- Keywords for UNMAPPED spans

    -- Audit trail
    model VARCHAR(100),                   -- e.g., "gpt-4o-mini"
    raw_response JSONB,                   -- Full LLM response
    review_hash VARCHAR(32),              -- For deduplication
    language VARCHAR(10),                 -- Detected language

    created_at TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW()
);

-- Key indexes
CREATE INDEX idx_spans_v2_business_id ON detected_spans_v2(business_id);
CREATE INDEX idx_spans_v2_primitive ON detected_spans_v2(primitive);
CREATE INDEX idx_spans_v2_valence ON detected_spans_v2(valence);
CREATE INDEX idx_spans_v2_run_id ON detected_spans_v2(run_id);
CREATE INDEX idx_spans_v2_language ON detected_spans_v2(language);

Key Queries

Get all spans for a business in a time window:

SELECT s.*, f.review_time_utc, f.rating
FROM pipeline.detected_spans_v2 s
JOIN pipeline.review_facts_v1 f
  ON f.review_id = s.review_id
 AND f.business_id = s.business_id    -- CRITICAL: join on both!
WHERE s.business_id = $1
  AND f.review_time_utc >= $2
  AND f.review_time_utc < $3
ORDER BY f.review_time_utc DESC;

Aggregate by primitive:

SELECT
    primitive,
    valence,
    COUNT(*) as span_count,
    AVG(confidence) as avg_confidence,
    AVG(intensity) as avg_intensity
FROM pipeline.detected_spans_v2
WHERE business_id = $1
  AND created_at >= $2
GROUP BY primitive, valence
ORDER BY span_count DESC;

Configuration

Environment Variables

Variable	Required	Description
`OPENAI_API_KEY`	Yes	For LLM classification
`DATABASE_URL`	Yes	PostgreSQL connection

CLI Options

python run_classification_v2.py [OPTIONS]

Options:
  --business TEXT       Business name or pattern (required for classify/evaluate)
  --limit INT           Max reviews to process (default: 100)
  --dry-run             Don't store results to database
  --evaluate BUSINESS   Evaluate existing classification quality
  --language-analysis   Analyze UNMAPPED rates by language across all data
  --use-llm             Use real LLM classification (default: mock)
  --model TEXT          Model for LLM (default: gpt-4o-mini)

Models

Model	Cost	Use Case
`gpt-4o-mini`	Low	Default, good balance
`gpt-4o`	High	Complex reviews, higher accuracy

Evaluation

The classifier includes built-in evaluation to measure quality:

# Evaluate classification quality for a business
python run_classification_v2.py --evaluate "Go Karts Mar Menor"

# Output includes:
# - UNMAPPED rate (target: < 10%)
# - UNMAPPED rate by language
# - Top primitives distribution
# - Contradiction detection (positive text + negative valence)
# - Confidence distribution

Quality Metrics

Metric	Target	Description
UNMAPPED rate	< 10%	Content spans that couldn't be classified
NON_INFORMATIVE rate	< 30%	Reviews with no actionable content
Avg confidence	> 0.7	Average classifier confidence
Contradictions	< 5%	Valence mismatches (e.g., "great" → negative)

Language Analysis

# Analyze UNMAPPED rates across all languages and sectors
python run_classification_v2.py --language-analysis

# Exclude legacy data (auto/unknown language)
python run_classification_v2.py --language-analysis --ignore-legacy-language

# Only recent data
python run_classification_v2.py --language-analysis --latest-hours 24

Changelog

v2.0 (2026-01)

New primitives-based taxonomy (MANNER, SPEED, etc.)
Config resolution from GBP category hierarchy
Sector-specific enabled primitives and weights
Language detection with per-language UNMAPPED tracking
Non-informative detection to skip LLM for junk content
run_id for tracking classification runs
Evaluation tooling built-in

v1.0 (Legacy)

URT code-based classification (J1.01, O1.01)
Stored in review_spans table
Part of original pipeline package

19 KiB Raw Blame History