- reviewiq-pipeline-v1-final.md: Earlier pipeline specification - test_metadata_extraction.py: Test script for metadata extraction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
993 lines
33 KiB
Markdown
993 lines
33 KiB
Markdown
# ReviewIQ Pipeline v1 — Final Architecture
|
|
|
|
**Design principle**: Minimum state, defensible stats, multilingual, robust to messy mobile text, 1 LLM call per report, <$0.30/report.
|
|
|
|
**Core decision**: Do not persist topics. Persist only enriched spans. Build topics at report time via clustering and match across periods for trends.
|
|
|
|
---
|
|
|
|
## A. Architecture Overview
|
|
|
|
```
|
|
INGEST (continuous, stateless, ~$0.00)
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Raw Review │────▶│ Span │────▶│ Embed + │────▶│ Store │
|
|
│ (text,rating,│ │ Splitter │ │ Sentiment │ │ Enriched │
|
|
│ date, lang) │ │ │ │ + NER │ │ Spans │
|
|
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
|
|
|
No topic assignment at ingest. Just store enriched spans.
|
|
|
|
REPORT (per request, ~$0.20)
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Fetch │────▶│ Cluster │────▶│ Stats + │────▶│ LLM │
|
|
│ Spans │ │ (HDBSCAN) │ │ Labels + │ │ Narrate │
|
|
│ │ │ │ │ Quotes │ │ (1 call) │
|
|
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
|
|
|
Topics are ephemeral. They exist only for this report.
|
|
Trends are computed by matching clusters across periods via centroid similarity.
|
|
```
|
|
|
|
### Cost Model
|
|
|
|
| Stage | When | Cost | Notes |
|
|
|-------|------|------|-------|
|
|
| Span splitting | Per review ingested | $0.00 | Regex only |
|
|
| Embedding | Per span ingested | $0.00 | Local model, batched |
|
|
| Sentiment | Per span ingested | $0.00 | Embedding math (EN/ES/DE multi-anchor) |
|
|
| NER (staff) | Per span ingested | $0.00 | spaCy, guarded |
|
|
| Clustering | Per report | $0.00 | HDBSCAN <4k spans, PCA+KMeans fallback |
|
|
| Stats + labels | Per report | $0.00 | Python/SQL |
|
|
| LLM narration | Per report | ~$0.15-0.25 | Single API call |
|
|
|
|
**Total: ~$0.20/report** (dominated by LLM)
|
|
|
|
---
|
|
|
|
## B. Data Model (Only What Persists)
|
|
|
|
### 1. Raw Reviews
|
|
|
|
```sql
|
|
CREATE TABLE reviews (
|
|
review_id TEXT PRIMARY KEY,
|
|
business_id TEXT NOT NULL,
|
|
text TEXT NOT NULL,
|
|
rating INT NOT NULL,
|
|
date TIMESTAMP,
|
|
source TEXT DEFAULT 'google',
|
|
ingested_at TIMESTAMP DEFAULT NOW()
|
|
);
|
|
```
|
|
|
|
### 2. Enriched Spans (The Only ML Artifact)
|
|
|
|
```sql
|
|
CREATE TABLE spans (
|
|
span_id TEXT PRIMARY KEY,
|
|
review_id TEXT REFERENCES reviews(review_id),
|
|
business_id TEXT NOT NULL,
|
|
span_index INT NOT NULL,
|
|
text TEXT NOT NULL,
|
|
embedding VECTOR(384),
|
|
sentiment TEXT, -- 'positive', 'negative', 'neutral'
|
|
sentiment_score FLOAT,
|
|
staff_mentions TEXT[], -- guarded extraction
|
|
date TIMESTAMP,
|
|
created_at TIMESTAMP DEFAULT NOW()
|
|
);
|
|
|
|
CREATE INDEX idx_spans_business_date ON spans(business_id, date);
|
|
|
|
-- Embedding index: prefer HNSW if available (pgvector 0.5+), otherwise ivfflat
|
|
-- HNSW: no training required, better query performance
|
|
CREATE INDEX idx_spans_embedding ON spans USING hnsw (embedding vector_cosine_ops)
|
|
WITH (m = 16, ef_construction = 64);
|
|
|
|
-- Alternative for older pgvector (requires ANALYZE after bulk inserts):
|
|
-- CREATE INDEX idx_spans_embedding ON spans USING ivfflat (embedding vector_cosine_ops)
|
|
-- WITH (lists = 100);
|
|
-- ANALYZE spans; -- Required after bulk insert for ivfflat to work correctly
|
|
```
|
|
|
|
### 3. Review-Topic Presence (Computed at Report Time, Not Stored)
|
|
|
|
Topics are ephemeral. Presence is computed per report, not persisted.
|
|
|
|
---
|
|
|
|
## C. Ingest Pipeline
|
|
|
|
### Step 1: Span Splitting
|
|
|
|
Split on punctuation. Fallback split on contrast markers. Merge tiny fragments.
|
|
|
|
```python
|
|
import re
|
|
|
|
CONTRAST_RE = re.compile(
|
|
r'\b(?:but|pero|aber|aunque|however|though|although|yet|still|sin embargo)\b',
|
|
re.IGNORECASE
|
|
)
|
|
|
|
def split_spans(text: str) -> list[str]:
|
|
# Split on punctuation (good enough for most text, with contrast fallback)
|
|
parts = re.split(r'[.!?;:,]\s*|\s{2,}', text)
|
|
parts = [p.strip() for p in parts if len(p.strip()) >= 12]
|
|
|
|
# Fallback split on contrast markers
|
|
refined = []
|
|
for p in parts:
|
|
if CONTRAST_RE.search(p):
|
|
sub = [s.strip() for s in CONTRAST_RE.split(p)]
|
|
# Merge tiny fragments back
|
|
merged = []
|
|
for s in sub:
|
|
if not s:
|
|
continue
|
|
if len(s) < 12 and merged:
|
|
merged[-1] = merged[-1] + ' ' + s
|
|
else:
|
|
merged.append(s)
|
|
refined.extend([m for m in merged if len(m) >= 12])
|
|
else:
|
|
refined.append(p)
|
|
|
|
return refined
|
|
```
|
|
|
|
**Note**: Do NOT split on "and/y/und" by default — these often connect positive qualities ("friendly and fast").
|
|
|
|
### Step 2: Embedding
|
|
|
|
Use multilingual model. No translation needed.
|
|
|
|
```python
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
model = SentenceTransformer('intfloat/multilingual-e5-small')
|
|
|
|
def embed_spans(spans: list[str]) -> np.ndarray:
|
|
return model.encode(spans, normalize_embeddings=True)
|
|
```
|
|
|
|
### Step 3: Sentiment (Anchor-Based)
|
|
|
|
Score sentiment via embedding distance to polar anchors. Works across all languages.
|
|
|
|
**Note**: Encode multiple short anchors separately, normalize, then average. This gives
|
|
better multilingual alignment than a single "bag sentence".
|
|
|
|
```python
|
|
# Multiple short anchors for better multilingual alignment
|
|
# Include ES/DE anchors for improved cross-language recall
|
|
POSITIVE_WORDS = [
|
|
# English
|
|
"excellent", "wonderful", "amazing", "great", "fantastic",
|
|
"delicious", "friendly", "helpful", "perfect", "outstanding",
|
|
# Spanish
|
|
"excelente", "increíble", "delicioso", "amable", "rápido",
|
|
# German
|
|
"toll", "lecker", "freundlich", "schnell", "perfekt",
|
|
]
|
|
NEGATIVE_WORDS = [
|
|
# English
|
|
"terrible", "awful", "horrible", "bad", "disgusting",
|
|
"rude", "slow", "dirty", "broken", "disappointing",
|
|
# Spanish
|
|
"horrible", "sucio", "lento", "grosero", "caro",
|
|
# German
|
|
"schlecht", "langsam", "unhöflich", "dreckig", "teuer",
|
|
]
|
|
|
|
def _compute_anchor(words: list[str]) -> np.ndarray:
|
|
"""Encode multiple anchors, normalize each, then average.
|
|
|
|
Deduplicates words to avoid implicit weighting.
|
|
"""
|
|
unique_words = list(dict.fromkeys(words)) # Preserve order, remove dupes
|
|
embeddings = model.encode(unique_words, normalize_embeddings=True)
|
|
avg = embeddings.mean(axis=0)
|
|
return avg / np.linalg.norm(avg) # Re-normalize the average
|
|
|
|
POSITIVE_ANCHOR = _compute_anchor(POSITIVE_WORDS)
|
|
NEGATIVE_ANCHOR = _compute_anchor(NEGATIVE_WORDS)
|
|
|
|
def score_sentiment(embedding: np.ndarray) -> tuple[str, float]:
|
|
pos_sim = embedding @ POSITIVE_ANCHOR
|
|
neg_sim = embedding @ NEGATIVE_ANCHOR
|
|
|
|
score = (pos_sim - neg_sim) / (pos_sim + neg_sim + 1e-6)
|
|
|
|
if score > 0.15:
|
|
return ('positive', float(score))
|
|
elif score < -0.15:
|
|
return ('negative', float(abs(score)))
|
|
else:
|
|
return ('neutral', 0.0)
|
|
```
|
|
|
|
### Step 4: Staff Extraction (Guarded)
|
|
|
|
Use spaCy NER, but only count as staff when guarded:
|
|
|
|
```python
|
|
import spacy
|
|
|
|
nlp = spacy.load('xx_ent_wiki_sm') # multilingual
|
|
|
|
ROLE_WORDS = {'server', 'waiter', 'waitress', 'manager', 'chef', 'doctor',
|
|
'nurse', 'receptionist', 'mesero', 'gerente', 'doctor', 'kellner'}
|
|
|
|
def extract_staff(text: str, business_history: dict = None) -> list[str]:
|
|
doc = nlp(text)
|
|
staff = []
|
|
|
|
for ent in doc.ents:
|
|
if ent.label_ != 'PERSON':
|
|
continue
|
|
|
|
name = ent.text.strip()
|
|
normalized = normalize_name(name) # Normalize early for consistent lookup
|
|
context = text[max(0, ent.start_char-30):ent.end_char+30].lower()
|
|
|
|
# Guard 1: Near role word
|
|
if any(role in context for role in ROLE_WORDS):
|
|
staff.append(normalized)
|
|
continue
|
|
|
|
# Guard 2: Appears in thanks pattern
|
|
if any(p in context for p in ['thank', 'gracias', 'danke', 'shout out', 'kudos']):
|
|
staff.append(normalized)
|
|
continue
|
|
|
|
# Guard 3: Frequent across reviews (if history available)
|
|
# Use normalized name for lookup (history keys are also normalized)
|
|
if business_history and business_history.get(normalized, 0) >= 3:
|
|
staff.append(normalized)
|
|
|
|
return list(set(staff))
|
|
|
|
def normalize_name(name: str) -> str:
|
|
return ' '.join(name.strip().title().split())
|
|
```
|
|
|
|
### Full Ingest Function
|
|
|
|
```python
|
|
def ingest_review(review: dict) -> list[dict]:
|
|
spans = split_spans(review['text'])
|
|
if not spans:
|
|
return []
|
|
|
|
embeddings = embed_spans(spans)
|
|
|
|
enriched = []
|
|
for i, (text, emb) in enumerate(zip(spans, embeddings)):
|
|
sentiment, confidence = score_sentiment(emb)
|
|
staff = extract_staff(text)
|
|
|
|
enriched.append({
|
|
'span_id': f"{review['review_id']}_{i}",
|
|
'review_id': review['review_id'],
|
|
'business_id': review['business_id'],
|
|
'span_index': i,
|
|
'text': text,
|
|
'embedding': emb,
|
|
'sentiment': sentiment,
|
|
'sentiment_score': confidence,
|
|
'staff_mentions': staff if staff else None,
|
|
'date': review['date'],
|
|
})
|
|
|
|
return enriched
|
|
```
|
|
|
|
---
|
|
|
|
## D. Report Generation
|
|
|
|
### Step 1: Fetch Spans
|
|
|
|
```python
|
|
def fetch_spans(business_id: str, start: date, end: date) -> list[dict]:
|
|
return db.query("""
|
|
SELECT span_id, review_id, text, embedding, sentiment,
|
|
sentiment_score, staff_mentions, date
|
|
FROM spans
|
|
WHERE business_id = %s AND date >= %s AND date < %s
|
|
""", [business_id, start, end])
|
|
```
|
|
|
|
### Step 2: Cluster Spans (Ephemeral Topics)
|
|
|
|
Cluster ALL spans together (not pos/neg separately). Compute sentiment breakdown within each cluster.
|
|
|
|
**Scalability note**: Full distance matrix is O(n²) memory/time. For large span counts,
|
|
we fall back to PCA + MiniBatchKMeans.
|
|
|
|
```python
|
|
import hdbscan
|
|
import numpy as np
|
|
from sklearn.decomposition import PCA
|
|
from sklearn.cluster import MiniBatchKMeans
|
|
|
|
MAX_SPANS_FOR_HDBSCAN = 4000 # Beyond this, O(n²) distance matrix is too expensive
|
|
|
|
def cluster_spans(spans: list[dict]) -> tuple[list[dict], list[dict]]:
|
|
"""Returns (topics, noise_spans)
|
|
|
|
Uses HDBSCAN for small datasets, falls back to PCA+KMeans for large ones.
|
|
"""
|
|
|
|
if len(spans) > MAX_SPANS_FOR_HDBSCAN:
|
|
return _cluster_spans_fallback(spans)
|
|
|
|
embeddings = np.array([s['embedding'] for s in spans])
|
|
|
|
# L2-normalize and compute distance matrix
|
|
normed = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
|
|
dist_matrix = 1 - (normed @ normed.T)
|
|
np.fill_diagonal(dist_matrix, 0)
|
|
|
|
clusterer = hdbscan.HDBSCAN(
|
|
min_cluster_size=10, # Aligned with publish gate
|
|
min_samples=5,
|
|
metric='precomputed'
|
|
)
|
|
labels = clusterer.fit_predict(dist_matrix)
|
|
|
|
# Group spans by cluster
|
|
topics = {}
|
|
noise_spans = []
|
|
|
|
for span, label in zip(spans, labels):
|
|
if label == -1:
|
|
# Keep high-confidence noise for quotes
|
|
if abs(span['sentiment_score']) > 0.5:
|
|
noise_spans.append(span)
|
|
continue
|
|
|
|
if label not in topics:
|
|
topics[label] = {'spans': [], 'embeddings': []}
|
|
topics[label]['spans'].append(span)
|
|
topics[label]['embeddings'].append(span['embedding'])
|
|
|
|
# Compute centroids
|
|
result = []
|
|
for label, data in topics.items():
|
|
embs = np.array(data['embeddings'])
|
|
centroid = embs.mean(axis=0)
|
|
centroid = centroid / np.linalg.norm(centroid)
|
|
|
|
result.append({
|
|
'cluster_id': label,
|
|
'spans': data['spans'],
|
|
'embeddings': embs,
|
|
'centroid': centroid,
|
|
})
|
|
|
|
return result, noise_spans
|
|
|
|
|
|
def _cluster_spans_fallback(spans: list[dict]) -> tuple[list[dict], list[dict]]:
|
|
"""Fallback clustering for large datasets using PCA + MiniBatchKMeans.
|
|
|
|
Trades cluster quality for O(n) scalability.
|
|
Generates pseudo-noise from spans far from their cluster centroid.
|
|
|
|
Requires: Each span must have 'embedding' and 'sentiment_score' populated.
|
|
"""
|
|
|
|
embeddings = np.array([s['embedding'] for s in spans])
|
|
|
|
# Reduce dimensionality
|
|
pca = PCA(n_components=50)
|
|
reduced = pca.fit_transform(embeddings)
|
|
|
|
# Estimate k (heuristic: sqrt(n/10), clamped)
|
|
k = max(5, min(50, int(np.sqrt(len(spans) / 10))))
|
|
|
|
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=256, n_init=3)
|
|
labels = kmeans.fit_predict(reduced)
|
|
|
|
# Group spans by cluster
|
|
topics = {}
|
|
for span, emb, label in zip(spans, embeddings, labels):
|
|
if label not in topics:
|
|
topics[label] = {'spans': [], 'embeddings': []}
|
|
topics[label]['spans'].append(span)
|
|
topics[label]['embeddings'].append(emb)
|
|
|
|
# Compute centroids and identify pseudo-noise (bottom 3% by similarity)
|
|
result = []
|
|
all_distances = [] # (distance, span) tuples for pseudo-noise selection
|
|
|
|
for label, data in topics.items():
|
|
embs = np.array(data['embeddings'])
|
|
centroid = embs.mean(axis=0)
|
|
centroid = centroid / np.linalg.norm(centroid)
|
|
|
|
# Compute similarities to centroid
|
|
normed_embs = embs / np.linalg.norm(embs, axis=1, keepdims=True)
|
|
sims = normed_embs @ centroid
|
|
|
|
# Track distances for pseudo-noise
|
|
for span, sim in zip(data['spans'], sims):
|
|
all_distances.append((1 - sim, span))
|
|
|
|
result.append({
|
|
'cluster_id': label,
|
|
'spans': data['spans'],
|
|
'embeddings': embs,
|
|
'centroid': centroid,
|
|
})
|
|
|
|
# Pseudo-noise: bottom 3% by similarity (farthest from any centroid)
|
|
# Only include high-confidence sentiment spans (same as HDBSCAN noise handling)
|
|
all_distances.sort(key=lambda x: x[0], reverse=True)
|
|
noise_cutoff = int(len(all_distances) * 0.03)
|
|
pseudo_noise = [
|
|
span for _, span in all_distances[:noise_cutoff]
|
|
if abs(span['sentiment_score']) > 0.5
|
|
]
|
|
|
|
return result, pseudo_noise
|
|
```
|
|
|
|
### Step 3: Compute Review-Level Stats
|
|
|
|
Stats are review-level presence (not span counts). This is critical for defensible claims.
|
|
|
|
```python
|
|
def compute_topic_stats(topic: dict, all_review_ids: set) -> dict:
|
|
"""Compute review-level presence stats."""
|
|
|
|
spans = topic['spans']
|
|
n = len(all_review_ids)
|
|
|
|
# Review-level presence
|
|
reviews_any = set(s['review_id'] for s in spans)
|
|
reviews_neg = set(s['review_id'] for s in spans if s['sentiment'] == 'negative')
|
|
reviews_pos = set(s['review_id'] for s in spans if s['sentiment'] == 'positive')
|
|
|
|
k_neg = len(reviews_neg)
|
|
k_pos = len(reviews_pos)
|
|
|
|
return {
|
|
'k_any': len(reviews_any),
|
|
'k_neg': k_neg,
|
|
'k_pos': k_pos,
|
|
'n': n,
|
|
'rate_neg': k_neg / n if n > 0 else 0,
|
|
'rate_pos': k_pos / n if n > 0 else 0,
|
|
'ci_neg': wilson_interval(k_neg, n),
|
|
'ci_pos': wilson_interval(k_pos, n),
|
|
}
|
|
|
|
def wilson_interval(k: int, n: int, z: float = 1.96) -> tuple[float, float]:
|
|
if n == 0:
|
|
return (0.0, 1.0)
|
|
|
|
p = k / n
|
|
denom = 1 + z**2 / n
|
|
center = (p + z**2 / (2*n)) / denom
|
|
margin = (z / denom) * np.sqrt(p*(1-p)/n + z**2/(4*n**2))
|
|
|
|
return (max(0, center - margin), min(1, center + margin))
|
|
```
|
|
|
|
### Step 4: Label Topics (Representative Spans, No Stopwords)
|
|
|
|
Topic identity = centroid (for matching). Display label = cleaned representative span (for UI).
|
|
|
|
```python
|
|
import re
|
|
|
|
EMAIL_RE = re.compile(r'\b\S+@\S+\.\S+\b')
|
|
URL_RE = re.compile(r'\b(?:https?://|www\.)\S+\b', re.I)
|
|
PHONE_RE = re.compile(r'\b(?:\+?\d[\d .()-]{7,}\d)\b')
|
|
LONGDIG_RE = re.compile(r'\b\d{8,}\b')
|
|
|
|
def beautify_label(text: str) -> str:
|
|
"""Clean PII and noise from label text."""
|
|
text = ' '.join(text.split())
|
|
text = EMAIL_RE.sub('', text)
|
|
text = URL_RE.sub('', text)
|
|
text = PHONE_RE.sub('', text)
|
|
text = LONGDIG_RE.sub('', text)
|
|
text = re.sub(r'([!?.]){2,}', r'\1', text)
|
|
return text.strip()
|
|
|
|
def norm_for_dedup(text: str) -> str:
|
|
"""Normalize for near-duplicate detection. Unicode-safe for multilingual."""
|
|
import unicodedata
|
|
|
|
# Casefold (stronger than lower() for Unicode)
|
|
t = text.casefold()
|
|
|
|
# Normalize Unicode (NFC form)
|
|
t = unicodedata.normalize('NFC', t)
|
|
|
|
# Replace digits with placeholder
|
|
t = re.sub(r'\d+', '#', t)
|
|
|
|
# Remove punctuation but keep letters from any alphabet (\w includes Unicode letters)
|
|
t = re.sub(r'[^\w\s#]+', ' ', t, flags=re.UNICODE)
|
|
|
|
# Collapse whitespace
|
|
t = ' '.join(t.split())
|
|
|
|
return t
|
|
|
|
def select_label(topic: dict, used_labels: set) -> str:
|
|
"""Select clean, unique display label from representative spans."""
|
|
|
|
spans = topic['spans']
|
|
embeddings = np.array(topic['embeddings'])
|
|
centroid = topic['centroid']
|
|
|
|
# Rank by similarity to centroid
|
|
sims = embeddings @ centroid
|
|
ranked = np.argsort(sims)[::-1]
|
|
|
|
for idx in ranked[:15]:
|
|
cleaned = beautify_label(spans[idx]['text'])
|
|
|
|
if not (15 <= len(cleaned) <= 80):
|
|
continue
|
|
|
|
key = norm_for_dedup(cleaned)
|
|
if key in used_labels:
|
|
continue
|
|
|
|
used_labels.add(key)
|
|
return cleaned
|
|
|
|
# Fallback: truncate best match
|
|
best = beautify_label(spans[ranked[0]]['text'])
|
|
return best[:60].rstrip() + ("..." if len(best) > 60 else "")
|
|
```
|
|
|
|
### Step 5: Trend Matching (Centroid-Based)
|
|
|
|
Match current topics to prior topics by centroid similarity. Never use label text for matching.
|
|
|
|
**v1 decision**: Compute separate trends for negative and positive rates. This ensures strengths
|
|
get correct trend values (not reusing negative-only logic).
|
|
|
|
```python
|
|
def match_trends(current_topics: list, prior_topics: list,
|
|
threshold: float = 0.70, margin: float = 0.05,
|
|
min_k: int = 8, min_n: int = 20):
|
|
"""Match topics across periods for trend computation.
|
|
|
|
Computes both trend_neg and trend_pos separately.
|
|
"""
|
|
|
|
for curr in current_topics:
|
|
stats = curr['stats']
|
|
curr['trend_neg'] = None
|
|
curr['trend_pos'] = None
|
|
curr['trend_match_sim'] = None
|
|
|
|
if not prior_topics:
|
|
continue
|
|
|
|
# Find best and second-best match by centroid similarity
|
|
sims = [(p, float(curr['centroid'] @ p['centroid'])) for p in prior_topics]
|
|
sims.sort(key=lambda x: x[1], reverse=True)
|
|
|
|
best, best_sim = sims[0]
|
|
second_sim = sims[1][1] if len(sims) > 1 else 0
|
|
|
|
# Gate: match must be confident AND clearly better than alternatives
|
|
if best_sim < threshold or (best_sim - second_sim) < margin:
|
|
continue
|
|
|
|
curr['trend_match_sim'] = best_sim
|
|
|
|
# Compute trend for negatives (if both periods have enough data)
|
|
if (stats['k_neg'] >= min_k and stats['n'] >= min_n and
|
|
best['stats']['k_neg'] >= min_k and best['stats']['n'] >= min_n):
|
|
curr['trend_neg'] = stats['rate_neg'] - best['stats']['rate_neg']
|
|
|
|
# Compute trend for positives (if both periods have enough data)
|
|
if (stats['k_pos'] >= min_k and stats['n'] >= min_n and
|
|
best['stats']['k_pos'] >= min_k and best['stats']['n'] >= min_n):
|
|
curr['trend_pos'] = stats['rate_pos'] - best['stats']['rate_pos']
|
|
```
|
|
|
|
### Step 6: Quote Selection
|
|
|
|
Pick representative + sharp quotes. Include high-confidence noise spans.
|
|
|
|
- **Representative**: closest span to centroid (within topic, matching sentiment)
|
|
- **Sharp**: highest |sentiment_score| among topic spans + high-confidence noise
|
|
|
|
```python
|
|
def pick_quotes(topic: dict, noise_spans: list, sentiment_filter: str,
|
|
k: int = 2) -> list[dict]:
|
|
"""Select diverse, high-quality quotes: 1 representative + 1 sharp."""
|
|
|
|
topic_spans = [s for s in topic['spans'] if s['sentiment'] == sentiment_filter]
|
|
centroid = topic['centroid']
|
|
|
|
quotes = []
|
|
seen_reviews = set()
|
|
|
|
# 1. Representative: closest to centroid
|
|
if topic_spans:
|
|
embeddings = np.array([s['embedding'] for s in topic_spans])
|
|
sims = embeddings @ centroid
|
|
ranked_idx = np.argsort(sims)[::-1]
|
|
|
|
for idx in ranked_idx:
|
|
span = topic_spans[idx]
|
|
if span['review_id'] in seen_reviews:
|
|
continue
|
|
if len(span['text']) > 200:
|
|
continue
|
|
|
|
quotes.append({
|
|
'text': span['text'],
|
|
'sentiment': span['sentiment'],
|
|
'date': span['date'],
|
|
'type': 'representative',
|
|
})
|
|
seen_reviews.add(span['review_id'])
|
|
break
|
|
|
|
# 2. Sharp: highest confidence from topic + noise
|
|
sharp_candidates = topic_spans + [s for s in noise_spans
|
|
if s['sentiment'] == sentiment_filter
|
|
and abs(s['sentiment_score']) > 0.5]
|
|
sharp_candidates.sort(key=lambda s: abs(s['sentiment_score']), reverse=True)
|
|
|
|
for span in sharp_candidates:
|
|
if span['review_id'] in seen_reviews:
|
|
continue
|
|
if len(span['text']) > 200:
|
|
continue
|
|
|
|
quotes.append({
|
|
'text': span['text'],
|
|
'sentiment': span['sentiment'],
|
|
'date': span['date'],
|
|
'type': 'sharp',
|
|
})
|
|
seen_reviews.add(span['review_id'])
|
|
|
|
if len(quotes) >= k:
|
|
break
|
|
|
|
return quotes
|
|
```
|
|
|
|
### Step 7: Staff Aggregation
|
|
|
|
```python
|
|
def aggregate_staff(spans: list[dict], all_review_ids: set) -> dict:
|
|
"""Aggregate staff mentions with review-level presence."""
|
|
|
|
staff_data = {}
|
|
|
|
for span in spans:
|
|
if not span['staff_mentions']:
|
|
continue
|
|
|
|
for name in span['staff_mentions']:
|
|
if name not in staff_data:
|
|
staff_data[name] = {'pos_reviews': set(), 'neg_reviews': set(), 'quotes': []}
|
|
|
|
if span['sentiment'] == 'positive':
|
|
staff_data[name]['pos_reviews'].add(span['review_id'])
|
|
staff_data[name]['quotes'].append(span['text'])
|
|
elif span['sentiment'] == 'negative':
|
|
staff_data[name]['neg_reviews'].add(span['review_id'])
|
|
staff_data[name]['quotes'].append(span['text'])
|
|
|
|
# Build heroes and concerns
|
|
heroes, concerns = [], []
|
|
|
|
for name, data in staff_data.items():
|
|
pos = len(data['pos_reviews'])
|
|
neg = len(data['neg_reviews'])
|
|
total = pos + neg
|
|
|
|
if total < 3: # Minimum mentions
|
|
continue
|
|
|
|
entry = {
|
|
'name': name,
|
|
'positive': pos,
|
|
'negative': neg,
|
|
'total': total,
|
|
'quote': data['quotes'][0] if data['quotes'] else None,
|
|
}
|
|
|
|
if pos > neg and pos >= 3:
|
|
heroes.append(entry)
|
|
elif neg > pos and neg >= 3:
|
|
concerns.append(entry)
|
|
|
|
heroes.sort(key=lambda x: x['positive'], reverse=True)
|
|
concerns.sort(key=lambda x: x['negative'], reverse=True)
|
|
|
|
return {'heroes': heroes[:3], 'concerns': concerns[:3]}
|
|
```
|
|
|
|
### Step 8: Build LLM Payload
|
|
|
|
```python
|
|
def build_payload(business_id: str, current_period: tuple,
|
|
topics: list, noise_spans: list, staff: dict,
|
|
review_count: int) -> dict:
|
|
"""Build structured payload for LLM narration.
|
|
|
|
Args:
|
|
noise_spans: High-confidence spans not assigned to any cluster.
|
|
Used for quote selection.
|
|
"""
|
|
|
|
issues = []
|
|
strengths = []
|
|
|
|
for topic in topics:
|
|
stats = topic['stats']
|
|
|
|
# Issue: significant negative presence
|
|
if stats['k_neg'] >= 8 and stats['n'] >= 20:
|
|
ci = stats['ci_neg']
|
|
if ci[1] - ci[0] <= 0.30: # CI not too wide
|
|
issues.append({
|
|
'label': topic['label'],
|
|
'rate': round(stats['rate_neg'], 3),
|
|
'ci': [round(ci[0], 3), round(ci[1], 3)],
|
|
'n': stats['k_neg'],
|
|
'trend': round(topic['trend_neg'], 3) if topic.get('trend_neg') else None,
|
|
'quotes': pick_quotes(topic, noise_spans, 'negative', k=2),
|
|
})
|
|
|
|
# Strength: significant positive presence
|
|
if stats['k_pos'] >= 8 and stats['n'] >= 20:
|
|
ci = stats['ci_pos']
|
|
if ci[1] - ci[0] <= 0.30:
|
|
strengths.append({
|
|
'label': topic['label'],
|
|
'rate': round(stats['rate_pos'], 3),
|
|
'ci': [round(ci[0], 3), round(ci[1], 3)],
|
|
'n': stats['k_pos'],
|
|
'trend': round(topic['trend_pos'], 3) if topic.get('trend_pos') else None,
|
|
'quotes': pick_quotes(topic, noise_spans, 'positive', k=2),
|
|
})
|
|
|
|
# Sort by rate
|
|
issues.sort(key=lambda x: x['rate'], reverse=True)
|
|
strengths.sort(key=lambda x: x['rate'], reverse=True)
|
|
|
|
return {
|
|
'business_id': business_id,
|
|
'period': f"{current_period[0]} to {current_period[1]}",
|
|
'total_reviews': review_count,
|
|
'issues': issues[:5],
|
|
'strengths': strengths[:5],
|
|
'staff': staff,
|
|
}
|
|
```
|
|
|
|
### Step 9: LLM Narration (Single Call)
|
|
|
|
```python
|
|
SYSTEM_PROMPT = """You are a business consultant analyzing customer review data.
|
|
Write a clear, actionable report for a small business owner.
|
|
|
|
RULES:
|
|
1. Use ONLY the statistics provided. Never invent numbers.
|
|
2. Include confidence intervals when stating percentages.
|
|
3. Be direct and actionable. The owner is busy.
|
|
4. Prioritize issues by frequency and trend direction.
|
|
5. Each recommendation must reference a specific issue from the data."""
|
|
|
|
def generate_report(payload: dict) -> str:
|
|
user_prompt = f"""Based on this review analysis, write a consultant report.
|
|
|
|
DATA:
|
|
{json.dumps(payload, indent=2)}
|
|
|
|
SECTIONS:
|
|
1. Executive Summary (3 sentences max)
|
|
2. Top Strengths (what's working, with stats)
|
|
3. Critical Issues (what needs attention, with stats and trends)
|
|
4. Staff Performance (heroes and concerns if present)
|
|
5. Recommended Actions (3-5 specific steps, prioritized)
|
|
|
|
Keep total length under 600 words."""
|
|
|
|
response = llm_client.chat(
|
|
model="gpt-4o-mini",
|
|
messages=[
|
|
{"role": "system", "content": SYSTEM_PROMPT},
|
|
{"role": "user", "content": user_prompt}
|
|
],
|
|
max_tokens=1500
|
|
)
|
|
return response.content
|
|
```
|
|
|
|
### Full Report Generation Function
|
|
|
|
```python
|
|
def generate_full_report(business_id: str,
|
|
current_start: date, current_end: date,
|
|
prior_start: date, prior_end: date) -> str:
|
|
"""Generate complete report for a business."""
|
|
|
|
# Fetch spans
|
|
current_spans = fetch_spans(business_id, current_start, current_end)
|
|
prior_spans = fetch_spans(business_id, prior_start, prior_end)
|
|
|
|
if not current_spans:
|
|
return "Insufficient data for report."
|
|
|
|
# Get unique review IDs
|
|
current_reviews = set(s['review_id'] for s in current_spans)
|
|
prior_reviews = set(s['review_id'] for s in prior_spans)
|
|
|
|
# Cluster current period
|
|
current_topics, noise_spans = cluster_spans(current_spans)
|
|
|
|
# Compute stats for current topics
|
|
for topic in current_topics:
|
|
topic['stats'] = compute_topic_stats(topic, current_reviews)
|
|
|
|
# Label topics (with deduplication)
|
|
used_labels = set()
|
|
for topic in current_topics:
|
|
topic['label'] = select_label(topic, used_labels)
|
|
|
|
# Cluster and compute stats for prior period
|
|
prior_topics = []
|
|
if prior_spans:
|
|
prior_topics, _ = cluster_spans(prior_spans)
|
|
for topic in prior_topics:
|
|
topic['stats'] = compute_topic_stats(topic, prior_reviews)
|
|
|
|
# Match trends
|
|
match_trends(current_topics, prior_topics)
|
|
|
|
# Aggregate staff
|
|
staff = aggregate_staff(current_spans, current_reviews)
|
|
|
|
# Build payload (include noise_spans for quote selection)
|
|
payload = build_payload(
|
|
business_id,
|
|
(current_start, current_end),
|
|
current_topics,
|
|
noise_spans, # Pass noise spans for quote selection
|
|
staff,
|
|
len(current_reviews)
|
|
)
|
|
|
|
# Generate report
|
|
return generate_report(payload)
|
|
```
|
|
|
|
---
|
|
|
|
## E. Summary of Design Decisions
|
|
|
|
### What We Do
|
|
|
|
| Decision | Rationale |
|
|
|----------|-----------|
|
|
| Ephemeral topics (no persistent catalog) | Eliminates drift, merge logic, thresholds |
|
|
| Cluster all spans together | One topic can have pos/neg breakdown; avoids duplicates |
|
|
| Fallback clustering for large datasets | PCA + KMeans when >4000 spans (O(n) vs O(n²)) |
|
|
| Review-level presence for stats | Defensible claims ("X% of customers") |
|
|
| Wilson intervals + publish gates | Statistical rigor |
|
|
| Centroid-based trend matching | Stable identity regardless of label changes |
|
|
| Separate trend_neg/trend_pos | Correct trends for both issues and strengths |
|
|
| Representative + sharp quotes | Best of both: centroid-closest + highest confidence |
|
|
| Representative span labels | Human-readable, no stopwords/NLP needed |
|
|
| Unicode-safe label dedup | Works for Spanish, German, etc. |
|
|
| Multi-anchor sentiment | Better multilingual alignment than bag sentence |
|
|
| Guarded staff extraction | Reduces false positives |
|
|
| Single LLM call | Cost control |
|
|
|
|
### What We Don't Do
|
|
|
|
| Avoided | Why |
|
|
|---------|-----|
|
|
| Persistent topic catalog | Adds state, drift, merge complexity |
|
|
| Topic assignment at ingest | Unnecessary; cluster at report time |
|
|
| Span-count stats | Inflates rates; review-level is correct |
|
|
| TF-IDF with stopwords | Brittle; representative spans are better |
|
|
| Split on "and/y/und" | Over-splits positive phrases |
|
|
| POS tagging for labels | Heavy dependency; regex cleanup is sufficient |
|
|
| Translation | Multilingual embeddings + multi-language anchors handle it |
|
|
| Sentiment classifier | Multi-anchor approach works across languages |
|
|
|
|
### Statistical Gates
|
|
|
|
| Gate | Threshold | Purpose |
|
|
|------|-----------|---------|
|
|
| Minimum k | 8 | Topic must have enough mentions |
|
|
| Minimum n | 20 | Period must have enough reviews |
|
|
| CI width | ≤ 0.30 | Reject imprecise estimates |
|
|
| Trend match sim | ≥ 0.70 | Confident topic match |
|
|
| Trend margin | ≥ 0.05 | Clear winner vs alternatives |
|
|
| Both periods min | k≥8, n≥20 | Trend requires data on both sides |
|
|
|
|
### Trend Handling
|
|
|
|
- **Accurate when**: Topic structure is stable (most real issues)
|
|
- **Omitted when**: Match confidence is low
|
|
- **Separate trends**: `trend_neg` and `trend_pos` computed independently
|
|
- **Never**: Show confidently wrong trends
|
|
|
|
---
|
|
|
|
## F. Implementation Plan
|
|
|
|
| Day | Deliverable |
|
|
|-----|-------------|
|
|
| 1-2 | Span splitter + embedding service |
|
|
| 3-4 | Sentiment scoring + staff extraction |
|
|
| 5-6 | Database schema + ingest pipeline |
|
|
| 7-8 | Clustering + stats + labeling |
|
|
| 9-10 | Trend matching + quote selection |
|
|
| 11-12 | LLM integration + end-to-end testing |
|
|
|
|
**Total: ~12 days for a competent engineer**
|
|
|
|
---
|
|
|
|
## G. What's NOT in v1
|
|
|
|
| Feature | Rationale | v2 Trigger |
|
|
|---------|-----------|------------|
|
|
| Token-window segmentation | Punctuation split is good enough | Run-on reviews cause quality issues |
|
|
| Many-to-many trend matching | Best-match is good enough | Trend accuracy complaints |
|
|
| Owner-driven topic editing | Not needed yet | Users want to rename/merge topics |
|
|
| Multi-location rollup | Different product | Chain restaurants sign up |
|
|
| Anomaly detection | Different product | Fraud complaints |
|
|
| Response templates | Low value | User requests |
|
|
|
|
---
|
|
|
|
## H. Known Limitations / Future Improvements
|
|
|
|
| Limitation | Impact | v2 Consideration |
|
|
|------------|--------|------------------|
|
|
| Sentiment anchors cover EN/ES/DE only | Other languages (FR, PT, IT, etc.) rely on multilingual-e5 alignment | Add 5-10 anchors per new language as user base grows |
|
|
| KMeans fallback uses pseudo-noise heuristic | Sharp quotes may be slightly less sharp for >4k span reports | Consider HDBSCAN with approximate nearest neighbors (pynndescent) |
|
|
| No streaming for very large reports | Memory pressure if report spans exceed 10k | Paginate or sample spans for extreme cases |
|
|
|
|
---
|
|
|
|
## I. Final Checklist Before Ship
|
|
|
|
- [ ] Span splitter handles mobile text (no punctuation edge case)
|
|
- [ ] Embeddings are L2-normalized before clustering
|
|
- [ ] HDBSCAN uses precomputed cosine distance matrix
|
|
- [ ] Clustering has fallback for >4000 spans (PCA + KMeans)
|
|
- [ ] KMeans fallback generates pseudo-noise (bottom 3% by centroid distance)
|
|
- [ ] Stats are review-level presence (not span counts)
|
|
- [ ] Labels are deduplicated across topics (Unicode-safe)
|
|
- [ ] Trends computed separately for neg/pos (trend_neg, trend_pos)
|
|
- [ ] Trends require min support in BOTH periods
|
|
- [ ] Sentiment anchors are multi-word averaged (not bag sentence)
|
|
- [ ] Sentiment anchors include EN/ES/DE words
|
|
- [ ] Staff history lookup uses normalized names
|
|
- [ ] noise_spans passed to quote selection
|
|
- [ ] pgvector index uses HNSW (or ivfflat with ANALYZE documented)
|
|
- [ ] LLM prompt enforces "only use provided numbers"
|
|
- [ ] Cost per report < $0.30
|
|
|
|
---
|
|
|
|
**Document version**: v1-final-reviewed
|
|
**Status**: Ready for implementation (with reviewer fixes applied)
|