Initial commit - WhyRating Engine (Google Reviews Scraper)

2026-02-02 18:19:00 +00:00
parent 0543a08242
commit 2206ddeff2
136 changed files with 51138 additions and 855 deletions
--- a/packages/reviewiq-pipeline/IMPROVEMENTS.md
+++ b/packages/reviewiq-pipeline/IMPROVEMENTS.md
@@ -0,0 +1,311 @@
+# ReviewIQ Pipeline Improvement Suggestions
+
+Based on validation testing and analysis of the classification pipeline.
+
+---
+
+## 🔴 High Priority (Quality & Cost Impact)
+
+### 1. Multi-Aspect Detection Gap
+**Problem**: LLM misses secondary codes in multi-aspect reviews.
+- "not too expensive" → V4.01 missed
+- "easy and fast" → J1.01 missed
+
+**Solution**: Update classification prompt to:
+```
+For reviews with multiple distinct topics:
+1. Extract ALL aspects, not just the dominant one
+2. Assign urt_secondary codes for each additional aspect
+3. Flag reviews with 3+ aspects as "complex"
+```
+
+**Impact**: ~15-20% of reviews have multiple aspects being partially captured.
+
+---
+
+### 2. Enable Smart Router (Cost Savings)
+**Problem**: All reviews go through expensive Sonnet model.
+
+**Solution**: Enable the implemented router:
+```python
+Config(
+    router_enabled=True,
+    router_conservative=True,
+    router_cheap_model="claude-3-5-haiku-20241022",
+)
+```
+
+**Impact**:
+- SKIP (1.6%): $0 cost (was ~$0.05)
+- CHEAP (31.4%): ~10x cheaper with Haiku
+- **Estimated 25-30% cost reduction**
+
+---
+
+### 3. JSON Truncation Recovery
+**Problem**: ~33% of batches hit JSON truncation, causing partial failures.
+
+**Current State**: Partial recovery implemented but still loses some reviews.
+
+**Solution**:
+1. Reduce batch size when reviews are long
+2. Add `max_tokens` buffer based on expected output
+3. Implement streaming JSON parser for real-time recovery
+
+```python
+# Dynamic batch sizing based on review length
+if avg_review_length > 200:
+    batch_size = min(batch_size, 15)
+if avg_review_length > 500:
+    batch_size = min(batch_size, 8)
+```
+
+**Impact**: Reduce fallback processing by ~50%, saving time and cost.
+
+---
+
+## 🟡 Medium Priority (Reliability & Accuracy)
+
+### 4. LLM Response Caching
+**Problem**: Retries reprocess already-classified reviews.
+
+**Solution**: Cache successful LLM responses by content hash:
+```python
+class ResponseCache:
+    async def get(self, text_hash: str) -> dict | None:
+        return await redis.get(f"llm:classify:{text_hash}")
+
+    async def set(self, text_hash: str, response: dict, ttl: int = 86400):
+        await redis.setex(f"llm:classify:{text_hash}", ttl, json.dumps(response))
+```
+
+**Impact**:
+- Zero cost for re-runs on same reviews
+- Faster pipeline retries
+- Useful for A/B testing prompts
+
+---
+
+### 5. Confidence-Based Routing
+**Problem**: LLM assigns codes even when uncertain.
+
+**Solution**: Add confidence threshold in prompt:
+```
+If confidence < 70%:
+  - Set confidence: "low"
+  - Use generic code (V4.03) instead of guessing
+  - Flag for human review
+```
+
+**Impact**: Reduces misclassifications, improves data quality.
+
+---
+
+### 6. Post-Classification Validation
+**Problem**: Some classifications don't match review content.
+
+**Solution**: Add rule-based validation layer:
+```python
+def validate_classification(text: str, urt_code: str) -> bool:
+    # Price mentioned but not V4.xx code?
+    if has_price_mention(text) and not urt_code.startswith("V4"):
+        return False, "V4.01"  # Suggest correction
+
+    # Staff mentioned but not P1.xx code?
+    if has_staff_mention(text) and not urt_code.startswith("P1"):
+        return False, "P1.01"
+
+    return True, None
+```
+
+**Impact**: Catch ~5-10% of obvious misclassifications.
+
+---
+
+### 7. Span Coverage Validation
+**Problem**: Some review text not covered by any span.
+
+**Solution**: Track span coverage percentage:
+```python
+def calculate_coverage(text: str, spans: list) -> float:
+    covered_chars = set()
+    for span in spans:
+        covered_chars.update(range(span['start'], span['end']))
+    return len(covered_chars) / len(text)
+
+# Flag if coverage < 60%
+if coverage < 0.6:
+    log.warning(f"Low span coverage: {coverage:.0%}")
+```
+
+**Impact**: Identify reviews where LLM skipped important content.
+
+---
+
+## 🟢 Lower Priority (Optimization & Monitoring)
+
+### 8. Taxonomy Alignment Scoring
+**Problem**: Hard to measure classification quality at scale.
+
+**Solution**: Build automated taxonomy alignment checker:
+```python
+# Check if keywords in text match expected domain
+DOMAIN_KEYWORDS = {
+    "V4": ["price", "money", "worth", "cost", "expensive", "cheap"],
+    "P1": ["staff", "employee", "service", "friendly", "rude"],
+    "J1": ["wait", "fast", "slow", "quick", "time", "minutes"],
+    "E1": ["clean", "dirty", "comfortable", "space", "room"],
+}
+
+def alignment_score(text: str, urt_code: str) -> float:
+    domain = urt_code[0:2]
+    keywords = DOMAIN_KEYWORDS.get(domain, [])
+    matches = sum(1 for kw in keywords if kw in text.lower())
+    return matches / len(keywords) if keywords else 0.5
+```
+
+**Impact**: Quality dashboard, regression detection.
+
+---
+
+### 9. Batch Size Auto-Tuning
+**Problem**: Fixed batch size doesn't adapt to review complexity.
+
+**Solution**: Implement adaptive batch sizing:
+```python
+class AdaptiveBatchSizer:
+    def __init__(self):
+        self.history = []  # (batch_size, success_rate, avg_tokens)
+
+    def recommend_size(self, reviews: list) -> int:
+        avg_length = sum(len(r['text']) for r in reviews) / len(reviews)
+
+        # Learn from history
+        if self.history:
+            # Find optimal size for similar review lengths
+            similar = [h for h in self.history if abs(h['avg_len'] - avg_length) < 50]
+            if similar:
+                return max(h['size'] for h in similar if h['success_rate'] > 0.95)
+
+        # Default heuristics
+        if avg_length > 300:
+            return 10
+        elif avg_length > 150:
+            return 20
+        else:
+            return 30
+```
+
+---
+
+### 10. Cost Tracking Dashboard
+**Problem**: No visibility into per-job, per-stage costs.
+
+**Solution**: Add cost tracking to pipeline output:
+```python
+@dataclass
+class CostBreakdown:
+    stage: str
+    model: str
+    input_tokens: int
+    output_tokens: int
+    cached_tokens: int
+    cost_usd: float
+    reviews_processed: int
+    cost_per_review: float
+
+# Store in database
+CREATE TABLE pipeline.cost_tracking (
+    id SERIAL PRIMARY KEY,
+    execution_id UUID,
+    job_id UUID,
+    stage VARCHAR(50),
+    model VARCHAR(100),
+    input_tokens INT,
+    output_tokens INT,
+    cached_tokens INT,
+    cost_usd DECIMAL(10, 6),
+    reviews_processed INT,
+    created_at TIMESTAMP DEFAULT NOW()
+);
+```
+
+---
+
+### 11. Streaming Classification
+**Problem**: Large batches block until complete.
+
+**Solution**: Implement streaming for real-time progress:
+```python
+async def classify_streaming(reviews: list):
+    async for partial_result in llm_client.stream_batch(reviews):
+        # Yield each review as it completes
+        yield partial_result
+
+        # Persist immediately
+        await persist_classification(partial_result)
+```
+
+**Impact**: Better UX, faster partial results, resilience to failures.
+
+---
+
+### 12. A/B Testing Framework
+**Problem**: Hard to compare prompt/model changes.
+
+**Solution**: Built-in A/B testing:
+```python
+class ABTestConfig:
+    test_name: str
+    variant_a: ClassificationConfig  # Control
+    variant_b: ClassificationConfig  # Treatment
+    split_ratio: float = 0.1  # 10% to treatment
+    metrics: list[str] = ["accuracy", "cost", "latency"]
+
+# Run both variants on same reviews
+results_a = await classify(reviews, config_a)
+results_b = await classify(reviews[:int(len(reviews)*0.1)], config_b)
+
+# Compare metrics
+compare_results(results_a, results_b)
+```
+
+---
+
+## Implementation Priority Matrix
+
+| Improvement | Effort | Impact | Priority |
+|-------------|--------|--------|----------|
+| 1. Multi-Aspect Detection | Medium | High | 🔴 P1 |
+| 2. Enable Smart Router | Low | High | 🔴 P1 |
+| 3. JSON Truncation Fix | Medium | High | 🔴 P1 |
+| 4. Response Caching | Medium | Medium | 🟡 P2 |
+| 5. Confidence Routing | Medium | Medium | 🟡 P2 |
+| 6. Post-Classification Validation | Low | Medium | 🟡 P2 |
+| 7. Span Coverage Validation | Low | Low | 🟢 P3 |
+| 8. Taxonomy Alignment | Medium | Low | 🟢 P3 |
+| 9. Adaptive Batch Sizing | High | Medium | 🟢 P3 |
+| 10. Cost Dashboard | Medium | Low | 🟢 P3 |
+| 11. Streaming Classification | High | Medium | 🟢 P3 |
+| 12. A/B Testing | High | Low | 🟢 P3 |
+
+---
+
+## Quick Wins (Can implement today)
+
+1. **Enable router** - Already implemented, just needs config flag
+2. **Reduce batch size** - Change `classification_batch_size=15` for long reviews
+3. **Add span coverage logging** - Simple metric to track quality
+4. **Post-classification keyword check** - Basic validation rules
+
+---
+
+## Estimated Impact Summary
+
+| Area | Current | After Improvements |
+|------|---------|-------------------|
+| Cost per 1000 reviews | ~$3.40 | ~$2.40 (-30%) |
+| Classification accuracy | ~85% | ~92% |
+| Multi-aspect capture | ~65% | ~90% |
+| Batch failure rate | ~33% | ~10% |
+| Pipeline retry cost | 100% | ~20% (with caching) |