# ReviewIQ Pipeline Improvement Suggestions Based on validation testing and analysis of the classification pipeline. --- ## 🔴 High Priority (Quality & Cost Impact) ### 1. Multi-Aspect Detection Gap **Problem**: LLM misses secondary codes in multi-aspect reviews. - "not too expensive" → V4.01 missed - "easy and fast" → J1.01 missed **Solution**: Update classification prompt to: ``` For reviews with multiple distinct topics: 1. Extract ALL aspects, not just the dominant one 2. Assign urt_secondary codes for each additional aspect 3. Flag reviews with 3+ aspects as "complex" ``` **Impact**: ~15-20% of reviews have multiple aspects being partially captured. --- ### 2. Enable Smart Router (Cost Savings) **Problem**: All reviews go through expensive Sonnet model. **Solution**: Enable the implemented router: ```python Config( router_enabled=True, router_conservative=True, router_cheap_model="claude-3-5-haiku-20241022", ) ``` **Impact**: - SKIP (1.6%): $0 cost (was ~$0.05) - CHEAP (31.4%): ~10x cheaper with Haiku - **Estimated 25-30% cost reduction** --- ### 3. JSON Truncation Recovery **Problem**: ~33% of batches hit JSON truncation, causing partial failures. **Current State**: Partial recovery implemented but still loses some reviews. **Solution**: 1. Reduce batch size when reviews are long 2. Add `max_tokens` buffer based on expected output 3. Implement streaming JSON parser for real-time recovery ```python # Dynamic batch sizing based on review length if avg_review_length > 200: batch_size = min(batch_size, 15) if avg_review_length > 500: batch_size = min(batch_size, 8) ``` **Impact**: Reduce fallback processing by ~50%, saving time and cost. --- ## 🟡 Medium Priority (Reliability & Accuracy) ### 4. LLM Response Caching **Problem**: Retries reprocess already-classified reviews. **Solution**: Cache successful LLM responses by content hash: ```python class ResponseCache: async def get(self, text_hash: str) -> dict | None: return await redis.get(f"llm:classify:{text_hash}") async def set(self, text_hash: str, response: dict, ttl: int = 86400): await redis.setex(f"llm:classify:{text_hash}", ttl, json.dumps(response)) ``` **Impact**: - Zero cost for re-runs on same reviews - Faster pipeline retries - Useful for A/B testing prompts --- ### 5. Confidence-Based Routing **Problem**: LLM assigns codes even when uncertain. **Solution**: Add confidence threshold in prompt: ``` If confidence < 70%: - Set confidence: "low" - Use generic code (V4.03) instead of guessing - Flag for human review ``` **Impact**: Reduces misclassifications, improves data quality. --- ### 6. Post-Classification Validation **Problem**: Some classifications don't match review content. **Solution**: Add rule-based validation layer: ```python def validate_classification(text: str, urt_code: str) -> bool: # Price mentioned but not V4.xx code? if has_price_mention(text) and not urt_code.startswith("V4"): return False, "V4.01" # Suggest correction # Staff mentioned but not P1.xx code? if has_staff_mention(text) and not urt_code.startswith("P1"): return False, "P1.01" return True, None ``` **Impact**: Catch ~5-10% of obvious misclassifications. --- ### 7. Span Coverage Validation **Problem**: Some review text not covered by any span. **Solution**: Track span coverage percentage: ```python def calculate_coverage(text: str, spans: list) -> float: covered_chars = set() for span in spans: covered_chars.update(range(span['start'], span['end'])) return len(covered_chars) / len(text) # Flag if coverage < 60% if coverage < 0.6: log.warning(f"Low span coverage: {coverage:.0%}") ``` **Impact**: Identify reviews where LLM skipped important content. --- ## 🟢 Lower Priority (Optimization & Monitoring) ### 8. Taxonomy Alignment Scoring **Problem**: Hard to measure classification quality at scale. **Solution**: Build automated taxonomy alignment checker: ```python # Check if keywords in text match expected domain DOMAIN_KEYWORDS = { "V4": ["price", "money", "worth", "cost", "expensive", "cheap"], "P1": ["staff", "employee", "service", "friendly", "rude"], "J1": ["wait", "fast", "slow", "quick", "time", "minutes"], "E1": ["clean", "dirty", "comfortable", "space", "room"], } def alignment_score(text: str, urt_code: str) -> float: domain = urt_code[0:2] keywords = DOMAIN_KEYWORDS.get(domain, []) matches = sum(1 for kw in keywords if kw in text.lower()) return matches / len(keywords) if keywords else 0.5 ``` **Impact**: Quality dashboard, regression detection. --- ### 9. Batch Size Auto-Tuning **Problem**: Fixed batch size doesn't adapt to review complexity. **Solution**: Implement adaptive batch sizing: ```python class AdaptiveBatchSizer: def __init__(self): self.history = [] # (batch_size, success_rate, avg_tokens) def recommend_size(self, reviews: list) -> int: avg_length = sum(len(r['text']) for r in reviews) / len(reviews) # Learn from history if self.history: # Find optimal size for similar review lengths similar = [h for h in self.history if abs(h['avg_len'] - avg_length) < 50] if similar: return max(h['size'] for h in similar if h['success_rate'] > 0.95) # Default heuristics if avg_length > 300: return 10 elif avg_length > 150: return 20 else: return 30 ``` --- ### 10. Cost Tracking Dashboard **Problem**: No visibility into per-job, per-stage costs. **Solution**: Add cost tracking to pipeline output: ```python @dataclass class CostBreakdown: stage: str model: str input_tokens: int output_tokens: int cached_tokens: int cost_usd: float reviews_processed: int cost_per_review: float # Store in database CREATE TABLE pipeline.cost_tracking ( id SERIAL PRIMARY KEY, execution_id UUID, job_id UUID, stage VARCHAR(50), model VARCHAR(100), input_tokens INT, output_tokens INT, cached_tokens INT, cost_usd DECIMAL(10, 6), reviews_processed INT, created_at TIMESTAMP DEFAULT NOW() ); ``` --- ### 11. Streaming Classification **Problem**: Large batches block until complete. **Solution**: Implement streaming for real-time progress: ```python async def classify_streaming(reviews: list): async for partial_result in llm_client.stream_batch(reviews): # Yield each review as it completes yield partial_result # Persist immediately await persist_classification(partial_result) ``` **Impact**: Better UX, faster partial results, resilience to failures. --- ### 12. A/B Testing Framework **Problem**: Hard to compare prompt/model changes. **Solution**: Built-in A/B testing: ```python class ABTestConfig: test_name: str variant_a: ClassificationConfig # Control variant_b: ClassificationConfig # Treatment split_ratio: float = 0.1 # 10% to treatment metrics: list[str] = ["accuracy", "cost", "latency"] # Run both variants on same reviews results_a = await classify(reviews, config_a) results_b = await classify(reviews[:int(len(reviews)*0.1)], config_b) # Compare metrics compare_results(results_a, results_b) ``` --- ## Implementation Priority Matrix | Improvement | Effort | Impact | Priority | |-------------|--------|--------|----------| | 1. Multi-Aspect Detection | Medium | High | 🔴 P1 | | 2. Enable Smart Router | Low | High | 🔴 P1 | | 3. JSON Truncation Fix | Medium | High | 🔴 P1 | | 4. Response Caching | Medium | Medium | 🟡 P2 | | 5. Confidence Routing | Medium | Medium | 🟡 P2 | | 6. Post-Classification Validation | Low | Medium | 🟡 P2 | | 7. Span Coverage Validation | Low | Low | 🟢 P3 | | 8. Taxonomy Alignment | Medium | Low | 🟢 P3 | | 9. Adaptive Batch Sizing | High | Medium | 🟢 P3 | | 10. Cost Dashboard | Medium | Low | 🟢 P3 | | 11. Streaming Classification | High | Medium | 🟢 P3 | | 12. A/B Testing | High | Low | 🟢 P3 | --- ## Quick Wins (Can implement today) 1. **Enable router** - Already implemented, just needs config flag 2. **Reduce batch size** - Change `classification_batch_size=15` for long reviews 3. **Add span coverage logging** - Simple metric to track quality 4. **Post-classification keyword check** - Basic validation rules --- ## Estimated Impact Summary | Area | Current | After Improvements | |------|---------|-------------------| | Cost per 1000 reviews | ~$3.40 | ~$2.40 (-30%) | | Classification accuracy | ~85% | ~92% | | Multi-aspect capture | ~65% | ~90% | | Batch failure rate | ~33% | ~10% | | Pipeline retry cost | 100% | ~20% (with caching) |