Initial commit - WhyRating Engine (Google Reviews Scraper)
This commit is contained in:
311
packages/reviewiq-pipeline/IMPROVEMENTS.md
Normal file
311
packages/reviewiq-pipeline/IMPROVEMENTS.md
Normal file
@@ -0,0 +1,311 @@
|
||||
# ReviewIQ Pipeline Improvement Suggestions
|
||||
|
||||
Based on validation testing and analysis of the classification pipeline.
|
||||
|
||||
---
|
||||
|
||||
## 🔴 High Priority (Quality & Cost Impact)
|
||||
|
||||
### 1. Multi-Aspect Detection Gap
|
||||
**Problem**: LLM misses secondary codes in multi-aspect reviews.
|
||||
- "not too expensive" → V4.01 missed
|
||||
- "easy and fast" → J1.01 missed
|
||||
|
||||
**Solution**: Update classification prompt to:
|
||||
```
|
||||
For reviews with multiple distinct topics:
|
||||
1. Extract ALL aspects, not just the dominant one
|
||||
2. Assign urt_secondary codes for each additional aspect
|
||||
3. Flag reviews with 3+ aspects as "complex"
|
||||
```
|
||||
|
||||
**Impact**: ~15-20% of reviews have multiple aspects being partially captured.
|
||||
|
||||
---
|
||||
|
||||
### 2. Enable Smart Router (Cost Savings)
|
||||
**Problem**: All reviews go through expensive Sonnet model.
|
||||
|
||||
**Solution**: Enable the implemented router:
|
||||
```python
|
||||
Config(
|
||||
router_enabled=True,
|
||||
router_conservative=True,
|
||||
router_cheap_model="claude-3-5-haiku-20241022",
|
||||
)
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- SKIP (1.6%): $0 cost (was ~$0.05)
|
||||
- CHEAP (31.4%): ~10x cheaper with Haiku
|
||||
- **Estimated 25-30% cost reduction**
|
||||
|
||||
---
|
||||
|
||||
### 3. JSON Truncation Recovery
|
||||
**Problem**: ~33% of batches hit JSON truncation, causing partial failures.
|
||||
|
||||
**Current State**: Partial recovery implemented but still loses some reviews.
|
||||
|
||||
**Solution**:
|
||||
1. Reduce batch size when reviews are long
|
||||
2. Add `max_tokens` buffer based on expected output
|
||||
3. Implement streaming JSON parser for real-time recovery
|
||||
|
||||
```python
|
||||
# Dynamic batch sizing based on review length
|
||||
if avg_review_length > 200:
|
||||
batch_size = min(batch_size, 15)
|
||||
if avg_review_length > 500:
|
||||
batch_size = min(batch_size, 8)
|
||||
```
|
||||
|
||||
**Impact**: Reduce fallback processing by ~50%, saving time and cost.
|
||||
|
||||
---
|
||||
|
||||
## 🟡 Medium Priority (Reliability & Accuracy)
|
||||
|
||||
### 4. LLM Response Caching
|
||||
**Problem**: Retries reprocess already-classified reviews.
|
||||
|
||||
**Solution**: Cache successful LLM responses by content hash:
|
||||
```python
|
||||
class ResponseCache:
|
||||
async def get(self, text_hash: str) -> dict | None:
|
||||
return await redis.get(f"llm:classify:{text_hash}")
|
||||
|
||||
async def set(self, text_hash: str, response: dict, ttl: int = 86400):
|
||||
await redis.setex(f"llm:classify:{text_hash}", ttl, json.dumps(response))
|
||||
```
|
||||
|
||||
**Impact**:
|
||||
- Zero cost for re-runs on same reviews
|
||||
- Faster pipeline retries
|
||||
- Useful for A/B testing prompts
|
||||
|
||||
---
|
||||
|
||||
### 5. Confidence-Based Routing
|
||||
**Problem**: LLM assigns codes even when uncertain.
|
||||
|
||||
**Solution**: Add confidence threshold in prompt:
|
||||
```
|
||||
If confidence < 70%:
|
||||
- Set confidence: "low"
|
||||
- Use generic code (V4.03) instead of guessing
|
||||
- Flag for human review
|
||||
```
|
||||
|
||||
**Impact**: Reduces misclassifications, improves data quality.
|
||||
|
||||
---
|
||||
|
||||
### 6. Post-Classification Validation
|
||||
**Problem**: Some classifications don't match review content.
|
||||
|
||||
**Solution**: Add rule-based validation layer:
|
||||
```python
|
||||
def validate_classification(text: str, urt_code: str) -> bool:
|
||||
# Price mentioned but not V4.xx code?
|
||||
if has_price_mention(text) and not urt_code.startswith("V4"):
|
||||
return False, "V4.01" # Suggest correction
|
||||
|
||||
# Staff mentioned but not P1.xx code?
|
||||
if has_staff_mention(text) and not urt_code.startswith("P1"):
|
||||
return False, "P1.01"
|
||||
|
||||
return True, None
|
||||
```
|
||||
|
||||
**Impact**: Catch ~5-10% of obvious misclassifications.
|
||||
|
||||
---
|
||||
|
||||
### 7. Span Coverage Validation
|
||||
**Problem**: Some review text not covered by any span.
|
||||
|
||||
**Solution**: Track span coverage percentage:
|
||||
```python
|
||||
def calculate_coverage(text: str, spans: list) -> float:
|
||||
covered_chars = set()
|
||||
for span in spans:
|
||||
covered_chars.update(range(span['start'], span['end']))
|
||||
return len(covered_chars) / len(text)
|
||||
|
||||
# Flag if coverage < 60%
|
||||
if coverage < 0.6:
|
||||
log.warning(f"Low span coverage: {coverage:.0%}")
|
||||
```
|
||||
|
||||
**Impact**: Identify reviews where LLM skipped important content.
|
||||
|
||||
---
|
||||
|
||||
## 🟢 Lower Priority (Optimization & Monitoring)
|
||||
|
||||
### 8. Taxonomy Alignment Scoring
|
||||
**Problem**: Hard to measure classification quality at scale.
|
||||
|
||||
**Solution**: Build automated taxonomy alignment checker:
|
||||
```python
|
||||
# Check if keywords in text match expected domain
|
||||
DOMAIN_KEYWORDS = {
|
||||
"V4": ["price", "money", "worth", "cost", "expensive", "cheap"],
|
||||
"P1": ["staff", "employee", "service", "friendly", "rude"],
|
||||
"J1": ["wait", "fast", "slow", "quick", "time", "minutes"],
|
||||
"E1": ["clean", "dirty", "comfortable", "space", "room"],
|
||||
}
|
||||
|
||||
def alignment_score(text: str, urt_code: str) -> float:
|
||||
domain = urt_code[0:2]
|
||||
keywords = DOMAIN_KEYWORDS.get(domain, [])
|
||||
matches = sum(1 for kw in keywords if kw in text.lower())
|
||||
return matches / len(keywords) if keywords else 0.5
|
||||
```
|
||||
|
||||
**Impact**: Quality dashboard, regression detection.
|
||||
|
||||
---
|
||||
|
||||
### 9. Batch Size Auto-Tuning
|
||||
**Problem**: Fixed batch size doesn't adapt to review complexity.
|
||||
|
||||
**Solution**: Implement adaptive batch sizing:
|
||||
```python
|
||||
class AdaptiveBatchSizer:
|
||||
def __init__(self):
|
||||
self.history = [] # (batch_size, success_rate, avg_tokens)
|
||||
|
||||
def recommend_size(self, reviews: list) -> int:
|
||||
avg_length = sum(len(r['text']) for r in reviews) / len(reviews)
|
||||
|
||||
# Learn from history
|
||||
if self.history:
|
||||
# Find optimal size for similar review lengths
|
||||
similar = [h for h in self.history if abs(h['avg_len'] - avg_length) < 50]
|
||||
if similar:
|
||||
return max(h['size'] for h in similar if h['success_rate'] > 0.95)
|
||||
|
||||
# Default heuristics
|
||||
if avg_length > 300:
|
||||
return 10
|
||||
elif avg_length > 150:
|
||||
return 20
|
||||
else:
|
||||
return 30
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 10. Cost Tracking Dashboard
|
||||
**Problem**: No visibility into per-job, per-stage costs.
|
||||
|
||||
**Solution**: Add cost tracking to pipeline output:
|
||||
```python
|
||||
@dataclass
|
||||
class CostBreakdown:
|
||||
stage: str
|
||||
model: str
|
||||
input_tokens: int
|
||||
output_tokens: int
|
||||
cached_tokens: int
|
||||
cost_usd: float
|
||||
reviews_processed: int
|
||||
cost_per_review: float
|
||||
|
||||
# Store in database
|
||||
CREATE TABLE pipeline.cost_tracking (
|
||||
id SERIAL PRIMARY KEY,
|
||||
execution_id UUID,
|
||||
job_id UUID,
|
||||
stage VARCHAR(50),
|
||||
model VARCHAR(100),
|
||||
input_tokens INT,
|
||||
output_tokens INT,
|
||||
cached_tokens INT,
|
||||
cost_usd DECIMAL(10, 6),
|
||||
reviews_processed INT,
|
||||
created_at TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 11. Streaming Classification
|
||||
**Problem**: Large batches block until complete.
|
||||
|
||||
**Solution**: Implement streaming for real-time progress:
|
||||
```python
|
||||
async def classify_streaming(reviews: list):
|
||||
async for partial_result in llm_client.stream_batch(reviews):
|
||||
# Yield each review as it completes
|
||||
yield partial_result
|
||||
|
||||
# Persist immediately
|
||||
await persist_classification(partial_result)
|
||||
```
|
||||
|
||||
**Impact**: Better UX, faster partial results, resilience to failures.
|
||||
|
||||
---
|
||||
|
||||
### 12. A/B Testing Framework
|
||||
**Problem**: Hard to compare prompt/model changes.
|
||||
|
||||
**Solution**: Built-in A/B testing:
|
||||
```python
|
||||
class ABTestConfig:
|
||||
test_name: str
|
||||
variant_a: ClassificationConfig # Control
|
||||
variant_b: ClassificationConfig # Treatment
|
||||
split_ratio: float = 0.1 # 10% to treatment
|
||||
metrics: list[str] = ["accuracy", "cost", "latency"]
|
||||
|
||||
# Run both variants on same reviews
|
||||
results_a = await classify(reviews, config_a)
|
||||
results_b = await classify(reviews[:int(len(reviews)*0.1)], config_b)
|
||||
|
||||
# Compare metrics
|
||||
compare_results(results_a, results_b)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority Matrix
|
||||
|
||||
| Improvement | Effort | Impact | Priority |
|
||||
|-------------|--------|--------|----------|
|
||||
| 1. Multi-Aspect Detection | Medium | High | 🔴 P1 |
|
||||
| 2. Enable Smart Router | Low | High | 🔴 P1 |
|
||||
| 3. JSON Truncation Fix | Medium | High | 🔴 P1 |
|
||||
| 4. Response Caching | Medium | Medium | 🟡 P2 |
|
||||
| 5. Confidence Routing | Medium | Medium | 🟡 P2 |
|
||||
| 6. Post-Classification Validation | Low | Medium | 🟡 P2 |
|
||||
| 7. Span Coverage Validation | Low | Low | 🟢 P3 |
|
||||
| 8. Taxonomy Alignment | Medium | Low | 🟢 P3 |
|
||||
| 9. Adaptive Batch Sizing | High | Medium | 🟢 P3 |
|
||||
| 10. Cost Dashboard | Medium | Low | 🟢 P3 |
|
||||
| 11. Streaming Classification | High | Medium | 🟢 P3 |
|
||||
| 12. A/B Testing | High | Low | 🟢 P3 |
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Can implement today)
|
||||
|
||||
1. **Enable router** - Already implemented, just needs config flag
|
||||
2. **Reduce batch size** - Change `classification_batch_size=15` for long reviews
|
||||
3. **Add span coverage logging** - Simple metric to track quality
|
||||
4. **Post-classification keyword check** - Basic validation rules
|
||||
|
||||
---
|
||||
|
||||
## Estimated Impact Summary
|
||||
|
||||
| Area | Current | After Improvements |
|
||||
|------|---------|-------------------|
|
||||
| Cost per 1000 reviews | ~$3.40 | ~$2.40 (-30%) |
|
||||
| Classification accuracy | ~85% | ~92% |
|
||||
| Multi-aspect capture | ~65% | ~90% |
|
||||
| Batch failure rate | ~33% | ~10% |
|
||||
| Pipeline retry cost | 100% | ~20% (with caching) |
|
||||
Reference in New Issue
Block a user