alezmad/whyrating-engine-legacy

Fork 0

Files

Alejandro Gutiérrez 2206ddeff2 Initial commit - WhyRating Engine (Google Reviews Scraper)

2026-02-02 18:19:00 +00:00

8.6 KiB

Raw Blame History

ReviewIQ Pipeline Improvement Suggestions

Based on validation testing and analysis of the classification pipeline.

🔴 High Priority (Quality & Cost Impact)

1. Multi-Aspect Detection Gap

Problem: LLM misses secondary codes in multi-aspect reviews.

"not too expensive" → V4.01 missed
"easy and fast" → J1.01 missed

Solution: Update classification prompt to:

For reviews with multiple distinct topics:
1. Extract ALL aspects, not just the dominant one
2. Assign urt_secondary codes for each additional aspect
3. Flag reviews with 3+ aspects as "complex"

Impact: ~15-20% of reviews have multiple aspects being partially captured.

2. Enable Smart Router (Cost Savings)

Problem: All reviews go through expensive Sonnet model.

Solution: Enable the implemented router:

Config(
    router_enabled=True,
    router_conservative=True,
    router_cheap_model="claude-3-5-haiku-20241022",
)

Impact:

SKIP (1.6%): $0 cost (was ~$0.05)
CHEAP (31.4%): ~10x cheaper with Haiku
Estimated 25-30% cost reduction

3. JSON Truncation Recovery

Problem: ~33% of batches hit JSON truncation, causing partial failures.

Current State: Partial recovery implemented but still loses some reviews.

Solution:

Reduce batch size when reviews are long
Add max_tokens buffer based on expected output
Implement streaming JSON parser for real-time recovery

# Dynamic batch sizing based on review length
if avg_review_length > 200:
    batch_size = min(batch_size, 15)
if avg_review_length > 500:
    batch_size = min(batch_size, 8)

Impact: Reduce fallback processing by ~50%, saving time and cost.

🟡 Medium Priority (Reliability & Accuracy)

4. LLM Response Caching

Problem: Retries reprocess already-classified reviews.

Solution: Cache successful LLM responses by content hash:

class ResponseCache:
    async def get(self, text_hash: str) -> dict | None:
        return await redis.get(f"llm:classify:{text_hash}")

    async def set(self, text_hash: str, response: dict, ttl: int = 86400):
        await redis.setex(f"llm:classify:{text_hash}", ttl, json.dumps(response))

Impact:

Zero cost for re-runs on same reviews
Faster pipeline retries
Useful for A/B testing prompts

5. Confidence-Based Routing

Problem: LLM assigns codes even when uncertain.

Solution: Add confidence threshold in prompt:

If confidence < 70%:
  - Set confidence: "low"
  - Use generic code (V4.03) instead of guessing
  - Flag for human review

Impact: Reduces misclassifications, improves data quality.

6. Post-Classification Validation

Problem: Some classifications don't match review content.

Solution: Add rule-based validation layer:

def validate_classification(text: str, urt_code: str) -> bool:
    # Price mentioned but not V4.xx code?
    if has_price_mention(text) and not urt_code.startswith("V4"):
        return False, "V4.01"  # Suggest correction

    # Staff mentioned but not P1.xx code?
    if has_staff_mention(text) and not urt_code.startswith("P1"):
        return False, "P1.01"

    return True, None

Impact: Catch ~5-10% of obvious misclassifications.

7. Span Coverage Validation

Problem: Some review text not covered by any span.

Solution: Track span coverage percentage:

def calculate_coverage(text: str, spans: list) -> float:
    covered_chars = set()
    for span in spans:
        covered_chars.update(range(span['start'], span['end']))
    return len(covered_chars) / len(text)

# Flag if coverage < 60%
if coverage < 0.6:
    log.warning(f"Low span coverage: {coverage:.0%}")

Impact: Identify reviews where LLM skipped important content.

🟢 Lower Priority (Optimization & Monitoring)

8. Taxonomy Alignment Scoring

Problem: Hard to measure classification quality at scale.

Solution: Build automated taxonomy alignment checker:

# Check if keywords in text match expected domain
DOMAIN_KEYWORDS = {
    "V4": ["price", "money", "worth", "cost", "expensive", "cheap"],
    "P1": ["staff", "employee", "service", "friendly", "rude"],
    "J1": ["wait", "fast", "slow", "quick", "time", "minutes"],
    "E1": ["clean", "dirty", "comfortable", "space", "room"],
}

def alignment_score(text: str, urt_code: str) -> float:
    domain = urt_code[0:2]
    keywords = DOMAIN_KEYWORDS.get(domain, [])
    matches = sum(1 for kw in keywords if kw in text.lower())
    return matches / len(keywords) if keywords else 0.5

Impact: Quality dashboard, regression detection.

9. Batch Size Auto-Tuning

Problem: Fixed batch size doesn't adapt to review complexity.

Solution: Implement adaptive batch sizing:

class AdaptiveBatchSizer:
    def __init__(self):
        self.history = []  # (batch_size, success_rate, avg_tokens)

    def recommend_size(self, reviews: list) -> int:
        avg_length = sum(len(r['text']) for r in reviews) / len(reviews)

        # Learn from history
        if self.history:
            # Find optimal size for similar review lengths
            similar = [h for h in self.history if abs(h['avg_len'] - avg_length) < 50]
            if similar:
                return max(h['size'] for h in similar if h['success_rate'] > 0.95)

        # Default heuristics
        if avg_length > 300:
            return 10
        elif avg_length > 150:
            return 20
        else:
            return 30

10. Cost Tracking Dashboard

Problem: No visibility into per-job, per-stage costs.

Solution: Add cost tracking to pipeline output:

@dataclass
class CostBreakdown:
    stage: str
    model: str
    input_tokens: int
    output_tokens: int
    cached_tokens: int
    cost_usd: float
    reviews_processed: int
    cost_per_review: float

# Store in database
CREATE TABLE pipeline.cost_tracking (
    id SERIAL PRIMARY KEY,
    execution_id UUID,
    job_id UUID,
    stage VARCHAR(50),
    model VARCHAR(100),
    input_tokens INT,
    output_tokens INT,
    cached_tokens INT,
    cost_usd DECIMAL(10, 6),
    reviews_processed INT,
    created_at TIMESTAMP DEFAULT NOW()
);

11. Streaming Classification

Problem: Large batches block until complete.

Solution: Implement streaming for real-time progress:

async def classify_streaming(reviews: list):
    async for partial_result in llm_client.stream_batch(reviews):
        # Yield each review as it completes
        yield partial_result

        # Persist immediately
        await persist_classification(partial_result)

Impact: Better UX, faster partial results, resilience to failures.

12. A/B Testing Framework

Problem: Hard to compare prompt/model changes.

Solution: Built-in A/B testing:

class ABTestConfig:
    test_name: str
    variant_a: ClassificationConfig  # Control
    variant_b: ClassificationConfig  # Treatment
    split_ratio: float = 0.1  # 10% to treatment
    metrics: list[str] = ["accuracy", "cost", "latency"]

# Run both variants on same reviews
results_a = await classify(reviews, config_a)
results_b = await classify(reviews[:int(len(reviews)*0.1)], config_b)

# Compare metrics
compare_results(results_a, results_b)

Implementation Priority Matrix

Improvement	Effort	Impact	Priority
1. Multi-Aspect Detection	Medium	High	🔴 P1
2. Enable Smart Router	Low	High	🔴 P1
3. JSON Truncation Fix	Medium	High	🔴 P1
4. Response Caching	Medium	Medium	🟡 P2
5. Confidence Routing	Medium	Medium	🟡 P2
6. Post-Classification Validation	Low	Medium	🟡 P2
7. Span Coverage Validation	Low	Low	🟢 P3
8. Taxonomy Alignment	Medium	Low	🟢 P3
9. Adaptive Batch Sizing	High	Medium	🟢 P3
10. Cost Dashboard	Medium	Low	🟢 P3
11. Streaming Classification	High	Medium	🟢 P3
12. A/B Testing	High	Low	🟢 P3

Quick Wins (Can implement today)

Enable router - Already implemented, just needs config flag
Reduce batch size - Change classification_batch_size=15 for long reviews
Add span coverage logging - Simple metric to track quality
Post-classification keyword check - Basic validation rules

Estimated Impact Summary

Area	Current	After Improvements
Cost per 1000 reviews	~$3.40	~$2.40 (-30%)
Classification accuracy	~85%	~92%
Multi-aspect capture	~65%	~90%
Batch failure rate	~33%	~10%
Pipeline retry cost	100%	~20% (with caching)

8.6 KiB Raw Blame History