Files
whyrating-engine-legacy/urt-taxonomy/track-a-training/A2-QA-Protocol.md
Alejandro Gutiérrez 3eda9bdbfa Add complete URT v5.1 taxonomy framework (11 artifacts)
Universal Review Taxonomy v5.1 implementation with:
- Track A (Training): A1 Quickstart, A2 QA Protocol, A3 Calibration Set, A4 Full Manual
- Track B (Engineering): B1 Code Registry, B2 Database Schema, B3 Owner Routing, B4 API Contract
- Track C (Analytics): C1 Issue Lifecycle, C2 KPI Mapping Guide
- Track D (Integration): D1 Dashboard Specification

Covers 7 domains, 28 categories, 138 subcodes, 16 causal codes, and 7 metadata dimensions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:51:41 +00:00

477 lines
12 KiB
Markdown

# A2: QA Protocol
## Universal Review Taxonomy (URT) v5.1
**Purpose**: Define quality assurance processes for URT annotation
**Version**: 5.1 | **Status**: Production Ready | **Date**: 2026-01-23
---
## Table of Contents
1. [Inter-Annotator Agreement (IAA) Metrics](#1-inter-annotator-agreement-iaa-metrics)
2. [Calibration Sessions](#2-calibration-sessions)
3. [Error Categories and Severity](#3-error-categories-and-severity)
4. [Audit Procedures](#4-audit-procedures)
5. [Gold Standard Management](#5-gold-standard-management)
6. [Quality Gates](#6-quality-gates)
7. [Feedback Loop](#7-feedback-loop)
8. [Metrics Dashboard](#8-metrics-dashboard)
---
## 1. Inter-Annotator Agreement (IAA) Metrics
### 1.1 Cohen's Kappa Thresholds by Code Tier
| Code Tier | Minimum Kappa | Target Kappa |
|-----------|---------------|--------------|
| **Domain (Tier 1)** | 0.80 | 0.90+ |
| **Category (Tier 2)** | 0.75 | 0.85+ |
| **Subcode (Tier 3)** | 0.70 | 0.80+ |
| **Valence** | 0.85 | 0.92+ |
| **Intensity** | 0.70 | 0.80+ |
| **Comparative Reference** | 0.75 | 0.85+ |
### 1.2 Krippendorff's Alpha (3+ Annotators)
| Scenario | Minimum Alpha | Target Alpha |
|----------|---------------|--------------|
| Initial Training | 0.67 | 0.75+ |
| Production Quality | 0.75 | 0.85+ |
| Gold Standard Creation | 0.85 | 0.90+ |
### 1.3 Agreement Interpretation Scale
| Range | Interpretation | Action |
|-------|----------------|--------|
| **0.90 - 1.00** | Almost Perfect | Maintain standards |
| **0.80 - 0.89** | Excellent | Minor calibration |
| **0.70 - 0.79** | Good | Schedule calibration |
| **0.60 - 0.69** | Moderate | Mandatory retraining |
| **< 0.60** | Poor | Suspend, reassess |
### 1.4 Profile-Specific Requirements
| Profile | Domain | Category | Subcode | Overall Target |
|---------|--------|----------|---------|----------------|
| URT-Lite | >= 0.85 | N/A | N/A | >= 0.85 |
| URT-Core | >= 0.85 | >= 0.80 | N/A | >= 0.80 |
| URT-Standard/Full | >= 0.85 | >= 0.80 | >= 0.75 | >= 0.78 |
---
## 2. Calibration Sessions
### 2.1 Session Frequency
| Team Size | Daily | Weekly | Monthly |
|-----------|-------|--------|---------|
| 1-3 annotators | -- | 30min | 2hr |
| 4-10 annotators | 15min | 1hr | 3hr |
| 11+ annotators | 15min | 2x 1hr | 4hr |
### 2.2 Weekly Session Structure (60 min)
| Time | Activity |
|------|----------|
| 0-5 min | Review IAA metrics from past week |
| 5-15 min | Discuss 3 highest-disagreement spans |
| 15-35 min | Group annotation exercise (5 spans) |
| 35-50 min | Compare results, discuss differences |
| 50-60 min | Document decisions, update guidance |
### 2.3 Materials Checklist
- [ ] 5-10 pre-selected review spans
- [ ] IAA metrics report
- [ ] Top disagreement patterns list
- [ ] Gold standard examples
- [ ] A1 Quickstart Guide
- [ ] Session notes template
### 2.4 Outcome Documentation
```
## Calibration Session: [DATE]
### Disagreement Patterns
1. Pattern: [Description]
Resolution: [Decision]
### Exercise Results
| Span | Consensus | Votes | Notes |
|------|-----------|-------|-------|
### Action Items
- [ ] Update A1 Section X
- [ ] Add gold standard example
```
---
## 3. Error Categories and Severity
### 3.1 Error Severity Matrix
| Severity | Weight | Description |
|----------|--------|-------------|
| **Critical** | 1.0 | Fundamentally wrong |
| **Major** | 0.5 | Significant deviation |
| **Minor** | 0.25 | Suboptimal but defensible |
| **Slip** | 0.1 | Typo/formatting |
### 3.2 Critical Errors (Weight: 1.0)
| Error Type | Example |
|------------|---------|
| Wrong Domain | "Rude waiter" coded as O instead of P |
| Wrong Valence | Complaint coded as V+ |
| Valence Omission | No valence assigned |
| Profile Violation | Subcode in URT-Lite |
### 3.3 Major Errors (Weight: 0.5)
| Error Type | Example |
|------------|---------|
| Wrong Category | J1 (Timing) vs J4 (Resolution) |
| Intensity Off by 2 | "TERRIBLE!!!" coded as I1 |
| Wrong CR Direction | "Gone downhill" coded as CR-B |
| Missed/Over Split | Two issues merged OR single split |
| J4/R3 Confusion | Process vs Ownership |
| V/R Confusion | "Total scam" coded as V4.01 |
### 3.4 Minor Errors (Weight: 0.25)
| Error Type | Example |
|------------|---------|
| Wrong Subcode (Same Category) | P1.01 vs P1.02 within P1 |
| Intensity Off by 1 | "pretty good" as I1 vs I2 |
| Borderline Secondary | Questionable secondary code |
### 3.5 Slips (Weight: 0.1)
Typos, formatting errors, boundary off by <5 chars
### 3.6 Error Severity Decision Tree
```
Is DOMAIN wrong? --> YES: CRITICAL
Is VALENCE wrong/missing? --> YES: CRITICAL
Is CATEGORY wrong? --> YES: MAJOR
Is INTENSITY off by 2? --> YES: MAJOR
Is SUBCODE wrong? --> YES: MINOR
Is it formatting/typo? --> YES: SLIP
```
### 3.7 Accuracy Calculation
```
Error Score = Sum(error_weight * count) / total_spans
Accuracy = 100% - Error Score
Thresholds:
> 95% = Excellent 85-90% = Acceptable
90-95% = Good < 85% = Below Standard
```
---
## 4. Audit Procedures
### 4.1 Sampling Methodology
**Random**: Equal probability selection for general monitoring
**Stratified**: Ensure representation across domains, annotators, edge cases
| Stratum | Minimum Sample |
|---------|----------------|
| Each Domain (O-R) | 5% of total |
| Each Annotator | 10% of output |
| High-Intensity (I3) | 15% of I3 spans |
| Non-default CR | 25% of CR-B/W/S |
### 4.2 Sample Size by Volume
| Daily Volume | Audit Rate |
|--------------|------------|
| < 100 spans | 30% |
| 100-500 | 20% |
| 500-2000 | 15% |
| 2000-10000 | 10% |
| > 10000 | 7% |
### 4.3 Audit Frequency
| Type | Frequency | Owner |
|------|-----------|-------|
| Spot Check | Daily | QA Lead |
| Sample Audit | Weekly | QA Team |
| Full Audit | Monthly | Senior QA |
| External | Quarterly | External |
### 4.4 Audit Workflow
```
[Daily Output] --> [Sample] --> [Blind Re-code]
|
v
[Compare] --> Match: [Log]
|
+--> Mismatch: [Classify Error] --> [Route] --> [Aggregate]
```
### 4.5 Escalation Paths
| Error Score | Level | Action |
|-------------|-------|--------|
| < 10% | 1 | Self-correction |
| 10-15% | 2 | QA Lead review |
| 15-20% | 3 | Team calibration |
| > 20% | 4 | Management escalation |
---
## 5. Gold Standard Management
### 5.1 Corpus Requirements
| Metric | Minimum | Target |
|--------|---------|--------|
| Total Spans | 500 | 1000+ |
| Per Domain | 50 | 100+ |
| Per Category | 10 | 25+ |
| Edge Cases | 100 | 200+ |
### 5.2 Creation Process
```
[Candidate] --> [3+ Annotators Classify] --> [Calculate Alpha]
|
Alpha >= 0.85: [Add to Gold]
Alpha < 0.85: [Discuss/Reject]
```
### 5.3 Gold Standard Documentation
```json
{
"gold_id": "GS-2026-001",
"span_text": "The waiter was incredibly rude",
"classification": {
"primary_code": "P1.02",
"valence": "V-",
"intensity": "I3"
},
"rationale": "Clear disrespect. 'Incredibly' indicates I3.",
"common_mistakes": ["P1.01 (Warmth)"],
"agreement_score": 0.92,
"version": "5.1",
"status": "active"
}
```
### 5.4 Version Control
| Change Type | Version Bump |
|-------------|--------------|
| Add example | Patch (5.1.1) |
| Fix error | Patch |
| Spec alignment | Minor (5.2) |
| Taxonomy change | Major (6.0) |
### 5.5 Retirement Criteria
- Spec change invalidates example
- Systematic confusion traced to example
- Industry shifts make obsolete
---
## 6. Quality Gates
### 6.1 New Annotator Qualification
```
Week 1: Training (A1 Guide, Spec, Videos)
Week 2: Supervised Practice (100 spans, daily feedback)
Week 3: Qualification Exam (50 gold spans, blind)
|
Pass (>= 85%): Production + 30-day probation
Fail: Remediation + retake
```
**Passing Criteria**:
- Overall Accuracy >= 85%
- Domain Accuracy >= 90%
- Critical Errors = 0
- Major Errors <= 3
### 6.2 Production Annotator Requirements
| Requirement | Frequency | Threshold |
|-------------|-----------|-----------|
| Accuracy Check | Weekly | >= 90% |
| Calibration | Weekly | 90% attendance |
| Gold Quiz | Monthly | >= 85% |
| IAA with Peers | Bi-weekly | Kappa >= 0.75 |
### 6.3 Annotator Tiers
| Tier | Accuracy | Audit Rate |
|------|----------|------------|
| Expert | >= 95% | 5% |
| Senior | 92-95% | 10% |
| Standard | 88-92% | 15% |
| Developing | 85-88% | 25% |
| Probation | < 85% | 50% |
### 6.4 Automated System Thresholds
| Metric | Minimum | Production | Best |
|--------|---------|------------|------|
| Domain Accuracy | 85% | 90% | 95% |
| Category Accuracy | 80% | 85% | 90% |
| Subcode Accuracy | 75% | 80% | 85% |
| Valence F1 | 0.88 | 0.92 | 0.96 |
### 6.5 Release Criteria
- [ ] All accuracy metrics meet thresholds
- [ ] Gold standard test documented
- [ ] Error analysis completed
- [ ] Rollback plan in place
- [ ] Stakeholder sign-off
---
## 7. Feedback Loop
### 7.1 Error Reporting
```
ERROR REPORT FIELDS:
- Reporter, Date, Span ID
- Type: [Spec Unclear | Gold Issue | Edge Case | Tool Bug]
- Description
- Suggested Resolution
- Urgency: [Critical | High | Medium | Low]
```
### 7.2 Triage Process
```
[Error Submitted] --> [QA Lead (24hr)]
|
+--> Spec Issue --> PM
+--> Gold Issue --> QA Team
+--> Tool Bug --> Engineering
+--> Training Gap --> QA Lead
```
### 7.3 Spec Clarification Process
```
[Ambiguity] --> Check A1 Guide --> Found: Apply
|
Not Found: Submit Request
|
PM + QA Review
|
Accept: Update A1/Spec
Reject: Document Rationale
```
### 7.4 Training Update Triggers
| Trigger | Action | Timeline |
|---------|--------|----------|
| IAA < 0.75 | Mandatory calibration | 48 hours |
| New error pattern (3+) | Targeted training | 1 week |
| Spec release | Full training | 2 weeks |
| Annotator < 85% | Individual coaching | Immediate |
### 7.5 Response SLAs
| Urgency | Response | Resolution |
|---------|----------|------------|
| Critical | 2 hours | 24 hours |
| High | 24 hours | 1 week |
| Medium | 48 hours | 2 weeks |
| Low | 1 week | Next sprint |
---
## 8. Metrics Dashboard
### 8.1 Key QA KPIs
| KPI | Target | Alert |
|-----|--------|-------|
| Overall Accuracy | >= 92% | < 88% |
| IAA (Kappa) | >= 0.80 | < 0.75 |
| Critical Error Rate | < 2/1K | >= 5/1K |
| Audit Coverage | >= 10% | < 7% |
| Calibration Attendance | >= 90% | < 80% |
| Error Resolution Time | < 5 days | > 10 days |
### 8.2 Reporting Frequency
| Report | Frequency | Audience |
|--------|-----------|----------|
| Daily Snapshot | Daily | QA Lead |
| Weekly Summary | Weekly | Team + Management |
| Monthly Deep Dive | Monthly | Leadership |
| Quarterly Review | Quarterly | Executives |
### 8.3 Alert Configuration
```
Green Yellow Red
Accuracy >= 92% 88-92% < 88%
IAA >= 0.80 0.75-0.80 < 0.75
Critical Err < 2/1K 2-5/1K > 5/1K
Coverage >= 12% 10-12% < 10%
Yellow: Notify QA Lead
Red: Escalate + Immediate Action
```
### 8.4 Dashboard Panels
1. **Accuracy Trend**: Line chart, 30-day rolling
2. **IAA Heatmap**: Annotator pairwise Kappa
3. **Error Distribution**: Stacked bar by severity
4. **Domain Performance**: Radar chart (O-P-J-E-A-V-R)
5. **Annotator Leaderboard**: Table with tiers
6. **Alert Status**: Traffic light indicators
### 8.5 Metric Formulas
```
Accuracy = 1 - (Sum(error_weight * count) / total_spans)
Cohen's Kappa = (Po - Pe) / (1 - Pe)
Po = Observed agreement
Pe = Expected agreement by chance
Krippendorff's Alpha = 1 - (Do / De)
Do = Observed disagreement
De = Expected disagreement
F1 = 2 * (Precision * Recall) / (Precision + Recall)
```
---
## Document References
| Document | Location |
|----------|----------|
| URT-Specification-v5.1.md | `/urt-taxonomy/spec/` |
| A1-Annotator-Quickstart.md | `/urt-taxonomy/track-a-training/` |
| Gold Standard Corpus | `/urt-taxonomy/gold-standard/` |
---
*URT v5.1 QA Protocol | Track A: Training Materials*