# A2: QA Protocol
## Universal Review Taxonomy (URT) v5.1

**Purpose**: Define quality assurance processes for URT annotation
**Version**: 5.1 | **Status**: Production Ready | **Date**: 2026-01-23

---

## Table of Contents

1. [Inter-Annotator Agreement (IAA) Metrics](#1-inter-annotator-agreement-iaa-metrics)
2. [Calibration Sessions](#2-calibration-sessions)
3. [Error Categories and Severity](#3-error-categories-and-severity)
4. [Audit Procedures](#4-audit-procedures)
5. [Gold Standard Management](#5-gold-standard-management)
6. [Quality Gates](#6-quality-gates)
7. [Feedback Loop](#7-feedback-loop)
8. [Metrics Dashboard](#8-metrics-dashboard)

---

## 1. Inter-Annotator Agreement (IAA) Metrics

### 1.1 Cohen's Kappa Thresholds by Code Tier

| Code Tier | Minimum Kappa | Target Kappa |
|-----------|---------------|--------------|
| **Domain (Tier 1)** | 0.80 | 0.90+ |
| **Category (Tier 2)** | 0.75 | 0.85+ |
| **Subcode (Tier 3)** | 0.70 | 0.80+ |
| **Valence** | 0.85 | 0.92+ |
| **Intensity** | 0.70 | 0.80+ |
| **Comparative Reference** | 0.75 | 0.85+ |

### 1.2 Krippendorff's Alpha (3+ Annotators)

| Scenario | Minimum Alpha | Target Alpha |
|----------|---------------|--------------|
| Initial Training | 0.67 | 0.75+ |
| Production Quality | 0.75 | 0.85+ |
| Gold Standard Creation | 0.85 | 0.90+ |

### 1.3 Agreement Interpretation Scale

| Range | Interpretation | Action |
|-------|----------------|--------|
| **0.90 - 1.00** | Almost Perfect | Maintain standards |
| **0.80 - 0.89** | Excellent | Minor calibration |
| **0.70 - 0.79** | Good | Schedule calibration |
| **0.60 - 0.69** | Moderate | Mandatory retraining |
| **< 0.60** | Poor | Suspend, reassess |

### 1.4 Profile-Specific Requirements

| Profile | Domain | Category | Subcode | Overall Target |
|---------|--------|----------|---------|----------------|
| URT-Lite | >= 0.85 | N/A | N/A | >= 0.85 |
| URT-Core | >= 0.85 | >= 0.80 | N/A | >= 0.80 |
| URT-Standard/Full | >= 0.85 | >= 0.80 | >= 0.75 | >= 0.78 |

---

## 2. Calibration Sessions

### 2.1 Session Frequency

| Team Size | Daily | Weekly | Monthly |
|-----------|-------|--------|---------|
| 1-3 annotators | -- | 30min | 2hr |
| 4-10 annotators | 15min | 1hr | 3hr |
| 11+ annotators | 15min | 2x 1hr | 4hr |

### 2.2 Weekly Session Structure (60 min)

| Time | Activity |
|------|----------|
| 0-5 min | Review IAA metrics from past week |
| 5-15 min | Discuss 3 highest-disagreement spans |
| 15-35 min | Group annotation exercise (5 spans) |
| 35-50 min | Compare results, discuss differences |
| 50-60 min | Document decisions, update guidance |

### 2.3 Materials Checklist

- [ ] 5-10 pre-selected review spans
- [ ] IAA metrics report
- [ ] Top disagreement patterns list
- [ ] Gold standard examples
- [ ] A1 Quickstart Guide
- [ ] Session notes template

### 2.4 Outcome Documentation

```
## Calibration Session: [DATE]

### Disagreement Patterns
1. Pattern: [Description]
   Resolution: [Decision]

### Exercise Results
| Span | Consensus | Votes | Notes |
|------|-----------|-------|-------|

### Action Items
- [ ] Update A1 Section X
- [ ] Add gold standard example
```

---

## 3. Error Categories and Severity

### 3.1 Error Severity Matrix

| Severity | Weight | Description |
|----------|--------|-------------|
| **Critical** | 1.0 | Fundamentally wrong |
| **Major** | 0.5 | Significant deviation |
| **Minor** | 0.25 | Suboptimal but defensible |
| **Slip** | 0.1 | Typo/formatting |

### 3.2 Critical Errors (Weight: 1.0)

| Error Type | Example |
|------------|---------|
| Wrong Domain | "Rude waiter" coded as O instead of P |
| Wrong Valence | Complaint coded as V+ |
| Valence Omission | No valence assigned |
| Profile Violation | Subcode in URT-Lite |

### 3.3 Major Errors (Weight: 0.5)

| Error Type | Example |
|------------|---------|
| Wrong Category | J1 (Timing) vs J4 (Resolution) |
| Intensity Off by 2 | "TERRIBLE!!!" coded as I1 |
| Wrong CR Direction | "Gone downhill" coded as CR-B |
| Missed/Over Split | Two issues merged OR single split |
| J4/R3 Confusion | Process vs Ownership |
| V/R Confusion | "Total scam" coded as V4.01 |

### 3.4 Minor Errors (Weight: 0.25)

| Error Type | Example |
|------------|---------|
| Wrong Subcode (Same Category) | P1.01 vs P1.02 within P1 |
| Intensity Off by 1 | "pretty good" as I1 vs I2 |
| Borderline Secondary | Questionable secondary code |

### 3.5 Slips (Weight: 0.1)

Typos, formatting errors, boundary off by <5 chars

### 3.6 Error Severity Decision Tree

```
Is DOMAIN wrong?            --> YES: CRITICAL
Is VALENCE wrong/missing?   --> YES: CRITICAL
Is CATEGORY wrong?          --> YES: MAJOR
Is INTENSITY off by 2?      --> YES: MAJOR
Is SUBCODE wrong?           --> YES: MINOR
Is it formatting/typo?      --> YES: SLIP
```

### 3.7 Accuracy Calculation

```
Error Score = Sum(error_weight * count) / total_spans
Accuracy = 100% - Error Score

Thresholds:
  > 95% = Excellent    85-90% = Acceptable
  90-95% = Good        < 85% = Below Standard
```

---

## 4. Audit Procedures

### 4.1 Sampling Methodology

**Random**: Equal probability selection for general monitoring
**Stratified**: Ensure representation across domains, annotators, edge cases

| Stratum | Minimum Sample |
|---------|----------------|
| Each Domain (O-R) | 5% of total |
| Each Annotator | 10% of output |
| High-Intensity (I3) | 15% of I3 spans |
| Non-default CR | 25% of CR-B/W/S |

### 4.2 Sample Size by Volume

| Daily Volume | Audit Rate |
|--------------|------------|
| < 100 spans | 30% |
| 100-500 | 20% |
| 500-2000 | 15% |
| 2000-10000 | 10% |
| > 10000 | 7% |

### 4.3 Audit Frequency

| Type | Frequency | Owner |
|------|-----------|-------|
| Spot Check | Daily | QA Lead |
| Sample Audit | Weekly | QA Team |
| Full Audit | Monthly | Senior QA |
| External | Quarterly | External |

### 4.4 Audit Workflow

```
[Daily Output] --> [Sample] --> [Blind Re-code]
       |
       v
[Compare] --> Match: [Log]
       |
       +--> Mismatch: [Classify Error] --> [Route] --> [Aggregate]
```

### 4.5 Escalation Paths

| Error Score | Level | Action |
|-------------|-------|--------|
| < 10% | 1 | Self-correction |
| 10-15% | 2 | QA Lead review |
| 15-20% | 3 | Team calibration |
| > 20% | 4 | Management escalation |

---

## 5. Gold Standard Management

### 5.1 Corpus Requirements

| Metric | Minimum | Target |
|--------|---------|--------|
| Total Spans | 500 | 1000+ |
| Per Domain | 50 | 100+ |
| Per Category | 10 | 25+ |
| Edge Cases | 100 | 200+ |

### 5.2 Creation Process

```
[Candidate] --> [3+ Annotators Classify] --> [Calculate Alpha]
                                                    |
                                   Alpha >= 0.85: [Add to Gold]
                                   Alpha < 0.85: [Discuss/Reject]
```

### 5.3 Gold Standard Documentation

```json
{
  "gold_id": "GS-2026-001",
  "span_text": "The waiter was incredibly rude",
  "classification": {
    "primary_code": "P1.02",
    "valence": "V-",
    "intensity": "I3"
  },
  "rationale": "Clear disrespect. 'Incredibly' indicates I3.",
  "common_mistakes": ["P1.01 (Warmth)"],
  "agreement_score": 0.92,
  "version": "5.1",
  "status": "active"
}
```

### 5.4 Version Control

| Change Type | Version Bump |
|-------------|--------------|
| Add example | Patch (5.1.1) |
| Fix error | Patch |
| Spec alignment | Minor (5.2) |
| Taxonomy change | Major (6.0) |

### 5.5 Retirement Criteria

- Spec change invalidates example
- Systematic confusion traced to example
- Industry shifts make obsolete

---

## 6. Quality Gates

### 6.1 New Annotator Qualification

```
Week 1: Training (A1 Guide, Spec, Videos)
Week 2: Supervised Practice (100 spans, daily feedback)
Week 3: Qualification Exam (50 gold spans, blind)
        |
        Pass (>= 85%): Production + 30-day probation
        Fail: Remediation + retake
```

**Passing Criteria**:
- Overall Accuracy >= 85%
- Domain Accuracy >= 90%
- Critical Errors = 0
- Major Errors <= 3

### 6.2 Production Annotator Requirements

| Requirement | Frequency | Threshold |
|-------------|-----------|-----------|
| Accuracy Check | Weekly | >= 90% |
| Calibration | Weekly | 90% attendance |
| Gold Quiz | Monthly | >= 85% |
| IAA with Peers | Bi-weekly | Kappa >= 0.75 |

### 6.3 Annotator Tiers

| Tier | Accuracy | Audit Rate |
|------|----------|------------|
| Expert | >= 95% | 5% |
| Senior | 92-95% | 10% |
| Standard | 88-92% | 15% |
| Developing | 85-88% | 25% |
| Probation | < 85% | 50% |

### 6.4 Automated System Thresholds

| Metric | Minimum | Production | Best |
|--------|---------|------------|------|
| Domain Accuracy | 85% | 90% | 95% |
| Category Accuracy | 80% | 85% | 90% |
| Subcode Accuracy | 75% | 80% | 85% |
| Valence F1 | 0.88 | 0.92 | 0.96 |

### 6.5 Release Criteria

- [ ] All accuracy metrics meet thresholds
- [ ] Gold standard test documented
- [ ] Error analysis completed
- [ ] Rollback plan in place
- [ ] Stakeholder sign-off

---

## 7. Feedback Loop

### 7.1 Error Reporting

```
ERROR REPORT FIELDS:
- Reporter, Date, Span ID
- Type: [Spec Unclear | Gold Issue | Edge Case | Tool Bug]
- Description
- Suggested Resolution
- Urgency: [Critical | High | Medium | Low]
```

### 7.2 Triage Process

```
[Error Submitted] --> [QA Lead (24hr)]
        |
        +--> Spec Issue --> PM
        +--> Gold Issue --> QA Team
        +--> Tool Bug --> Engineering
        +--> Training Gap --> QA Lead
```

### 7.3 Spec Clarification Process

```
[Ambiguity] --> Check A1 Guide --> Found: Apply
                                   |
                                   Not Found: Submit Request
                                        |
                                   PM + QA Review
                                        |
                                   Accept: Update A1/Spec
                                   Reject: Document Rationale
```

### 7.4 Training Update Triggers

| Trigger | Action | Timeline |
|---------|--------|----------|
| IAA < 0.75 | Mandatory calibration | 48 hours |
| New error pattern (3+) | Targeted training | 1 week |
| Spec release | Full training | 2 weeks |
| Annotator < 85% | Individual coaching | Immediate |

### 7.5 Response SLAs

| Urgency | Response | Resolution |
|---------|----------|------------|
| Critical | 2 hours | 24 hours |
| High | 24 hours | 1 week |
| Medium | 48 hours | 2 weeks |
| Low | 1 week | Next sprint |

---

## 8. Metrics Dashboard

### 8.1 Key QA KPIs

| KPI | Target | Alert |
|-----|--------|-------|
| Overall Accuracy | >= 92% | < 88% |
| IAA (Kappa) | >= 0.80 | < 0.75 |
| Critical Error Rate | < 2/1K | >= 5/1K |
| Audit Coverage | >= 10% | < 7% |
| Calibration Attendance | >= 90% | < 80% |
| Error Resolution Time | < 5 days | > 10 days |

### 8.2 Reporting Frequency

| Report | Frequency | Audience |
|--------|-----------|----------|
| Daily Snapshot | Daily | QA Lead |
| Weekly Summary | Weekly | Team + Management |
| Monthly Deep Dive | Monthly | Leadership |
| Quarterly Review | Quarterly | Executives |

### 8.3 Alert Configuration

```
              Green      Yellow     Red
Accuracy      >= 92%     88-92%     < 88%
IAA           >= 0.80    0.75-0.80  < 0.75
Critical Err  < 2/1K     2-5/1K     > 5/1K
Coverage      >= 12%     10-12%     < 10%

Yellow: Notify QA Lead
Red: Escalate + Immediate Action
```

### 8.4 Dashboard Panels

1. **Accuracy Trend**: Line chart, 30-day rolling
2. **IAA Heatmap**: Annotator pairwise Kappa
3. **Error Distribution**: Stacked bar by severity
4. **Domain Performance**: Radar chart (O-P-J-E-A-V-R)
5. **Annotator Leaderboard**: Table with tiers
6. **Alert Status**: Traffic light indicators

### 8.5 Metric Formulas

```
Accuracy = 1 - (Sum(error_weight * count) / total_spans)

Cohen's Kappa = (Po - Pe) / (1 - Pe)
  Po = Observed agreement
  Pe = Expected agreement by chance

Krippendorff's Alpha = 1 - (Do / De)
  Do = Observed disagreement
  De = Expected disagreement

F1 = 2 * (Precision * Recall) / (Precision + Recall)
```

---

## Document References

| Document | Location |
|----------|----------|
| URT-Specification-v5.1.md | `/urt-taxonomy/spec/` |
| A1-Annotator-Quickstart.md | `/urt-taxonomy/track-a-training/` |
| Gold Standard Corpus | `/urt-taxonomy/gold-standard/` |

---

*URT v5.1 QA Protocol | Track A: Training Materials*