# A2: QA Protocol ## Universal Review Taxonomy (URT) v5.1 **Purpose**: Define quality assurance processes for URT annotation **Version**: 5.1 | **Status**: Production Ready | **Date**: 2026-01-23 --- ## Table of Contents 1. [Inter-Annotator Agreement (IAA) Metrics](#1-inter-annotator-agreement-iaa-metrics) 2. [Calibration Sessions](#2-calibration-sessions) 3. [Error Categories and Severity](#3-error-categories-and-severity) 4. [Audit Procedures](#4-audit-procedures) 5. [Gold Standard Management](#5-gold-standard-management) 6. [Quality Gates](#6-quality-gates) 7. [Feedback Loop](#7-feedback-loop) 8. [Metrics Dashboard](#8-metrics-dashboard) --- ## 1. Inter-Annotator Agreement (IAA) Metrics ### 1.1 Cohen's Kappa Thresholds by Code Tier | Code Tier | Minimum Kappa | Target Kappa | |-----------|---------------|--------------| | **Domain (Tier 1)** | 0.80 | 0.90+ | | **Category (Tier 2)** | 0.75 | 0.85+ | | **Subcode (Tier 3)** | 0.70 | 0.80+ | | **Valence** | 0.85 | 0.92+ | | **Intensity** | 0.70 | 0.80+ | | **Comparative Reference** | 0.75 | 0.85+ | ### 1.2 Krippendorff's Alpha (3+ Annotators) | Scenario | Minimum Alpha | Target Alpha | |----------|---------------|--------------| | Initial Training | 0.67 | 0.75+ | | Production Quality | 0.75 | 0.85+ | | Gold Standard Creation | 0.85 | 0.90+ | ### 1.3 Agreement Interpretation Scale | Range | Interpretation | Action | |-------|----------------|--------| | **0.90 - 1.00** | Almost Perfect | Maintain standards | | **0.80 - 0.89** | Excellent | Minor calibration | | **0.70 - 0.79** | Good | Schedule calibration | | **0.60 - 0.69** | Moderate | Mandatory retraining | | **< 0.60** | Poor | Suspend, reassess | ### 1.4 Profile-Specific Requirements | Profile | Domain | Category | Subcode | Overall Target | |---------|--------|----------|---------|----------------| | URT-Lite | >= 0.85 | N/A | N/A | >= 0.85 | | URT-Core | >= 0.85 | >= 0.80 | N/A | >= 0.80 | | URT-Standard/Full | >= 0.85 | >= 0.80 | >= 0.75 | >= 0.78 | --- ## 2. Calibration Sessions ### 2.1 Session Frequency | Team Size | Daily | Weekly | Monthly | |-----------|-------|--------|---------| | 1-3 annotators | -- | 30min | 2hr | | 4-10 annotators | 15min | 1hr | 3hr | | 11+ annotators | 15min | 2x 1hr | 4hr | ### 2.2 Weekly Session Structure (60 min) | Time | Activity | |------|----------| | 0-5 min | Review IAA metrics from past week | | 5-15 min | Discuss 3 highest-disagreement spans | | 15-35 min | Group annotation exercise (5 spans) | | 35-50 min | Compare results, discuss differences | | 50-60 min | Document decisions, update guidance | ### 2.3 Materials Checklist - [ ] 5-10 pre-selected review spans - [ ] IAA metrics report - [ ] Top disagreement patterns list - [ ] Gold standard examples - [ ] A1 Quickstart Guide - [ ] Session notes template ### 2.4 Outcome Documentation ``` ## Calibration Session: [DATE] ### Disagreement Patterns 1. Pattern: [Description] Resolution: [Decision] ### Exercise Results | Span | Consensus | Votes | Notes | |------|-----------|-------|-------| ### Action Items - [ ] Update A1 Section X - [ ] Add gold standard example ``` --- ## 3. Error Categories and Severity ### 3.1 Error Severity Matrix | Severity | Weight | Description | |----------|--------|-------------| | **Critical** | 1.0 | Fundamentally wrong | | **Major** | 0.5 | Significant deviation | | **Minor** | 0.25 | Suboptimal but defensible | | **Slip** | 0.1 | Typo/formatting | ### 3.2 Critical Errors (Weight: 1.0) | Error Type | Example | |------------|---------| | Wrong Domain | "Rude waiter" coded as O instead of P | | Wrong Valence | Complaint coded as V+ | | Valence Omission | No valence assigned | | Profile Violation | Subcode in URT-Lite | ### 3.3 Major Errors (Weight: 0.5) | Error Type | Example | |------------|---------| | Wrong Category | J1 (Timing) vs J4 (Resolution) | | Intensity Off by 2 | "TERRIBLE!!!" coded as I1 | | Wrong CR Direction | "Gone downhill" coded as CR-B | | Missed/Over Split | Two issues merged OR single split | | J4/R3 Confusion | Process vs Ownership | | V/R Confusion | "Total scam" coded as V4.01 | ### 3.4 Minor Errors (Weight: 0.25) | Error Type | Example | |------------|---------| | Wrong Subcode (Same Category) | P1.01 vs P1.02 within P1 | | Intensity Off by 1 | "pretty good" as I1 vs I2 | | Borderline Secondary | Questionable secondary code | ### 3.5 Slips (Weight: 0.1) Typos, formatting errors, boundary off by <5 chars ### 3.6 Error Severity Decision Tree ``` Is DOMAIN wrong? --> YES: CRITICAL Is VALENCE wrong/missing? --> YES: CRITICAL Is CATEGORY wrong? --> YES: MAJOR Is INTENSITY off by 2? --> YES: MAJOR Is SUBCODE wrong? --> YES: MINOR Is it formatting/typo? --> YES: SLIP ``` ### 3.7 Accuracy Calculation ``` Error Score = Sum(error_weight * count) / total_spans Accuracy = 100% - Error Score Thresholds: > 95% = Excellent 85-90% = Acceptable 90-95% = Good < 85% = Below Standard ``` --- ## 4. Audit Procedures ### 4.1 Sampling Methodology **Random**: Equal probability selection for general monitoring **Stratified**: Ensure representation across domains, annotators, edge cases | Stratum | Minimum Sample | |---------|----------------| | Each Domain (O-R) | 5% of total | | Each Annotator | 10% of output | | High-Intensity (I3) | 15% of I3 spans | | Non-default CR | 25% of CR-B/W/S | ### 4.2 Sample Size by Volume | Daily Volume | Audit Rate | |--------------|------------| | < 100 spans | 30% | | 100-500 | 20% | | 500-2000 | 15% | | 2000-10000 | 10% | | > 10000 | 7% | ### 4.3 Audit Frequency | Type | Frequency | Owner | |------|-----------|-------| | Spot Check | Daily | QA Lead | | Sample Audit | Weekly | QA Team | | Full Audit | Monthly | Senior QA | | External | Quarterly | External | ### 4.4 Audit Workflow ``` [Daily Output] --> [Sample] --> [Blind Re-code] | v [Compare] --> Match: [Log] | +--> Mismatch: [Classify Error] --> [Route] --> [Aggregate] ``` ### 4.5 Escalation Paths | Error Score | Level | Action | |-------------|-------|--------| | < 10% | 1 | Self-correction | | 10-15% | 2 | QA Lead review | | 15-20% | 3 | Team calibration | | > 20% | 4 | Management escalation | --- ## 5. Gold Standard Management ### 5.1 Corpus Requirements | Metric | Minimum | Target | |--------|---------|--------| | Total Spans | 500 | 1000+ | | Per Domain | 50 | 100+ | | Per Category | 10 | 25+ | | Edge Cases | 100 | 200+ | ### 5.2 Creation Process ``` [Candidate] --> [3+ Annotators Classify] --> [Calculate Alpha] | Alpha >= 0.85: [Add to Gold] Alpha < 0.85: [Discuss/Reject] ``` ### 5.3 Gold Standard Documentation ```json { "gold_id": "GS-2026-001", "span_text": "The waiter was incredibly rude", "classification": { "primary_code": "P1.02", "valence": "V-", "intensity": "I3" }, "rationale": "Clear disrespect. 'Incredibly' indicates I3.", "common_mistakes": ["P1.01 (Warmth)"], "agreement_score": 0.92, "version": "5.1", "status": "active" } ``` ### 5.4 Version Control | Change Type | Version Bump | |-------------|--------------| | Add example | Patch (5.1.1) | | Fix error | Patch | | Spec alignment | Minor (5.2) | | Taxonomy change | Major (6.0) | ### 5.5 Retirement Criteria - Spec change invalidates example - Systematic confusion traced to example - Industry shifts make obsolete --- ## 6. Quality Gates ### 6.1 New Annotator Qualification ``` Week 1: Training (A1 Guide, Spec, Videos) Week 2: Supervised Practice (100 spans, daily feedback) Week 3: Qualification Exam (50 gold spans, blind) | Pass (>= 85%): Production + 30-day probation Fail: Remediation + retake ``` **Passing Criteria**: - Overall Accuracy >= 85% - Domain Accuracy >= 90% - Critical Errors = 0 - Major Errors <= 3 ### 6.2 Production Annotator Requirements | Requirement | Frequency | Threshold | |-------------|-----------|-----------| | Accuracy Check | Weekly | >= 90% | | Calibration | Weekly | 90% attendance | | Gold Quiz | Monthly | >= 85% | | IAA with Peers | Bi-weekly | Kappa >= 0.75 | ### 6.3 Annotator Tiers | Tier | Accuracy | Audit Rate | |------|----------|------------| | Expert | >= 95% | 5% | | Senior | 92-95% | 10% | | Standard | 88-92% | 15% | | Developing | 85-88% | 25% | | Probation | < 85% | 50% | ### 6.4 Automated System Thresholds | Metric | Minimum | Production | Best | |--------|---------|------------|------| | Domain Accuracy | 85% | 90% | 95% | | Category Accuracy | 80% | 85% | 90% | | Subcode Accuracy | 75% | 80% | 85% | | Valence F1 | 0.88 | 0.92 | 0.96 | ### 6.5 Release Criteria - [ ] All accuracy metrics meet thresholds - [ ] Gold standard test documented - [ ] Error analysis completed - [ ] Rollback plan in place - [ ] Stakeholder sign-off --- ## 7. Feedback Loop ### 7.1 Error Reporting ``` ERROR REPORT FIELDS: - Reporter, Date, Span ID - Type: [Spec Unclear | Gold Issue | Edge Case | Tool Bug] - Description - Suggested Resolution - Urgency: [Critical | High | Medium | Low] ``` ### 7.2 Triage Process ``` [Error Submitted] --> [QA Lead (24hr)] | +--> Spec Issue --> PM +--> Gold Issue --> QA Team +--> Tool Bug --> Engineering +--> Training Gap --> QA Lead ``` ### 7.3 Spec Clarification Process ``` [Ambiguity] --> Check A1 Guide --> Found: Apply | Not Found: Submit Request | PM + QA Review | Accept: Update A1/Spec Reject: Document Rationale ``` ### 7.4 Training Update Triggers | Trigger | Action | Timeline | |---------|--------|----------| | IAA < 0.75 | Mandatory calibration | 48 hours | | New error pattern (3+) | Targeted training | 1 week | | Spec release | Full training | 2 weeks | | Annotator < 85% | Individual coaching | Immediate | ### 7.5 Response SLAs | Urgency | Response | Resolution | |---------|----------|------------| | Critical | 2 hours | 24 hours | | High | 24 hours | 1 week | | Medium | 48 hours | 2 weeks | | Low | 1 week | Next sprint | --- ## 8. Metrics Dashboard ### 8.1 Key QA KPIs | KPI | Target | Alert | |-----|--------|-------| | Overall Accuracy | >= 92% | < 88% | | IAA (Kappa) | >= 0.80 | < 0.75 | | Critical Error Rate | < 2/1K | >= 5/1K | | Audit Coverage | >= 10% | < 7% | | Calibration Attendance | >= 90% | < 80% | | Error Resolution Time | < 5 days | > 10 days | ### 8.2 Reporting Frequency | Report | Frequency | Audience | |--------|-----------|----------| | Daily Snapshot | Daily | QA Lead | | Weekly Summary | Weekly | Team + Management | | Monthly Deep Dive | Monthly | Leadership | | Quarterly Review | Quarterly | Executives | ### 8.3 Alert Configuration ``` Green Yellow Red Accuracy >= 92% 88-92% < 88% IAA >= 0.80 0.75-0.80 < 0.75 Critical Err < 2/1K 2-5/1K > 5/1K Coverage >= 12% 10-12% < 10% Yellow: Notify QA Lead Red: Escalate + Immediate Action ``` ### 8.4 Dashboard Panels 1. **Accuracy Trend**: Line chart, 30-day rolling 2. **IAA Heatmap**: Annotator pairwise Kappa 3. **Error Distribution**: Stacked bar by severity 4. **Domain Performance**: Radar chart (O-P-J-E-A-V-R) 5. **Annotator Leaderboard**: Table with tiers 6. **Alert Status**: Traffic light indicators ### 8.5 Metric Formulas ``` Accuracy = 1 - (Sum(error_weight * count) / total_spans) Cohen's Kappa = (Po - Pe) / (1 - Pe) Po = Observed agreement Pe = Expected agreement by chance Krippendorff's Alpha = 1 - (Do / De) Do = Observed disagreement De = Expected disagreement F1 = 2 * (Precision * Recall) / (Precision + Recall) ``` --- ## Document References | Document | Location | |----------|----------| | URT-Specification-v5.1.md | `/urt-taxonomy/spec/` | | A1-Annotator-Quickstart.md | `/urt-taxonomy/track-a-training/` | | Gold Standard Corpus | `/urt-taxonomy/gold-standard/` | --- *URT v5.1 QA Protocol | Track A: Training Materials*