Add complete URT v5.1 taxonomy framework (11 artifacts)

Universal Review Taxonomy v5.1 implementation with: - Track A (Training): A1 Quickstart, A2 QA Protocol, A3 Calibration Set, A4 Full Manual - Track B (Engineering): B1 Code Registry, B2 Database Schema, B3 Owner Routing, B4 API Contract - Track C (Analytics): C1 Issue Lifecycle, C2 KPI Mapping Guide - Track D (Integration): D1 Dashboard Specification Covers 7 domains, 28 categories, 138 subcodes, 16 causal codes, and 7 metadata dimensions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:51:41 +00:00
parent a540ab97b1
commit 3eda9bdbfa
13 changed files with 21264 additions and 0 deletions
--- a/urt-taxonomy/track-a-training/A2-QA-Protocol.md
+++ b/urt-taxonomy/track-a-training/A2-QA-Protocol.md
@@ -0,0 +1,476 @@
+# A2: QA Protocol
+## Universal Review Taxonomy (URT) v5.1
+
+**Purpose**: Define quality assurance processes for URT annotation
+**Version**: 5.1 | **Status**: Production Ready | **Date**: 2026-01-23
+
+---
+
+## Table of Contents
+
+1. [Inter-Annotator Agreement (IAA) Metrics](#1-inter-annotator-agreement-iaa-metrics)
+2. [Calibration Sessions](#2-calibration-sessions)
+3. [Error Categories and Severity](#3-error-categories-and-severity)
+4. [Audit Procedures](#4-audit-procedures)
+5. [Gold Standard Management](#5-gold-standard-management)
+6. [Quality Gates](#6-quality-gates)
+7. [Feedback Loop](#7-feedback-loop)
+8. [Metrics Dashboard](#8-metrics-dashboard)
+
+---
+
+## 1. Inter-Annotator Agreement (IAA) Metrics
+
+### 1.1 Cohen's Kappa Thresholds by Code Tier
+
+| Code Tier | Minimum Kappa | Target Kappa |
+|-----------|---------------|--------------|
+| **Domain (Tier 1)** | 0.80 | 0.90+ |
+| **Category (Tier 2)** | 0.75 | 0.85+ |
+| **Subcode (Tier 3)** | 0.70 | 0.80+ |
+| **Valence** | 0.85 | 0.92+ |
+| **Intensity** | 0.70 | 0.80+ |
+| **Comparative Reference** | 0.75 | 0.85+ |
+
+### 1.2 Krippendorff's Alpha (3+ Annotators)
+
+| Scenario | Minimum Alpha | Target Alpha |
+|----------|---------------|--------------|
+| Initial Training | 0.67 | 0.75+ |
+| Production Quality | 0.75 | 0.85+ |
+| Gold Standard Creation | 0.85 | 0.90+ |
+
+### 1.3 Agreement Interpretation Scale
+
+| Range | Interpretation | Action |
+|-------|----------------|--------|
+| **0.90 - 1.00** | Almost Perfect | Maintain standards |
+| **0.80 - 0.89** | Excellent | Minor calibration |
+| **0.70 - 0.79** | Good | Schedule calibration |
+| **0.60 - 0.69** | Moderate | Mandatory retraining |
+| **< 0.60** | Poor | Suspend, reassess |
+
+### 1.4 Profile-Specific Requirements
+
+| Profile | Domain | Category | Subcode | Overall Target |
+|---------|--------|----------|---------|----------------|
+| URT-Lite | >= 0.85 | N/A | N/A | >= 0.85 |
+| URT-Core | >= 0.85 | >= 0.80 | N/A | >= 0.80 |
+| URT-Standard/Full | >= 0.85 | >= 0.80 | >= 0.75 | >= 0.78 |
+
+---
+
+## 2. Calibration Sessions
+
+### 2.1 Session Frequency
+
+| Team Size | Daily | Weekly | Monthly |
+|-----------|-------|--------|---------|
+| 1-3 annotators | -- | 30min | 2hr |
+| 4-10 annotators | 15min | 1hr | 3hr |
+| 11+ annotators | 15min | 2x 1hr | 4hr |
+
+### 2.2 Weekly Session Structure (60 min)
+
+| Time | Activity |
+|------|----------|
+| 0-5 min | Review IAA metrics from past week |
+| 5-15 min | Discuss 3 highest-disagreement spans |
+| 15-35 min | Group annotation exercise (5 spans) |
+| 35-50 min | Compare results, discuss differences |
+| 50-60 min | Document decisions, update guidance |
+
+### 2.3 Materials Checklist
+
+- [ ] 5-10 pre-selected review spans
+- [ ] IAA metrics report
+- [ ] Top disagreement patterns list
+- [ ] Gold standard examples
+- [ ] A1 Quickstart Guide
+- [ ] Session notes template
+
+### 2.4 Outcome Documentation
+
+```
+## Calibration Session: [DATE]
+
+### Disagreement Patterns
+1. Pattern: [Description]
+   Resolution: [Decision]
+
+### Exercise Results
+| Span | Consensus | Votes | Notes |
+|------|-----------|-------|-------|
+
+### Action Items
+- [ ] Update A1 Section X
+- [ ] Add gold standard example
+```
+
+---
+
+## 3. Error Categories and Severity
+
+### 3.1 Error Severity Matrix
+
+| Severity | Weight | Description |
+|----------|--------|-------------|
+| **Critical** | 1.0 | Fundamentally wrong |
+| **Major** | 0.5 | Significant deviation |
+| **Minor** | 0.25 | Suboptimal but defensible |
+| **Slip** | 0.1 | Typo/formatting |
+
+### 3.2 Critical Errors (Weight: 1.0)
+
+| Error Type | Example |
+|------------|---------|
+| Wrong Domain | "Rude waiter" coded as O instead of P |
+| Wrong Valence | Complaint coded as V+ |
+| Valence Omission | No valence assigned |
+| Profile Violation | Subcode in URT-Lite |
+
+### 3.3 Major Errors (Weight: 0.5)
+
+| Error Type | Example |
+|------------|---------|
+| Wrong Category | J1 (Timing) vs J4 (Resolution) |
+| Intensity Off by 2 | "TERRIBLE!!!" coded as I1 |
+| Wrong CR Direction | "Gone downhill" coded as CR-B |
+| Missed/Over Split | Two issues merged OR single split |
+| J4/R3 Confusion | Process vs Ownership |
+| V/R Confusion | "Total scam" coded as V4.01 |
+
+### 3.4 Minor Errors (Weight: 0.25)
+
+| Error Type | Example |
+|------------|---------|
+| Wrong Subcode (Same Category) | P1.01 vs P1.02 within P1 |
+| Intensity Off by 1 | "pretty good" as I1 vs I2 |
+| Borderline Secondary | Questionable secondary code |
+
+### 3.5 Slips (Weight: 0.1)
+
+Typos, formatting errors, boundary off by <5 chars
+
+### 3.6 Error Severity Decision Tree
+
+```
+Is DOMAIN wrong?            --> YES: CRITICAL
+Is VALENCE wrong/missing?   --> YES: CRITICAL
+Is CATEGORY wrong?          --> YES: MAJOR
+Is INTENSITY off by 2?      --> YES: MAJOR
+Is SUBCODE wrong?           --> YES: MINOR
+Is it formatting/typo?      --> YES: SLIP
+```
+
+### 3.7 Accuracy Calculation
+
+```
+Error Score = Sum(error_weight * count) / total_spans
+Accuracy = 100% - Error Score
+
+Thresholds:
+  > 95% = Excellent    85-90% = Acceptable
+  90-95% = Good        < 85% = Below Standard
+```
+
+---
+
+## 4. Audit Procedures
+
+### 4.1 Sampling Methodology
+
+**Random**: Equal probability selection for general monitoring
+**Stratified**: Ensure representation across domains, annotators, edge cases
+
+| Stratum | Minimum Sample |
+|---------|----------------|
+| Each Domain (O-R) | 5% of total |
+| Each Annotator | 10% of output |
+| High-Intensity (I3) | 15% of I3 spans |
+| Non-default CR | 25% of CR-B/W/S |
+
+### 4.2 Sample Size by Volume
+
+| Daily Volume | Audit Rate |
+|--------------|------------|
+| < 100 spans | 30% |
+| 100-500 | 20% |
+| 500-2000 | 15% |
+| 2000-10000 | 10% |
+| > 10000 | 7% |
+
+### 4.3 Audit Frequency
+
+| Type | Frequency | Owner |
+|------|-----------|-------|
+| Spot Check | Daily | QA Lead |
+| Sample Audit | Weekly | QA Team |
+| Full Audit | Monthly | Senior QA |
+| External | Quarterly | External |
+
+### 4.4 Audit Workflow
+
+```
+[Daily Output] --> [Sample] --> [Blind Re-code]
+       |
+       v
+[Compare] --> Match: [Log]
+       |
+       +--> Mismatch: [Classify Error] --> [Route] --> [Aggregate]
+```
+
+### 4.5 Escalation Paths
+
+| Error Score | Level | Action |
+|-------------|-------|--------|
+| < 10% | 1 | Self-correction |
+| 10-15% | 2 | QA Lead review |
+| 15-20% | 3 | Team calibration |
+| > 20% | 4 | Management escalation |
+
+---
+
+## 5. Gold Standard Management
+
+### 5.1 Corpus Requirements
+
+| Metric | Minimum | Target |
+|--------|---------|--------|
+| Total Spans | 500 | 1000+ |
+| Per Domain | 50 | 100+ |
+| Per Category | 10 | 25+ |
+| Edge Cases | 100 | 200+ |
+
+### 5.2 Creation Process
+
+```
+[Candidate] --> [3+ Annotators Classify] --> [Calculate Alpha]
+                                                    |
+                                   Alpha >= 0.85: [Add to Gold]
+                                   Alpha < 0.85: [Discuss/Reject]
+```
+
+### 5.3 Gold Standard Documentation
+
+```json
+{
+  "gold_id": "GS-2026-001",
+  "span_text": "The waiter was incredibly rude",
+  "classification": {
+    "primary_code": "P1.02",
+    "valence": "V-",
+    "intensity": "I3"
+  },
+  "rationale": "Clear disrespect. 'Incredibly' indicates I3.",
+  "common_mistakes": ["P1.01 (Warmth)"],
+  "agreement_score": 0.92,
+  "version": "5.1",
+  "status": "active"
+}
+```
+
+### 5.4 Version Control
+
+| Change Type | Version Bump |
+|-------------|--------------|
+| Add example | Patch (5.1.1) |
+| Fix error | Patch |
+| Spec alignment | Minor (5.2) |
+| Taxonomy change | Major (6.0) |
+
+### 5.5 Retirement Criteria
+
+- Spec change invalidates example
+- Systematic confusion traced to example
+- Industry shifts make obsolete
+
+---
+
+## 6. Quality Gates
+
+### 6.1 New Annotator Qualification
+
+```
+Week 1: Training (A1 Guide, Spec, Videos)
+Week 2: Supervised Practice (100 spans, daily feedback)
+Week 3: Qualification Exam (50 gold spans, blind)
+        |
+        Pass (>= 85%): Production + 30-day probation
+        Fail: Remediation + retake
+```
+
+**Passing Criteria**:
+- Overall Accuracy >= 85%
+- Domain Accuracy >= 90%
+- Critical Errors = 0
+- Major Errors <= 3
+
+### 6.2 Production Annotator Requirements
+
+| Requirement | Frequency | Threshold |
+|-------------|-----------|-----------|
+| Accuracy Check | Weekly | >= 90% |
+| Calibration | Weekly | 90% attendance |
+| Gold Quiz | Monthly | >= 85% |
+| IAA with Peers | Bi-weekly | Kappa >= 0.75 |
+
+### 6.3 Annotator Tiers
+
+| Tier | Accuracy | Audit Rate |
+|------|----------|------------|
+| Expert | >= 95% | 5% |
+| Senior | 92-95% | 10% |
+| Standard | 88-92% | 15% |
+| Developing | 85-88% | 25% |
+| Probation | < 85% | 50% |
+
+### 6.4 Automated System Thresholds
+
+| Metric | Minimum | Production | Best |
+|--------|---------|------------|------|
+| Domain Accuracy | 85% | 90% | 95% |
+| Category Accuracy | 80% | 85% | 90% |
+| Subcode Accuracy | 75% | 80% | 85% |
+| Valence F1 | 0.88 | 0.92 | 0.96 |
+
+### 6.5 Release Criteria
+
+- [ ] All accuracy metrics meet thresholds
+- [ ] Gold standard test documented
+- [ ] Error analysis completed
+- [ ] Rollback plan in place
+- [ ] Stakeholder sign-off
+
+---
+
+## 7. Feedback Loop
+
+### 7.1 Error Reporting
+
+```
+ERROR REPORT FIELDS:
+- Reporter, Date, Span ID
+- Type: [Spec Unclear | Gold Issue | Edge Case | Tool Bug]
+- Description
+- Suggested Resolution
+- Urgency: [Critical | High | Medium | Low]
+```
+
+### 7.2 Triage Process
+
+```
+[Error Submitted] --> [QA Lead (24hr)]
+        |
+        +--> Spec Issue --> PM
+        +--> Gold Issue --> QA Team
+        +--> Tool Bug --> Engineering
+        +--> Training Gap --> QA Lead
+```
+
+### 7.3 Spec Clarification Process
+
+```
+[Ambiguity] --> Check A1 Guide --> Found: Apply
+                                   |
+                                   Not Found: Submit Request
+                                        |
+                                   PM + QA Review
+                                        |
+                                   Accept: Update A1/Spec
+                                   Reject: Document Rationale
+```
+
+### 7.4 Training Update Triggers
+
+| Trigger | Action | Timeline |
+|---------|--------|----------|
+| IAA < 0.75 | Mandatory calibration | 48 hours |
+| New error pattern (3+) | Targeted training | 1 week |
+| Spec release | Full training | 2 weeks |
+| Annotator < 85% | Individual coaching | Immediate |
+
+### 7.5 Response SLAs
+
+| Urgency | Response | Resolution |
+|---------|----------|------------|
+| Critical | 2 hours | 24 hours |
+| High | 24 hours | 1 week |
+| Medium | 48 hours | 2 weeks |
+| Low | 1 week | Next sprint |
+
+---
+
+## 8. Metrics Dashboard
+
+### 8.1 Key QA KPIs
+
+| KPI | Target | Alert |
+|-----|--------|-------|
+| Overall Accuracy | >= 92% | < 88% |
+| IAA (Kappa) | >= 0.80 | < 0.75 |
+| Critical Error Rate | < 2/1K | >= 5/1K |
+| Audit Coverage | >= 10% | < 7% |
+| Calibration Attendance | >= 90% | < 80% |
+| Error Resolution Time | < 5 days | > 10 days |
+
+### 8.2 Reporting Frequency
+
+| Report | Frequency | Audience |
+|--------|-----------|----------|
+| Daily Snapshot | Daily | QA Lead |
+| Weekly Summary | Weekly | Team + Management |
+| Monthly Deep Dive | Monthly | Leadership |
+| Quarterly Review | Quarterly | Executives |
+
+### 8.3 Alert Configuration
+
+```
+              Green      Yellow     Red
+Accuracy      >= 92%     88-92%     < 88%
+IAA           >= 0.80    0.75-0.80  < 0.75
+Critical Err  < 2/1K     2-5/1K     > 5/1K
+Coverage      >= 12%     10-12%     < 10%
+
+Yellow: Notify QA Lead
+Red: Escalate + Immediate Action
+```
+
+### 8.4 Dashboard Panels
+
+1. **Accuracy Trend**: Line chart, 30-day rolling
+2. **IAA Heatmap**: Annotator pairwise Kappa
+3. **Error Distribution**: Stacked bar by severity
+4. **Domain Performance**: Radar chart (O-P-J-E-A-V-R)
+5. **Annotator Leaderboard**: Table with tiers
+6. **Alert Status**: Traffic light indicators
+
+### 8.5 Metric Formulas
+
+```
+Accuracy = 1 - (Sum(error_weight * count) / total_spans)
+
+Cohen's Kappa = (Po - Pe) / (1 - Pe)
+  Po = Observed agreement
+  Pe = Expected agreement by chance
+
+Krippendorff's Alpha = 1 - (Do / De)
+  Do = Observed disagreement
+  De = Expected disagreement
+
+F1 = 2 * (Precision * Recall) / (Precision + Recall)
+```
+
+---
+
+## Document References
+
+| Document | Location |
+|----------|----------|
+| URT-Specification-v5.1.md | `/urt-taxonomy/spec/` |
+| A1-Annotator-Quickstart.md | `/urt-taxonomy/track-a-training/` |
+| Gold Standard Corpus | `/urt-taxonomy/gold-standard/` |
+
+---
+
+*URT v5.1 QA Protocol | Track A: Training Materials*