Files
Alejandro Gutiérrez 3eda9bdbfa Add complete URT v5.1 taxonomy framework (11 artifacts)
Universal Review Taxonomy v5.1 implementation with:
- Track A (Training): A1 Quickstart, A2 QA Protocol, A3 Calibration Set, A4 Full Manual
- Track B (Engineering): B1 Code Registry, B2 Database Schema, B3 Owner Routing, B4 API Contract
- Track C (Analytics): C1 Issue Lifecycle, C2 KPI Mapping Guide
- Track D (Integration): D1 Dashboard Specification

Covers 7 domains, 28 categories, 138 subcodes, 16 causal codes, and 7 metadata dimensions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:51:41 +00:00

12 KiB

A2: QA Protocol

Universal Review Taxonomy (URT) v5.1

Purpose: Define quality assurance processes for URT annotation Version: 5.1 | Status: Production Ready | Date: 2026-01-23


Table of Contents

  1. Inter-Annotator Agreement (IAA) Metrics
  2. Calibration Sessions
  3. Error Categories and Severity
  4. Audit Procedures
  5. Gold Standard Management
  6. Quality Gates
  7. Feedback Loop
  8. Metrics Dashboard

1. Inter-Annotator Agreement (IAA) Metrics

1.1 Cohen's Kappa Thresholds by Code Tier

Code Tier Minimum Kappa Target Kappa
Domain (Tier 1) 0.80 0.90+
Category (Tier 2) 0.75 0.85+
Subcode (Tier 3) 0.70 0.80+
Valence 0.85 0.92+
Intensity 0.70 0.80+
Comparative Reference 0.75 0.85+

1.2 Krippendorff's Alpha (3+ Annotators)

Scenario Minimum Alpha Target Alpha
Initial Training 0.67 0.75+
Production Quality 0.75 0.85+
Gold Standard Creation 0.85 0.90+

1.3 Agreement Interpretation Scale

Range Interpretation Action
0.90 - 1.00 Almost Perfect Maintain standards
0.80 - 0.89 Excellent Minor calibration
0.70 - 0.79 Good Schedule calibration
0.60 - 0.69 Moderate Mandatory retraining
< 0.60 Poor Suspend, reassess

1.4 Profile-Specific Requirements

Profile Domain Category Subcode Overall Target
URT-Lite >= 0.85 N/A N/A >= 0.85
URT-Core >= 0.85 >= 0.80 N/A >= 0.80
URT-Standard/Full >= 0.85 >= 0.80 >= 0.75 >= 0.78

2. Calibration Sessions

2.1 Session Frequency

Team Size Daily Weekly Monthly
1-3 annotators -- 30min 2hr
4-10 annotators 15min 1hr 3hr
11+ annotators 15min 2x 1hr 4hr

2.2 Weekly Session Structure (60 min)

Time Activity
0-5 min Review IAA metrics from past week
5-15 min Discuss 3 highest-disagreement spans
15-35 min Group annotation exercise (5 spans)
35-50 min Compare results, discuss differences
50-60 min Document decisions, update guidance

2.3 Materials Checklist

  • 5-10 pre-selected review spans
  • IAA metrics report
  • Top disagreement patterns list
  • Gold standard examples
  • A1 Quickstart Guide
  • Session notes template

2.4 Outcome Documentation

## Calibration Session: [DATE]

### Disagreement Patterns
1. Pattern: [Description]
   Resolution: [Decision]

### Exercise Results
| Span | Consensus | Votes | Notes |
|------|-----------|-------|-------|

### Action Items
- [ ] Update A1 Section X
- [ ] Add gold standard example

3. Error Categories and Severity

3.1 Error Severity Matrix

Severity Weight Description
Critical 1.0 Fundamentally wrong
Major 0.5 Significant deviation
Minor 0.25 Suboptimal but defensible
Slip 0.1 Typo/formatting

3.2 Critical Errors (Weight: 1.0)

Error Type Example
Wrong Domain "Rude waiter" coded as O instead of P
Wrong Valence Complaint coded as V+
Valence Omission No valence assigned
Profile Violation Subcode in URT-Lite

3.3 Major Errors (Weight: 0.5)

Error Type Example
Wrong Category J1 (Timing) vs J4 (Resolution)
Intensity Off by 2 "TERRIBLE!!!" coded as I1
Wrong CR Direction "Gone downhill" coded as CR-B
Missed/Over Split Two issues merged OR single split
J4/R3 Confusion Process vs Ownership
V/R Confusion "Total scam" coded as V4.01

3.4 Minor Errors (Weight: 0.25)

Error Type Example
Wrong Subcode (Same Category) P1.01 vs P1.02 within P1
Intensity Off by 1 "pretty good" as I1 vs I2
Borderline Secondary Questionable secondary code

3.5 Slips (Weight: 0.1)

Typos, formatting errors, boundary off by <5 chars

3.6 Error Severity Decision Tree

Is DOMAIN wrong?            --> YES: CRITICAL
Is VALENCE wrong/missing?   --> YES: CRITICAL
Is CATEGORY wrong?          --> YES: MAJOR
Is INTENSITY off by 2?      --> YES: MAJOR
Is SUBCODE wrong?           --> YES: MINOR
Is it formatting/typo?      --> YES: SLIP

3.7 Accuracy Calculation

Error Score = Sum(error_weight * count) / total_spans
Accuracy = 100% - Error Score

Thresholds:
  > 95% = Excellent    85-90% = Acceptable
  90-95% = Good        < 85% = Below Standard

4. Audit Procedures

4.1 Sampling Methodology

Random: Equal probability selection for general monitoring Stratified: Ensure representation across domains, annotators, edge cases

Stratum Minimum Sample
Each Domain (O-R) 5% of total
Each Annotator 10% of output
High-Intensity (I3) 15% of I3 spans
Non-default CR 25% of CR-B/W/S

4.2 Sample Size by Volume

Daily Volume Audit Rate
< 100 spans 30%
100-500 20%
500-2000 15%
2000-10000 10%
> 10000 7%

4.3 Audit Frequency

Type Frequency Owner
Spot Check Daily QA Lead
Sample Audit Weekly QA Team
Full Audit Monthly Senior QA
External Quarterly External

4.4 Audit Workflow

[Daily Output] --> [Sample] --> [Blind Re-code]
       |
       v
[Compare] --> Match: [Log]
       |
       +--> Mismatch: [Classify Error] --> [Route] --> [Aggregate]

4.5 Escalation Paths

Error Score Level Action
< 10% 1 Self-correction
10-15% 2 QA Lead review
15-20% 3 Team calibration
> 20% 4 Management escalation

5. Gold Standard Management

5.1 Corpus Requirements

Metric Minimum Target
Total Spans 500 1000+
Per Domain 50 100+
Per Category 10 25+
Edge Cases 100 200+

5.2 Creation Process

[Candidate] --> [3+ Annotators Classify] --> [Calculate Alpha]
                                                    |
                                   Alpha >= 0.85: [Add to Gold]
                                   Alpha < 0.85: [Discuss/Reject]

5.3 Gold Standard Documentation

{
  "gold_id": "GS-2026-001",
  "span_text": "The waiter was incredibly rude",
  "classification": {
    "primary_code": "P1.02",
    "valence": "V-",
    "intensity": "I3"
  },
  "rationale": "Clear disrespect. 'Incredibly' indicates I3.",
  "common_mistakes": ["P1.01 (Warmth)"],
  "agreement_score": 0.92,
  "version": "5.1",
  "status": "active"
}

5.4 Version Control

Change Type Version Bump
Add example Patch (5.1.1)
Fix error Patch
Spec alignment Minor (5.2)
Taxonomy change Major (6.0)

5.5 Retirement Criteria

  • Spec change invalidates example
  • Systematic confusion traced to example
  • Industry shifts make obsolete

6. Quality Gates

6.1 New Annotator Qualification

Week 1: Training (A1 Guide, Spec, Videos)
Week 2: Supervised Practice (100 spans, daily feedback)
Week 3: Qualification Exam (50 gold spans, blind)
        |
        Pass (>= 85%): Production + 30-day probation
        Fail: Remediation + retake

Passing Criteria:

  • Overall Accuracy >= 85%
  • Domain Accuracy >= 90%
  • Critical Errors = 0
  • Major Errors <= 3

6.2 Production Annotator Requirements

Requirement Frequency Threshold
Accuracy Check Weekly >= 90%
Calibration Weekly 90% attendance
Gold Quiz Monthly >= 85%
IAA with Peers Bi-weekly Kappa >= 0.75

6.3 Annotator Tiers

Tier Accuracy Audit Rate
Expert >= 95% 5%
Senior 92-95% 10%
Standard 88-92% 15%
Developing 85-88% 25%
Probation < 85% 50%

6.4 Automated System Thresholds

Metric Minimum Production Best
Domain Accuracy 85% 90% 95%
Category Accuracy 80% 85% 90%
Subcode Accuracy 75% 80% 85%
Valence F1 0.88 0.92 0.96

6.5 Release Criteria

  • All accuracy metrics meet thresholds
  • Gold standard test documented
  • Error analysis completed
  • Rollback plan in place
  • Stakeholder sign-off

7. Feedback Loop

7.1 Error Reporting

ERROR REPORT FIELDS:
- Reporter, Date, Span ID
- Type: [Spec Unclear | Gold Issue | Edge Case | Tool Bug]
- Description
- Suggested Resolution
- Urgency: [Critical | High | Medium | Low]

7.2 Triage Process

[Error Submitted] --> [QA Lead (24hr)]
        |
        +--> Spec Issue --> PM
        +--> Gold Issue --> QA Team
        +--> Tool Bug --> Engineering
        +--> Training Gap --> QA Lead

7.3 Spec Clarification Process

[Ambiguity] --> Check A1 Guide --> Found: Apply
                                   |
                                   Not Found: Submit Request
                                        |
                                   PM + QA Review
                                        |
                                   Accept: Update A1/Spec
                                   Reject: Document Rationale

7.4 Training Update Triggers

Trigger Action Timeline
IAA < 0.75 Mandatory calibration 48 hours
New error pattern (3+) Targeted training 1 week
Spec release Full training 2 weeks
Annotator < 85% Individual coaching Immediate

7.5 Response SLAs

Urgency Response Resolution
Critical 2 hours 24 hours
High 24 hours 1 week
Medium 48 hours 2 weeks
Low 1 week Next sprint

8. Metrics Dashboard

8.1 Key QA KPIs

KPI Target Alert
Overall Accuracy >= 92% < 88%
IAA (Kappa) >= 0.80 < 0.75
Critical Error Rate < 2/1K >= 5/1K
Audit Coverage >= 10% < 7%
Calibration Attendance >= 90% < 80%
Error Resolution Time < 5 days > 10 days

8.2 Reporting Frequency

Report Frequency Audience
Daily Snapshot Daily QA Lead
Weekly Summary Weekly Team + Management
Monthly Deep Dive Monthly Leadership
Quarterly Review Quarterly Executives

8.3 Alert Configuration

              Green      Yellow     Red
Accuracy      >= 92%     88-92%     < 88%
IAA           >= 0.80    0.75-0.80  < 0.75
Critical Err  < 2/1K     2-5/1K     > 5/1K
Coverage      >= 12%     10-12%     < 10%

Yellow: Notify QA Lead
Red: Escalate + Immediate Action

8.4 Dashboard Panels

  1. Accuracy Trend: Line chart, 30-day rolling
  2. IAA Heatmap: Annotator pairwise Kappa
  3. Error Distribution: Stacked bar by severity
  4. Domain Performance: Radar chart (O-P-J-E-A-V-R)
  5. Annotator Leaderboard: Table with tiers
  6. Alert Status: Traffic light indicators

8.5 Metric Formulas

Accuracy = 1 - (Sum(error_weight * count) / total_spans)

Cohen's Kappa = (Po - Pe) / (1 - Pe)
  Po = Observed agreement
  Pe = Expected agreement by chance

Krippendorff's Alpha = 1 - (Do / De)
  Do = Observed disagreement
  De = Expected disagreement

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Document References

Document Location
URT-Specification-v5.1.md /urt-taxonomy/spec/
A1-Annotator-Quickstart.md /urt-taxonomy/track-a-training/
Gold Standard Corpus /urt-taxonomy/gold-standard/

URT v5.1 QA Protocol | Track A: Training Materials