A2: QA Protocol
Universal Review Taxonomy (URT) v5.1
Purpose: Define quality assurance processes for URT annotation
Version: 5.1 | Status: Production Ready | Date: 2026-01-23
Table of Contents
- Inter-Annotator Agreement (IAA) Metrics
- Calibration Sessions
- Error Categories and Severity
- Audit Procedures
- Gold Standard Management
- Quality Gates
- Feedback Loop
- Metrics Dashboard
1. Inter-Annotator Agreement (IAA) Metrics
1.1 Cohen's Kappa Thresholds by Code Tier
| Code Tier |
Minimum Kappa |
Target Kappa |
| Domain (Tier 1) |
0.80 |
0.90+ |
| Category (Tier 2) |
0.75 |
0.85+ |
| Subcode (Tier 3) |
0.70 |
0.80+ |
| Valence |
0.85 |
0.92+ |
| Intensity |
0.70 |
0.80+ |
| Comparative Reference |
0.75 |
0.85+ |
1.2 Krippendorff's Alpha (3+ Annotators)
| Scenario |
Minimum Alpha |
Target Alpha |
| Initial Training |
0.67 |
0.75+ |
| Production Quality |
0.75 |
0.85+ |
| Gold Standard Creation |
0.85 |
0.90+ |
1.3 Agreement Interpretation Scale
| Range |
Interpretation |
Action |
| 0.90 - 1.00 |
Almost Perfect |
Maintain standards |
| 0.80 - 0.89 |
Excellent |
Minor calibration |
| 0.70 - 0.79 |
Good |
Schedule calibration |
| 0.60 - 0.69 |
Moderate |
Mandatory retraining |
| < 0.60 |
Poor |
Suspend, reassess |
1.4 Profile-Specific Requirements
| Profile |
Domain |
Category |
Subcode |
Overall Target |
| URT-Lite |
>= 0.85 |
N/A |
N/A |
>= 0.85 |
| URT-Core |
>= 0.85 |
>= 0.80 |
N/A |
>= 0.80 |
| URT-Standard/Full |
>= 0.85 |
>= 0.80 |
>= 0.75 |
>= 0.78 |
2. Calibration Sessions
2.1 Session Frequency
| Team Size |
Daily |
Weekly |
Monthly |
| 1-3 annotators |
-- |
30min |
2hr |
| 4-10 annotators |
15min |
1hr |
3hr |
| 11+ annotators |
15min |
2x 1hr |
4hr |
2.2 Weekly Session Structure (60 min)
| Time |
Activity |
| 0-5 min |
Review IAA metrics from past week |
| 5-15 min |
Discuss 3 highest-disagreement spans |
| 15-35 min |
Group annotation exercise (5 spans) |
| 35-50 min |
Compare results, discuss differences |
| 50-60 min |
Document decisions, update guidance |
2.3 Materials Checklist
2.4 Outcome Documentation
3. Error Categories and Severity
3.1 Error Severity Matrix
| Severity |
Weight |
Description |
| Critical |
1.0 |
Fundamentally wrong |
| Major |
0.5 |
Significant deviation |
| Minor |
0.25 |
Suboptimal but defensible |
| Slip |
0.1 |
Typo/formatting |
3.2 Critical Errors (Weight: 1.0)
| Error Type |
Example |
| Wrong Domain |
"Rude waiter" coded as O instead of P |
| Wrong Valence |
Complaint coded as V+ |
| Valence Omission |
No valence assigned |
| Profile Violation |
Subcode in URT-Lite |
3.3 Major Errors (Weight: 0.5)
| Error Type |
Example |
| Wrong Category |
J1 (Timing) vs J4 (Resolution) |
| Intensity Off by 2 |
"TERRIBLE!!!" coded as I1 |
| Wrong CR Direction |
"Gone downhill" coded as CR-B |
| Missed/Over Split |
Two issues merged OR single split |
| J4/R3 Confusion |
Process vs Ownership |
| V/R Confusion |
"Total scam" coded as V4.01 |
3.4 Minor Errors (Weight: 0.25)
| Error Type |
Example |
| Wrong Subcode (Same Category) |
P1.01 vs P1.02 within P1 |
| Intensity Off by 1 |
"pretty good" as I1 vs I2 |
| Borderline Secondary |
Questionable secondary code |
3.5 Slips (Weight: 0.1)
Typos, formatting errors, boundary off by <5 chars
3.6 Error Severity Decision Tree
3.7 Accuracy Calculation
4. Audit Procedures
4.1 Sampling Methodology
Random: Equal probability selection for general monitoring
Stratified: Ensure representation across domains, annotators, edge cases
| Stratum |
Minimum Sample |
| Each Domain (O-R) |
5% of total |
| Each Annotator |
10% of output |
| High-Intensity (I3) |
15% of I3 spans |
| Non-default CR |
25% of CR-B/W/S |
4.2 Sample Size by Volume
| Daily Volume |
Audit Rate |
| < 100 spans |
30% |
| 100-500 |
20% |
| 500-2000 |
15% |
| 2000-10000 |
10% |
| > 10000 |
7% |
4.3 Audit Frequency
| Type |
Frequency |
Owner |
| Spot Check |
Daily |
QA Lead |
| Sample Audit |
Weekly |
QA Team |
| Full Audit |
Monthly |
Senior QA |
| External |
Quarterly |
External |
4.4 Audit Workflow
4.5 Escalation Paths
| Error Score |
Level |
Action |
| < 10% |
1 |
Self-correction |
| 10-15% |
2 |
QA Lead review |
| 15-20% |
3 |
Team calibration |
| > 20% |
4 |
Management escalation |
5. Gold Standard Management
5.1 Corpus Requirements
| Metric |
Minimum |
Target |
| Total Spans |
500 |
1000+ |
| Per Domain |
50 |
100+ |
| Per Category |
10 |
25+ |
| Edge Cases |
100 |
200+ |
5.2 Creation Process
5.3 Gold Standard Documentation
5.4 Version Control
| Change Type |
Version Bump |
| Add example |
Patch (5.1.1) |
| Fix error |
Patch |
| Spec alignment |
Minor (5.2) |
| Taxonomy change |
Major (6.0) |
5.5 Retirement Criteria
- Spec change invalidates example
- Systematic confusion traced to example
- Industry shifts make obsolete
6. Quality Gates
6.1 New Annotator Qualification
Passing Criteria:
- Overall Accuracy >= 85%
- Domain Accuracy >= 90%
- Critical Errors = 0
- Major Errors <= 3
6.2 Production Annotator Requirements
| Requirement |
Frequency |
Threshold |
| Accuracy Check |
Weekly |
>= 90% |
| Calibration |
Weekly |
90% attendance |
| Gold Quiz |
Monthly |
>= 85% |
| IAA with Peers |
Bi-weekly |
Kappa >= 0.75 |
6.3 Annotator Tiers
| Tier |
Accuracy |
Audit Rate |
| Expert |
>= 95% |
5% |
| Senior |
92-95% |
10% |
| Standard |
88-92% |
15% |
| Developing |
85-88% |
25% |
| Probation |
< 85% |
50% |
6.4 Automated System Thresholds
| Metric |
Minimum |
Production |
Best |
| Domain Accuracy |
85% |
90% |
95% |
| Category Accuracy |
80% |
85% |
90% |
| Subcode Accuracy |
75% |
80% |
85% |
| Valence F1 |
0.88 |
0.92 |
0.96 |
6.5 Release Criteria
7. Feedback Loop
7.1 Error Reporting
7.2 Triage Process
7.3 Spec Clarification Process
7.4 Training Update Triggers
| Trigger |
Action |
Timeline |
| IAA < 0.75 |
Mandatory calibration |
48 hours |
| New error pattern (3+) |
Targeted training |
1 week |
| Spec release |
Full training |
2 weeks |
| Annotator < 85% |
Individual coaching |
Immediate |
7.5 Response SLAs
| Urgency |
Response |
Resolution |
| Critical |
2 hours |
24 hours |
| High |
24 hours |
1 week |
| Medium |
48 hours |
2 weeks |
| Low |
1 week |
Next sprint |
8. Metrics Dashboard
8.1 Key QA KPIs
| KPI |
Target |
Alert |
| Overall Accuracy |
>= 92% |
< 88% |
| IAA (Kappa) |
>= 0.80 |
< 0.75 |
| Critical Error Rate |
< 2/1K |
>= 5/1K |
| Audit Coverage |
>= 10% |
< 7% |
| Calibration Attendance |
>= 90% |
< 80% |
| Error Resolution Time |
< 5 days |
> 10 days |
8.2 Reporting Frequency
| Report |
Frequency |
Audience |
| Daily Snapshot |
Daily |
QA Lead |
| Weekly Summary |
Weekly |
Team + Management |
| Monthly Deep Dive |
Monthly |
Leadership |
| Quarterly Review |
Quarterly |
Executives |
8.3 Alert Configuration
8.4 Dashboard Panels
- Accuracy Trend: Line chart, 30-day rolling
- IAA Heatmap: Annotator pairwise Kappa
- Error Distribution: Stacked bar by severity
- Domain Performance: Radar chart (O-P-J-E-A-V-R)
- Annotator Leaderboard: Table with tiers
- Alert Status: Traffic light indicators
8.5 Metric Formulas
Document References
| Document |
Location |
| URT-Specification-v5.1.md |
/urt-taxonomy/spec/ |
| A1-Annotator-Quickstart.md |
/urt-taxonomy/track-a-training/ |
| Gold Standard Corpus |
/urt-taxonomy/gold-standard/ |
URT v5.1 QA Protocol | Track A: Training Materials