Add complete URT v5.1 taxonomy framework (11 artifacts)
Universal Review Taxonomy v5.1 implementation with: - Track A (Training): A1 Quickstart, A2 QA Protocol, A3 Calibration Set, A4 Full Manual - Track B (Engineering): B1 Code Registry, B2 Database Schema, B3 Owner Routing, B4 API Contract - Track C (Analytics): C1 Issue Lifecycle, C2 KPI Mapping Guide - Track D (Integration): D1 Dashboard Specification Covers 7 domains, 28 categories, 138 subcodes, 16 causal codes, and 7 metadata dimensions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
476
urt-taxonomy/track-a-training/A2-QA-Protocol.md
Normal file
476
urt-taxonomy/track-a-training/A2-QA-Protocol.md
Normal file
@@ -0,0 +1,476 @@
|
||||
# A2: QA Protocol
|
||||
## Universal Review Taxonomy (URT) v5.1
|
||||
|
||||
**Purpose**: Define quality assurance processes for URT annotation
|
||||
**Version**: 5.1 | **Status**: Production Ready | **Date**: 2026-01-23
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Inter-Annotator Agreement (IAA) Metrics](#1-inter-annotator-agreement-iaa-metrics)
|
||||
2. [Calibration Sessions](#2-calibration-sessions)
|
||||
3. [Error Categories and Severity](#3-error-categories-and-severity)
|
||||
4. [Audit Procedures](#4-audit-procedures)
|
||||
5. [Gold Standard Management](#5-gold-standard-management)
|
||||
6. [Quality Gates](#6-quality-gates)
|
||||
7. [Feedback Loop](#7-feedback-loop)
|
||||
8. [Metrics Dashboard](#8-metrics-dashboard)
|
||||
|
||||
---
|
||||
|
||||
## 1. Inter-Annotator Agreement (IAA) Metrics
|
||||
|
||||
### 1.1 Cohen's Kappa Thresholds by Code Tier
|
||||
|
||||
| Code Tier | Minimum Kappa | Target Kappa |
|
||||
|-----------|---------------|--------------|
|
||||
| **Domain (Tier 1)** | 0.80 | 0.90+ |
|
||||
| **Category (Tier 2)** | 0.75 | 0.85+ |
|
||||
| **Subcode (Tier 3)** | 0.70 | 0.80+ |
|
||||
| **Valence** | 0.85 | 0.92+ |
|
||||
| **Intensity** | 0.70 | 0.80+ |
|
||||
| **Comparative Reference** | 0.75 | 0.85+ |
|
||||
|
||||
### 1.2 Krippendorff's Alpha (3+ Annotators)
|
||||
|
||||
| Scenario | Minimum Alpha | Target Alpha |
|
||||
|----------|---------------|--------------|
|
||||
| Initial Training | 0.67 | 0.75+ |
|
||||
| Production Quality | 0.75 | 0.85+ |
|
||||
| Gold Standard Creation | 0.85 | 0.90+ |
|
||||
|
||||
### 1.3 Agreement Interpretation Scale
|
||||
|
||||
| Range | Interpretation | Action |
|
||||
|-------|----------------|--------|
|
||||
| **0.90 - 1.00** | Almost Perfect | Maintain standards |
|
||||
| **0.80 - 0.89** | Excellent | Minor calibration |
|
||||
| **0.70 - 0.79** | Good | Schedule calibration |
|
||||
| **0.60 - 0.69** | Moderate | Mandatory retraining |
|
||||
| **< 0.60** | Poor | Suspend, reassess |
|
||||
|
||||
### 1.4 Profile-Specific Requirements
|
||||
|
||||
| Profile | Domain | Category | Subcode | Overall Target |
|
||||
|---------|--------|----------|---------|----------------|
|
||||
| URT-Lite | >= 0.85 | N/A | N/A | >= 0.85 |
|
||||
| URT-Core | >= 0.85 | >= 0.80 | N/A | >= 0.80 |
|
||||
| URT-Standard/Full | >= 0.85 | >= 0.80 | >= 0.75 | >= 0.78 |
|
||||
|
||||
---
|
||||
|
||||
## 2. Calibration Sessions
|
||||
|
||||
### 2.1 Session Frequency
|
||||
|
||||
| Team Size | Daily | Weekly | Monthly |
|
||||
|-----------|-------|--------|---------|
|
||||
| 1-3 annotators | -- | 30min | 2hr |
|
||||
| 4-10 annotators | 15min | 1hr | 3hr |
|
||||
| 11+ annotators | 15min | 2x 1hr | 4hr |
|
||||
|
||||
### 2.2 Weekly Session Structure (60 min)
|
||||
|
||||
| Time | Activity |
|
||||
|------|----------|
|
||||
| 0-5 min | Review IAA metrics from past week |
|
||||
| 5-15 min | Discuss 3 highest-disagreement spans |
|
||||
| 15-35 min | Group annotation exercise (5 spans) |
|
||||
| 35-50 min | Compare results, discuss differences |
|
||||
| 50-60 min | Document decisions, update guidance |
|
||||
|
||||
### 2.3 Materials Checklist
|
||||
|
||||
- [ ] 5-10 pre-selected review spans
|
||||
- [ ] IAA metrics report
|
||||
- [ ] Top disagreement patterns list
|
||||
- [ ] Gold standard examples
|
||||
- [ ] A1 Quickstart Guide
|
||||
- [ ] Session notes template
|
||||
|
||||
### 2.4 Outcome Documentation
|
||||
|
||||
```
|
||||
## Calibration Session: [DATE]
|
||||
|
||||
### Disagreement Patterns
|
||||
1. Pattern: [Description]
|
||||
Resolution: [Decision]
|
||||
|
||||
### Exercise Results
|
||||
| Span | Consensus | Votes | Notes |
|
||||
|------|-----------|-------|-------|
|
||||
|
||||
### Action Items
|
||||
- [ ] Update A1 Section X
|
||||
- [ ] Add gold standard example
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Error Categories and Severity
|
||||
|
||||
### 3.1 Error Severity Matrix
|
||||
|
||||
| Severity | Weight | Description |
|
||||
|----------|--------|-------------|
|
||||
| **Critical** | 1.0 | Fundamentally wrong |
|
||||
| **Major** | 0.5 | Significant deviation |
|
||||
| **Minor** | 0.25 | Suboptimal but defensible |
|
||||
| **Slip** | 0.1 | Typo/formatting |
|
||||
|
||||
### 3.2 Critical Errors (Weight: 1.0)
|
||||
|
||||
| Error Type | Example |
|
||||
|------------|---------|
|
||||
| Wrong Domain | "Rude waiter" coded as O instead of P |
|
||||
| Wrong Valence | Complaint coded as V+ |
|
||||
| Valence Omission | No valence assigned |
|
||||
| Profile Violation | Subcode in URT-Lite |
|
||||
|
||||
### 3.3 Major Errors (Weight: 0.5)
|
||||
|
||||
| Error Type | Example |
|
||||
|------------|---------|
|
||||
| Wrong Category | J1 (Timing) vs J4 (Resolution) |
|
||||
| Intensity Off by 2 | "TERRIBLE!!!" coded as I1 |
|
||||
| Wrong CR Direction | "Gone downhill" coded as CR-B |
|
||||
| Missed/Over Split | Two issues merged OR single split |
|
||||
| J4/R3 Confusion | Process vs Ownership |
|
||||
| V/R Confusion | "Total scam" coded as V4.01 |
|
||||
|
||||
### 3.4 Minor Errors (Weight: 0.25)
|
||||
|
||||
| Error Type | Example |
|
||||
|------------|---------|
|
||||
| Wrong Subcode (Same Category) | P1.01 vs P1.02 within P1 |
|
||||
| Intensity Off by 1 | "pretty good" as I1 vs I2 |
|
||||
| Borderline Secondary | Questionable secondary code |
|
||||
|
||||
### 3.5 Slips (Weight: 0.1)
|
||||
|
||||
Typos, formatting errors, boundary off by <5 chars
|
||||
|
||||
### 3.6 Error Severity Decision Tree
|
||||
|
||||
```
|
||||
Is DOMAIN wrong? --> YES: CRITICAL
|
||||
Is VALENCE wrong/missing? --> YES: CRITICAL
|
||||
Is CATEGORY wrong? --> YES: MAJOR
|
||||
Is INTENSITY off by 2? --> YES: MAJOR
|
||||
Is SUBCODE wrong? --> YES: MINOR
|
||||
Is it formatting/typo? --> YES: SLIP
|
||||
```
|
||||
|
||||
### 3.7 Accuracy Calculation
|
||||
|
||||
```
|
||||
Error Score = Sum(error_weight * count) / total_spans
|
||||
Accuracy = 100% - Error Score
|
||||
|
||||
Thresholds:
|
||||
> 95% = Excellent 85-90% = Acceptable
|
||||
90-95% = Good < 85% = Below Standard
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Audit Procedures
|
||||
|
||||
### 4.1 Sampling Methodology
|
||||
|
||||
**Random**: Equal probability selection for general monitoring
|
||||
**Stratified**: Ensure representation across domains, annotators, edge cases
|
||||
|
||||
| Stratum | Minimum Sample |
|
||||
|---------|----------------|
|
||||
| Each Domain (O-R) | 5% of total |
|
||||
| Each Annotator | 10% of output |
|
||||
| High-Intensity (I3) | 15% of I3 spans |
|
||||
| Non-default CR | 25% of CR-B/W/S |
|
||||
|
||||
### 4.2 Sample Size by Volume
|
||||
|
||||
| Daily Volume | Audit Rate |
|
||||
|--------------|------------|
|
||||
| < 100 spans | 30% |
|
||||
| 100-500 | 20% |
|
||||
| 500-2000 | 15% |
|
||||
| 2000-10000 | 10% |
|
||||
| > 10000 | 7% |
|
||||
|
||||
### 4.3 Audit Frequency
|
||||
|
||||
| Type | Frequency | Owner |
|
||||
|------|-----------|-------|
|
||||
| Spot Check | Daily | QA Lead |
|
||||
| Sample Audit | Weekly | QA Team |
|
||||
| Full Audit | Monthly | Senior QA |
|
||||
| External | Quarterly | External |
|
||||
|
||||
### 4.4 Audit Workflow
|
||||
|
||||
```
|
||||
[Daily Output] --> [Sample] --> [Blind Re-code]
|
||||
|
|
||||
v
|
||||
[Compare] --> Match: [Log]
|
||||
|
|
||||
+--> Mismatch: [Classify Error] --> [Route] --> [Aggregate]
|
||||
```
|
||||
|
||||
### 4.5 Escalation Paths
|
||||
|
||||
| Error Score | Level | Action |
|
||||
|-------------|-------|--------|
|
||||
| < 10% | 1 | Self-correction |
|
||||
| 10-15% | 2 | QA Lead review |
|
||||
| 15-20% | 3 | Team calibration |
|
||||
| > 20% | 4 | Management escalation |
|
||||
|
||||
---
|
||||
|
||||
## 5. Gold Standard Management
|
||||
|
||||
### 5.1 Corpus Requirements
|
||||
|
||||
| Metric | Minimum | Target |
|
||||
|--------|---------|--------|
|
||||
| Total Spans | 500 | 1000+ |
|
||||
| Per Domain | 50 | 100+ |
|
||||
| Per Category | 10 | 25+ |
|
||||
| Edge Cases | 100 | 200+ |
|
||||
|
||||
### 5.2 Creation Process
|
||||
|
||||
```
|
||||
[Candidate] --> [3+ Annotators Classify] --> [Calculate Alpha]
|
||||
|
|
||||
Alpha >= 0.85: [Add to Gold]
|
||||
Alpha < 0.85: [Discuss/Reject]
|
||||
```
|
||||
|
||||
### 5.3 Gold Standard Documentation
|
||||
|
||||
```json
|
||||
{
|
||||
"gold_id": "GS-2026-001",
|
||||
"span_text": "The waiter was incredibly rude",
|
||||
"classification": {
|
||||
"primary_code": "P1.02",
|
||||
"valence": "V-",
|
||||
"intensity": "I3"
|
||||
},
|
||||
"rationale": "Clear disrespect. 'Incredibly' indicates I3.",
|
||||
"common_mistakes": ["P1.01 (Warmth)"],
|
||||
"agreement_score": 0.92,
|
||||
"version": "5.1",
|
||||
"status": "active"
|
||||
}
|
||||
```
|
||||
|
||||
### 5.4 Version Control
|
||||
|
||||
| Change Type | Version Bump |
|
||||
|-------------|--------------|
|
||||
| Add example | Patch (5.1.1) |
|
||||
| Fix error | Patch |
|
||||
| Spec alignment | Minor (5.2) |
|
||||
| Taxonomy change | Major (6.0) |
|
||||
|
||||
### 5.5 Retirement Criteria
|
||||
|
||||
- Spec change invalidates example
|
||||
- Systematic confusion traced to example
|
||||
- Industry shifts make obsolete
|
||||
|
||||
---
|
||||
|
||||
## 6. Quality Gates
|
||||
|
||||
### 6.1 New Annotator Qualification
|
||||
|
||||
```
|
||||
Week 1: Training (A1 Guide, Spec, Videos)
|
||||
Week 2: Supervised Practice (100 spans, daily feedback)
|
||||
Week 3: Qualification Exam (50 gold spans, blind)
|
||||
|
|
||||
Pass (>= 85%): Production + 30-day probation
|
||||
Fail: Remediation + retake
|
||||
```
|
||||
|
||||
**Passing Criteria**:
|
||||
- Overall Accuracy >= 85%
|
||||
- Domain Accuracy >= 90%
|
||||
- Critical Errors = 0
|
||||
- Major Errors <= 3
|
||||
|
||||
### 6.2 Production Annotator Requirements
|
||||
|
||||
| Requirement | Frequency | Threshold |
|
||||
|-------------|-----------|-----------|
|
||||
| Accuracy Check | Weekly | >= 90% |
|
||||
| Calibration | Weekly | 90% attendance |
|
||||
| Gold Quiz | Monthly | >= 85% |
|
||||
| IAA with Peers | Bi-weekly | Kappa >= 0.75 |
|
||||
|
||||
### 6.3 Annotator Tiers
|
||||
|
||||
| Tier | Accuracy | Audit Rate |
|
||||
|------|----------|------------|
|
||||
| Expert | >= 95% | 5% |
|
||||
| Senior | 92-95% | 10% |
|
||||
| Standard | 88-92% | 15% |
|
||||
| Developing | 85-88% | 25% |
|
||||
| Probation | < 85% | 50% |
|
||||
|
||||
### 6.4 Automated System Thresholds
|
||||
|
||||
| Metric | Minimum | Production | Best |
|
||||
|--------|---------|------------|------|
|
||||
| Domain Accuracy | 85% | 90% | 95% |
|
||||
| Category Accuracy | 80% | 85% | 90% |
|
||||
| Subcode Accuracy | 75% | 80% | 85% |
|
||||
| Valence F1 | 0.88 | 0.92 | 0.96 |
|
||||
|
||||
### 6.5 Release Criteria
|
||||
|
||||
- [ ] All accuracy metrics meet thresholds
|
||||
- [ ] Gold standard test documented
|
||||
- [ ] Error analysis completed
|
||||
- [ ] Rollback plan in place
|
||||
- [ ] Stakeholder sign-off
|
||||
|
||||
---
|
||||
|
||||
## 7. Feedback Loop
|
||||
|
||||
### 7.1 Error Reporting
|
||||
|
||||
```
|
||||
ERROR REPORT FIELDS:
|
||||
- Reporter, Date, Span ID
|
||||
- Type: [Spec Unclear | Gold Issue | Edge Case | Tool Bug]
|
||||
- Description
|
||||
- Suggested Resolution
|
||||
- Urgency: [Critical | High | Medium | Low]
|
||||
```
|
||||
|
||||
### 7.2 Triage Process
|
||||
|
||||
```
|
||||
[Error Submitted] --> [QA Lead (24hr)]
|
||||
|
|
||||
+--> Spec Issue --> PM
|
||||
+--> Gold Issue --> QA Team
|
||||
+--> Tool Bug --> Engineering
|
||||
+--> Training Gap --> QA Lead
|
||||
```
|
||||
|
||||
### 7.3 Spec Clarification Process
|
||||
|
||||
```
|
||||
[Ambiguity] --> Check A1 Guide --> Found: Apply
|
||||
|
|
||||
Not Found: Submit Request
|
||||
|
|
||||
PM + QA Review
|
||||
|
|
||||
Accept: Update A1/Spec
|
||||
Reject: Document Rationale
|
||||
```
|
||||
|
||||
### 7.4 Training Update Triggers
|
||||
|
||||
| Trigger | Action | Timeline |
|
||||
|---------|--------|----------|
|
||||
| IAA < 0.75 | Mandatory calibration | 48 hours |
|
||||
| New error pattern (3+) | Targeted training | 1 week |
|
||||
| Spec release | Full training | 2 weeks |
|
||||
| Annotator < 85% | Individual coaching | Immediate |
|
||||
|
||||
### 7.5 Response SLAs
|
||||
|
||||
| Urgency | Response | Resolution |
|
||||
|---------|----------|------------|
|
||||
| Critical | 2 hours | 24 hours |
|
||||
| High | 24 hours | 1 week |
|
||||
| Medium | 48 hours | 2 weeks |
|
||||
| Low | 1 week | Next sprint |
|
||||
|
||||
---
|
||||
|
||||
## 8. Metrics Dashboard
|
||||
|
||||
### 8.1 Key QA KPIs
|
||||
|
||||
| KPI | Target | Alert |
|
||||
|-----|--------|-------|
|
||||
| Overall Accuracy | >= 92% | < 88% |
|
||||
| IAA (Kappa) | >= 0.80 | < 0.75 |
|
||||
| Critical Error Rate | < 2/1K | >= 5/1K |
|
||||
| Audit Coverage | >= 10% | < 7% |
|
||||
| Calibration Attendance | >= 90% | < 80% |
|
||||
| Error Resolution Time | < 5 days | > 10 days |
|
||||
|
||||
### 8.2 Reporting Frequency
|
||||
|
||||
| Report | Frequency | Audience |
|
||||
|--------|-----------|----------|
|
||||
| Daily Snapshot | Daily | QA Lead |
|
||||
| Weekly Summary | Weekly | Team + Management |
|
||||
| Monthly Deep Dive | Monthly | Leadership |
|
||||
| Quarterly Review | Quarterly | Executives |
|
||||
|
||||
### 8.3 Alert Configuration
|
||||
|
||||
```
|
||||
Green Yellow Red
|
||||
Accuracy >= 92% 88-92% < 88%
|
||||
IAA >= 0.80 0.75-0.80 < 0.75
|
||||
Critical Err < 2/1K 2-5/1K > 5/1K
|
||||
Coverage >= 12% 10-12% < 10%
|
||||
|
||||
Yellow: Notify QA Lead
|
||||
Red: Escalate + Immediate Action
|
||||
```
|
||||
|
||||
### 8.4 Dashboard Panels
|
||||
|
||||
1. **Accuracy Trend**: Line chart, 30-day rolling
|
||||
2. **IAA Heatmap**: Annotator pairwise Kappa
|
||||
3. **Error Distribution**: Stacked bar by severity
|
||||
4. **Domain Performance**: Radar chart (O-P-J-E-A-V-R)
|
||||
5. **Annotator Leaderboard**: Table with tiers
|
||||
6. **Alert Status**: Traffic light indicators
|
||||
|
||||
### 8.5 Metric Formulas
|
||||
|
||||
```
|
||||
Accuracy = 1 - (Sum(error_weight * count) / total_spans)
|
||||
|
||||
Cohen's Kappa = (Po - Pe) / (1 - Pe)
|
||||
Po = Observed agreement
|
||||
Pe = Expected agreement by chance
|
||||
|
||||
Krippendorff's Alpha = 1 - (Do / De)
|
||||
Do = Observed disagreement
|
||||
De = Expected disagreement
|
||||
|
||||
F1 = 2 * (Precision * Recall) / (Precision + Recall)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Document References
|
||||
|
||||
| Document | Location |
|
||||
|----------|----------|
|
||||
| URT-Specification-v5.1.md | `/urt-taxonomy/spec/` |
|
||||
| A1-Annotator-Quickstart.md | `/urt-taxonomy/track-a-training/` |
|
||||
| Gold Standard Corpus | `/urt-taxonomy/gold-standard/` |
|
||||
|
||||
---
|
||||
|
||||
*URT v5.1 QA Protocol | Track A: Training Materials*
|
||||
Reference in New Issue
Block a user