Files

Alejandro Gutiérrez 3eda9bdbfa Add complete URT v5.1 taxonomy framework (11 artifacts)

Universal Review Taxonomy v5.1 implementation with:
- Track A (Training): A1 Quickstart, A2 QA Protocol, A3 Calibration Set, A4 Full Manual
- Track B (Engineering): B1 Code Registry, B2 Database Schema, B3 Owner Routing, B4 API Contract
- Track C (Analytics): C1 Issue Lifecycle, C2 KPI Mapping Guide
- Track D (Integration): D1 Dashboard Specification

Covers 7 domains, 28 categories, 138 subcodes, 16 causal codes, and 7 metadata dimensions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-24 10:51:41 +00:00

12 KiB

Raw Blame History

A2: QA Protocol

Universal Review Taxonomy (URT) v5.1

Purpose: Define quality assurance processes for URT annotation Version: 5.1 | Status: Production Ready | Date: 2026-01-23

Inter-Annotator Agreement (IAA) Metrics
Calibration Sessions
Error Categories and Severity
Audit Procedures
Gold Standard Management
Quality Gates
Feedback Loop
Metrics Dashboard

1. Inter-Annotator Agreement (IAA) Metrics

1.1 Cohen's Kappa Thresholds by Code Tier

Code Tier	Minimum Kappa	Target Kappa
Domain (Tier 1)	0.80	0.90+
Category (Tier 2)	0.75	0.85+
Subcode (Tier 3)	0.70	0.80+
Valence	0.85	0.92+
Intensity	0.70	0.80+
Comparative Reference	0.75	0.85+

1.2 Krippendorff's Alpha (3+ Annotators)

Scenario	Minimum Alpha	Target Alpha
Initial Training	0.67	0.75+
Production Quality	0.75	0.85+
Gold Standard Creation	0.85	0.90+

1.3 Agreement Interpretation Scale

Range	Interpretation	Action
0.90 - 1.00	Almost Perfect	Maintain standards
0.80 - 0.89	Excellent	Minor calibration
0.70 - 0.79	Good	Schedule calibration
0.60 - 0.69	Moderate	Mandatory retraining
< 0.60	Poor	Suspend, reassess

1.4 Profile-Specific Requirements

Profile	Domain	Category	Subcode	Overall Target
URT-Lite	>= 0.85	N/A	N/A	>= 0.85
URT-Core	>= 0.85	>= 0.80	N/A	>= 0.80
URT-Standard/Full	>= 0.85	>= 0.80	>= 0.75	>= 0.78

2. Calibration Sessions

2.1 Session Frequency

Team Size	Daily	Weekly	Monthly
1-3 annotators	--	30min	2hr
4-10 annotators	15min	1hr	3hr
11+ annotators	15min	2x 1hr	4hr

2.2 Weekly Session Structure (60 min)

Time	Activity
0-5 min	Review IAA metrics from past week
5-15 min	Discuss 3 highest-disagreement spans
15-35 min	Group annotation exercise (5 spans)
35-50 min	Compare results, discuss differences
50-60 min	Document decisions, update guidance

2.3 Materials Checklist

5-10 pre-selected review spans
IAA metrics report
Top disagreement patterns list
Gold standard examples
A1 Quickstart Guide
Session notes template

2.4 Outcome Documentation

## Calibration Session: [DATE]

### Disagreement Patterns
1. Pattern: [Description]
   Resolution: [Decision]

### Exercise Results
| Span | Consensus | Votes | Notes |
|------|-----------|-------|-------|

### Action Items
- [ ] Update A1 Section X
- [ ] Add gold standard example

3. Error Categories and Severity

3.1 Error Severity Matrix

Severity	Weight	Description
Critical	1.0	Fundamentally wrong
Major	0.5	Significant deviation
Minor	0.25	Suboptimal but defensible
Slip	0.1	Typo/formatting

3.2 Critical Errors (Weight: 1.0)

Error Type	Example
Wrong Domain	"Rude waiter" coded as O instead of P
Wrong Valence	Complaint coded as V+
Valence Omission	No valence assigned
Profile Violation	Subcode in URT-Lite

3.3 Major Errors (Weight: 0.5)

Error Type	Example
Wrong Category	J1 (Timing) vs J4 (Resolution)
Intensity Off by 2	"TERRIBLE!!!" coded as I1
Wrong CR Direction	"Gone downhill" coded as CR-B
Missed/Over Split	Two issues merged OR single split
J4/R3 Confusion	Process vs Ownership
V/R Confusion	"Total scam" coded as V4.01

3.4 Minor Errors (Weight: 0.25)

Error Type	Example
Wrong Subcode (Same Category)	P1.01 vs P1.02 within P1
Intensity Off by 1	"pretty good" as I1 vs I2
Borderline Secondary	Questionable secondary code

3.5 Slips (Weight: 0.1)

Typos, formatting errors, boundary off by <5 chars

3.6 Error Severity Decision Tree

Is DOMAIN wrong?            --> YES: CRITICAL
Is VALENCE wrong/missing?   --> YES: CRITICAL
Is CATEGORY wrong?          --> YES: MAJOR
Is INTENSITY off by 2?      --> YES: MAJOR
Is SUBCODE wrong?           --> YES: MINOR
Is it formatting/typo?      --> YES: SLIP

3.7 Accuracy Calculation

Error Score = Sum(error_weight * count) / total_spans
Accuracy = 100% - Error Score

Thresholds:
  > 95% = Excellent    85-90% = Acceptable
  90-95% = Good        < 85% = Below Standard

4. Audit Procedures

4.1 Sampling Methodology

Random: Equal probability selection for general monitoring Stratified: Ensure representation across domains, annotators, edge cases

Stratum	Minimum Sample
Each Domain (O-R)	5% of total
Each Annotator	10% of output
High-Intensity (I3)	15% of I3 spans
Non-default CR	25% of CR-B/W/S

4.2 Sample Size by Volume

Daily Volume	Audit Rate
< 100 spans	30%
100-500	20%
500-2000	15%
2000-10000	10%
> 10000	7%

4.3 Audit Frequency

Type	Frequency	Owner
Spot Check	Daily	QA Lead
Sample Audit	Weekly	QA Team
Full Audit	Monthly	Senior QA
External	Quarterly	External

4.4 Audit Workflow

[Daily Output] --> [Sample] --> [Blind Re-code]
       |
       v
[Compare] --> Match: [Log]
       |
       +--> Mismatch: [Classify Error] --> [Route] --> [Aggregate]

4.5 Escalation Paths

Error Score	Level	Action
< 10%	1	Self-correction
10-15%	2	QA Lead review
15-20%	3	Team calibration
> 20%	4	Management escalation

5. Gold Standard Management

5.1 Corpus Requirements

Metric	Minimum	Target
Total Spans	500	1000+
Per Domain	50	100+
Per Category	10	25+
Edge Cases	100	200+

5.2 Creation Process

[Candidate] --> [3+ Annotators Classify] --> [Calculate Alpha]
                                                    |
                                   Alpha >= 0.85: [Add to Gold]
                                   Alpha < 0.85: [Discuss/Reject]

5.3 Gold Standard Documentation

{
  "gold_id": "GS-2026-001",
  "span_text": "The waiter was incredibly rude",
  "classification": {
    "primary_code": "P1.02",
    "valence": "V-",
    "intensity": "I3"
  },
  "rationale": "Clear disrespect. 'Incredibly' indicates I3.",
  "common_mistakes": ["P1.01 (Warmth)"],
  "agreement_score": 0.92,
  "version": "5.1",
  "status": "active"
}

5.4 Version Control

Change Type	Version Bump
Add example	Patch (5.1.1)
Fix error	Patch
Spec alignment	Minor (5.2)
Taxonomy change	Major (6.0)

5.5 Retirement Criteria

Spec change invalidates example
Systematic confusion traced to example
Industry shifts make obsolete

6. Quality Gates

6.1 New Annotator Qualification

Week 1: Training (A1 Guide, Spec, Videos)
Week 2: Supervised Practice (100 spans, daily feedback)
Week 3: Qualification Exam (50 gold spans, blind)
        |
        Pass (>= 85%): Production + 30-day probation
        Fail: Remediation + retake

Passing Criteria:

Overall Accuracy >= 85%
Domain Accuracy >= 90%
Critical Errors = 0
Major Errors <= 3

6.2 Production Annotator Requirements

Requirement	Frequency	Threshold
Accuracy Check	Weekly	>= 90%
Calibration	Weekly	90% attendance
Gold Quiz	Monthly	>= 85%
IAA with Peers	Bi-weekly	Kappa >= 0.75

6.3 Annotator Tiers

Tier	Accuracy	Audit Rate
Expert	>= 95%	5%
Senior	92-95%	10%
Standard	88-92%	15%
Developing	85-88%	25%
Probation	< 85%	50%

6.4 Automated System Thresholds

Metric	Minimum	Production	Best
Domain Accuracy	85%	90%	95%
Category Accuracy	80%	85%	90%
Subcode Accuracy	75%	80%	85%
Valence F1	0.88	0.92	0.96

6.5 Release Criteria

All accuracy metrics meet thresholds
Gold standard test documented
Error analysis completed
Rollback plan in place
Stakeholder sign-off

7. Feedback Loop

7.1 Error Reporting

ERROR REPORT FIELDS:
- Reporter, Date, Span ID
- Type: [Spec Unclear | Gold Issue | Edge Case | Tool Bug]
- Description
- Suggested Resolution
- Urgency: [Critical | High | Medium | Low]

7.2 Triage Process

[Error Submitted] --> [QA Lead (24hr)]
        |
        +--> Spec Issue --> PM
        +--> Gold Issue --> QA Team
        +--> Tool Bug --> Engineering
        +--> Training Gap --> QA Lead

7.3 Spec Clarification Process

[Ambiguity] --> Check A1 Guide --> Found: Apply
                                   |
                                   Not Found: Submit Request
                                        |
                                   PM + QA Review
                                        |
                                   Accept: Update A1/Spec
                                   Reject: Document Rationale

7.4 Training Update Triggers

Trigger	Action	Timeline
IAA < 0.75	Mandatory calibration	48 hours
New error pattern (3+)	Targeted training	1 week
Spec release	Full training	2 weeks
Annotator < 85%	Individual coaching	Immediate

7.5 Response SLAs

Urgency	Response	Resolution
Critical	2 hours	24 hours
High	24 hours	1 week
Medium	48 hours	2 weeks
Low	1 week	Next sprint

8. Metrics Dashboard

8.1 Key QA KPIs

KPI	Target	Alert
Overall Accuracy	>= 92%	< 88%
IAA (Kappa)	>= 0.80	< 0.75
Critical Error Rate	< 2/1K	>= 5/1K
Audit Coverage	>= 10%	< 7%
Calibration Attendance	>= 90%	< 80%
Error Resolution Time	< 5 days	> 10 days

8.2 Reporting Frequency

Report	Frequency	Audience
Daily Snapshot	Daily	QA Lead
Weekly Summary	Weekly	Team + Management
Monthly Deep Dive	Monthly	Leadership
Quarterly Review	Quarterly	Executives

8.3 Alert Configuration

              Green      Yellow     Red
Accuracy      >= 92%     88-92%     < 88%
IAA           >= 0.80    0.75-0.80  < 0.75
Critical Err  < 2/1K     2-5/1K     > 5/1K
Coverage      >= 12%     10-12%     < 10%

Yellow: Notify QA Lead
Red: Escalate + Immediate Action

8.4 Dashboard Panels

Accuracy Trend: Line chart, 30-day rolling
IAA Heatmap: Annotator pairwise Kappa
Error Distribution: Stacked bar by severity
Domain Performance: Radar chart (O-P-J-E-A-V-R)
Annotator Leaderboard: Table with tiers
Alert Status: Traffic light indicators

8.5 Metric Formulas

Accuracy = 1 - (Sum(error_weight * count) / total_spans)

Cohen's Kappa = (Po - Pe) / (1 - Pe)
  Po = Observed agreement
  Pe = Expected agreement by chance

Krippendorff's Alpha = 1 - (Do / De)
  Do = Observed disagreement
  De = Expected disagreement

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Document References

Document	Location
URT-Specification-v5.1.md	`/urt-taxonomy/spec/`
A1-Annotator-Quickstart.md	`/urt-taxonomy/track-a-training/`
Gold Standard Corpus	`/urt-taxonomy/gold-standard/`

URT v5.1 QA Protocol | Track A: Training Materials

12 KiB Raw Blame History