Phase 0: Project restructure to ReviewIQ platform architecture

New structure: - scrapers/google_reviews/v1_0_0.py (was modules/scraper_clean.py) - scrapers/base.py (BaseScraper interface) - scrapers/registry.py (ScraperRegistry for version routing) - core/database.py, models.py, config.py, enums.py - utils/logger.py, crash_analyzer.py, health_checks.py, helpers.py, date_converter.py - workers/chrome_pool.py - services/webhook_service.py - api/ routes structure (empty, ready for Phase 2) - tests/ structure mirroring source All imports updated in: - api_server_production.py (7 import paths updated) - utils/health_checks.py (scraper import path) Legacy modules moved to modules/_legacy/: - data_storage.py, image_handler.py, s3_handler.py (unused) Syntax verified, frontend build passing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:22:08 +00:00
parent bb0291f265
commit 544e028c3f
37 changed files with 5782 additions and 30 deletions
--- a/.artifacts/ReviewIQ-Architecture-v2.md
+++ b/.artifacts/ReviewIQ-Architecture-v2.md
--- a/.artifacts/ReviewIQ-Architecture-v3.2.md
+++ b/.artifacts/ReviewIQ-Architecture-v3.2.md
--- a/.artifacts/ReviewIQ-Architecture-v3.md
+++ b/.artifacts/ReviewIQ-Architecture-v3.md
--- a/.artifacts/ReviewIQ-v32-Decisions.md
+++ b/.artifacts/ReviewIQ-v32-Decisions.md
@@ -0,0 +1,183 @@
+# ReviewIQ v3.2 Design Decisions
+
+> Fast context-recovery document — all key decisions without the full spec.
+
+---
+
+## 1. Markpoint
+
+```
+ID:       reviewiq-v32-span-layer-2026-01-24-001
+Status:   v3.2 span layer complete
+Based on: v3.1.2 (commit f998277)
+```
+
+---
+
+## 2. Core Design Decisions
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Span granularity | Clause/topic-level | Preserves multi-domain signal |
+| span_id format | ULID (TEXT) | Survives re-segmentation |
+| Span offsets | Required (NOT NULL) | Deterministic reconstruction |
+| Offsets reference | reviews_enriched.text | Not text_normalized |
+| Span → Issue mapping | One-to-one (UNIQUE span_id) | Atomic unit per issue |
+| Primary span enforcement | Partial unique index | Exactly one per review version |
+| Primary selection | I3>I2>I1, V->V±>V0>V+, span_index | Deterministic, stable |
+| Reprocessing strategy | Soft-switch with is_active | No transient empty states |
+| Span overlap | GiST exclusion constraint | Non-overlapping ranges enforced |
+| Secondary codes | Array with cardinality ≤ 2 | Could normalize to link table later |
+| Causal chain storage | JSONB | Flexibility, normalize later if needed |
+| relation_type vs causal_chain | Separate concerns | relation = within-review, causal = root cause |
+| Dimension columns | Postgres ENUMs | Type safety, storage efficiency |
+| Trust score floor | 0.2 (GREATEST clamp) | Prevent multiplicative collapse |
+| Issue routing key | (business_id, place_id, urt_primary, entity_normalized) | Deterministic, entity-aware |
+| Issue ID generation | SHA256 via pgcrypto | Deterministic, collision-resistant |
+| Text validation trigger | Conditional via session setting | Performance: skip in bulk loads |
+| Relation validation | Application-level post-insert | Handles insertion order |
+
+---
+
+## 3. Extensions Required
+
+| Extension | Purpose |
+|-----------|---------|
+| `btree_gist` | Exclusion constraint for non-overlapping spans |
+| `pgcrypto` | SHA256-based issue ID generation |
+
+---
+
+## 4. New Tables
+
+| Table | Purpose |
+|-------|---------|
+| `review_spans` | Span-level URT classification |
+| `review_span_secondary_codes` | (Optional) Normalized secondary codes |
+
+---
+
+## 5. Modified Tables
+
+| Table | Changes |
+|-------|---------|
+| `issue_spans` | Added `span_id` FK (NOT NULL), removed direct review FK as canonical |
+
+---
+
+## 6. New ENUM Types
+
+**Valence & Intensity:**
+- `urt_valence` — V-, V±, V0, V+
+- `urt_intensity` — I1, I2, I3
+
+**Specificity & Actionability:**
+- `urt_specificity` — S1, S2, S3
+- `urt_actionability` — A1, A2, A3
+
+**Context & Evidence:**
+- `urt_temporal` — T1, T2, T3
+- `urt_evidence` — E1, E2, E3
+- `urt_comparative` — CR1, CR2, CR3
+
+**Classification:**
+- `urt_profile` — factual, emotional, comparative, etc.
+- `urt_confidence` — low, medium, high
+- `urt_relation` — elaborates, contrasts, causes, etc.
+- `urt_entity_type` — person, product, location, etc.
+
+---
+
+## 7. Key Functions
+
+| Function | Purpose |
+|----------|---------|
+| `urt_validate_causal_chain()` | Validates causal JSONB structure |
+| `validate_review_relations()` | Ensures related_span_id same-parent |
+| `validate_active_spans()` | Ensures valid active span set |
+| `set_primary_span()` | Deterministic primary selection |
+| `generate_issue_id()` | SHA256-based issue ID |
+
+---
+
+## 8. Key Triggers
+
+| Trigger | Purpose |
+|---------|---------|
+| `review_spans_validate_bounds` | span_end ≤ text length |
+| `review_spans_validate_text` | span_text matches substring |
+| `review_spans_validate_causal_chain` | causal_chain JSONB valid |
+
+---
+
+## 9. USN Format
+
+```
+Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
+Full:     URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
+```
+
+**Examples:**
+- `URT:S:SVC.SPD:V-I3:S3A3T2.E2.CR1` — Specific service speed complaint
+- `URT:F:PRD.QUA:V+I2:S2A1T1.E3.CR2:staff→training` — Product quality praise with causal chain
+
+---
+
+## 10. Span Boundary Rules
+
+1. **Split on contrasting conjunctions** — "but", "however", "although"
+2. **Split on topic/target change** — Different entity or aspect
+3. **Split on valence change** — Positive → Negative or vice versa
+4. **Split on domain change** — SVC → PRD → AMB
+5. **Keep cause→effect together** — Causal chain stays in one span
+
+---
+
+## 11. Deferred to v3.3+
+
+| Item | Reason |
+|------|--------|
+| Entity extraction implementation | Requires NER pipeline |
+| Trust-weighted fact aggregation | Needs more span data |
+| Secondary domain enforcement | App-level validation sufficient |
+| Span-based fact counting | Currently review-based, optimize later |
+
+---
+
+## 12. Open Questions Resolved
+
+| Question | Resolution |
+|----------|------------|
+| Span → Issue cardinality? | **One-to-one** (not many-to-many) |
+| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
+| Reprocessing strategy? | **Soft-switch** with is_active flag |
+| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
+
+---
+
+## Quick Reference
+
+### Primary Span Selection Algorithm
+
+```
+ORDER BY:
+  1. intensity DESC (I3 > I2 > I1)
+  2. valence ASC (V- > V± > V0 > V+)
+  3. span_index ASC (first wins ties)
+```
+
+### Issue Routing Key
+
+```sql
+(business_id, place_id, urt_primary, entity_normalized)
+```
+
+### Trust Score Calculation
+
+```sql
+GREATEST(0.2, base_trust * modifiers)  -- Floor prevents collapse
+```
+
+---
+
+*Last updated: 2026-01-24*
--- a/.artifacts/URT-v5.1-Reference.md
+++ b/.artifacts/URT-v5.1-Reference.md
@@ -0,0 +1,331 @@
+# Universal Review Taxonomy (URT) v5.1 Reference
+
+## Overview
+
+The Universal Review Taxonomy (URT) is a classification system for customer feedback. It provides a structured approach to categorizing, annotating, and analyzing review content across any industry.
+
+### Key Characteristics
+
+- **Three Profiles**: Core, Standard, Full (increasing detail)
+- **Seven Domains**: Covering all aspects of customer experience
+- **Tier-3 Canonical Codes**: Format `X#.##` (e.g., J1.02, P2.15)
+- **Dimensional Annotation**: Valence, intensity, specificity, and more
+- **Causal Analysis**: Root cause chains (Full profile)
+
+---
+
+## Domain Codes
+
+URT organizes feedback into seven domains, each identified by a single letter.
+
+| Domain | Letter | Description |
+|--------|--------|-------------|
+| Offering | O | Product/service quality |
+| Price | P | Value, pricing, promotions |
+| Journey | J | Customer experience, timing, process |
+| Environment | E | Physical/digital space |
+| Attitude | A | Staff behavior, service attitude |
+| Voice | V | Brand, communication, marketing |
+| Relationship | R | Loyalty, trust, long-term relationship |
+
+### Tier-3 Code Format
+
+```
+Pattern: [OPJEAVR][1-4]\.[0-9]{2}
+```
+
+Examples:
+- `J1.02` - Journey domain, category 1, subcategory 02
+- `P2.15` - Price domain, category 2, subcategory 15
+- `A3.01` - Attitude domain, category 3, subcategory 01
+
+---
+
+## Dimension Codes
+
+### Valence
+
+Indicates the sentiment direction of the feedback.
+
+| Code | Meaning |
+|------|---------|
+| V+ | Positive |
+| V- | Negative |
+| V0 | Neutral |
+| V± | Mixed |
+
+### Intensity
+
+Indicates the strength of the expressed sentiment.
+
+| Code | Meaning |
+|------|---------|
+| I1 | Low intensity |
+| I2 | Moderate intensity |
+| I3 | High intensity |
+
+### Specificity (Standard+)
+
+Indicates how detailed the feedback is.
+
+| Code | Meaning |
+|------|---------|
+| S1 | Low - vague, general |
+| S2 | Medium - some detail |
+| S3 | High - specific, precise |
+
+### Actionability (Standard+)
+
+Indicates whether clear actions can be derived from the feedback.
+
+| Code | Meaning |
+|------|---------|
+| A1 | None - no clear action |
+| A2 | Unclear - possible actions |
+| A3 | Clear - specific actionable |
+
+### Temporal (Standard+)
+
+Indicates the time frame referenced in the feedback.
+
+| Code | Meaning | Markers |
+|------|---------|---------|
+| TC | Current - this visit | "today", "this time", "yesterday" |
+| TR | Recent - last few visits | "lately", "recently", "again" |
+| TH | Historical - long-standing | "for years", "always", "historically" |
+| TF | Future - expectations | "I won't come back", "next time" |
+
+**Default**: TC when no temporal language exists.
+
+### Evidence (Standard+)
+
+Indicates how the information was obtained from the text.
+
+| Code | Meaning | Example |
+|------|---------|---------|
+| ES | Stated - explicit in text | "Waited 45 minutes" |
+| EI | Inferred - logically entailed | "Took 3 weeks to reply" → slow response |
+| EC | Contextual - depends on context | "That happened again" |
+
+**Default**: ES. Use EI/EC only when needed.
+
+### Comparative
+
+Indicates whether the feedback compares to alternatives.
+
+| Code | Meaning |
+|------|---------|
+| CR-N | No comparison |
+| CR-B | Better than alternatives |
+| CR-W | Worse than alternatives |
+| CR-S | Same as alternatives |
+
+---
+
+## USN (URT String Notation)
+
+USN is a compact string encoding for URT annotations.
+
+### Grammar
+
+```
+Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
+Full:     URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
+```
+
+### Encoding Rules
+
+**Valence**:
+- `+` for V+
+- `-` for V-
+
+**Intensity**:
+- `1` for I1
+- `2` for I2
+- `3` for I3
+
+### Examples
+
+**Standard Profile**:
+```
+URT:S:J1.03:-2:22TC.ES.N
+```
+Decoded:
+- Profile: Standard
+- Code: J1.03
+- Valence: V- (negative)
+- Intensity: I2
+- Specificity: S2
+- Actionability: A2
+- Temporal: TC
+- Evidence: ES
+- Comparative: CR-N
+
+**Full Profile with Causal Chain**:
+```
+URT:F:J1.01+A1.04:-3:23TR.EI.S:CD.O,MG.O
+```
+Decoded:
+- Profile: Full
+- Codes: J1.01, A1.04
+- Valence: V- (negative)
+- Intensity: I3
+- Specificity: S2
+- Actionability: A3
+- Temporal: TR
+- Evidence: EI
+- Comparative: CR-S
+- Causal: CD.O (Conditions-Operational), MG.O (Management-Oversight)
+
+---
+
+## Causal Chain (Full Profile Only)
+
+The causal chain identifies root causes across three layers, ordered from immediate to systemic.
+
+### Layers
+
+| Layer | Codes | Scope |
+|-------|-------|-------|
+| conditions | CD-S, CD-T, CD-E, CD-F, CD-O | Staff State, Team Dynamics, Equipment, Facility, Operational |
+| management | MG-P, MG-T, MG-O, MG-R, MG-C | Planning, Training, Oversight, Resources, Communication |
+| systemic | SY-R, SY-P, SY-C, SY-S, SY-H, SY-X | Resource Decisions, Policy, Culture, Standards, Human Capital, External |
+
+### Code Reference
+
+**Conditions Layer**:
+- `CD-S` - Staff State
+- `CD-T` - Team Dynamics
+- `CD-E` - Equipment
+- `CD-F` - Facility
+- `CD-O` - Operational
+
+**Management Layer**:
+- `MG-P` - Planning
+- `MG-T` - Training
+- `MG-O` - Oversight
+- `MG-R` - Resources
+- `MG-C` - Communication
+
+**Systemic Layer**:
+- `SY-R` - Resource Decisions
+- `SY-P` - Policy
+- `SY-C` - Culture
+- `SY-S` - Standards
+- `SY-H` - Human Capital
+- `SY-X` - External
+
+### JSONB Schema
+
+```json
+[
+  {"layer": "conditions", "code": "CD-O", "evidence": "ES"},
+  {"layer": "management", "code": "MG-P", "evidence": "EI"}
+]
+```
+
+### Constraints
+
+- Maximum 3 entries (one per layer)
+- Only include when text explicitly supports it
+- Order: conditions → management → systemic
+
+---
+
+## Span Boundary Detection Rules
+
+Spans are detected at the clause/topic level, not sentence level.
+
+### Split Rules (in priority order)
+
+1. **Split on contrasting conjunctions**: but, however, although, despite, yet
+2. **Split when subject/target changes** (topic shift)
+3. **Split when valence changes** (positive ↔ negative)
+4. **Split when domain changes** (O/P/J/E/A/V/R)
+5. **Keep together** for cause→effect within same feedback unit
+
+### Guidelines
+
+- **Maximum**: ~3 spans per sentence
+- **Validation**: If 4+ spans detected, re-check for over-splitting
+
+### Example
+
+**Input**:
+> "The food was great but the service was slow and the bathroom was dirty."
+
+**Output**: 3 spans
+1. "The food was great" (Offering, positive)
+2. "the service was slow" (Journey/Attitude, negative)
+3. "the bathroom was dirty" (Environment, negative)
+
+**Reasoning**: Topic shift + domain shift at each boundary.
+
+---
+
+## Primary Span Selection
+
+When a review contains multiple spans, select the primary span using these criteria in order:
+
+### Selection Priority
+
+1. **Highest intensity** (I3 > I2 > I1)
+2. **Tie-break**: Negative over positive (V- > V± > V0 > V+)
+3. **Tie-break**: Earliest span_index
+
+### Example
+
+Given spans:
+- Span 0: I2, V+
+- Span 1: I3, V+
+- Span 2: I3, V-
+
+**Primary**: Span 2 (highest intensity I3, negative valence wins tie-break)
+
+---
+
+## Secondary Codes Rules
+
+Secondary codes capture additional topics mentioned in a span.
+
+### Constraints
+
+- **Maximum**: 2 secondary codes
+- **Format**: Must be Tier-3 (X#.##)
+- **Recommendation**: Should be different domain from primary
+
+### Example
+
+Primary: `J1.03` (Journey)
+Secondary: `A2.01`, `E1.05` (Attitude, Environment)
+
+---
+
+## Quick Reference Card
+
+### Profiles
+
+| Profile | Dimensions | Causal Chain |
+|---------|------------|--------------|
+| Core | V, I | No |
+| Standard | V, I, S, A, T, E, CR | No |
+| Full | V, I, S, A, T, E, CR | Yes |
+
+### USN Quick Format
+
+```
+URT:{S|F}:{tier3_codes}:{valence}{intensity}:{SAT}.{E}.{CR}[:{causal}]
+```
+
+### Domain Letters
+
+```
+O P J E A V R
+│ │ │ │ │ │ └─ Relationship
+│ │ │ │ │ └─── Voice
+│ │ │ │ └───── Attitude
+│ │ │ └─────── Environment
+│ │ └───────── Journey
+│ └─────────── Price
+└───────────── Offering
+```