Phase 0: Project restructure to ReviewIQ platform architecture

New structure:
- scrapers/google_reviews/v1_0_0.py (was modules/scraper_clean.py)
- scrapers/base.py (BaseScraper interface)
- scrapers/registry.py (ScraperRegistry for version routing)
- core/database.py, models.py, config.py, enums.py
- utils/logger.py, crash_analyzer.py, health_checks.py, helpers.py, date_converter.py
- workers/chrome_pool.py
- services/webhook_service.py
- api/ routes structure (empty, ready for Phase 2)
- tests/ structure mirroring source

All imports updated in:
- api_server_production.py (7 import paths updated)
- utils/health_checks.py (scraper import path)

Legacy modules moved to modules/_legacy/:
- data_storage.py, image_handler.py, s3_handler.py (unused)

Syntax verified, frontend build passing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-24 15:22:08 +00:00
parent bb0291f265
commit 544e028c3f
37 changed files with 5782 additions and 30 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,183 @@
# ReviewIQ v3.2 Design Decisions
> Fast context-recovery document — all key decisions without the full spec.
---
## 1. Markpoint
```
ID: reviewiq-v32-span-layer-2026-01-24-001
Status: v3.2 span layer complete
Based on: v3.1.2 (commit f998277)
```
---
## 2. Core Design Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Span granularity | Clause/topic-level | Preserves multi-domain signal |
| span_id format | ULID (TEXT) | Survives re-segmentation |
| Span offsets | Required (NOT NULL) | Deterministic reconstruction |
| Offsets reference | reviews_enriched.text | Not text_normalized |
| Span → Issue mapping | One-to-one (UNIQUE span_id) | Atomic unit per issue |
| Primary span enforcement | Partial unique index | Exactly one per review version |
| Primary selection | I3>I2>I1, V->V±>V0>V+, span_index | Deterministic, stable |
| Reprocessing strategy | Soft-switch with is_active | No transient empty states |
| Span overlap | GiST exclusion constraint | Non-overlapping ranges enforced |
| Secondary codes | Array with cardinality ≤ 2 | Could normalize to link table later |
| Causal chain storage | JSONB | Flexibility, normalize later if needed |
| relation_type vs causal_chain | Separate concerns | relation = within-review, causal = root cause |
| Dimension columns | Postgres ENUMs | Type safety, storage efficiency |
| Trust score floor | 0.2 (GREATEST clamp) | Prevent multiplicative collapse |
| Issue routing key | (business_id, place_id, urt_primary, entity_normalized) | Deterministic, entity-aware |
| Issue ID generation | SHA256 via pgcrypto | Deterministic, collision-resistant |
| Text validation trigger | Conditional via session setting | Performance: skip in bulk loads |
| Relation validation | Application-level post-insert | Handles insertion order |
---
## 3. Extensions Required
| Extension | Purpose |
|-----------|---------|
| `btree_gist` | Exclusion constraint for non-overlapping spans |
| `pgcrypto` | SHA256-based issue ID generation |
---
## 4. New Tables
| Table | Purpose |
|-------|---------|
| `review_spans` | Span-level URT classification |
| `review_span_secondary_codes` | (Optional) Normalized secondary codes |
---
## 5. Modified Tables
| Table | Changes |
|-------|---------|
| `issue_spans` | Added `span_id` FK (NOT NULL), removed direct review FK as canonical |
---
## 6. New ENUM Types
**Valence & Intensity:**
- `urt_valence` — V-, V±, V0, V+
- `urt_intensity` — I1, I2, I3
**Specificity & Actionability:**
- `urt_specificity` — S1, S2, S3
- `urt_actionability` — A1, A2, A3
**Context & Evidence:**
- `urt_temporal` — T1, T2, T3
- `urt_evidence` — E1, E2, E3
- `urt_comparative` — CR1, CR2, CR3
**Classification:**
- `urt_profile` — factual, emotional, comparative, etc.
- `urt_confidence` — low, medium, high
- `urt_relation` — elaborates, contrasts, causes, etc.
- `urt_entity_type` — person, product, location, etc.
---
## 7. Key Functions
| Function | Purpose |
|----------|---------|
| `urt_validate_causal_chain()` | Validates causal JSONB structure |
| `validate_review_relations()` | Ensures related_span_id same-parent |
| `validate_active_spans()` | Ensures valid active span set |
| `set_primary_span()` | Deterministic primary selection |
| `generate_issue_id()` | SHA256-based issue ID |
---
## 8. Key Triggers
| Trigger | Purpose |
|---------|---------|
| `review_spans_validate_bounds` | span_end ≤ text length |
| `review_spans_validate_text` | span_text matches substring |
| `review_spans_validate_causal_chain` | causal_chain JSONB valid |
---
## 9. USN Format
```
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
```
**Examples:**
- `URT:S:SVC.SPD:V-I3:S3A3T2.E2.CR1` — Specific service speed complaint
- `URT:F:PRD.QUA:V+I2:S2A1T1.E3.CR2:staff→training` — Product quality praise with causal chain
---
## 10. Span Boundary Rules
1. **Split on contrasting conjunctions** — "but", "however", "although"
2. **Split on topic/target change** — Different entity or aspect
3. **Split on valence change** — Positive → Negative or vice versa
4. **Split on domain change** — SVC → PRD → AMB
5. **Keep cause→effect together** — Causal chain stays in one span
---
## 11. Deferred to v3.3+
| Item | Reason |
|------|--------|
| Entity extraction implementation | Requires NER pipeline |
| Trust-weighted fact aggregation | Needs more span data |
| Secondary domain enforcement | App-level validation sufficient |
| Span-based fact counting | Currently review-based, optimize later |
---
## 12. Open Questions Resolved
| Question | Resolution |
|----------|------------|
| Span → Issue cardinality? | **One-to-one** (not many-to-many) |
| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
| Reprocessing strategy? | **Soft-switch** with is_active flag |
| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
---
## Quick Reference
### Primary Span Selection Algorithm
```
ORDER BY:
1. intensity DESC (I3 > I2 > I1)
2. valence ASC (V- > V± > V0 > V+)
3. span_index ASC (first wins ties)
```
### Issue Routing Key
```sql
(business_id, place_id, urt_primary, entity_normalized)
```
### Trust Score Calculation
```sql
GREATEST(0.2, base_trust * modifiers) -- Floor prevents collapse
```
---
*Last updated: 2026-01-24*

View File

@@ -0,0 +1,331 @@
# Universal Review Taxonomy (URT) v5.1 Reference
## Overview
The Universal Review Taxonomy (URT) is a classification system for customer feedback. It provides a structured approach to categorizing, annotating, and analyzing review content across any industry.
### Key Characteristics
- **Three Profiles**: Core, Standard, Full (increasing detail)
- **Seven Domains**: Covering all aspects of customer experience
- **Tier-3 Canonical Codes**: Format `X#.##` (e.g., J1.02, P2.15)
- **Dimensional Annotation**: Valence, intensity, specificity, and more
- **Causal Analysis**: Root cause chains (Full profile)
---
## Domain Codes
URT organizes feedback into seven domains, each identified by a single letter.
| Domain | Letter | Description |
|--------|--------|-------------|
| Offering | O | Product/service quality |
| Price | P | Value, pricing, promotions |
| Journey | J | Customer experience, timing, process |
| Environment | E | Physical/digital space |
| Attitude | A | Staff behavior, service attitude |
| Voice | V | Brand, communication, marketing |
| Relationship | R | Loyalty, trust, long-term relationship |
### Tier-3 Code Format
```
Pattern: [OPJEAVR][1-4]\.[0-9]{2}
```
Examples:
- `J1.02` - Journey domain, category 1, subcategory 02
- `P2.15` - Price domain, category 2, subcategory 15
- `A3.01` - Attitude domain, category 3, subcategory 01
---
## Dimension Codes
### Valence
Indicates the sentiment direction of the feedback.
| Code | Meaning |
|------|---------|
| V+ | Positive |
| V- | Negative |
| V0 | Neutral |
| V± | Mixed |
### Intensity
Indicates the strength of the expressed sentiment.
| Code | Meaning |
|------|---------|
| I1 | Low intensity |
| I2 | Moderate intensity |
| I3 | High intensity |
### Specificity (Standard+)
Indicates how detailed the feedback is.
| Code | Meaning |
|------|---------|
| S1 | Low - vague, general |
| S2 | Medium - some detail |
| S3 | High - specific, precise |
### Actionability (Standard+)
Indicates whether clear actions can be derived from the feedback.
| Code | Meaning |
|------|---------|
| A1 | None - no clear action |
| A2 | Unclear - possible actions |
| A3 | Clear - specific actionable |
### Temporal (Standard+)
Indicates the time frame referenced in the feedback.
| Code | Meaning | Markers |
|------|---------|---------|
| TC | Current - this visit | "today", "this time", "yesterday" |
| TR | Recent - last few visits | "lately", "recently", "again" |
| TH | Historical - long-standing | "for years", "always", "historically" |
| TF | Future - expectations | "I won't come back", "next time" |
**Default**: TC when no temporal language exists.
### Evidence (Standard+)
Indicates how the information was obtained from the text.
| Code | Meaning | Example |
|------|---------|---------|
| ES | Stated - explicit in text | "Waited 45 minutes" |
| EI | Inferred - logically entailed | "Took 3 weeks to reply" → slow response |
| EC | Contextual - depends on context | "That happened again" |
**Default**: ES. Use EI/EC only when needed.
### Comparative
Indicates whether the feedback compares to alternatives.
| Code | Meaning |
|------|---------|
| CR-N | No comparison |
| CR-B | Better than alternatives |
| CR-W | Worse than alternatives |
| CR-S | Same as alternatives |
---
## USN (URT String Notation)
USN is a compact string encoding for URT annotations.
### Grammar
```
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
```
### Encoding Rules
**Valence**:
- `+` for V+
- `-` for V-
**Intensity**:
- `1` for I1
- `2` for I2
- `3` for I3
### Examples
**Standard Profile**:
```
URT:S:J1.03:-2:22TC.ES.N
```
Decoded:
- Profile: Standard
- Code: J1.03
- Valence: V- (negative)
- Intensity: I2
- Specificity: S2
- Actionability: A2
- Temporal: TC
- Evidence: ES
- Comparative: CR-N
**Full Profile with Causal Chain**:
```
URT:F:J1.01+A1.04:-3:23TR.EI.S:CD.O,MG.O
```
Decoded:
- Profile: Full
- Codes: J1.01, A1.04
- Valence: V- (negative)
- Intensity: I3
- Specificity: S2
- Actionability: A3
- Temporal: TR
- Evidence: EI
- Comparative: CR-S
- Causal: CD.O (Conditions-Operational), MG.O (Management-Oversight)
---
## Causal Chain (Full Profile Only)
The causal chain identifies root causes across three layers, ordered from immediate to systemic.
### Layers
| Layer | Codes | Scope |
|-------|-------|-------|
| conditions | CD-S, CD-T, CD-E, CD-F, CD-O | Staff State, Team Dynamics, Equipment, Facility, Operational |
| management | MG-P, MG-T, MG-O, MG-R, MG-C | Planning, Training, Oversight, Resources, Communication |
| systemic | SY-R, SY-P, SY-C, SY-S, SY-H, SY-X | Resource Decisions, Policy, Culture, Standards, Human Capital, External |
### Code Reference
**Conditions Layer**:
- `CD-S` - Staff State
- `CD-T` - Team Dynamics
- `CD-E` - Equipment
- `CD-F` - Facility
- `CD-O` - Operational
**Management Layer**:
- `MG-P` - Planning
- `MG-T` - Training
- `MG-O` - Oversight
- `MG-R` - Resources
- `MG-C` - Communication
**Systemic Layer**:
- `SY-R` - Resource Decisions
- `SY-P` - Policy
- `SY-C` - Culture
- `SY-S` - Standards
- `SY-H` - Human Capital
- `SY-X` - External
### JSONB Schema
```json
[
{"layer": "conditions", "code": "CD-O", "evidence": "ES"},
{"layer": "management", "code": "MG-P", "evidence": "EI"}
]
```
### Constraints
- Maximum 3 entries (one per layer)
- Only include when text explicitly supports it
- Order: conditions → management → systemic
---
## Span Boundary Detection Rules
Spans are detected at the clause/topic level, not sentence level.
### Split Rules (in priority order)
1. **Split on contrasting conjunctions**: but, however, although, despite, yet
2. **Split when subject/target changes** (topic shift)
3. **Split when valence changes** (positive ↔ negative)
4. **Split when domain changes** (O/P/J/E/A/V/R)
5. **Keep together** for cause→effect within same feedback unit
### Guidelines
- **Maximum**: ~3 spans per sentence
- **Validation**: If 4+ spans detected, re-check for over-splitting
### Example
**Input**:
> "The food was great but the service was slow and the bathroom was dirty."
**Output**: 3 spans
1. "The food was great" (Offering, positive)
2. "the service was slow" (Journey/Attitude, negative)
3. "the bathroom was dirty" (Environment, negative)
**Reasoning**: Topic shift + domain shift at each boundary.
---
## Primary Span Selection
When a review contains multiple spans, select the primary span using these criteria in order:
### Selection Priority
1. **Highest intensity** (I3 > I2 > I1)
2. **Tie-break**: Negative over positive (V- > V± > V0 > V+)
3. **Tie-break**: Earliest span_index
### Example
Given spans:
- Span 0: I2, V+
- Span 1: I3, V+
- Span 2: I3, V-
**Primary**: Span 2 (highest intensity I3, negative valence wins tie-break)
---
## Secondary Codes Rules
Secondary codes capture additional topics mentioned in a span.
### Constraints
- **Maximum**: 2 secondary codes
- **Format**: Must be Tier-3 (X#.##)
- **Recommendation**: Should be different domain from primary
### Example
Primary: `J1.03` (Journey)
Secondary: `A2.01`, `E1.05` (Attitude, Environment)
---
## Quick Reference Card
### Profiles
| Profile | Dimensions | Causal Chain |
|---------|------------|--------------|
| Core | V, I | No |
| Standard | V, I, S, A, T, E, CR | No |
| Full | V, I, S, A, T, E, CR | Yes |
### USN Quick Format
```
URT:{S|F}:{tier3_codes}:{valence}{intensity}:{SAT}.{E}.{CR}[:{causal}]
```
### Domain Letters
```
O P J E A V R
│ │ │ │ │ │ └─ Relationship
│ │ │ │ │ └─── Voice
│ │ │ │ └───── Attitude
│ │ │ └─────── Environment
│ │ └───────── Journey
│ └─────────── Price
└───────────── Offering
```