Phase 0: Project restructure to ReviewIQ platform architecture
New structure: - scrapers/google_reviews/v1_0_0.py (was modules/scraper_clean.py) - scrapers/base.py (BaseScraper interface) - scrapers/registry.py (ScraperRegistry for version routing) - core/database.py, models.py, config.py, enums.py - utils/logger.py, crash_analyzer.py, health_checks.py, helpers.py, date_converter.py - workers/chrome_pool.py - services/webhook_service.py - api/ routes structure (empty, ready for Phase 2) - tests/ structure mirroring source All imports updated in: - api_server_production.py (7 import paths updated) - utils/health_checks.py (scraper import path) Legacy modules moved to modules/_legacy/: - data_storage.py, image_handler.py, s3_handler.py (unused) Syntax verified, frontend build passing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
1143
.artifacts/ReviewIQ-Architecture-v2.md
Normal file
1143
.artifacts/ReviewIQ-Architecture-v2.md
Normal file
File diff suppressed because it is too large
Load Diff
2306
.artifacts/ReviewIQ-Architecture-v3.2.md
Normal file
2306
.artifacts/ReviewIQ-Architecture-v3.2.md
Normal file
File diff suppressed because it is too large
Load Diff
1513
.artifacts/ReviewIQ-Architecture-v3.md
Normal file
1513
.artifacts/ReviewIQ-Architecture-v3.md
Normal file
File diff suppressed because it is too large
Load Diff
183
.artifacts/ReviewIQ-v32-Decisions.md
Normal file
183
.artifacts/ReviewIQ-v32-Decisions.md
Normal file
@@ -0,0 +1,183 @@
|
|||||||
|
# ReviewIQ v3.2 Design Decisions
|
||||||
|
|
||||||
|
> Fast context-recovery document — all key decisions without the full spec.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Markpoint
|
||||||
|
|
||||||
|
```
|
||||||
|
ID: reviewiq-v32-span-layer-2026-01-24-001
|
||||||
|
Status: v3.2 span layer complete
|
||||||
|
Based on: v3.1.2 (commit f998277)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Core Design Decisions
|
||||||
|
|
||||||
|
| Decision | Choice | Rationale |
|
||||||
|
|----------|--------|-----------|
|
||||||
|
| Span granularity | Clause/topic-level | Preserves multi-domain signal |
|
||||||
|
| span_id format | ULID (TEXT) | Survives re-segmentation |
|
||||||
|
| Span offsets | Required (NOT NULL) | Deterministic reconstruction |
|
||||||
|
| Offsets reference | reviews_enriched.text | Not text_normalized |
|
||||||
|
| Span → Issue mapping | One-to-one (UNIQUE span_id) | Atomic unit per issue |
|
||||||
|
| Primary span enforcement | Partial unique index | Exactly one per review version |
|
||||||
|
| Primary selection | I3>I2>I1, V->V±>V0>V+, span_index | Deterministic, stable |
|
||||||
|
| Reprocessing strategy | Soft-switch with is_active | No transient empty states |
|
||||||
|
| Span overlap | GiST exclusion constraint | Non-overlapping ranges enforced |
|
||||||
|
| Secondary codes | Array with cardinality ≤ 2 | Could normalize to link table later |
|
||||||
|
| Causal chain storage | JSONB | Flexibility, normalize later if needed |
|
||||||
|
| relation_type vs causal_chain | Separate concerns | relation = within-review, causal = root cause |
|
||||||
|
| Dimension columns | Postgres ENUMs | Type safety, storage efficiency |
|
||||||
|
| Trust score floor | 0.2 (GREATEST clamp) | Prevent multiplicative collapse |
|
||||||
|
| Issue routing key | (business_id, place_id, urt_primary, entity_normalized) | Deterministic, entity-aware |
|
||||||
|
| Issue ID generation | SHA256 via pgcrypto | Deterministic, collision-resistant |
|
||||||
|
| Text validation trigger | Conditional via session setting | Performance: skip in bulk loads |
|
||||||
|
| Relation validation | Application-level post-insert | Handles insertion order |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Extensions Required
|
||||||
|
|
||||||
|
| Extension | Purpose |
|
||||||
|
|-----------|---------|
|
||||||
|
| `btree_gist` | Exclusion constraint for non-overlapping spans |
|
||||||
|
| `pgcrypto` | SHA256-based issue ID generation |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. New Tables
|
||||||
|
|
||||||
|
| Table | Purpose |
|
||||||
|
|-------|---------|
|
||||||
|
| `review_spans` | Span-level URT classification |
|
||||||
|
| `review_span_secondary_codes` | (Optional) Normalized secondary codes |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Modified Tables
|
||||||
|
|
||||||
|
| Table | Changes |
|
||||||
|
|-------|---------|
|
||||||
|
| `issue_spans` | Added `span_id` FK (NOT NULL), removed direct review FK as canonical |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. New ENUM Types
|
||||||
|
|
||||||
|
**Valence & Intensity:**
|
||||||
|
- `urt_valence` — V-, V±, V0, V+
|
||||||
|
- `urt_intensity` — I1, I2, I3
|
||||||
|
|
||||||
|
**Specificity & Actionability:**
|
||||||
|
- `urt_specificity` — S1, S2, S3
|
||||||
|
- `urt_actionability` — A1, A2, A3
|
||||||
|
|
||||||
|
**Context & Evidence:**
|
||||||
|
- `urt_temporal` — T1, T2, T3
|
||||||
|
- `urt_evidence` — E1, E2, E3
|
||||||
|
- `urt_comparative` — CR1, CR2, CR3
|
||||||
|
|
||||||
|
**Classification:**
|
||||||
|
- `urt_profile` — factual, emotional, comparative, etc.
|
||||||
|
- `urt_confidence` — low, medium, high
|
||||||
|
- `urt_relation` — elaborates, contrasts, causes, etc.
|
||||||
|
- `urt_entity_type` — person, product, location, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Key Functions
|
||||||
|
|
||||||
|
| Function | Purpose |
|
||||||
|
|----------|---------|
|
||||||
|
| `urt_validate_causal_chain()` | Validates causal JSONB structure |
|
||||||
|
| `validate_review_relations()` | Ensures related_span_id same-parent |
|
||||||
|
| `validate_active_spans()` | Ensures valid active span set |
|
||||||
|
| `set_primary_span()` | Deterministic primary selection |
|
||||||
|
| `generate_issue_id()` | SHA256-based issue ID |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Key Triggers
|
||||||
|
|
||||||
|
| Trigger | Purpose |
|
||||||
|
|---------|---------|
|
||||||
|
| `review_spans_validate_bounds` | span_end ≤ text length |
|
||||||
|
| `review_spans_validate_text` | span_text matches substring |
|
||||||
|
| `review_spans_validate_causal_chain` | causal_chain JSONB valid |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. USN Format
|
||||||
|
|
||||||
|
```
|
||||||
|
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
|
||||||
|
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Examples:**
|
||||||
|
- `URT:S:SVC.SPD:V-I3:S3A3T2.E2.CR1` — Specific service speed complaint
|
||||||
|
- `URT:F:PRD.QUA:V+I2:S2A1T1.E3.CR2:staff→training` — Product quality praise with causal chain
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Span Boundary Rules
|
||||||
|
|
||||||
|
1. **Split on contrasting conjunctions** — "but", "however", "although"
|
||||||
|
2. **Split on topic/target change** — Different entity or aspect
|
||||||
|
3. **Split on valence change** — Positive → Negative or vice versa
|
||||||
|
4. **Split on domain change** — SVC → PRD → AMB
|
||||||
|
5. **Keep cause→effect together** — Causal chain stays in one span
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Deferred to v3.3+
|
||||||
|
|
||||||
|
| Item | Reason |
|
||||||
|
|------|--------|
|
||||||
|
| Entity extraction implementation | Requires NER pipeline |
|
||||||
|
| Trust-weighted fact aggregation | Needs more span data |
|
||||||
|
| Secondary domain enforcement | App-level validation sufficient |
|
||||||
|
| Span-based fact counting | Currently review-based, optimize later |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Open Questions Resolved
|
||||||
|
|
||||||
|
| Question | Resolution |
|
||||||
|
|----------|------------|
|
||||||
|
| Span → Issue cardinality? | **One-to-one** (not many-to-many) |
|
||||||
|
| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
|
||||||
|
| Reprocessing strategy? | **Soft-switch** with is_active flag |
|
||||||
|
| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference
|
||||||
|
|
||||||
|
### Primary Span Selection Algorithm
|
||||||
|
|
||||||
|
```
|
||||||
|
ORDER BY:
|
||||||
|
1. intensity DESC (I3 > I2 > I1)
|
||||||
|
2. valence ASC (V- > V± > V0 > V+)
|
||||||
|
3. span_index ASC (first wins ties)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Issue Routing Key
|
||||||
|
|
||||||
|
```sql
|
||||||
|
(business_id, place_id, urt_primary, entity_normalized)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Trust Score Calculation
|
||||||
|
|
||||||
|
```sql
|
||||||
|
GREATEST(0.2, base_trust * modifiers) -- Floor prevents collapse
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last updated: 2026-01-24*
|
||||||
331
.artifacts/URT-v5.1-Reference.md
Normal file
331
.artifacts/URT-v5.1-Reference.md
Normal file
@@ -0,0 +1,331 @@
|
|||||||
|
# Universal Review Taxonomy (URT) v5.1 Reference
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The Universal Review Taxonomy (URT) is a classification system for customer feedback. It provides a structured approach to categorizing, annotating, and analyzing review content across any industry.
|
||||||
|
|
||||||
|
### Key Characteristics
|
||||||
|
|
||||||
|
- **Three Profiles**: Core, Standard, Full (increasing detail)
|
||||||
|
- **Seven Domains**: Covering all aspects of customer experience
|
||||||
|
- **Tier-3 Canonical Codes**: Format `X#.##` (e.g., J1.02, P2.15)
|
||||||
|
- **Dimensional Annotation**: Valence, intensity, specificity, and more
|
||||||
|
- **Causal Analysis**: Root cause chains (Full profile)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Domain Codes
|
||||||
|
|
||||||
|
URT organizes feedback into seven domains, each identified by a single letter.
|
||||||
|
|
||||||
|
| Domain | Letter | Description |
|
||||||
|
|--------|--------|-------------|
|
||||||
|
| Offering | O | Product/service quality |
|
||||||
|
| Price | P | Value, pricing, promotions |
|
||||||
|
| Journey | J | Customer experience, timing, process |
|
||||||
|
| Environment | E | Physical/digital space |
|
||||||
|
| Attitude | A | Staff behavior, service attitude |
|
||||||
|
| Voice | V | Brand, communication, marketing |
|
||||||
|
| Relationship | R | Loyalty, trust, long-term relationship |
|
||||||
|
|
||||||
|
### Tier-3 Code Format
|
||||||
|
|
||||||
|
```
|
||||||
|
Pattern: [OPJEAVR][1-4]\.[0-9]{2}
|
||||||
|
```
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
- `J1.02` - Journey domain, category 1, subcategory 02
|
||||||
|
- `P2.15` - Price domain, category 2, subcategory 15
|
||||||
|
- `A3.01` - Attitude domain, category 3, subcategory 01
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dimension Codes
|
||||||
|
|
||||||
|
### Valence
|
||||||
|
|
||||||
|
Indicates the sentiment direction of the feedback.
|
||||||
|
|
||||||
|
| Code | Meaning |
|
||||||
|
|------|---------|
|
||||||
|
| V+ | Positive |
|
||||||
|
| V- | Negative |
|
||||||
|
| V0 | Neutral |
|
||||||
|
| V± | Mixed |
|
||||||
|
|
||||||
|
### Intensity
|
||||||
|
|
||||||
|
Indicates the strength of the expressed sentiment.
|
||||||
|
|
||||||
|
| Code | Meaning |
|
||||||
|
|------|---------|
|
||||||
|
| I1 | Low intensity |
|
||||||
|
| I2 | Moderate intensity |
|
||||||
|
| I3 | High intensity |
|
||||||
|
|
||||||
|
### Specificity (Standard+)
|
||||||
|
|
||||||
|
Indicates how detailed the feedback is.
|
||||||
|
|
||||||
|
| Code | Meaning |
|
||||||
|
|------|---------|
|
||||||
|
| S1 | Low - vague, general |
|
||||||
|
| S2 | Medium - some detail |
|
||||||
|
| S3 | High - specific, precise |
|
||||||
|
|
||||||
|
### Actionability (Standard+)
|
||||||
|
|
||||||
|
Indicates whether clear actions can be derived from the feedback.
|
||||||
|
|
||||||
|
| Code | Meaning |
|
||||||
|
|------|---------|
|
||||||
|
| A1 | None - no clear action |
|
||||||
|
| A2 | Unclear - possible actions |
|
||||||
|
| A3 | Clear - specific actionable |
|
||||||
|
|
||||||
|
### Temporal (Standard+)
|
||||||
|
|
||||||
|
Indicates the time frame referenced in the feedback.
|
||||||
|
|
||||||
|
| Code | Meaning | Markers |
|
||||||
|
|------|---------|---------|
|
||||||
|
| TC | Current - this visit | "today", "this time", "yesterday" |
|
||||||
|
| TR | Recent - last few visits | "lately", "recently", "again" |
|
||||||
|
| TH | Historical - long-standing | "for years", "always", "historically" |
|
||||||
|
| TF | Future - expectations | "I won't come back", "next time" |
|
||||||
|
|
||||||
|
**Default**: TC when no temporal language exists.
|
||||||
|
|
||||||
|
### Evidence (Standard+)
|
||||||
|
|
||||||
|
Indicates how the information was obtained from the text.
|
||||||
|
|
||||||
|
| Code | Meaning | Example |
|
||||||
|
|------|---------|---------|
|
||||||
|
| ES | Stated - explicit in text | "Waited 45 minutes" |
|
||||||
|
| EI | Inferred - logically entailed | "Took 3 weeks to reply" → slow response |
|
||||||
|
| EC | Contextual - depends on context | "That happened again" |
|
||||||
|
|
||||||
|
**Default**: ES. Use EI/EC only when needed.
|
||||||
|
|
||||||
|
### Comparative
|
||||||
|
|
||||||
|
Indicates whether the feedback compares to alternatives.
|
||||||
|
|
||||||
|
| Code | Meaning |
|
||||||
|
|------|---------|
|
||||||
|
| CR-N | No comparison |
|
||||||
|
| CR-B | Better than alternatives |
|
||||||
|
| CR-W | Worse than alternatives |
|
||||||
|
| CR-S | Same as alternatives |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## USN (URT String Notation)
|
||||||
|
|
||||||
|
USN is a compact string encoding for URT annotations.
|
||||||
|
|
||||||
|
### Grammar
|
||||||
|
|
||||||
|
```
|
||||||
|
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
|
||||||
|
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Encoding Rules
|
||||||
|
|
||||||
|
**Valence**:
|
||||||
|
- `+` for V+
|
||||||
|
- `-` for V-
|
||||||
|
|
||||||
|
**Intensity**:
|
||||||
|
- `1` for I1
|
||||||
|
- `2` for I2
|
||||||
|
- `3` for I3
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
|
||||||
|
**Standard Profile**:
|
||||||
|
```
|
||||||
|
URT:S:J1.03:-2:22TC.ES.N
|
||||||
|
```
|
||||||
|
Decoded:
|
||||||
|
- Profile: Standard
|
||||||
|
- Code: J1.03
|
||||||
|
- Valence: V- (negative)
|
||||||
|
- Intensity: I2
|
||||||
|
- Specificity: S2
|
||||||
|
- Actionability: A2
|
||||||
|
- Temporal: TC
|
||||||
|
- Evidence: ES
|
||||||
|
- Comparative: CR-N
|
||||||
|
|
||||||
|
**Full Profile with Causal Chain**:
|
||||||
|
```
|
||||||
|
URT:F:J1.01+A1.04:-3:23TR.EI.S:CD.O,MG.O
|
||||||
|
```
|
||||||
|
Decoded:
|
||||||
|
- Profile: Full
|
||||||
|
- Codes: J1.01, A1.04
|
||||||
|
- Valence: V- (negative)
|
||||||
|
- Intensity: I3
|
||||||
|
- Specificity: S2
|
||||||
|
- Actionability: A3
|
||||||
|
- Temporal: TR
|
||||||
|
- Evidence: EI
|
||||||
|
- Comparative: CR-S
|
||||||
|
- Causal: CD.O (Conditions-Operational), MG.O (Management-Oversight)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Causal Chain (Full Profile Only)
|
||||||
|
|
||||||
|
The causal chain identifies root causes across three layers, ordered from immediate to systemic.
|
||||||
|
|
||||||
|
### Layers
|
||||||
|
|
||||||
|
| Layer | Codes | Scope |
|
||||||
|
|-------|-------|-------|
|
||||||
|
| conditions | CD-S, CD-T, CD-E, CD-F, CD-O | Staff State, Team Dynamics, Equipment, Facility, Operational |
|
||||||
|
| management | MG-P, MG-T, MG-O, MG-R, MG-C | Planning, Training, Oversight, Resources, Communication |
|
||||||
|
| systemic | SY-R, SY-P, SY-C, SY-S, SY-H, SY-X | Resource Decisions, Policy, Culture, Standards, Human Capital, External |
|
||||||
|
|
||||||
|
### Code Reference
|
||||||
|
|
||||||
|
**Conditions Layer**:
|
||||||
|
- `CD-S` - Staff State
|
||||||
|
- `CD-T` - Team Dynamics
|
||||||
|
- `CD-E` - Equipment
|
||||||
|
- `CD-F` - Facility
|
||||||
|
- `CD-O` - Operational
|
||||||
|
|
||||||
|
**Management Layer**:
|
||||||
|
- `MG-P` - Planning
|
||||||
|
- `MG-T` - Training
|
||||||
|
- `MG-O` - Oversight
|
||||||
|
- `MG-R` - Resources
|
||||||
|
- `MG-C` - Communication
|
||||||
|
|
||||||
|
**Systemic Layer**:
|
||||||
|
- `SY-R` - Resource Decisions
|
||||||
|
- `SY-P` - Policy
|
||||||
|
- `SY-C` - Culture
|
||||||
|
- `SY-S` - Standards
|
||||||
|
- `SY-H` - Human Capital
|
||||||
|
- `SY-X` - External
|
||||||
|
|
||||||
|
### JSONB Schema
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{"layer": "conditions", "code": "CD-O", "evidence": "ES"},
|
||||||
|
{"layer": "management", "code": "MG-P", "evidence": "EI"}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Maximum 3 entries (one per layer)
|
||||||
|
- Only include when text explicitly supports it
|
||||||
|
- Order: conditions → management → systemic
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Span Boundary Detection Rules
|
||||||
|
|
||||||
|
Spans are detected at the clause/topic level, not sentence level.
|
||||||
|
|
||||||
|
### Split Rules (in priority order)
|
||||||
|
|
||||||
|
1. **Split on contrasting conjunctions**: but, however, although, despite, yet
|
||||||
|
2. **Split when subject/target changes** (topic shift)
|
||||||
|
3. **Split when valence changes** (positive ↔ negative)
|
||||||
|
4. **Split when domain changes** (O/P/J/E/A/V/R)
|
||||||
|
5. **Keep together** for cause→effect within same feedback unit
|
||||||
|
|
||||||
|
### Guidelines
|
||||||
|
|
||||||
|
- **Maximum**: ~3 spans per sentence
|
||||||
|
- **Validation**: If 4+ spans detected, re-check for over-splitting
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
**Input**:
|
||||||
|
> "The food was great but the service was slow and the bathroom was dirty."
|
||||||
|
|
||||||
|
**Output**: 3 spans
|
||||||
|
1. "The food was great" (Offering, positive)
|
||||||
|
2. "the service was slow" (Journey/Attitude, negative)
|
||||||
|
3. "the bathroom was dirty" (Environment, negative)
|
||||||
|
|
||||||
|
**Reasoning**: Topic shift + domain shift at each boundary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Primary Span Selection
|
||||||
|
|
||||||
|
When a review contains multiple spans, select the primary span using these criteria in order:
|
||||||
|
|
||||||
|
### Selection Priority
|
||||||
|
|
||||||
|
1. **Highest intensity** (I3 > I2 > I1)
|
||||||
|
2. **Tie-break**: Negative over positive (V- > V± > V0 > V+)
|
||||||
|
3. **Tie-break**: Earliest span_index
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
Given spans:
|
||||||
|
- Span 0: I2, V+
|
||||||
|
- Span 1: I3, V+
|
||||||
|
- Span 2: I3, V-
|
||||||
|
|
||||||
|
**Primary**: Span 2 (highest intensity I3, negative valence wins tie-break)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Secondary Codes Rules
|
||||||
|
|
||||||
|
Secondary codes capture additional topics mentioned in a span.
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- **Maximum**: 2 secondary codes
|
||||||
|
- **Format**: Must be Tier-3 (X#.##)
|
||||||
|
- **Recommendation**: Should be different domain from primary
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
Primary: `J1.03` (Journey)
|
||||||
|
Secondary: `A2.01`, `E1.05` (Attitude, Environment)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference Card
|
||||||
|
|
||||||
|
### Profiles
|
||||||
|
|
||||||
|
| Profile | Dimensions | Causal Chain |
|
||||||
|
|---------|------------|--------------|
|
||||||
|
| Core | V, I | No |
|
||||||
|
| Standard | V, I, S, A, T, E, CR | No |
|
||||||
|
| Full | V, I, S, A, T, E, CR | Yes |
|
||||||
|
|
||||||
|
### USN Quick Format
|
||||||
|
|
||||||
|
```
|
||||||
|
URT:{S|F}:{tier3_codes}:{valence}{intensity}:{SAT}.{E}.{CR}[:{causal}]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Domain Letters
|
||||||
|
|
||||||
|
```
|
||||||
|
O P J E A V R
|
||||||
|
│ │ │ │ │ │ └─ Relationship
|
||||||
|
│ │ │ │ │ └─── Voice
|
||||||
|
│ │ │ │ └───── Attitude
|
||||||
|
│ │ │ └─────── Environment
|
||||||
|
│ │ └───────── Journey
|
||||||
|
│ └─────────── Price
|
||||||
|
└───────────── Offering
|
||||||
|
```
|
||||||
0
api/__init__.py
Normal file
0
api/__init__.py
Normal file
0
api/middleware/__init__.py
Normal file
0
api/middleware/__init__.py
Normal file
0
api/routes/__init__.py
Normal file
0
api/routes/__init__.py
Normal file
@@ -20,13 +20,13 @@ from fastapi.middleware.cors import CORSMiddleware
|
|||||||
from pydantic import BaseModel, HttpUrl, Field
|
from pydantic import BaseModel, HttpUrl, Field
|
||||||
from fastapi.responses import JSONResponse, StreamingResponse
|
from fastapi.responses import JSONResponse, StreamingResponse
|
||||||
|
|
||||||
from modules.database import DatabaseManager, JobStatus
|
from core.database import DatabaseManager, JobStatus
|
||||||
from modules.webhooks import WebhookDispatcher, WebhookManager
|
from services.webhook_service import WebhookDispatcher, WebhookManager
|
||||||
from modules.health_checks import HealthCheckSystem
|
from utils.health_checks import HealthCheckSystem
|
||||||
from modules.scraper_clean import fast_scrape_reviews, LogCapture, get_business_card_info # Clean scraper
|
from scrapers.google_reviews.v1_0_0 import fast_scrape_reviews, LogCapture, get_business_card_info # Clean scraper
|
||||||
from modules.crash_analyzer import analyze_crash, summarize_crash_patterns, apply_auto_fix
|
from utils.crash_analyzer import analyze_crash, summarize_crash_patterns, apply_auto_fix
|
||||||
from modules.structured_logger import StructuredLogger, LogEntry
|
from utils.logger import StructuredLogger, LogEntry
|
||||||
from modules.chrome_pool import (
|
from workers.chrome_pool import (
|
||||||
start_worker_pools,
|
start_worker_pools,
|
||||||
stop_worker_pools,
|
stop_worker_pools,
|
||||||
get_validation_worker,
|
get_validation_worker,
|
||||||
|
|||||||
0
core/__init__.py
Normal file
0
core/__init__.py
Normal file
@@ -8,22 +8,13 @@ import json
|
|||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from typing import Optional, List, Dict, Any
|
from typing import Optional, List, Dict, Any
|
||||||
from uuid import UUID, uuid4
|
from uuid import UUID, uuid4
|
||||||
from enum import Enum
|
|
||||||
import logging
|
import logging
|
||||||
|
|
||||||
|
from core.enums import JobStatus
|
||||||
|
|
||||||
log = logging.getLogger(__name__)
|
log = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
class JobStatus(str, Enum):
|
|
||||||
"""Job status enumeration"""
|
|
||||||
PENDING = "pending"
|
|
||||||
RUNNING = "running"
|
|
||||||
COMPLETED = "completed"
|
|
||||||
FAILED = "failed"
|
|
||||||
CANCELLED = "cancelled"
|
|
||||||
PARTIAL = "partial" # Job crashed but has partial reviews saved
|
|
||||||
|
|
||||||
|
|
||||||
class DatabaseManager:
|
class DatabaseManager:
|
||||||
"""PostgreSQL database manager with connection pooling"""
|
"""PostgreSQL database manager with connection pooling"""
|
||||||
|
|
||||||
14
core/enums.py
Normal file
14
core/enums.py
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
"""
|
||||||
|
Enumerations for the ReviewIQ project.
|
||||||
|
"""
|
||||||
|
from enum import Enum
|
||||||
|
|
||||||
|
|
||||||
|
class JobStatus(str, Enum):
|
||||||
|
"""Job status enumeration"""
|
||||||
|
PENDING = "pending"
|
||||||
|
RUNNING = "running"
|
||||||
|
COMPLETED = "completed"
|
||||||
|
FAILED = "failed"
|
||||||
|
CANCELLED = "cancelled"
|
||||||
|
PARTIAL = "partial" # Job crashed but has partial reviews saved
|
||||||
@@ -6,7 +6,7 @@ from dataclasses import dataclass, field
|
|||||||
|
|
||||||
from selenium.webdriver.remote.webelement import WebElement
|
from selenium.webdriver.remote.webelement import WebElement
|
||||||
|
|
||||||
from modules.utils import (try_find, first_text, first_attr, safe_int, detect_lang, parse_date_to_iso)
|
from utils.helpers import (try_find, first_text, first_attr, safe_int, detect_lang, parse_date_to_iso)
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -27,7 +27,7 @@ class RawReview:
|
|||||||
owner_date: str = ""
|
owner_date: str = ""
|
||||||
owner_text: str = ""
|
owner_text: str = ""
|
||||||
review_date: str = "" # ISO format date
|
review_date: str = "" # ISO format date
|
||||||
|
|
||||||
# Translation fields
|
# Translation fields
|
||||||
translations: dict = field(default_factory=dict) # Store translations by language code
|
translations: dict = field(default_factory=dict) # Store translations by language code
|
||||||
|
|
||||||
10
scrapers/__init__.py
Normal file
10
scrapers/__init__.py
Normal file
@@ -0,0 +1,10 @@
|
|||||||
|
"""
|
||||||
|
Scrapers Package
|
||||||
|
|
||||||
|
This package contains all scraper implementations for the ReviewIQ system.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from scrapers.base import BaseScraper
|
||||||
|
from scrapers.registry import ScraperRegistry, registry
|
||||||
|
|
||||||
|
__all__ = ["BaseScraper", "ScraperRegistry", "registry"]
|
||||||
97
scrapers/base.py
Normal file
97
scrapers/base.py
Normal file
@@ -0,0 +1,97 @@
|
|||||||
|
"""
|
||||||
|
Base Scraper Interface
|
||||||
|
|
||||||
|
This module defines the abstract base class that all scrapers must implement.
|
||||||
|
It ensures consistent interface across different scraper implementations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from abc import ABC, abstractmethod
|
||||||
|
from typing import Any, Callable, Dict, List, Optional
|
||||||
|
|
||||||
|
|
||||||
|
class BaseScraper(ABC):
|
||||||
|
"""
|
||||||
|
Abstract base class for all scrapers in the ReviewIQ system.
|
||||||
|
|
||||||
|
All concrete scraper implementations must inherit from this class
|
||||||
|
and implement the required abstract methods.
|
||||||
|
"""
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def scrape(
|
||||||
|
self,
|
||||||
|
driver: Any,
|
||||||
|
url: str,
|
||||||
|
max_reviews: int = 5000,
|
||||||
|
timeout_no_new: int = 15,
|
||||||
|
flush_callback: Optional[Callable[[List[Dict]], None]] = None,
|
||||||
|
flush_batch_size: int = 500,
|
||||||
|
progress_callback: Optional[Callable[[int, Optional[int]], None]] = None,
|
||||||
|
validation_only: bool = False
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Scrape reviews from the given URL.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
driver: WebDriver instance (e.g., Selenium WebDriver)
|
||||||
|
url: The URL to scrape reviews from
|
||||||
|
max_reviews: Maximum number of reviews to collect
|
||||||
|
timeout_no_new: Seconds to wait with no new reviews before stopping
|
||||||
|
flush_callback: Optional callback called with reviews batches for streaming
|
||||||
|
flush_batch_size: Number of reviews before triggering flush_callback
|
||||||
|
progress_callback: Optional callback(current_count, total_count) for progress
|
||||||
|
validation_only: If True, return early after extracting metadata only
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing:
|
||||||
|
- reviews: List of review dictionaries
|
||||||
|
- total: Total number of reviews collected
|
||||||
|
- error: Error message if any, None otherwise
|
||||||
|
- Additional scraper-specific metadata
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def validate_url(self, url: str) -> bool:
|
||||||
|
"""
|
||||||
|
Validate if the given URL is supported by this scraper.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: The URL to validate
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
True if the URL is valid for this scraper, False otherwise
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
def get_business_info(self, driver: Any, url: str) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Extract business information from the URL without scraping reviews.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
driver: WebDriver instance
|
||||||
|
url: The URL to extract info from
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dictionary containing business metadata (name, rating, address, etc.)
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@property
|
||||||
|
@abstractmethod
|
||||||
|
def name(self) -> str:
|
||||||
|
"""Return the human-readable name of this scraper."""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@property
|
||||||
|
@abstractmethod
|
||||||
|
def version(self) -> str:
|
||||||
|
"""Return the version string of this scraper."""
|
||||||
|
pass
|
||||||
|
|
||||||
|
@property
|
||||||
|
@abstractmethod
|
||||||
|
def supported_domains(self) -> List[str]:
|
||||||
|
"""Return list of domains this scraper supports."""
|
||||||
|
pass
|
||||||
21
scrapers/google_reviews/__init__.py
Normal file
21
scrapers/google_reviews/__init__.py
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
"""
|
||||||
|
Google Reviews Scraper Package
|
||||||
|
|
||||||
|
This package contains the Google Reviews scraper implementations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from scrapers.google_reviews.v1_0_0 import (
|
||||||
|
scrape_reviews,
|
||||||
|
fast_scrape_reviews,
|
||||||
|
get_business_card_info,
|
||||||
|
extract_about_info,
|
||||||
|
LogCapture,
|
||||||
|
)
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"scrape_reviews",
|
||||||
|
"fast_scrape_reviews",
|
||||||
|
"get_business_card_info",
|
||||||
|
"extract_about_info",
|
||||||
|
"LogCapture",
|
||||||
|
]
|
||||||
@@ -1,7 +1,12 @@
|
|||||||
"""
|
"""
|
||||||
Clean Google Maps Reviews Scraper
|
Google Reviews Scraper v1.0.0
|
||||||
|
|
||||||
|
This module provides the core Google Maps reviews scraping functionality.
|
||||||
- Simple down scrolling
|
- Simple down scrolling
|
||||||
- DOM scraping + API interception
|
- DOM scraping + API interception
|
||||||
|
|
||||||
|
Version: 1.0.0
|
||||||
|
Migrated from: modules/scraper_clean.py
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import re
|
import re
|
||||||
@@ -12,7 +17,7 @@ from datetime import datetime
|
|||||||
from typing import List, Optional
|
from typing import List, Optional
|
||||||
from selenium.webdriver.common.by import By
|
from selenium.webdriver.common.by import By
|
||||||
|
|
||||||
from modules.structured_logger import StructuredLogger
|
from utils.logger import StructuredLogger
|
||||||
|
|
||||||
def get_chrome_memory(driver) -> Optional[int]:
|
def get_chrome_memory(driver) -> Optional[int]:
|
||||||
"""Get Chrome memory usage in MB using CDP."""
|
"""Get Chrome memory usage in MB using CDP."""
|
||||||
138
scrapers/registry.py
Normal file
138
scrapers/registry.py
Normal file
@@ -0,0 +1,138 @@
|
|||||||
|
"""
|
||||||
|
Scraper Registry
|
||||||
|
|
||||||
|
This module provides a registry for managing and discovering scrapers.
|
||||||
|
It allows dynamic registration and lookup of scraper implementations.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from typing import Dict, List, Optional, Type
|
||||||
|
|
||||||
|
from scrapers.base import BaseScraper
|
||||||
|
|
||||||
|
|
||||||
|
class ScraperRegistry:
|
||||||
|
"""
|
||||||
|
Registry for managing scraper implementations.
|
||||||
|
|
||||||
|
The registry allows:
|
||||||
|
- Registering scrapers by name and version
|
||||||
|
- Looking up scrapers by domain or name
|
||||||
|
- Listing all available scrapers
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
registry = ScraperRegistry()
|
||||||
|
registry.register(GoogleReviewsScraper)
|
||||||
|
scraper = registry.get_scraper_for_url("https://google.com/maps/place/...")
|
||||||
|
"""
|
||||||
|
|
||||||
|
_instance: Optional["ScraperRegistry"] = None
|
||||||
|
_scrapers: Dict[str, Type[BaseScraper]]
|
||||||
|
|
||||||
|
def __new__(cls) -> "ScraperRegistry":
|
||||||
|
"""Singleton pattern to ensure one global registry."""
|
||||||
|
if cls._instance is None:
|
||||||
|
cls._instance = super().__new__(cls)
|
||||||
|
cls._instance._scrapers = {}
|
||||||
|
cls._instance._domain_map = {}
|
||||||
|
return cls._instance
|
||||||
|
|
||||||
|
def register(self, scraper_class: Type[BaseScraper], name: Optional[str] = None) -> None:
|
||||||
|
"""
|
||||||
|
Register a scraper class with the registry.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
scraper_class: The scraper class to register (must inherit from BaseScraper)
|
||||||
|
name: Optional name override, defaults to scraper_class.name property
|
||||||
|
"""
|
||||||
|
# Create a temporary instance to get properties
|
||||||
|
# Note: In production, we might want scraper_class to have class-level properties
|
||||||
|
instance = scraper_class.__new__(scraper_class)
|
||||||
|
|
||||||
|
scraper_name = name or instance.name
|
||||||
|
scraper_version = instance.version
|
||||||
|
key = f"{scraper_name}:{scraper_version}"
|
||||||
|
|
||||||
|
self._scrapers[key] = scraper_class
|
||||||
|
|
||||||
|
# Map domains to this scraper
|
||||||
|
for domain in instance.supported_domains:
|
||||||
|
if domain not in self._domain_map:
|
||||||
|
self._domain_map[domain] = []
|
||||||
|
self._domain_map[domain].append(key)
|
||||||
|
|
||||||
|
def get_scraper(self, name: str, version: Optional[str] = None) -> Optional[Type[BaseScraper]]:
|
||||||
|
"""
|
||||||
|
Get a scraper class by name and optional version.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
name: The scraper name
|
||||||
|
version: Optional version string. If not provided, returns the latest.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
The scraper class, or None if not found
|
||||||
|
"""
|
||||||
|
if version:
|
||||||
|
key = f"{name}:{version}"
|
||||||
|
return self._scrapers.get(key)
|
||||||
|
|
||||||
|
# Find latest version for this name
|
||||||
|
matching = [k for k in self._scrapers.keys() if k.startswith(f"{name}:")]
|
||||||
|
if not matching:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Sort by version and return latest
|
||||||
|
matching.sort(reverse=True)
|
||||||
|
return self._scrapers.get(matching[0])
|
||||||
|
|
||||||
|
def get_scraper_for_url(self, url: str) -> Optional[Type[BaseScraper]]:
|
||||||
|
"""
|
||||||
|
Find a suitable scraper for the given URL.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: The URL to find a scraper for
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
The scraper class that can handle this URL, or None if no match
|
||||||
|
"""
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
|
||||||
|
parsed = urlparse(url)
|
||||||
|
domain = parsed.netloc.lower()
|
||||||
|
|
||||||
|
# Remove www. prefix for matching
|
||||||
|
if domain.startswith("www."):
|
||||||
|
domain = domain[4:]
|
||||||
|
|
||||||
|
scraper_keys = self._domain_map.get(domain, [])
|
||||||
|
if not scraper_keys:
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Return the latest version
|
||||||
|
scraper_keys.sort(reverse=True)
|
||||||
|
return self._scrapers.get(scraper_keys[0])
|
||||||
|
|
||||||
|
def list_scrapers(self) -> List[Dict[str, str]]:
|
||||||
|
"""
|
||||||
|
List all registered scrapers.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of dictionaries with scraper info (name, version, domains)
|
||||||
|
"""
|
||||||
|
result = []
|
||||||
|
for key, scraper_class in self._scrapers.items():
|
||||||
|
instance = scraper_class.__new__(scraper_class)
|
||||||
|
result.append({
|
||||||
|
"name": instance.name,
|
||||||
|
"version": instance.version,
|
||||||
|
"domains": instance.supported_domains
|
||||||
|
})
|
||||||
|
return result
|
||||||
|
|
||||||
|
def clear(self) -> None:
|
||||||
|
"""Clear all registered scrapers. Useful for testing."""
|
||||||
|
self._scrapers.clear()
|
||||||
|
self._domain_map.clear()
|
||||||
|
|
||||||
|
|
||||||
|
# Global registry instance
|
||||||
|
registry = ScraperRegistry()
|
||||||
0
services/__init__.py
Normal file
0
services/__init__.py
Normal file
0
tests/api/__init__.py
Normal file
0
tests/api/__init__.py
Normal file
0
tests/integration/__init__.py
Normal file
0
tests/integration/__init__.py
Normal file
0
tests/scrapers/__init__.py
Normal file
0
tests/scrapers/__init__.py
Normal file
0
tests/scrapers/google_reviews/__init__.py
Normal file
0
tests/scrapers/google_reviews/__init__.py
Normal file
0
tests/services/__init__.py
Normal file
0
tests/services/__init__.py
Normal file
0
utils/__init__.py
Normal file
0
utils/__init__.py
Normal file
@@ -67,7 +67,7 @@ class CanaryMonitor:
|
|||||||
# Alert if multiple consecutive failures
|
# Alert if multiple consecutive failures
|
||||||
if self.consecutive_failures >= 3:
|
if self.consecutive_failures >= 3:
|
||||||
await self.send_alert(
|
await self.send_alert(
|
||||||
f"🚨 CRITICAL: Scraper canary failed {self.consecutive_failures} times in a row! "
|
f"CRITICAL: Scraper canary failed {self.consecutive_failures} times in a row! "
|
||||||
f"Last error: {str(e)[:200]}"
|
f"Last error: {str(e)[:200]}"
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -90,7 +90,7 @@ class CanaryMonitor:
|
|||||||
- Scrape time is reasonable
|
- Scrape time is reasonable
|
||||||
- Data structure is valid
|
- Data structure is valid
|
||||||
"""
|
"""
|
||||||
from modules.scraper_clean import fast_scrape_reviews
|
from scrapers.google_reviews.v1_0_0 import fast_scrape_reviews
|
||||||
|
|
||||||
log.info(f"Running canary scrape test on {self.test_url[:60]}...")
|
log.info(f"Running canary scrape test on {self.test_url[:60]}...")
|
||||||
self.last_run = datetime.now()
|
self.last_run = datetime.now()
|
||||||
@@ -121,7 +121,7 @@ class CanaryMonitor:
|
|||||||
if all_passed:
|
if all_passed:
|
||||||
# Success!
|
# Success!
|
||||||
log.info(
|
log.info(
|
||||||
f"✅ Canary test PASSED: {result['count']} reviews in {result['time']:.1f}s"
|
f"Canary test PASSED: {result['count']} reviews in {result['time']:.1f}s"
|
||||||
)
|
)
|
||||||
self.consecutive_failures = 0
|
self.consecutive_failures = 0
|
||||||
self.last_success = datetime.now()
|
self.last_success = datetime.now()
|
||||||
@@ -144,7 +144,7 @@ class CanaryMonitor:
|
|||||||
# Validation failed
|
# Validation failed
|
||||||
failed_checks = [k for k, v in checks.items() if not v]
|
failed_checks = [k for k, v in checks.items() if not v]
|
||||||
log.error(
|
log.error(
|
||||||
f"❌ Canary test FAILED: validation failed on {failed_checks}"
|
f"Canary test FAILED: validation failed on {failed_checks}"
|
||||||
)
|
)
|
||||||
self.consecutive_failures += 1
|
self.consecutive_failures += 1
|
||||||
self.last_result = {
|
self.last_result = {
|
||||||
@@ -167,12 +167,12 @@ class CanaryMonitor:
|
|||||||
# Alert on failure
|
# Alert on failure
|
||||||
if self.consecutive_failures >= 3:
|
if self.consecutive_failures >= 3:
|
||||||
await self.send_alert(
|
await self.send_alert(
|
||||||
f"🚨 CRITICAL: Canary validation failed {self.consecutive_failures} times! "
|
f"CRITICAL: Canary validation failed {self.consecutive_failures} times! "
|
||||||
f"Failed checks: {failed_checks}"
|
f"Failed checks: {failed_checks}"
|
||||||
)
|
)
|
||||||
|
|
||||||
except asyncio.TimeoutError:
|
except asyncio.TimeoutError:
|
||||||
log.error("❌ Canary test TIMEOUT (>60s)")
|
log.error("Canary test TIMEOUT (>60s)")
|
||||||
self.consecutive_failures += 1
|
self.consecutive_failures += 1
|
||||||
self.last_result = {
|
self.last_result = {
|
||||||
"status": "timeout",
|
"status": "timeout",
|
||||||
@@ -186,11 +186,11 @@ class CanaryMonitor:
|
|||||||
|
|
||||||
if self.consecutive_failures >= 3:
|
if self.consecutive_failures >= 3:
|
||||||
await self.send_alert(
|
await self.send_alert(
|
||||||
f"🚨 CRITICAL: Canary timeout {self.consecutive_failures} times!"
|
f"CRITICAL: Canary timeout {self.consecutive_failures} times!"
|
||||||
)
|
)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
log.error(f"❌ Canary test ERROR: {e}")
|
log.error(f"Canary test ERROR: {e}")
|
||||||
self.consecutive_failures += 1
|
self.consecutive_failures += 1
|
||||||
self.last_result = {
|
self.last_result = {
|
||||||
"status": "error",
|
"status": "error",
|
||||||
0
workers/__init__.py
Normal file
0
workers/__init__.py
Normal file
Reference in New Issue
Block a user