Phase 0: Project restructure to ReviewIQ platform architecture
New structure: - scrapers/google_reviews/v1_0_0.py (was modules/scraper_clean.py) - scrapers/base.py (BaseScraper interface) - scrapers/registry.py (ScraperRegistry for version routing) - core/database.py, models.py, config.py, enums.py - utils/logger.py, crash_analyzer.py, health_checks.py, helpers.py, date_converter.py - workers/chrome_pool.py - services/webhook_service.py - api/ routes structure (empty, ready for Phase 2) - tests/ structure mirroring source All imports updated in: - api_server_production.py (7 import paths updated) - utils/health_checks.py (scraper import path) Legacy modules moved to modules/_legacy/: - data_storage.py, image_handler.py, s3_handler.py (unused) Syntax verified, frontend build passing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
1143
.artifacts/ReviewIQ-Architecture-v2.md
Normal file
1143
.artifacts/ReviewIQ-Architecture-v2.md
Normal file
File diff suppressed because it is too large
Load Diff
2306
.artifacts/ReviewIQ-Architecture-v3.2.md
Normal file
2306
.artifacts/ReviewIQ-Architecture-v3.2.md
Normal file
File diff suppressed because it is too large
Load Diff
1513
.artifacts/ReviewIQ-Architecture-v3.md
Normal file
1513
.artifacts/ReviewIQ-Architecture-v3.md
Normal file
File diff suppressed because it is too large
Load Diff
183
.artifacts/ReviewIQ-v32-Decisions.md
Normal file
183
.artifacts/ReviewIQ-v32-Decisions.md
Normal file
@@ -0,0 +1,183 @@
|
||||
# ReviewIQ v3.2 Design Decisions
|
||||
|
||||
> Fast context-recovery document — all key decisions without the full spec.
|
||||
|
||||
---
|
||||
|
||||
## 1. Markpoint
|
||||
|
||||
```
|
||||
ID: reviewiq-v32-span-layer-2026-01-24-001
|
||||
Status: v3.2 span layer complete
|
||||
Based on: v3.1.2 (commit f998277)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Core Design Decisions
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Span granularity | Clause/topic-level | Preserves multi-domain signal |
|
||||
| span_id format | ULID (TEXT) | Survives re-segmentation |
|
||||
| Span offsets | Required (NOT NULL) | Deterministic reconstruction |
|
||||
| Offsets reference | reviews_enriched.text | Not text_normalized |
|
||||
| Span → Issue mapping | One-to-one (UNIQUE span_id) | Atomic unit per issue |
|
||||
| Primary span enforcement | Partial unique index | Exactly one per review version |
|
||||
| Primary selection | I3>I2>I1, V->V±>V0>V+, span_index | Deterministic, stable |
|
||||
| Reprocessing strategy | Soft-switch with is_active | No transient empty states |
|
||||
| Span overlap | GiST exclusion constraint | Non-overlapping ranges enforced |
|
||||
| Secondary codes | Array with cardinality ≤ 2 | Could normalize to link table later |
|
||||
| Causal chain storage | JSONB | Flexibility, normalize later if needed |
|
||||
| relation_type vs causal_chain | Separate concerns | relation = within-review, causal = root cause |
|
||||
| Dimension columns | Postgres ENUMs | Type safety, storage efficiency |
|
||||
| Trust score floor | 0.2 (GREATEST clamp) | Prevent multiplicative collapse |
|
||||
| Issue routing key | (business_id, place_id, urt_primary, entity_normalized) | Deterministic, entity-aware |
|
||||
| Issue ID generation | SHA256 via pgcrypto | Deterministic, collision-resistant |
|
||||
| Text validation trigger | Conditional via session setting | Performance: skip in bulk loads |
|
||||
| Relation validation | Application-level post-insert | Handles insertion order |
|
||||
|
||||
---
|
||||
|
||||
## 3. Extensions Required
|
||||
|
||||
| Extension | Purpose |
|
||||
|-----------|---------|
|
||||
| `btree_gist` | Exclusion constraint for non-overlapping spans |
|
||||
| `pgcrypto` | SHA256-based issue ID generation |
|
||||
|
||||
---
|
||||
|
||||
## 4. New Tables
|
||||
|
||||
| Table | Purpose |
|
||||
|-------|---------|
|
||||
| `review_spans` | Span-level URT classification |
|
||||
| `review_span_secondary_codes` | (Optional) Normalized secondary codes |
|
||||
|
||||
---
|
||||
|
||||
## 5. Modified Tables
|
||||
|
||||
| Table | Changes |
|
||||
|-------|---------|
|
||||
| `issue_spans` | Added `span_id` FK (NOT NULL), removed direct review FK as canonical |
|
||||
|
||||
---
|
||||
|
||||
## 6. New ENUM Types
|
||||
|
||||
**Valence & Intensity:**
|
||||
- `urt_valence` — V-, V±, V0, V+
|
||||
- `urt_intensity` — I1, I2, I3
|
||||
|
||||
**Specificity & Actionability:**
|
||||
- `urt_specificity` — S1, S2, S3
|
||||
- `urt_actionability` — A1, A2, A3
|
||||
|
||||
**Context & Evidence:**
|
||||
- `urt_temporal` — T1, T2, T3
|
||||
- `urt_evidence` — E1, E2, E3
|
||||
- `urt_comparative` — CR1, CR2, CR3
|
||||
|
||||
**Classification:**
|
||||
- `urt_profile` — factual, emotional, comparative, etc.
|
||||
- `urt_confidence` — low, medium, high
|
||||
- `urt_relation` — elaborates, contrasts, causes, etc.
|
||||
- `urt_entity_type` — person, product, location, etc.
|
||||
|
||||
---
|
||||
|
||||
## 7. Key Functions
|
||||
|
||||
| Function | Purpose |
|
||||
|----------|---------|
|
||||
| `urt_validate_causal_chain()` | Validates causal JSONB structure |
|
||||
| `validate_review_relations()` | Ensures related_span_id same-parent |
|
||||
| `validate_active_spans()` | Ensures valid active span set |
|
||||
| `set_primary_span()` | Deterministic primary selection |
|
||||
| `generate_issue_id()` | SHA256-based issue ID |
|
||||
|
||||
---
|
||||
|
||||
## 8. Key Triggers
|
||||
|
||||
| Trigger | Purpose |
|
||||
|---------|---------|
|
||||
| `review_spans_validate_bounds` | span_end ≤ text length |
|
||||
| `review_spans_validate_text` | span_text matches substring |
|
||||
| `review_spans_validate_causal_chain` | causal_chain JSONB valid |
|
||||
|
||||
---
|
||||
|
||||
## 9. USN Format
|
||||
|
||||
```
|
||||
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
|
||||
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
|
||||
```
|
||||
|
||||
**Examples:**
|
||||
- `URT:S:SVC.SPD:V-I3:S3A3T2.E2.CR1` — Specific service speed complaint
|
||||
- `URT:F:PRD.QUA:V+I2:S2A1T1.E3.CR2:staff→training` — Product quality praise with causal chain
|
||||
|
||||
---
|
||||
|
||||
## 10. Span Boundary Rules
|
||||
|
||||
1. **Split on contrasting conjunctions** — "but", "however", "although"
|
||||
2. **Split on topic/target change** — Different entity or aspect
|
||||
3. **Split on valence change** — Positive → Negative or vice versa
|
||||
4. **Split on domain change** — SVC → PRD → AMB
|
||||
5. **Keep cause→effect together** — Causal chain stays in one span
|
||||
|
||||
---
|
||||
|
||||
## 11. Deferred to v3.3+
|
||||
|
||||
| Item | Reason |
|
||||
|------|--------|
|
||||
| Entity extraction implementation | Requires NER pipeline |
|
||||
| Trust-weighted fact aggregation | Needs more span data |
|
||||
| Secondary domain enforcement | App-level validation sufficient |
|
||||
| Span-based fact counting | Currently review-based, optimize later |
|
||||
|
||||
---
|
||||
|
||||
## 12. Open Questions Resolved
|
||||
|
||||
| Question | Resolution |
|
||||
|----------|------------|
|
||||
| Span → Issue cardinality? | **One-to-one** (not many-to-many) |
|
||||
| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
|
||||
| Reprocessing strategy? | **Soft-switch** with is_active flag |
|
||||
| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Primary Span Selection Algorithm
|
||||
|
||||
```
|
||||
ORDER BY:
|
||||
1. intensity DESC (I3 > I2 > I1)
|
||||
2. valence ASC (V- > V± > V0 > V+)
|
||||
3. span_index ASC (first wins ties)
|
||||
```
|
||||
|
||||
### Issue Routing Key
|
||||
|
||||
```sql
|
||||
(business_id, place_id, urt_primary, entity_normalized)
|
||||
```
|
||||
|
||||
### Trust Score Calculation
|
||||
|
||||
```sql
|
||||
GREATEST(0.2, base_trust * modifiers) -- Floor prevents collapse
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-01-24*
|
||||
331
.artifacts/URT-v5.1-Reference.md
Normal file
331
.artifacts/URT-v5.1-Reference.md
Normal file
@@ -0,0 +1,331 @@
|
||||
# Universal Review Taxonomy (URT) v5.1 Reference
|
||||
|
||||
## Overview
|
||||
|
||||
The Universal Review Taxonomy (URT) is a classification system for customer feedback. It provides a structured approach to categorizing, annotating, and analyzing review content across any industry.
|
||||
|
||||
### Key Characteristics
|
||||
|
||||
- **Three Profiles**: Core, Standard, Full (increasing detail)
|
||||
- **Seven Domains**: Covering all aspects of customer experience
|
||||
- **Tier-3 Canonical Codes**: Format `X#.##` (e.g., J1.02, P2.15)
|
||||
- **Dimensional Annotation**: Valence, intensity, specificity, and more
|
||||
- **Causal Analysis**: Root cause chains (Full profile)
|
||||
|
||||
---
|
||||
|
||||
## Domain Codes
|
||||
|
||||
URT organizes feedback into seven domains, each identified by a single letter.
|
||||
|
||||
| Domain | Letter | Description |
|
||||
|--------|--------|-------------|
|
||||
| Offering | O | Product/service quality |
|
||||
| Price | P | Value, pricing, promotions |
|
||||
| Journey | J | Customer experience, timing, process |
|
||||
| Environment | E | Physical/digital space |
|
||||
| Attitude | A | Staff behavior, service attitude |
|
||||
| Voice | V | Brand, communication, marketing |
|
||||
| Relationship | R | Loyalty, trust, long-term relationship |
|
||||
|
||||
### Tier-3 Code Format
|
||||
|
||||
```
|
||||
Pattern: [OPJEAVR][1-4]\.[0-9]{2}
|
||||
```
|
||||
|
||||
Examples:
|
||||
- `J1.02` - Journey domain, category 1, subcategory 02
|
||||
- `P2.15` - Price domain, category 2, subcategory 15
|
||||
- `A3.01` - Attitude domain, category 3, subcategory 01
|
||||
|
||||
---
|
||||
|
||||
## Dimension Codes
|
||||
|
||||
### Valence
|
||||
|
||||
Indicates the sentiment direction of the feedback.
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| V+ | Positive |
|
||||
| V- | Negative |
|
||||
| V0 | Neutral |
|
||||
| V± | Mixed |
|
||||
|
||||
### Intensity
|
||||
|
||||
Indicates the strength of the expressed sentiment.
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| I1 | Low intensity |
|
||||
| I2 | Moderate intensity |
|
||||
| I3 | High intensity |
|
||||
|
||||
### Specificity (Standard+)
|
||||
|
||||
Indicates how detailed the feedback is.
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| S1 | Low - vague, general |
|
||||
| S2 | Medium - some detail |
|
||||
| S3 | High - specific, precise |
|
||||
|
||||
### Actionability (Standard+)
|
||||
|
||||
Indicates whether clear actions can be derived from the feedback.
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| A1 | None - no clear action |
|
||||
| A2 | Unclear - possible actions |
|
||||
| A3 | Clear - specific actionable |
|
||||
|
||||
### Temporal (Standard+)
|
||||
|
||||
Indicates the time frame referenced in the feedback.
|
||||
|
||||
| Code | Meaning | Markers |
|
||||
|------|---------|---------|
|
||||
| TC | Current - this visit | "today", "this time", "yesterday" |
|
||||
| TR | Recent - last few visits | "lately", "recently", "again" |
|
||||
| TH | Historical - long-standing | "for years", "always", "historically" |
|
||||
| TF | Future - expectations | "I won't come back", "next time" |
|
||||
|
||||
**Default**: TC when no temporal language exists.
|
||||
|
||||
### Evidence (Standard+)
|
||||
|
||||
Indicates how the information was obtained from the text.
|
||||
|
||||
| Code | Meaning | Example |
|
||||
|------|---------|---------|
|
||||
| ES | Stated - explicit in text | "Waited 45 minutes" |
|
||||
| EI | Inferred - logically entailed | "Took 3 weeks to reply" → slow response |
|
||||
| EC | Contextual - depends on context | "That happened again" |
|
||||
|
||||
**Default**: ES. Use EI/EC only when needed.
|
||||
|
||||
### Comparative
|
||||
|
||||
Indicates whether the feedback compares to alternatives.
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| CR-N | No comparison |
|
||||
| CR-B | Better than alternatives |
|
||||
| CR-W | Worse than alternatives |
|
||||
| CR-S | Same as alternatives |
|
||||
|
||||
---
|
||||
|
||||
## USN (URT String Notation)
|
||||
|
||||
USN is a compact string encoding for URT annotations.
|
||||
|
||||
### Grammar
|
||||
|
||||
```
|
||||
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
|
||||
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
|
||||
```
|
||||
|
||||
### Encoding Rules
|
||||
|
||||
**Valence**:
|
||||
- `+` for V+
|
||||
- `-` for V-
|
||||
|
||||
**Intensity**:
|
||||
- `1` for I1
|
||||
- `2` for I2
|
||||
- `3` for I3
|
||||
|
||||
### Examples
|
||||
|
||||
**Standard Profile**:
|
||||
```
|
||||
URT:S:J1.03:-2:22TC.ES.N
|
||||
```
|
||||
Decoded:
|
||||
- Profile: Standard
|
||||
- Code: J1.03
|
||||
- Valence: V- (negative)
|
||||
- Intensity: I2
|
||||
- Specificity: S2
|
||||
- Actionability: A2
|
||||
- Temporal: TC
|
||||
- Evidence: ES
|
||||
- Comparative: CR-N
|
||||
|
||||
**Full Profile with Causal Chain**:
|
||||
```
|
||||
URT:F:J1.01+A1.04:-3:23TR.EI.S:CD.O,MG.O
|
||||
```
|
||||
Decoded:
|
||||
- Profile: Full
|
||||
- Codes: J1.01, A1.04
|
||||
- Valence: V- (negative)
|
||||
- Intensity: I3
|
||||
- Specificity: S2
|
||||
- Actionability: A3
|
||||
- Temporal: TR
|
||||
- Evidence: EI
|
||||
- Comparative: CR-S
|
||||
- Causal: CD.O (Conditions-Operational), MG.O (Management-Oversight)
|
||||
|
||||
---
|
||||
|
||||
## Causal Chain (Full Profile Only)
|
||||
|
||||
The causal chain identifies root causes across three layers, ordered from immediate to systemic.
|
||||
|
||||
### Layers
|
||||
|
||||
| Layer | Codes | Scope |
|
||||
|-------|-------|-------|
|
||||
| conditions | CD-S, CD-T, CD-E, CD-F, CD-O | Staff State, Team Dynamics, Equipment, Facility, Operational |
|
||||
| management | MG-P, MG-T, MG-O, MG-R, MG-C | Planning, Training, Oversight, Resources, Communication |
|
||||
| systemic | SY-R, SY-P, SY-C, SY-S, SY-H, SY-X | Resource Decisions, Policy, Culture, Standards, Human Capital, External |
|
||||
|
||||
### Code Reference
|
||||
|
||||
**Conditions Layer**:
|
||||
- `CD-S` - Staff State
|
||||
- `CD-T` - Team Dynamics
|
||||
- `CD-E` - Equipment
|
||||
- `CD-F` - Facility
|
||||
- `CD-O` - Operational
|
||||
|
||||
**Management Layer**:
|
||||
- `MG-P` - Planning
|
||||
- `MG-T` - Training
|
||||
- `MG-O` - Oversight
|
||||
- `MG-R` - Resources
|
||||
- `MG-C` - Communication
|
||||
|
||||
**Systemic Layer**:
|
||||
- `SY-R` - Resource Decisions
|
||||
- `SY-P` - Policy
|
||||
- `SY-C` - Culture
|
||||
- `SY-S` - Standards
|
||||
- `SY-H` - Human Capital
|
||||
- `SY-X` - External
|
||||
|
||||
### JSONB Schema
|
||||
|
||||
```json
|
||||
[
|
||||
{"layer": "conditions", "code": "CD-O", "evidence": "ES"},
|
||||
{"layer": "management", "code": "MG-P", "evidence": "EI"}
|
||||
]
|
||||
```
|
||||
|
||||
### Constraints
|
||||
|
||||
- Maximum 3 entries (one per layer)
|
||||
- Only include when text explicitly supports it
|
||||
- Order: conditions → management → systemic
|
||||
|
||||
---
|
||||
|
||||
## Span Boundary Detection Rules
|
||||
|
||||
Spans are detected at the clause/topic level, not sentence level.
|
||||
|
||||
### Split Rules (in priority order)
|
||||
|
||||
1. **Split on contrasting conjunctions**: but, however, although, despite, yet
|
||||
2. **Split when subject/target changes** (topic shift)
|
||||
3. **Split when valence changes** (positive ↔ negative)
|
||||
4. **Split when domain changes** (O/P/J/E/A/V/R)
|
||||
5. **Keep together** for cause→effect within same feedback unit
|
||||
|
||||
### Guidelines
|
||||
|
||||
- **Maximum**: ~3 spans per sentence
|
||||
- **Validation**: If 4+ spans detected, re-check for over-splitting
|
||||
|
||||
### Example
|
||||
|
||||
**Input**:
|
||||
> "The food was great but the service was slow and the bathroom was dirty."
|
||||
|
||||
**Output**: 3 spans
|
||||
1. "The food was great" (Offering, positive)
|
||||
2. "the service was slow" (Journey/Attitude, negative)
|
||||
3. "the bathroom was dirty" (Environment, negative)
|
||||
|
||||
**Reasoning**: Topic shift + domain shift at each boundary.
|
||||
|
||||
---
|
||||
|
||||
## Primary Span Selection
|
||||
|
||||
When a review contains multiple spans, select the primary span using these criteria in order:
|
||||
|
||||
### Selection Priority
|
||||
|
||||
1. **Highest intensity** (I3 > I2 > I1)
|
||||
2. **Tie-break**: Negative over positive (V- > V± > V0 > V+)
|
||||
3. **Tie-break**: Earliest span_index
|
||||
|
||||
### Example
|
||||
|
||||
Given spans:
|
||||
- Span 0: I2, V+
|
||||
- Span 1: I3, V+
|
||||
- Span 2: I3, V-
|
||||
|
||||
**Primary**: Span 2 (highest intensity I3, negative valence wins tie-break)
|
||||
|
||||
---
|
||||
|
||||
## Secondary Codes Rules
|
||||
|
||||
Secondary codes capture additional topics mentioned in a span.
|
||||
|
||||
### Constraints
|
||||
|
||||
- **Maximum**: 2 secondary codes
|
||||
- **Format**: Must be Tier-3 (X#.##)
|
||||
- **Recommendation**: Should be different domain from primary
|
||||
|
||||
### Example
|
||||
|
||||
Primary: `J1.03` (Journey)
|
||||
Secondary: `A2.01`, `E1.05` (Attitude, Environment)
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Card
|
||||
|
||||
### Profiles
|
||||
|
||||
| Profile | Dimensions | Causal Chain |
|
||||
|---------|------------|--------------|
|
||||
| Core | V, I | No |
|
||||
| Standard | V, I, S, A, T, E, CR | No |
|
||||
| Full | V, I, S, A, T, E, CR | Yes |
|
||||
|
||||
### USN Quick Format
|
||||
|
||||
```
|
||||
URT:{S|F}:{tier3_codes}:{valence}{intensity}:{SAT}.{E}.{CR}[:{causal}]
|
||||
```
|
||||
|
||||
### Domain Letters
|
||||
|
||||
```
|
||||
O P J E A V R
|
||||
│ │ │ │ │ │ └─ Relationship
|
||||
│ │ │ │ │ └─── Voice
|
||||
│ │ │ │ └───── Attitude
|
||||
│ │ │ └─────── Environment
|
||||
│ │ └───────── Journey
|
||||
│ └─────────── Price
|
||||
└───────────── Offering
|
||||
```
|
||||
0
api/__init__.py
Normal file
0
api/__init__.py
Normal file
0
api/middleware/__init__.py
Normal file
0
api/middleware/__init__.py
Normal file
0
api/routes/__init__.py
Normal file
0
api/routes/__init__.py
Normal file
@@ -20,13 +20,13 @@ from fastapi.middleware.cors import CORSMiddleware
|
||||
from pydantic import BaseModel, HttpUrl, Field
|
||||
from fastapi.responses import JSONResponse, StreamingResponse
|
||||
|
||||
from modules.database import DatabaseManager, JobStatus
|
||||
from modules.webhooks import WebhookDispatcher, WebhookManager
|
||||
from modules.health_checks import HealthCheckSystem
|
||||
from modules.scraper_clean import fast_scrape_reviews, LogCapture, get_business_card_info # Clean scraper
|
||||
from modules.crash_analyzer import analyze_crash, summarize_crash_patterns, apply_auto_fix
|
||||
from modules.structured_logger import StructuredLogger, LogEntry
|
||||
from modules.chrome_pool import (
|
||||
from core.database import DatabaseManager, JobStatus
|
||||
from services.webhook_service import WebhookDispatcher, WebhookManager
|
||||
from utils.health_checks import HealthCheckSystem
|
||||
from scrapers.google_reviews.v1_0_0 import fast_scrape_reviews, LogCapture, get_business_card_info # Clean scraper
|
||||
from utils.crash_analyzer import analyze_crash, summarize_crash_patterns, apply_auto_fix
|
||||
from utils.logger import StructuredLogger, LogEntry
|
||||
from workers.chrome_pool import (
|
||||
start_worker_pools,
|
||||
stop_worker_pools,
|
||||
get_validation_worker,
|
||||
|
||||
0
core/__init__.py
Normal file
0
core/__init__.py
Normal file
@@ -8,22 +8,13 @@ import json
|
||||
from datetime import datetime
|
||||
from typing import Optional, List, Dict, Any
|
||||
from uuid import UUID, uuid4
|
||||
from enum import Enum
|
||||
import logging
|
||||
|
||||
from core.enums import JobStatus
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class JobStatus(str, Enum):
|
||||
"""Job status enumeration"""
|
||||
PENDING = "pending"
|
||||
RUNNING = "running"
|
||||
COMPLETED = "completed"
|
||||
FAILED = "failed"
|
||||
CANCELLED = "cancelled"
|
||||
PARTIAL = "partial" # Job crashed but has partial reviews saved
|
||||
|
||||
|
||||
class DatabaseManager:
|
||||
"""PostgreSQL database manager with connection pooling"""
|
||||
|
||||
14
core/enums.py
Normal file
14
core/enums.py
Normal file
@@ -0,0 +1,14 @@
|
||||
"""
|
||||
Enumerations for the ReviewIQ project.
|
||||
"""
|
||||
from enum import Enum
|
||||
|
||||
|
||||
class JobStatus(str, Enum):
|
||||
"""Job status enumeration"""
|
||||
PENDING = "pending"
|
||||
RUNNING = "running"
|
||||
COMPLETED = "completed"
|
||||
FAILED = "failed"
|
||||
CANCELLED = "cancelled"
|
||||
PARTIAL = "partial" # Job crashed but has partial reviews saved
|
||||
@@ -6,7 +6,7 @@ from dataclasses import dataclass, field
|
||||
|
||||
from selenium.webdriver.remote.webelement import WebElement
|
||||
|
||||
from modules.utils import (try_find, first_text, first_attr, safe_int, detect_lang, parse_date_to_iso)
|
||||
from utils.helpers import (try_find, first_text, first_attr, safe_int, detect_lang, parse_date_to_iso)
|
||||
|
||||
|
||||
@dataclass
|
||||
10
scrapers/__init__.py
Normal file
10
scrapers/__init__.py
Normal file
@@ -0,0 +1,10 @@
|
||||
"""
|
||||
Scrapers Package
|
||||
|
||||
This package contains all scraper implementations for the ReviewIQ system.
|
||||
"""
|
||||
|
||||
from scrapers.base import BaseScraper
|
||||
from scrapers.registry import ScraperRegistry, registry
|
||||
|
||||
__all__ = ["BaseScraper", "ScraperRegistry", "registry"]
|
||||
97
scrapers/base.py
Normal file
97
scrapers/base.py
Normal file
@@ -0,0 +1,97 @@
|
||||
"""
|
||||
Base Scraper Interface
|
||||
|
||||
This module defines the abstract base class that all scrapers must implement.
|
||||
It ensures consistent interface across different scraper implementations.
|
||||
"""
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Any, Callable, Dict, List, Optional
|
||||
|
||||
|
||||
class BaseScraper(ABC):
|
||||
"""
|
||||
Abstract base class for all scrapers in the ReviewIQ system.
|
||||
|
||||
All concrete scraper implementations must inherit from this class
|
||||
and implement the required abstract methods.
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
def scrape(
|
||||
self,
|
||||
driver: Any,
|
||||
url: str,
|
||||
max_reviews: int = 5000,
|
||||
timeout_no_new: int = 15,
|
||||
flush_callback: Optional[Callable[[List[Dict]], None]] = None,
|
||||
flush_batch_size: int = 500,
|
||||
progress_callback: Optional[Callable[[int, Optional[int]], None]] = None,
|
||||
validation_only: bool = False
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Scrape reviews from the given URL.
|
||||
|
||||
Args:
|
||||
driver: WebDriver instance (e.g., Selenium WebDriver)
|
||||
url: The URL to scrape reviews from
|
||||
max_reviews: Maximum number of reviews to collect
|
||||
timeout_no_new: Seconds to wait with no new reviews before stopping
|
||||
flush_callback: Optional callback called with reviews batches for streaming
|
||||
flush_batch_size: Number of reviews before triggering flush_callback
|
||||
progress_callback: Optional callback(current_count, total_count) for progress
|
||||
validation_only: If True, return early after extracting metadata only
|
||||
|
||||
Returns:
|
||||
Dictionary containing:
|
||||
- reviews: List of review dictionaries
|
||||
- total: Total number of reviews collected
|
||||
- error: Error message if any, None otherwise
|
||||
- Additional scraper-specific metadata
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def validate_url(self, url: str) -> bool:
|
||||
"""
|
||||
Validate if the given URL is supported by this scraper.
|
||||
|
||||
Args:
|
||||
url: The URL to validate
|
||||
|
||||
Returns:
|
||||
True if the URL is valid for this scraper, False otherwise
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_business_info(self, driver: Any, url: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Extract business information from the URL without scraping reviews.
|
||||
|
||||
Args:
|
||||
driver: WebDriver instance
|
||||
url: The URL to extract info from
|
||||
|
||||
Returns:
|
||||
Dictionary containing business metadata (name, rating, address, etc.)
|
||||
"""
|
||||
pass
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def name(self) -> str:
|
||||
"""Return the human-readable name of this scraper."""
|
||||
pass
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def version(self) -> str:
|
||||
"""Return the version string of this scraper."""
|
||||
pass
|
||||
|
||||
@property
|
||||
@abstractmethod
|
||||
def supported_domains(self) -> List[str]:
|
||||
"""Return list of domains this scraper supports."""
|
||||
pass
|
||||
21
scrapers/google_reviews/__init__.py
Normal file
21
scrapers/google_reviews/__init__.py
Normal file
@@ -0,0 +1,21 @@
|
||||
"""
|
||||
Google Reviews Scraper Package
|
||||
|
||||
This package contains the Google Reviews scraper implementations.
|
||||
"""
|
||||
|
||||
from scrapers.google_reviews.v1_0_0 import (
|
||||
scrape_reviews,
|
||||
fast_scrape_reviews,
|
||||
get_business_card_info,
|
||||
extract_about_info,
|
||||
LogCapture,
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"scrape_reviews",
|
||||
"fast_scrape_reviews",
|
||||
"get_business_card_info",
|
||||
"extract_about_info",
|
||||
"LogCapture",
|
||||
]
|
||||
@@ -1,7 +1,12 @@
|
||||
"""
|
||||
Clean Google Maps Reviews Scraper
|
||||
Google Reviews Scraper v1.0.0
|
||||
|
||||
This module provides the core Google Maps reviews scraping functionality.
|
||||
- Simple down scrolling
|
||||
- DOM scraping + API interception
|
||||
|
||||
Version: 1.0.0
|
||||
Migrated from: modules/scraper_clean.py
|
||||
"""
|
||||
|
||||
import re
|
||||
@@ -12,7 +17,7 @@ from datetime import datetime
|
||||
from typing import List, Optional
|
||||
from selenium.webdriver.common.by import By
|
||||
|
||||
from modules.structured_logger import StructuredLogger
|
||||
from utils.logger import StructuredLogger
|
||||
|
||||
def get_chrome_memory(driver) -> Optional[int]:
|
||||
"""Get Chrome memory usage in MB using CDP."""
|
||||
138
scrapers/registry.py
Normal file
138
scrapers/registry.py
Normal file
@@ -0,0 +1,138 @@
|
||||
"""
|
||||
Scraper Registry
|
||||
|
||||
This module provides a registry for managing and discovering scrapers.
|
||||
It allows dynamic registration and lookup of scraper implementations.
|
||||
"""
|
||||
|
||||
from typing import Dict, List, Optional, Type
|
||||
|
||||
from scrapers.base import BaseScraper
|
||||
|
||||
|
||||
class ScraperRegistry:
|
||||
"""
|
||||
Registry for managing scraper implementations.
|
||||
|
||||
The registry allows:
|
||||
- Registering scrapers by name and version
|
||||
- Looking up scrapers by domain or name
|
||||
- Listing all available scrapers
|
||||
|
||||
Usage:
|
||||
registry = ScraperRegistry()
|
||||
registry.register(GoogleReviewsScraper)
|
||||
scraper = registry.get_scraper_for_url("https://google.com/maps/place/...")
|
||||
"""
|
||||
|
||||
_instance: Optional["ScraperRegistry"] = None
|
||||
_scrapers: Dict[str, Type[BaseScraper]]
|
||||
|
||||
def __new__(cls) -> "ScraperRegistry":
|
||||
"""Singleton pattern to ensure one global registry."""
|
||||
if cls._instance is None:
|
||||
cls._instance = super().__new__(cls)
|
||||
cls._instance._scrapers = {}
|
||||
cls._instance._domain_map = {}
|
||||
return cls._instance
|
||||
|
||||
def register(self, scraper_class: Type[BaseScraper], name: Optional[str] = None) -> None:
|
||||
"""
|
||||
Register a scraper class with the registry.
|
||||
|
||||
Args:
|
||||
scraper_class: The scraper class to register (must inherit from BaseScraper)
|
||||
name: Optional name override, defaults to scraper_class.name property
|
||||
"""
|
||||
# Create a temporary instance to get properties
|
||||
# Note: In production, we might want scraper_class to have class-level properties
|
||||
instance = scraper_class.__new__(scraper_class)
|
||||
|
||||
scraper_name = name or instance.name
|
||||
scraper_version = instance.version
|
||||
key = f"{scraper_name}:{scraper_version}"
|
||||
|
||||
self._scrapers[key] = scraper_class
|
||||
|
||||
# Map domains to this scraper
|
||||
for domain in instance.supported_domains:
|
||||
if domain not in self._domain_map:
|
||||
self._domain_map[domain] = []
|
||||
self._domain_map[domain].append(key)
|
||||
|
||||
def get_scraper(self, name: str, version: Optional[str] = None) -> Optional[Type[BaseScraper]]:
|
||||
"""
|
||||
Get a scraper class by name and optional version.
|
||||
|
||||
Args:
|
||||
name: The scraper name
|
||||
version: Optional version string. If not provided, returns the latest.
|
||||
|
||||
Returns:
|
||||
The scraper class, or None if not found
|
||||
"""
|
||||
if version:
|
||||
key = f"{name}:{version}"
|
||||
return self._scrapers.get(key)
|
||||
|
||||
# Find latest version for this name
|
||||
matching = [k for k in self._scrapers.keys() if k.startswith(f"{name}:")]
|
||||
if not matching:
|
||||
return None
|
||||
|
||||
# Sort by version and return latest
|
||||
matching.sort(reverse=True)
|
||||
return self._scrapers.get(matching[0])
|
||||
|
||||
def get_scraper_for_url(self, url: str) -> Optional[Type[BaseScraper]]:
|
||||
"""
|
||||
Find a suitable scraper for the given URL.
|
||||
|
||||
Args:
|
||||
url: The URL to find a scraper for
|
||||
|
||||
Returns:
|
||||
The scraper class that can handle this URL, or None if no match
|
||||
"""
|
||||
from urllib.parse import urlparse
|
||||
|
||||
parsed = urlparse(url)
|
||||
domain = parsed.netloc.lower()
|
||||
|
||||
# Remove www. prefix for matching
|
||||
if domain.startswith("www."):
|
||||
domain = domain[4:]
|
||||
|
||||
scraper_keys = self._domain_map.get(domain, [])
|
||||
if not scraper_keys:
|
||||
return None
|
||||
|
||||
# Return the latest version
|
||||
scraper_keys.sort(reverse=True)
|
||||
return self._scrapers.get(scraper_keys[0])
|
||||
|
||||
def list_scrapers(self) -> List[Dict[str, str]]:
|
||||
"""
|
||||
List all registered scrapers.
|
||||
|
||||
Returns:
|
||||
List of dictionaries with scraper info (name, version, domains)
|
||||
"""
|
||||
result = []
|
||||
for key, scraper_class in self._scrapers.items():
|
||||
instance = scraper_class.__new__(scraper_class)
|
||||
result.append({
|
||||
"name": instance.name,
|
||||
"version": instance.version,
|
||||
"domains": instance.supported_domains
|
||||
})
|
||||
return result
|
||||
|
||||
def clear(self) -> None:
|
||||
"""Clear all registered scrapers. Useful for testing."""
|
||||
self._scrapers.clear()
|
||||
self._domain_map.clear()
|
||||
|
||||
|
||||
# Global registry instance
|
||||
registry = ScraperRegistry()
|
||||
0
services/__init__.py
Normal file
0
services/__init__.py
Normal file
0
tests/api/__init__.py
Normal file
0
tests/api/__init__.py
Normal file
0
tests/integration/__init__.py
Normal file
0
tests/integration/__init__.py
Normal file
0
tests/scrapers/__init__.py
Normal file
0
tests/scrapers/__init__.py
Normal file
0
tests/scrapers/google_reviews/__init__.py
Normal file
0
tests/scrapers/google_reviews/__init__.py
Normal file
0
tests/services/__init__.py
Normal file
0
tests/services/__init__.py
Normal file
0
utils/__init__.py
Normal file
0
utils/__init__.py
Normal file
@@ -67,7 +67,7 @@ class CanaryMonitor:
|
||||
# Alert if multiple consecutive failures
|
||||
if self.consecutive_failures >= 3:
|
||||
await self.send_alert(
|
||||
f"🚨 CRITICAL: Scraper canary failed {self.consecutive_failures} times in a row! "
|
||||
f"CRITICAL: Scraper canary failed {self.consecutive_failures} times in a row! "
|
||||
f"Last error: {str(e)[:200]}"
|
||||
)
|
||||
|
||||
@@ -90,7 +90,7 @@ class CanaryMonitor:
|
||||
- Scrape time is reasonable
|
||||
- Data structure is valid
|
||||
"""
|
||||
from modules.scraper_clean import fast_scrape_reviews
|
||||
from scrapers.google_reviews.v1_0_0 import fast_scrape_reviews
|
||||
|
||||
log.info(f"Running canary scrape test on {self.test_url[:60]}...")
|
||||
self.last_run = datetime.now()
|
||||
@@ -121,7 +121,7 @@ class CanaryMonitor:
|
||||
if all_passed:
|
||||
# Success!
|
||||
log.info(
|
||||
f"✅ Canary test PASSED: {result['count']} reviews in {result['time']:.1f}s"
|
||||
f"Canary test PASSED: {result['count']} reviews in {result['time']:.1f}s"
|
||||
)
|
||||
self.consecutive_failures = 0
|
||||
self.last_success = datetime.now()
|
||||
@@ -144,7 +144,7 @@ class CanaryMonitor:
|
||||
# Validation failed
|
||||
failed_checks = [k for k, v in checks.items() if not v]
|
||||
log.error(
|
||||
f"❌ Canary test FAILED: validation failed on {failed_checks}"
|
||||
f"Canary test FAILED: validation failed on {failed_checks}"
|
||||
)
|
||||
self.consecutive_failures += 1
|
||||
self.last_result = {
|
||||
@@ -167,12 +167,12 @@ class CanaryMonitor:
|
||||
# Alert on failure
|
||||
if self.consecutive_failures >= 3:
|
||||
await self.send_alert(
|
||||
f"🚨 CRITICAL: Canary validation failed {self.consecutive_failures} times! "
|
||||
f"CRITICAL: Canary validation failed {self.consecutive_failures} times! "
|
||||
f"Failed checks: {failed_checks}"
|
||||
)
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
log.error("❌ Canary test TIMEOUT (>60s)")
|
||||
log.error("Canary test TIMEOUT (>60s)")
|
||||
self.consecutive_failures += 1
|
||||
self.last_result = {
|
||||
"status": "timeout",
|
||||
@@ -186,11 +186,11 @@ class CanaryMonitor:
|
||||
|
||||
if self.consecutive_failures >= 3:
|
||||
await self.send_alert(
|
||||
f"🚨 CRITICAL: Canary timeout {self.consecutive_failures} times!"
|
||||
f"CRITICAL: Canary timeout {self.consecutive_failures} times!"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
log.error(f"❌ Canary test ERROR: {e}")
|
||||
log.error(f"Canary test ERROR: {e}")
|
||||
self.consecutive_failures += 1
|
||||
self.last_result = {
|
||||
"status": "error",
|
||||
0
workers/__init__.py
Normal file
0
workers/__init__.py
Normal file
Reference in New Issue
Block a user