Phase 0: Project restructure to ReviewIQ platform architecture

New structure:
- scrapers/google_reviews/v1_0_0.py (was modules/scraper_clean.py)
- scrapers/base.py (BaseScraper interface)
- scrapers/registry.py (ScraperRegistry for version routing)
- core/database.py, models.py, config.py, enums.py
- utils/logger.py, crash_analyzer.py, health_checks.py, helpers.py, date_converter.py
- workers/chrome_pool.py
- services/webhook_service.py
- api/ routes structure (empty, ready for Phase 2)
- tests/ structure mirroring source

All imports updated in:
- api_server_production.py (7 import paths updated)
- utils/health_checks.py (scraper import path)

Legacy modules moved to modules/_legacy/:
- data_storage.py, image_handler.py, s3_handler.py (unused)

Syntax verified, frontend build passing.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-24 15:22:08 +00:00
parent bb0291f265
commit 544e028c3f
37 changed files with 5782 additions and 30 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,183 @@
# ReviewIQ v3.2 Design Decisions
> Fast context-recovery document — all key decisions without the full spec.
---
## 1. Markpoint
```
ID: reviewiq-v32-span-layer-2026-01-24-001
Status: v3.2 span layer complete
Based on: v3.1.2 (commit f998277)
```
---
## 2. Core Design Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Span granularity | Clause/topic-level | Preserves multi-domain signal |
| span_id format | ULID (TEXT) | Survives re-segmentation |
| Span offsets | Required (NOT NULL) | Deterministic reconstruction |
| Offsets reference | reviews_enriched.text | Not text_normalized |
| Span → Issue mapping | One-to-one (UNIQUE span_id) | Atomic unit per issue |
| Primary span enforcement | Partial unique index | Exactly one per review version |
| Primary selection | I3>I2>I1, V->V±>V0>V+, span_index | Deterministic, stable |
| Reprocessing strategy | Soft-switch with is_active | No transient empty states |
| Span overlap | GiST exclusion constraint | Non-overlapping ranges enforced |
| Secondary codes | Array with cardinality ≤ 2 | Could normalize to link table later |
| Causal chain storage | JSONB | Flexibility, normalize later if needed |
| relation_type vs causal_chain | Separate concerns | relation = within-review, causal = root cause |
| Dimension columns | Postgres ENUMs | Type safety, storage efficiency |
| Trust score floor | 0.2 (GREATEST clamp) | Prevent multiplicative collapse |
| Issue routing key | (business_id, place_id, urt_primary, entity_normalized) | Deterministic, entity-aware |
| Issue ID generation | SHA256 via pgcrypto | Deterministic, collision-resistant |
| Text validation trigger | Conditional via session setting | Performance: skip in bulk loads |
| Relation validation | Application-level post-insert | Handles insertion order |
---
## 3. Extensions Required
| Extension | Purpose |
|-----------|---------|
| `btree_gist` | Exclusion constraint for non-overlapping spans |
| `pgcrypto` | SHA256-based issue ID generation |
---
## 4. New Tables
| Table | Purpose |
|-------|---------|
| `review_spans` | Span-level URT classification |
| `review_span_secondary_codes` | (Optional) Normalized secondary codes |
---
## 5. Modified Tables
| Table | Changes |
|-------|---------|
| `issue_spans` | Added `span_id` FK (NOT NULL), removed direct review FK as canonical |
---
## 6. New ENUM Types
**Valence & Intensity:**
- `urt_valence` — V-, V±, V0, V+
- `urt_intensity` — I1, I2, I3
**Specificity & Actionability:**
- `urt_specificity` — S1, S2, S3
- `urt_actionability` — A1, A2, A3
**Context & Evidence:**
- `urt_temporal` — T1, T2, T3
- `urt_evidence` — E1, E2, E3
- `urt_comparative` — CR1, CR2, CR3
**Classification:**
- `urt_profile` — factual, emotional, comparative, etc.
- `urt_confidence` — low, medium, high
- `urt_relation` — elaborates, contrasts, causes, etc.
- `urt_entity_type` — person, product, location, etc.
---
## 7. Key Functions
| Function | Purpose |
|----------|---------|
| `urt_validate_causal_chain()` | Validates causal JSONB structure |
| `validate_review_relations()` | Ensures related_span_id same-parent |
| `validate_active_spans()` | Ensures valid active span set |
| `set_primary_span()` | Deterministic primary selection |
| `generate_issue_id()` | SHA256-based issue ID |
---
## 8. Key Triggers
| Trigger | Purpose |
|---------|---------|
| `review_spans_validate_bounds` | span_end ≤ text length |
| `review_spans_validate_text` | span_text matches substring |
| `review_spans_validate_causal_chain` | causal_chain JSONB valid |
---
## 9. USN Format
```
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
```
**Examples:**
- `URT:S:SVC.SPD:V-I3:S3A3T2.E2.CR1` — Specific service speed complaint
- `URT:F:PRD.QUA:V+I2:S2A1T1.E3.CR2:staff→training` — Product quality praise with causal chain
---
## 10. Span Boundary Rules
1. **Split on contrasting conjunctions** — "but", "however", "although"
2. **Split on topic/target change** — Different entity or aspect
3. **Split on valence change** — Positive → Negative or vice versa
4. **Split on domain change** — SVC → PRD → AMB
5. **Keep cause→effect together** — Causal chain stays in one span
---
## 11. Deferred to v3.3+
| Item | Reason |
|------|--------|
| Entity extraction implementation | Requires NER pipeline |
| Trust-weighted fact aggregation | Needs more span data |
| Secondary domain enforcement | App-level validation sufficient |
| Span-based fact counting | Currently review-based, optimize later |
---
## 12. Open Questions Resolved
| Question | Resolution |
|----------|------------|
| Span → Issue cardinality? | **One-to-one** (not many-to-many) |
| Offsets nullable for LLM-inferred? | **No** — required, NOT NULL |
| Reprocessing strategy? | **Soft-switch** with is_active flag |
| TEXT vs ENUM for dimensions? | **ENUMs** — committed to Postgres |
---
## Quick Reference
### Primary Span Selection Algorithm
```
ORDER BY:
1. intensity DESC (I3 > I2 > I1)
2. valence ASC (V- > V± > V0 > V+)
3. span_index ASC (first wins ties)
```
### Issue Routing Key
```sql
(business_id, place_id, urt_primary, entity_normalized)
```
### Trust Score Calculation
```sql
GREATEST(0.2, base_trust * modifiers) -- Floor prevents collapse
```
---
*Last updated: 2026-01-24*

View File

@@ -0,0 +1,331 @@
# Universal Review Taxonomy (URT) v5.1 Reference
## Overview
The Universal Review Taxonomy (URT) is a classification system for customer feedback. It provides a structured approach to categorizing, annotating, and analyzing review content across any industry.
### Key Characteristics
- **Three Profiles**: Core, Standard, Full (increasing detail)
- **Seven Domains**: Covering all aspects of customer experience
- **Tier-3 Canonical Codes**: Format `X#.##` (e.g., J1.02, P2.15)
- **Dimensional Annotation**: Valence, intensity, specificity, and more
- **Causal Analysis**: Root cause chains (Full profile)
---
## Domain Codes
URT organizes feedback into seven domains, each identified by a single letter.
| Domain | Letter | Description |
|--------|--------|-------------|
| Offering | O | Product/service quality |
| Price | P | Value, pricing, promotions |
| Journey | J | Customer experience, timing, process |
| Environment | E | Physical/digital space |
| Attitude | A | Staff behavior, service attitude |
| Voice | V | Brand, communication, marketing |
| Relationship | R | Loyalty, trust, long-term relationship |
### Tier-3 Code Format
```
Pattern: [OPJEAVR][1-4]\.[0-9]{2}
```
Examples:
- `J1.02` - Journey domain, category 1, subcategory 02
- `P2.15` - Price domain, category 2, subcategory 15
- `A3.01` - Attitude domain, category 3, subcategory 01
---
## Dimension Codes
### Valence
Indicates the sentiment direction of the feedback.
| Code | Meaning |
|------|---------|
| V+ | Positive |
| V- | Negative |
| V0 | Neutral |
| V± | Mixed |
### Intensity
Indicates the strength of the expressed sentiment.
| Code | Meaning |
|------|---------|
| I1 | Low intensity |
| I2 | Moderate intensity |
| I3 | High intensity |
### Specificity (Standard+)
Indicates how detailed the feedback is.
| Code | Meaning |
|------|---------|
| S1 | Low - vague, general |
| S2 | Medium - some detail |
| S3 | High - specific, precise |
### Actionability (Standard+)
Indicates whether clear actions can be derived from the feedback.
| Code | Meaning |
|------|---------|
| A1 | None - no clear action |
| A2 | Unclear - possible actions |
| A3 | Clear - specific actionable |
### Temporal (Standard+)
Indicates the time frame referenced in the feedback.
| Code | Meaning | Markers |
|------|---------|---------|
| TC | Current - this visit | "today", "this time", "yesterday" |
| TR | Recent - last few visits | "lately", "recently", "again" |
| TH | Historical - long-standing | "for years", "always", "historically" |
| TF | Future - expectations | "I won't come back", "next time" |
**Default**: TC when no temporal language exists.
### Evidence (Standard+)
Indicates how the information was obtained from the text.
| Code | Meaning | Example |
|------|---------|---------|
| ES | Stated - explicit in text | "Waited 45 minutes" |
| EI | Inferred - logically entailed | "Took 3 weeks to reply" → slow response |
| EC | Contextual - depends on context | "That happened again" |
**Default**: ES. Use EI/EC only when needed.
### Comparative
Indicates whether the feedback compares to alternatives.
| Code | Meaning |
|------|---------|
| CR-N | No comparison |
| CR-B | Better than alternatives |
| CR-W | Worse than alternatives |
| CR-S | Same as alternatives |
---
## USN (URT String Notation)
USN is a compact string encoding for URT annotations.
### Grammar
```
Standard: URT:S:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}
Full: URT:F:{codes}:{V}{I}:{S}{A}{T}.{E}.{CR}:{causal}
```
### Encoding Rules
**Valence**:
- `+` for V+
- `-` for V-
**Intensity**:
- `1` for I1
- `2` for I2
- `3` for I3
### Examples
**Standard Profile**:
```
URT:S:J1.03:-2:22TC.ES.N
```
Decoded:
- Profile: Standard
- Code: J1.03
- Valence: V- (negative)
- Intensity: I2
- Specificity: S2
- Actionability: A2
- Temporal: TC
- Evidence: ES
- Comparative: CR-N
**Full Profile with Causal Chain**:
```
URT:F:J1.01+A1.04:-3:23TR.EI.S:CD.O,MG.O
```
Decoded:
- Profile: Full
- Codes: J1.01, A1.04
- Valence: V- (negative)
- Intensity: I3
- Specificity: S2
- Actionability: A3
- Temporal: TR
- Evidence: EI
- Comparative: CR-S
- Causal: CD.O (Conditions-Operational), MG.O (Management-Oversight)
---
## Causal Chain (Full Profile Only)
The causal chain identifies root causes across three layers, ordered from immediate to systemic.
### Layers
| Layer | Codes | Scope |
|-------|-------|-------|
| conditions | CD-S, CD-T, CD-E, CD-F, CD-O | Staff State, Team Dynamics, Equipment, Facility, Operational |
| management | MG-P, MG-T, MG-O, MG-R, MG-C | Planning, Training, Oversight, Resources, Communication |
| systemic | SY-R, SY-P, SY-C, SY-S, SY-H, SY-X | Resource Decisions, Policy, Culture, Standards, Human Capital, External |
### Code Reference
**Conditions Layer**:
- `CD-S` - Staff State
- `CD-T` - Team Dynamics
- `CD-E` - Equipment
- `CD-F` - Facility
- `CD-O` - Operational
**Management Layer**:
- `MG-P` - Planning
- `MG-T` - Training
- `MG-O` - Oversight
- `MG-R` - Resources
- `MG-C` - Communication
**Systemic Layer**:
- `SY-R` - Resource Decisions
- `SY-P` - Policy
- `SY-C` - Culture
- `SY-S` - Standards
- `SY-H` - Human Capital
- `SY-X` - External
### JSONB Schema
```json
[
{"layer": "conditions", "code": "CD-O", "evidence": "ES"},
{"layer": "management", "code": "MG-P", "evidence": "EI"}
]
```
### Constraints
- Maximum 3 entries (one per layer)
- Only include when text explicitly supports it
- Order: conditions → management → systemic
---
## Span Boundary Detection Rules
Spans are detected at the clause/topic level, not sentence level.
### Split Rules (in priority order)
1. **Split on contrasting conjunctions**: but, however, although, despite, yet
2. **Split when subject/target changes** (topic shift)
3. **Split when valence changes** (positive ↔ negative)
4. **Split when domain changes** (O/P/J/E/A/V/R)
5. **Keep together** for cause→effect within same feedback unit
### Guidelines
- **Maximum**: ~3 spans per sentence
- **Validation**: If 4+ spans detected, re-check for over-splitting
### Example
**Input**:
> "The food was great but the service was slow and the bathroom was dirty."
**Output**: 3 spans
1. "The food was great" (Offering, positive)
2. "the service was slow" (Journey/Attitude, negative)
3. "the bathroom was dirty" (Environment, negative)
**Reasoning**: Topic shift + domain shift at each boundary.
---
## Primary Span Selection
When a review contains multiple spans, select the primary span using these criteria in order:
### Selection Priority
1. **Highest intensity** (I3 > I2 > I1)
2. **Tie-break**: Negative over positive (V- > V± > V0 > V+)
3. **Tie-break**: Earliest span_index
### Example
Given spans:
- Span 0: I2, V+
- Span 1: I3, V+
- Span 2: I3, V-
**Primary**: Span 2 (highest intensity I3, negative valence wins tie-break)
---
## Secondary Codes Rules
Secondary codes capture additional topics mentioned in a span.
### Constraints
- **Maximum**: 2 secondary codes
- **Format**: Must be Tier-3 (X#.##)
- **Recommendation**: Should be different domain from primary
### Example
Primary: `J1.03` (Journey)
Secondary: `A2.01`, `E1.05` (Attitude, Environment)
---
## Quick Reference Card
### Profiles
| Profile | Dimensions | Causal Chain |
|---------|------------|--------------|
| Core | V, I | No |
| Standard | V, I, S, A, T, E, CR | No |
| Full | V, I, S, A, T, E, CR | Yes |
### USN Quick Format
```
URT:{S|F}:{tier3_codes}:{valence}{intensity}:{SAT}.{E}.{CR}[:{causal}]
```
### Domain Letters
```
O P J E A V R
│ │ │ │ │ │ └─ Relationship
│ │ │ │ │ └─── Voice
│ │ │ │ └───── Attitude
│ │ │ └─────── Environment
│ │ └───────── Journey
│ └─────────── Price
└───────────── Offering
```

0
api/__init__.py Normal file
View File

View File

0
api/routes/__init__.py Normal file
View File

View File

@@ -20,13 +20,13 @@ from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, HttpUrl, Field
from fastapi.responses import JSONResponse, StreamingResponse
from modules.database import DatabaseManager, JobStatus
from modules.webhooks import WebhookDispatcher, WebhookManager
from modules.health_checks import HealthCheckSystem
from modules.scraper_clean import fast_scrape_reviews, LogCapture, get_business_card_info # Clean scraper
from modules.crash_analyzer import analyze_crash, summarize_crash_patterns, apply_auto_fix
from modules.structured_logger import StructuredLogger, LogEntry
from modules.chrome_pool import (
from core.database import DatabaseManager, JobStatus
from services.webhook_service import WebhookDispatcher, WebhookManager
from utils.health_checks import HealthCheckSystem
from scrapers.google_reviews.v1_0_0 import fast_scrape_reviews, LogCapture, get_business_card_info # Clean scraper
from utils.crash_analyzer import analyze_crash, summarize_crash_patterns, apply_auto_fix
from utils.logger import StructuredLogger, LogEntry
from workers.chrome_pool import (
start_worker_pools,
stop_worker_pools,
get_validation_worker,

0
core/__init__.py Normal file
View File

View File

@@ -8,22 +8,13 @@ import json
from datetime import datetime
from typing import Optional, List, Dict, Any
from uuid import UUID, uuid4
from enum import Enum
import logging
from core.enums import JobStatus
log = logging.getLogger(__name__)
class JobStatus(str, Enum):
"""Job status enumeration"""
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
CANCELLED = "cancelled"
PARTIAL = "partial" # Job crashed but has partial reviews saved
class DatabaseManager:
"""PostgreSQL database manager with connection pooling"""

14
core/enums.py Normal file
View File

@@ -0,0 +1,14 @@
"""
Enumerations for the ReviewIQ project.
"""
from enum import Enum
class JobStatus(str, Enum):
"""Job status enumeration"""
PENDING = "pending"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
CANCELLED = "cancelled"
PARTIAL = "partial" # Job crashed but has partial reviews saved

View File

@@ -6,7 +6,7 @@ from dataclasses import dataclass, field
from selenium.webdriver.remote.webelement import WebElement
from modules.utils import (try_find, first_text, first_attr, safe_int, detect_lang, parse_date_to_iso)
from utils.helpers import (try_find, first_text, first_attr, safe_int, detect_lang, parse_date_to_iso)
@dataclass

10
scrapers/__init__.py Normal file
View File

@@ -0,0 +1,10 @@
"""
Scrapers Package
This package contains all scraper implementations for the ReviewIQ system.
"""
from scrapers.base import BaseScraper
from scrapers.registry import ScraperRegistry, registry
__all__ = ["BaseScraper", "ScraperRegistry", "registry"]

97
scrapers/base.py Normal file
View File

@@ -0,0 +1,97 @@
"""
Base Scraper Interface
This module defines the abstract base class that all scrapers must implement.
It ensures consistent interface across different scraper implementations.
"""
from abc import ABC, abstractmethod
from typing import Any, Callable, Dict, List, Optional
class BaseScraper(ABC):
"""
Abstract base class for all scrapers in the ReviewIQ system.
All concrete scraper implementations must inherit from this class
and implement the required abstract methods.
"""
@abstractmethod
def scrape(
self,
driver: Any,
url: str,
max_reviews: int = 5000,
timeout_no_new: int = 15,
flush_callback: Optional[Callable[[List[Dict]], None]] = None,
flush_batch_size: int = 500,
progress_callback: Optional[Callable[[int, Optional[int]], None]] = None,
validation_only: bool = False
) -> Dict[str, Any]:
"""
Scrape reviews from the given URL.
Args:
driver: WebDriver instance (e.g., Selenium WebDriver)
url: The URL to scrape reviews from
max_reviews: Maximum number of reviews to collect
timeout_no_new: Seconds to wait with no new reviews before stopping
flush_callback: Optional callback called with reviews batches for streaming
flush_batch_size: Number of reviews before triggering flush_callback
progress_callback: Optional callback(current_count, total_count) for progress
validation_only: If True, return early after extracting metadata only
Returns:
Dictionary containing:
- reviews: List of review dictionaries
- total: Total number of reviews collected
- error: Error message if any, None otherwise
- Additional scraper-specific metadata
"""
pass
@abstractmethod
def validate_url(self, url: str) -> bool:
"""
Validate if the given URL is supported by this scraper.
Args:
url: The URL to validate
Returns:
True if the URL is valid for this scraper, False otherwise
"""
pass
@abstractmethod
def get_business_info(self, driver: Any, url: str) -> Dict[str, Any]:
"""
Extract business information from the URL without scraping reviews.
Args:
driver: WebDriver instance
url: The URL to extract info from
Returns:
Dictionary containing business metadata (name, rating, address, etc.)
"""
pass
@property
@abstractmethod
def name(self) -> str:
"""Return the human-readable name of this scraper."""
pass
@property
@abstractmethod
def version(self) -> str:
"""Return the version string of this scraper."""
pass
@property
@abstractmethod
def supported_domains(self) -> List[str]:
"""Return list of domains this scraper supports."""
pass

View File

@@ -0,0 +1,21 @@
"""
Google Reviews Scraper Package
This package contains the Google Reviews scraper implementations.
"""
from scrapers.google_reviews.v1_0_0 import (
scrape_reviews,
fast_scrape_reviews,
get_business_card_info,
extract_about_info,
LogCapture,
)
__all__ = [
"scrape_reviews",
"fast_scrape_reviews",
"get_business_card_info",
"extract_about_info",
"LogCapture",
]

View File

@@ -1,7 +1,12 @@
"""
Clean Google Maps Reviews Scraper
Google Reviews Scraper v1.0.0
This module provides the core Google Maps reviews scraping functionality.
- Simple down scrolling
- DOM scraping + API interception
Version: 1.0.0
Migrated from: modules/scraper_clean.py
"""
import re
@@ -12,7 +17,7 @@ from datetime import datetime
from typing import List, Optional
from selenium.webdriver.common.by import By
from modules.structured_logger import StructuredLogger
from utils.logger import StructuredLogger
def get_chrome_memory(driver) -> Optional[int]:
"""Get Chrome memory usage in MB using CDP."""

138
scrapers/registry.py Normal file
View File

@@ -0,0 +1,138 @@
"""
Scraper Registry
This module provides a registry for managing and discovering scrapers.
It allows dynamic registration and lookup of scraper implementations.
"""
from typing import Dict, List, Optional, Type
from scrapers.base import BaseScraper
class ScraperRegistry:
"""
Registry for managing scraper implementations.
The registry allows:
- Registering scrapers by name and version
- Looking up scrapers by domain or name
- Listing all available scrapers
Usage:
registry = ScraperRegistry()
registry.register(GoogleReviewsScraper)
scraper = registry.get_scraper_for_url("https://google.com/maps/place/...")
"""
_instance: Optional["ScraperRegistry"] = None
_scrapers: Dict[str, Type[BaseScraper]]
def __new__(cls) -> "ScraperRegistry":
"""Singleton pattern to ensure one global registry."""
if cls._instance is None:
cls._instance = super().__new__(cls)
cls._instance._scrapers = {}
cls._instance._domain_map = {}
return cls._instance
def register(self, scraper_class: Type[BaseScraper], name: Optional[str] = None) -> None:
"""
Register a scraper class with the registry.
Args:
scraper_class: The scraper class to register (must inherit from BaseScraper)
name: Optional name override, defaults to scraper_class.name property
"""
# Create a temporary instance to get properties
# Note: In production, we might want scraper_class to have class-level properties
instance = scraper_class.__new__(scraper_class)
scraper_name = name or instance.name
scraper_version = instance.version
key = f"{scraper_name}:{scraper_version}"
self._scrapers[key] = scraper_class
# Map domains to this scraper
for domain in instance.supported_domains:
if domain not in self._domain_map:
self._domain_map[domain] = []
self._domain_map[domain].append(key)
def get_scraper(self, name: str, version: Optional[str] = None) -> Optional[Type[BaseScraper]]:
"""
Get a scraper class by name and optional version.
Args:
name: The scraper name
version: Optional version string. If not provided, returns the latest.
Returns:
The scraper class, or None if not found
"""
if version:
key = f"{name}:{version}"
return self._scrapers.get(key)
# Find latest version for this name
matching = [k for k in self._scrapers.keys() if k.startswith(f"{name}:")]
if not matching:
return None
# Sort by version and return latest
matching.sort(reverse=True)
return self._scrapers.get(matching[0])
def get_scraper_for_url(self, url: str) -> Optional[Type[BaseScraper]]:
"""
Find a suitable scraper for the given URL.
Args:
url: The URL to find a scraper for
Returns:
The scraper class that can handle this URL, or None if no match
"""
from urllib.parse import urlparse
parsed = urlparse(url)
domain = parsed.netloc.lower()
# Remove www. prefix for matching
if domain.startswith("www."):
domain = domain[4:]
scraper_keys = self._domain_map.get(domain, [])
if not scraper_keys:
return None
# Return the latest version
scraper_keys.sort(reverse=True)
return self._scrapers.get(scraper_keys[0])
def list_scrapers(self) -> List[Dict[str, str]]:
"""
List all registered scrapers.
Returns:
List of dictionaries with scraper info (name, version, domains)
"""
result = []
for key, scraper_class in self._scrapers.items():
instance = scraper_class.__new__(scraper_class)
result.append({
"name": instance.name,
"version": instance.version,
"domains": instance.supported_domains
})
return result
def clear(self) -> None:
"""Clear all registered scrapers. Useful for testing."""
self._scrapers.clear()
self._domain_map.clear()
# Global registry instance
registry = ScraperRegistry()

0
services/__init__.py Normal file
View File

0
tests/api/__init__.py Normal file
View File

View File

View File

View File

0
utils/__init__.py Normal file
View File

View File

@@ -67,7 +67,7 @@ class CanaryMonitor:
# Alert if multiple consecutive failures
if self.consecutive_failures >= 3:
await self.send_alert(
f"🚨 CRITICAL: Scraper canary failed {self.consecutive_failures} times in a row! "
f"CRITICAL: Scraper canary failed {self.consecutive_failures} times in a row! "
f"Last error: {str(e)[:200]}"
)
@@ -90,7 +90,7 @@ class CanaryMonitor:
- Scrape time is reasonable
- Data structure is valid
"""
from modules.scraper_clean import fast_scrape_reviews
from scrapers.google_reviews.v1_0_0 import fast_scrape_reviews
log.info(f"Running canary scrape test on {self.test_url[:60]}...")
self.last_run = datetime.now()
@@ -121,7 +121,7 @@ class CanaryMonitor:
if all_passed:
# Success!
log.info(
f"Canary test PASSED: {result['count']} reviews in {result['time']:.1f}s"
f"Canary test PASSED: {result['count']} reviews in {result['time']:.1f}s"
)
self.consecutive_failures = 0
self.last_success = datetime.now()
@@ -144,7 +144,7 @@ class CanaryMonitor:
# Validation failed
failed_checks = [k for k, v in checks.items() if not v]
log.error(
f"Canary test FAILED: validation failed on {failed_checks}"
f"Canary test FAILED: validation failed on {failed_checks}"
)
self.consecutive_failures += 1
self.last_result = {
@@ -167,12 +167,12 @@ class CanaryMonitor:
# Alert on failure
if self.consecutive_failures >= 3:
await self.send_alert(
f"🚨 CRITICAL: Canary validation failed {self.consecutive_failures} times! "
f"CRITICAL: Canary validation failed {self.consecutive_failures} times! "
f"Failed checks: {failed_checks}"
)
except asyncio.TimeoutError:
log.error("Canary test TIMEOUT (>60s)")
log.error("Canary test TIMEOUT (>60s)")
self.consecutive_failures += 1
self.last_result = {
"status": "timeout",
@@ -186,11 +186,11 @@ class CanaryMonitor:
if self.consecutive_failures >= 3:
await self.send_alert(
f"🚨 CRITICAL: Canary timeout {self.consecutive_failures} times!"
f"CRITICAL: Canary timeout {self.consecutive_failures} times!"
)
except Exception as e:
log.error(f"Canary test ERROR: {e}")
log.error(f"Canary test ERROR: {e}")
self.consecutive_failures += 1
self.last_result = {
"status": "error",

0
workers/__init__.py Normal file
View File