# ReviewIQ Scraping Platform - Specification > **Purpose**: Define WHAT the platform should do, not HOW. This document serves as the source of truth during implementation. --- ## 1. Vision Transform the current Google Reviews scraper into a **multi-tenant scraping-as-a-service platform** that: - Serves external clients via API (initially veritasreview.com) - Supports multiple scraping job types (reviews, business info, etc.) - Provides full observability into system performance and problems - Enables safe scraper iteration through versioning and A/B testing --- ## 2. Core Concepts ### 2.1 Job Types The platform executes different types of scraping jobs: - `google_reviews` (current, primary) - Future: `yelp_reviews`, `tripadvisor_reviews`, `google_business_info`, etc. Each job type has its own: - Input parameters - Output schema - Scraper implementation(s) ### 2.2 Requesters External systems that request scraping jobs: - Identified by `client_id` (e.g., "veritas_client_123") - Originate from a `source` (e.g., "veritasreview.com") - Have a `purpose` for scraping: - `client_report` - generating reports for their clients - `prospect_screening` - evaluating potential clients - `market_research` - competitive/market analysis ### 2.3 Batches Jobs can be grouped into batches: - A batch is a collection of related jobs (e.g., "Q1 Prospect List") - Batches have their own completion callback - Dashboard shows batch progress and aggregate stats ### 2.4 Scraper Versions Each job type can have multiple scraper versions: - **Variants**: `stable`, `beta`, `canary` - **Traffic routing**: A/B testing via percentage allocation - **Version pinning**: Clients can request specific versions - **Safe rollouts**: Promote canary → beta → stable ### 2.5 Priority Levels Jobs have priority that affects execution order: - `0` = normal - `1` = high - `2` = urgent --- ## 3. Features ### 3.1 API - Job Submission **Single job submission:** - Submit a scraping job for a specific job type - Include requester identification - Optionally specify priority, callback URL, scraper variant - Returns job ID immediately **Batch submission:** - Submit multiple URLs as a single batch - Batch has a name and optional batch-level callback - Individual jobs track their position in batch - Batch callback fires when all jobs complete ### 3.2 API - Job Management - Get job status and results - Cancel pending/running jobs - Retry failed jobs - List jobs with filtering (by client, status, date, batch, job type) ### 3.3 API - Webhooks When a job completes (success or failure): - POST to the provided `callback_url` - Include job ID, status, summary results, error info if failed - Track callback delivery status (pending, sent, failed) - Retry failed callbacks When a batch completes: - POST to batch-level callback - Include batch summary (total, succeeded, failed) ### 3.4 Main Dashboard **System Overview:** - Total jobs (24h / 7d / 30d) - Success rate trend - Currently running jobs - Recent failures / problems requiring attention **By Client/Source:** - Jobs per client - Top consumers (volume) - Error rates by client - Purpose breakdown per client **By Job Type:** - Volume per job type - Success rate per type - Average duration per type **By Scraper Version:** - Performance comparison across versions - Success rate by version - Duration by version - Ability to identify when beta outperforms stable **Problems & Alerts:** - Recent failures with error types - Slow jobs (exceeding expected duration) - Callback delivery failures - Clients with elevated error rates ### 3.5 Job Detail View (existing, enhanced) Current functionality preserved, plus: - Show requester info (client, source, purpose) - Show batch membership if applicable - Show scraper version that executed - Link to related jobs (same batch, same client) ### 3.6 Analytics View Per-job analytics (existing) remains for Google Reviews: - Rating distribution - Sentiment analysis - Review topics - Timeline Future: type-specific analytics for other job types. --- ## 4. Data Model ### 4.1 Jobs (enhanced) **Existing fields preserved.** **New requester fields:** - `requester_client_id` - which client requested this - `requester_source` - origin system (veritasreview.com) - `scrape_purpose` - why (client_report, prospect_screening, market_research) - `requester_metadata` - flexible JSON for additional context **New batch fields:** - `batch_id` - links to batch if part of one - `batch_index` - position in batch (1, 2, 3...) **New execution fields:** - `job_type` - type of scraping job (google_reviews, etc.) - `scraper_version` - exact version that executed (1.2.0) - `scraper_variant` - variant used (stable, beta, canary) - `priority` - execution priority (0, 1, 2) **New callback fields:** - `callback_url` - where to POST on completion - `callback_status` - pending, sent, failed - `callback_sent_at` - when callback was delivered - `callback_attempts` - retry count ### 4.2 Batches (new) - `id` - unique identifier - `name` - human readable name - `requester_client_id` - client who submitted - `requester_source` - origin system - `scrape_purpose` - purpose for all jobs in batch - `total_jobs` - count of jobs in batch - `completed_jobs` - count finished (success or fail) - `failed_jobs` - count failed - `status` - pending, running, completed - `callback_url` - batch completion webhook - `callback_status` - pending, sent, failed - `created_at` - when batch was created - `completed_at` - when last job finished - `metadata` - flexible JSON ### 4.3 Scraper Registry (new) - `id` - unique identifier - `job_type` - which job type this scraper handles - `version` - semantic version (1.2.0, 2.0.0-beta) - `variant` - stable, beta, canary - `module_path` - Python module path - `function_name` - entry function - `is_default` - use if no version specified - `traffic_pct` - percentage of traffic for A/B testing - `min_priority` - only use for jobs at or above this priority - `created_at` - when registered - `deprecated_at` - when marked deprecated (null if active) - `config` - version-specific configuration JSON ### 4.4 Generic Result Summary Jobs have a `result_summary` JSON field for cross-type dashboard: ```json { "item_count": 150, "primary_metric": 4.2, "primary_metric_label": "rating", "secondary_metrics": { "reviews_with_text": 120, "avg_review_length": 45 } } ``` This enables the dashboard to show unified metrics across job types. --- ## 5. API Endpoints ### 5.1 Scraping Endpoints ``` POST /api/scrape/google-reviews POST /api/scrape/yelp-reviews (future) POST /api/scrape/tripadvisor-reviews (future) ``` Each accepts type-specific parameters plus common fields: - `requester` object (client_id, source, purpose, metadata) - `priority` (0, 1, 2) - `callback_url` - `scraper_version` or `scraper_variant` (optional) ### 5.2 Batch Endpoint ``` POST /api/scrape/google-reviews/batch ``` Accepts: - `name` - batch name - `urls` - array of URLs - `requester` object - `priority` - `callback_url` - called when entire batch completes ### 5.3 Management Endpoints ``` GET /api/jobs - list with filters GET /api/jobs/{id} - job detail DELETE /api/jobs/{id} - cancel job POST /api/jobs/{id}/retry - retry failed job GET /api/batches - list batches GET /api/batches/{id} - batch detail with job list DELETE /api/batches/{id} - cancel all pending jobs in batch ``` ### 5.4 Dashboard Endpoints ``` GET /api/dashboard/overview - system stats GET /api/dashboard/by-client - breakdown by client GET /api/dashboard/by-job-type - breakdown by job type GET /api/dashboard/by-version - scraper version comparison GET /api/dashboard/problems - recent failures, alerts ``` ### 5.5 Admin Endpoints ``` GET /api/admin/scrapers - list registered scrapers POST /api/admin/scrapers - register new scraper version PUT /api/admin/scrapers/{id}/traffic - update traffic percentage POST /api/admin/scrapers/{id}/deprecate - mark deprecated POST /api/admin/scrapers/{id}/promote - promote to stable ``` --- ## 6. Output Schemas Each job type has a defined output schema. External services (like veritasreview.com) consume this data to generate insights. ### 6.1 Google Reviews Output **Business Summary:** ```json { "business": { "name": "Acme Restaurant", "place_id": "ChIJ...", "address": "123 Main St, City, State", "category": "Restaurant", "total_reviews": 1250, "rating": 4.3, "rating_distribution": { "5": 720, "4": 280, "3": 120, "2": 80, "1": 50 }, "scraped_at": "2025-01-24T10:30:00Z" } } ``` **Review Object:** ```json { "review_id": "abc123", "author": { "name": "John D.", "profile_url": "https://...", "is_local_guide": true, "review_count": 42, "photo_count": 15 }, "rating": 4, "text": "Great food and service...", "language": "en", "published_at": "2025-01-15T14:30:00Z", "photos": [ { "url": "https://...", "caption": null } ], "owner_response": { "text": "Thank you for your feedback...", "responded_at": "2025-01-16T09:00:00Z" }, "metadata": { "source": "dom", "extracted_at": "2025-01-24T10:35:00Z" } } ``` **Key fields for insights service:** - `rating` + `text` → Sentiment analysis, rating correlation - `published_at` → Trend analysis, seasonality - `language` → Multi-language support - `owner_response` → Engagement metrics, response rate - `author.is_local_guide` → Review credibility weighting - `rating_distribution` → Rating spread analysis ### 6.2 Future Job Types Other scrapers (Yelp, TripAdvisor, etc.) will have their own schemas but follow similar patterns: - Business summary with ratings - Individual review objects - Author metadata - Timestamps for trend analysis --- ## 7. Webhook Payloads ### 6.1 Job Completion ```json { "event": "job.completed", "job_id": "uuid", "job_type": "google_reviews", "status": "completed", "url": "https://google.com/maps/...", "result_summary": { "item_count": 150, "primary_metric": 4.2 }, "scraper_version": "1.2.0", "duration_seconds": 45.2, "completed_at": "2024-01-15T10:30:00Z" } ``` ### 6.2 Job Failed ```json { "event": "job.failed", "job_id": "uuid", "job_type": "google_reviews", "status": "failed", "url": "https://google.com/maps/...", "error": { "type": "rate_limited", "message": "Google rate limit detected" }, "scraper_version": "1.2.0", "duration_seconds": 12.5, "failed_at": "2024-01-15T10:30:00Z" } ``` ### 6.3 Batch Completion ```json { "event": "batch.completed", "batch_id": "uuid", "name": "Q1 Prospects", "total_jobs": 50, "succeeded": 47, "failed": 3, "completed_at": "2024-01-15T10:30:00Z", "failed_job_ids": ["uuid1", "uuid2", "uuid3"] } ``` --- ## 8. UI Pages ### 7.1 Main Dashboard (`/dashboard`) - System health at a glance - Key metrics with trends - Problem alerts - Quick links to drill down ### 7.2 Clients View (`/dashboard/clients`) - Table of clients with job counts, success rates - Click to see client's jobs ### 7.3 Scrapers View (`/dashboard/scrapers`) - Registered scraper versions - Performance comparison - Traffic allocation controls - Promote/deprecate actions ### 7.4 Jobs View (`/jobs`) - enhanced - Add filters: client, job type, batch, scraper version - Show requester info in job cards ### 7.5 Batches View (`/batches`) - List of batches with progress - Click to see batch detail and jobs --- ## 9. Project Structure ### 8.1 Backend Structure ``` reviewiq/ # Root (renamed from google-reviews-scraper-pro) │ ├── api/ │ ├── __init__.py │ ├── server.py # FastAPI app, startup, middleware │ ├── routes/ │ │ ├── __init__.py │ │ ├── scrape.py # /api/scrape/* endpoints │ │ ├── jobs.py # /api/jobs/* endpoints │ │ ├── batches.py # /api/batches/* endpoints │ │ ├── dashboard.py # /api/dashboard/* endpoints │ │ └── admin.py # /api/admin/* endpoints │ └── middleware/ │ ├── __init__.py │ └── auth.py # API key authentication │ ├── scrapers/ │ ├── __init__.py │ ├── registry.py # ScraperRegistry - version routing │ ├── base.py # BaseScraper interface │ │ │ ├── google_reviews/ │ │ ├── __init__.py │ │ ├── v1_0_0.py # Current stable (migrated from scraper_clean.py) │ │ └── parsers.py # Review parsing logic │ │ │ └── yelp_reviews/ # Future │ ├── __init__.py │ └── v1_0_0.py │ ├── core/ │ ├── __init__.py │ ├── database.py # Database manager │ ├── models.py # Pydantic models (Job, Batch, etc.) │ ├── enums.py # JobStatus, JobType, Priority, etc. │ └── config.py # Settings, environment variables │ ├── services/ │ ├── __init__.py │ ├── job_service.py # Job creation, management │ ├── batch_service.py # Batch operations │ ├── webhook_service.py # Callback delivery │ └── dashboard_service.py # Aggregate queries │ ├── workers/ │ ├── __init__.py │ ├── chrome_pool.py # Browser pool management │ ├── job_executor.py # Job execution orchestration │ └── webhook_worker.py # Async webhook delivery │ ├── utils/ │ ├── __init__.py │ ├── logger.py # StructuredLogger │ ├── crash_analyzer.py # Crash detection │ └── health_checks.py # System health │ ├── tests/ │ ├── __init__.py │ ├── conftest.py # Pytest fixtures │ ├── api/ # API route tests │ ├── scrapers/ # Scraper tests (mirrors scrapers/) │ │ └── google_reviews/ │ │ └── test_v1_0_0.py │ ├── services/ # Service tests │ └── integration/ # End-to-end tests │ ├── migrations/ # Database migrations │ └── versions/ │ ├── web/ # Next.js frontend (existing) │ └── ... │ ├── docker-compose.yml ├── Dockerfile ├── pyproject.toml # Python dependencies └── README.md ``` ### 8.2 Key Conventions **Naming:** - Scraper versions use underscores: `v1_0_0.py` (valid Python module names) - Version strings use dots: `"1.0.0"` (semantic versioning in data) **Imports:** ```python from scrapers.google_reviews.v1_0_0 import GoogleReviewsScraper from scrapers.registry import ScraperRegistry from core.models import Job, Batch from services.job_service import JobService ``` **Scraper Interface:** Each scraper version implements: ```python class GoogleReviewsScraper(BaseScraper): VERSION = "1.0.0" JOB_TYPE = "google_reviews" async def scrape(self, url: str, options: dict) -> ScraperResult: ... def validate_url(self, url: str) -> bool: ... ``` ### 8.3 Frontend Structure (existing, minor additions) ``` web/ ├── app/ │ ├── dashboard/ # New main dashboard │ │ ├── page.tsx # Overview │ │ ├── clients/page.tsx │ │ ├── scrapers/page.tsx │ │ └── problems/page.tsx │ ├── batches/ # New │ │ ├── page.tsx │ │ └── [id]/page.tsx │ ├── jobs/ # Enhanced │ └── analytics/ # Existing ├── components/ │ ├── dashboard/ # Dashboard-specific components │ └── ... └── ... ``` --- ## 10. Backwards Compatibility ### 9.1 Existing API `POST /api/scrape` continues to work as-is: - Defaults to `job_type: google_reviews` - No requester required (legacy mode) - No callback required - Routes to the same scraper logic ### 9.2 Existing Database - All new fields have defaults - Existing jobs have null requester fields - `job_type` defaults to `google_reviews` - Migration adds columns without breaking existing data ### 9.3 Scraper Migration - Current scraper code moves to `scrapers/google_reviews/v1_0_0.py` - Registered in scraper_registry as `stable` with 100% traffic - Old file `scraper_clean.py` deleted after migration - All imports updated to new paths --- ## 11. Additional Considerations ### 10.1 Authentication - External API clients authenticate via API keys - API keys stored in `api_keys` table with `client_id` reference - Keys can be scoped (read-only, submit jobs, admin) - Rate limits can be per-key ### 10.2 Error Handling - All API errors return consistent JSON structure: ```json { "error": { "code": "VALIDATION_ERROR", "message": "URL is required", "details": { ... } } } ``` - Scraper errors captured with crash analysis - Failed webhooks retry with exponential backoff (max 5 attempts) ### 10.3 Logging - All components use StructuredLogger - Log levels: DEBUG, INFO, WARN, ERROR, FATAL - Categories: api, scraper, webhook, system - Logs include correlation IDs for tracing ### 10.4 Configuration - Environment-based configuration via `core/config.py` - Sensitive values from environment variables - Per-scraper config in scraper_registry.config JSON ### 10.5 Monitoring - Health check endpoint: `GET /health` - Prometheus metrics endpoint: `GET /metrics` (future) - Dashboard provides operational visibility ### 10.6 Data Retention - Define retention policy for completed jobs - Archive or delete old job data after N days - Keep aggregate stats for historical reporting --- ## 12. Implementation Phases ### Phase 0: Project Restructure - Reorganize files to new structure - Move `scraper_clean.py` → `scrapers/google_reviews/v1_0_0.py` - Update all imports - Verify everything still works ### Phase 1: Data Model - Add new fields to jobs table - Create batches table - Create scraper_registry table - Create api_keys table - Migration preserves existing data ### Phase 2: Requester & Batch Support - Update API to accept requester info - Implement batch submission endpoint - Store and display requester/batch info ### Phase 3: Webhooks - Implement callback delivery service - Retry logic for failed callbacks - Track delivery status ### Phase 4: Scraper Versioning - Implement scraper registry - Version routing logic - Admin endpoints for management ### Phase 5: Main Dashboard - Build dashboard pages - Aggregate queries - Real-time updates ### Phase 6: Traffic Management & A/B - A/B test traffic splitting - Promote/deprecate workflow - Performance comparison views ### Phase 7: Authentication - API key management - Client authentication middleware - Rate limiting (optional) --- ## 13. Success Metrics - API response time < 200ms for job submission - Webhook delivery within 5 seconds of job completion - Dashboard loads in < 2 seconds - Support 100+ concurrent scraping jobs - 99% webhook delivery success rate - Clear visibility into scraper version performance --- ## 14. Open Questions 1. ~~**Authentication**: How do external clients authenticate? API keys per client?~~ → Resolved: API keys 2. **Rate Limits**: Per-client rate limiting? (deferred to Phase 7) 3. **Retention**: How long to keep completed job data? (needs decision) 4. **Billing**: Track usage for billing purposes? (future consideration) 5. **Project Rename**: Rename folder from `google-reviews-scraper-pro` to `reviewiq`? --- ## 15. Glossary | Term | Definition | |------|------------| | Job | A single scraping task for one URL | | Batch | A collection of related jobs submitted together | | Job Type | Category of scraping (google_reviews, yelp_reviews, etc.) | | Requester | External client/system that requests jobs | | Scraper Version | Specific implementation of a scraper (v1.0.0, v2.0.0) | | Variant | Stability tier: stable, beta, canary | | Callback/Webhook | HTTP POST to notify client of job completion | --- *Document Version: 1.2* *Last Updated: 2025-01-24*