Files

Alejandro Gutiérrez 12d37e350b Fix JobDevTools contrast + log normalization, add Platform Spec

- Fix contrast issues in JobDevTools (level badges, text colors, timestamps)
- Make log normalization more robust (handles old/new formats, edge cases)
- Add ReviewIQ Platform Spec v1.2 defining:
  - Multi-tenant scraping-as-a-service architecture
  - Requester metadata, batches, webhooks, priority
  - Scraper versioning with A/B testing (stable/beta/canary)
  - API endpoints for job types, dashboard, admin
  - Output schemas for external service integration
  - Project structure reorganization plan

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-24 15:13:19 +00:00

20 KiB

Raw Blame History

ReviewIQ Scraping Platform - Specification

Purpose: Define WHAT the platform should do, not HOW. This document serves as the source of truth during implementation.

1. Vision

Transform the current Google Reviews scraper into a multi-tenant scraping-as-a-service platform that:

Serves external clients via API (initially veritasreview.com)
Supports multiple scraping job types (reviews, business info, etc.)
Provides full observability into system performance and problems
Enables safe scraper iteration through versioning and A/B testing

2. Core Concepts

2.1 Job Types

The platform executes different types of scraping jobs:

google_reviews (current, primary)
Future: yelp_reviews, tripadvisor_reviews, google_business_info, etc.

Each job type has its own:

Input parameters
Output schema
Scraper implementation(s)

2.2 Requesters

External systems that request scraping jobs:

Identified by client_id (e.g., "veritas_client_123")
Originate from a source (e.g., "veritasreview.com")
Have a purpose for scraping:
- client_report - generating reports for their clients
- prospect_screening - evaluating potential clients
- market_research - competitive/market analysis

2.3 Batches

Jobs can be grouped into batches:

A batch is a collection of related jobs (e.g., "Q1 Prospect List")
Batches have their own completion callback
Dashboard shows batch progress and aggregate stats

2.4 Scraper Versions

Each job type can have multiple scraper versions:

Variants: stable, beta, canary
Traffic routing: A/B testing via percentage allocation
Version pinning: Clients can request specific versions
Safe rollouts: Promote canary → beta → stable

2.5 Priority Levels

Jobs have priority that affects execution order:

0 = normal
1 = high
2 = urgent

3. Features

3.1 API - Job Submission

Single job submission:

Submit a scraping job for a specific job type
Include requester identification
Optionally specify priority, callback URL, scraper variant
Returns job ID immediately

Batch submission:

Submit multiple URLs as a single batch
Batch has a name and optional batch-level callback
Individual jobs track their position in batch
Batch callback fires when all jobs complete

3.2 API - Job Management

Get job status and results
Cancel pending/running jobs
Retry failed jobs
List jobs with filtering (by client, status, date, batch, job type)

3.3 API - Webhooks

When a job completes (success or failure):

POST to the provided callback_url
Include job ID, status, summary results, error info if failed
Track callback delivery status (pending, sent, failed)
Retry failed callbacks

When a batch completes:

POST to batch-level callback
Include batch summary (total, succeeded, failed)

3.4 Main Dashboard

System Overview:

Total jobs (24h / 7d / 30d)
Success rate trend
Currently running jobs
Recent failures / problems requiring attention

By Client/Source:

Jobs per client
Top consumers (volume)
Error rates by client
Purpose breakdown per client

By Job Type:

Volume per job type
Success rate per type
Average duration per type

By Scraper Version:

Performance comparison across versions
Success rate by version
Duration by version
Ability to identify when beta outperforms stable

Problems & Alerts:

Recent failures with error types
Slow jobs (exceeding expected duration)
Callback delivery failures
Clients with elevated error rates

3.5 Job Detail View (existing, enhanced)

Current functionality preserved, plus:

Show requester info (client, source, purpose)
Show batch membership if applicable
Show scraper version that executed
Link to related jobs (same batch, same client)

3.6 Analytics View

Per-job analytics (existing) remains for Google Reviews:

Rating distribution
Sentiment analysis
Review topics
Timeline

Future: type-specific analytics for other job types.

4. Data Model

4.1 Jobs (enhanced)

Existing fields preserved.

New requester fields:

requester_client_id - which client requested this
requester_source - origin system (veritasreview.com)
scrape_purpose - why (client_report, prospect_screening, market_research)
requester_metadata - flexible JSON for additional context

New batch fields:

batch_id - links to batch if part of one
batch_index - position in batch (1, 2, 3...)

New execution fields:

job_type - type of scraping job (google_reviews, etc.)
scraper_version - exact version that executed (1.2.0)
scraper_variant - variant used (stable, beta, canary)
priority - execution priority (0, 1, 2)

New callback fields:

callback_url - where to POST on completion
callback_status - pending, sent, failed
callback_sent_at - when callback was delivered
callback_attempts - retry count

4.2 Batches (new)

id - unique identifier
name - human readable name
requester_client_id - client who submitted
requester_source - origin system
scrape_purpose - purpose for all jobs in batch
total_jobs - count of jobs in batch
completed_jobs - count finished (success or fail)
failed_jobs - count failed
status - pending, running, completed
callback_url - batch completion webhook
callback_status - pending, sent, failed
created_at - when batch was created
completed_at - when last job finished
metadata - flexible JSON

4.3 Scraper Registry (new)

id - unique identifier
job_type - which job type this scraper handles
version - semantic version (1.2.0, 2.0.0-beta)
variant - stable, beta, canary
module_path - Python module path
function_name - entry function
is_default - use if no version specified
traffic_pct - percentage of traffic for A/B testing
min_priority - only use for jobs at or above this priority
created_at - when registered
deprecated_at - when marked deprecated (null if active)
config - version-specific configuration JSON

4.4 Generic Result Summary

Jobs have a result_summary JSON field for cross-type dashboard:

{
  "item_count": 150,
  "primary_metric": 4.2,
  "primary_metric_label": "rating",
  "secondary_metrics": {
    "reviews_with_text": 120,
    "avg_review_length": 45
  }
}

This enables the dashboard to show unified metrics across job types.

5. API Endpoints

5.1 Scraping Endpoints

POST /api/scrape/google-reviews
POST /api/scrape/yelp-reviews        (future)
POST /api/scrape/tripadvisor-reviews (future)

Each accepts type-specific parameters plus common fields:

requester object (client_id, source, purpose, metadata)
priority (0, 1, 2)
callback_url
scraper_version or scraper_variant (optional)

5.2 Batch Endpoint

POST /api/scrape/google-reviews/batch

Accepts:

name - batch name
urls - array of URLs
requester object
priority
callback_url - called when entire batch completes

5.3 Management Endpoints

GET  /api/jobs                    - list with filters
GET  /api/jobs/{id}               - job detail
DELETE /api/jobs/{id}             - cancel job
POST /api/jobs/{id}/retry         - retry failed job

GET  /api/batches                 - list batches
GET  /api/batches/{id}            - batch detail with job list
DELETE /api/batches/{id}          - cancel all pending jobs in batch

5.4 Dashboard Endpoints

GET /api/dashboard/overview       - system stats
GET /api/dashboard/by-client      - breakdown by client
GET /api/dashboard/by-job-type    - breakdown by job type
GET /api/dashboard/by-version     - scraper version comparison
GET /api/dashboard/problems       - recent failures, alerts

5.5 Admin Endpoints

GET  /api/admin/scrapers                    - list registered scrapers
POST /api/admin/scrapers                    - register new scraper version
PUT  /api/admin/scrapers/{id}/traffic       - update traffic percentage
POST /api/admin/scrapers/{id}/deprecate     - mark deprecated
POST /api/admin/scrapers/{id}/promote       - promote to stable

6. Output Schemas

Each job type has a defined output schema. External services (like veritasreview.com) consume this data to generate insights.

6.1 Google Reviews Output

Business Summary:

{
  "business": {
    "name": "Acme Restaurant",
    "place_id": "ChIJ...",
    "address": "123 Main St, City, State",
    "category": "Restaurant",
    "total_reviews": 1250,
    "rating": 4.3,
    "rating_distribution": {
      "5": 720,
      "4": 280,
      "3": 120,
      "2": 80,
      "1": 50
    },
    "scraped_at": "2025-01-24T10:30:00Z"
  }
}

Review Object:

{
  "review_id": "abc123",
  "author": {
    "name": "John D.",
    "profile_url": "https://...",
    "is_local_guide": true,
    "review_count": 42,
    "photo_count": 15
  },
  "rating": 4,
  "text": "Great food and service...",
  "language": "en",
  "published_at": "2025-01-15T14:30:00Z",
  "photos": [
    { "url": "https://...", "caption": null }
  ],
  "owner_response": {
    "text": "Thank you for your feedback...",
    "responded_at": "2025-01-16T09:00:00Z"
  },
  "metadata": {
    "source": "dom",
    "extracted_at": "2025-01-24T10:35:00Z"
  }
}

Key fields for insights service:

rating + text → Sentiment analysis, rating correlation
published_at → Trend analysis, seasonality
language → Multi-language support
owner_response → Engagement metrics, response rate
author.is_local_guide → Review credibility weighting
rating_distribution → Rating spread analysis

6.2 Future Job Types

Other scrapers (Yelp, TripAdvisor, etc.) will have their own schemas but follow similar patterns:

Business summary with ratings
Individual review objects
Author metadata
Timestamps for trend analysis

7. Webhook Payloads

6.1 Job Completion

{
  "event": "job.completed",
  "job_id": "uuid",
  "job_type": "google_reviews",
  "status": "completed",
  "url": "https://google.com/maps/...",
  "result_summary": {
    "item_count": 150,
    "primary_metric": 4.2
  },
  "scraper_version": "1.2.0",
  "duration_seconds": 45.2,
  "completed_at": "2024-01-15T10:30:00Z"
}

6.2 Job Failed

{
  "event": "job.failed",
  "job_id": "uuid",
  "job_type": "google_reviews",
  "status": "failed",
  "url": "https://google.com/maps/...",
  "error": {
    "type": "rate_limited",
    "message": "Google rate limit detected"
  },
  "scraper_version": "1.2.0",
  "duration_seconds": 12.5,
  "failed_at": "2024-01-15T10:30:00Z"
}

6.3 Batch Completion

{
  "event": "batch.completed",
  "batch_id": "uuid",
  "name": "Q1 Prospects",
  "total_jobs": 50,
  "succeeded": 47,
  "failed": 3,
  "completed_at": "2024-01-15T10:30:00Z",
  "failed_job_ids": ["uuid1", "uuid2", "uuid3"]
}

8. UI Pages

7.1 Main Dashboard (`/dashboard`)

System health at a glance
Key metrics with trends
Problem alerts
Quick links to drill down

7.2 Clients View (`/dashboard/clients`)

Table of clients with job counts, success rates
Click to see client's jobs

7.3 Scrapers View (`/dashboard/scrapers`)

Registered scraper versions
Performance comparison
Traffic allocation controls
Promote/deprecate actions

7.4 Jobs View (`/jobs`) - enhanced

Add filters: client, job type, batch, scraper version
Show requester info in job cards

7.5 Batches View (`/batches`)

List of batches with progress
Click to see batch detail and jobs

9. Project Structure

8.1 Backend Structure

reviewiq/                              # Root (renamed from google-reviews-scraper-pro)
│
├── api/
│   ├── __init__.py
│   ├── server.py                      # FastAPI app, startup, middleware
│   ├── routes/
│   │   ├── __init__.py
│   │   ├── scrape.py                  # /api/scrape/* endpoints
│   │   ├── jobs.py                    # /api/jobs/* endpoints
│   │   ├── batches.py                 # /api/batches/* endpoints
│   │   ├── dashboard.py               # /api/dashboard/* endpoints
│   │   └── admin.py                   # /api/admin/* endpoints
│   └── middleware/
│       ├── __init__.py
│       └── auth.py                    # API key authentication
│
├── scrapers/
│   ├── __init__.py
│   ├── registry.py                    # ScraperRegistry - version routing
│   ├── base.py                        # BaseScraper interface
│   │
│   ├── google_reviews/
│   │   ├── __init__.py
│   │   ├── v1_0_0.py                  # Current stable (migrated from scraper_clean.py)
│   │   └── parsers.py                 # Review parsing logic
│   │
│   └── yelp_reviews/                  # Future
│       ├── __init__.py
│       └── v1_0_0.py
│
├── core/
│   ├── __init__.py
│   ├── database.py                    # Database manager
│   ├── models.py                      # Pydantic models (Job, Batch, etc.)
│   ├── enums.py                       # JobStatus, JobType, Priority, etc.
│   └── config.py                      # Settings, environment variables
│
├── services/
│   ├── __init__.py
│   ├── job_service.py                 # Job creation, management
│   ├── batch_service.py               # Batch operations
│   ├── webhook_service.py             # Callback delivery
│   └── dashboard_service.py           # Aggregate queries
│
├── workers/
│   ├── __init__.py
│   ├── chrome_pool.py                 # Browser pool management
│   ├── job_executor.py                # Job execution orchestration
│   └── webhook_worker.py              # Async webhook delivery
│
├── utils/
│   ├── __init__.py
│   ├── logger.py                      # StructuredLogger
│   ├── crash_analyzer.py              # Crash detection
│   └── health_checks.py               # System health
│
├── tests/
│   ├── __init__.py
│   ├── conftest.py                    # Pytest fixtures
│   ├── api/                           # API route tests
│   ├── scrapers/                      # Scraper tests (mirrors scrapers/)
│   │   └── google_reviews/
│   │       └── test_v1_0_0.py
│   ├── services/                      # Service tests
│   └── integration/                   # End-to-end tests
│
├── migrations/                        # Database migrations
│   └── versions/
│
├── web/                               # Next.js frontend (existing)
│   └── ...
│
├── docker-compose.yml
├── Dockerfile
├── pyproject.toml                     # Python dependencies
└── README.md

8.2 Key Conventions

Naming:

Scraper versions use underscores: v1_0_0.py (valid Python module names)
Version strings use dots: "1.0.0" (semantic versioning in data)

Imports:

from scrapers.google_reviews.v1_0_0 import GoogleReviewsScraper
from scrapers.registry import ScraperRegistry
from core.models import Job, Batch
from services.job_service import JobService

Scraper Interface: Each scraper version implements:

class GoogleReviewsScraper(BaseScraper):
    VERSION = "1.0.0"
    JOB_TYPE = "google_reviews"

    async def scrape(self, url: str, options: dict) -> ScraperResult:
        ...

    def validate_url(self, url: str) -> bool:
        ...

8.3 Frontend Structure (existing, minor additions)

web/
├── app/
│   ├── dashboard/                     # New main dashboard
│   │   ├── page.tsx                   # Overview
│   │   ├── clients/page.tsx
│   │   ├── scrapers/page.tsx
│   │   └── problems/page.tsx
│   ├── batches/                       # New
│   │   ├── page.tsx
│   │   └── [id]/page.tsx
│   ├── jobs/                          # Enhanced
│   └── analytics/                     # Existing
├── components/
│   ├── dashboard/                     # Dashboard-specific components
│   └── ...
└── ...

10. Backwards Compatibility

9.1 Existing API

POST /api/scrape continues to work as-is:

Defaults to job_type: google_reviews
No requester required (legacy mode)
No callback required
Routes to the same scraper logic

9.2 Existing Database

All new fields have defaults
Existing jobs have null requester fields
job_type defaults to google_reviews
Migration adds columns without breaking existing data

9.3 Scraper Migration

Current scraper code moves to scrapers/google_reviews/v1_0_0.py
Registered in scraper_registry as stable with 100% traffic
Old file scraper_clean.py deleted after migration
All imports updated to new paths

11. Additional Considerations

10.1 Authentication

External API clients authenticate via API keys
API keys stored in api_keys table with client_id reference
Keys can be scoped (read-only, submit jobs, admin)
Rate limits can be per-key

10.2 Error Handling

All API errors return consistent JSON structure:

{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "URL is required",
    "details": { ... }
  }
}

Scraper errors captured with crash analysis
Failed webhooks retry with exponential backoff (max 5 attempts)

10.3 Logging

All components use StructuredLogger
Log levels: DEBUG, INFO, WARN, ERROR, FATAL
Categories: api, scraper, webhook, system
Logs include correlation IDs for tracing

10.4 Configuration

Environment-based configuration via core/config.py
Sensitive values from environment variables
Per-scraper config in scraper_registry.config JSON

10.5 Monitoring

Health check endpoint: GET /health
Prometheus metrics endpoint: GET /metrics (future)
Dashboard provides operational visibility

10.6 Data Retention

Define retention policy for completed jobs
Archive or delete old job data after N days
Keep aggregate stats for historical reporting

12. Implementation Phases

Phase 0: Project Restructure

Reorganize files to new structure
Move scraper_clean.py → scrapers/google_reviews/v1_0_0.py
Update all imports
Verify everything still works

Phase 1: Data Model

Add new fields to jobs table
Create batches table
Create scraper_registry table
Create api_keys table
Migration preserves existing data

Phase 2: Requester & Batch Support

Update API to accept requester info
Implement batch submission endpoint
Store and display requester/batch info

Phase 3: Webhooks

Implement callback delivery service
Retry logic for failed callbacks
Track delivery status

Phase 4: Scraper Versioning

Implement scraper registry
Version routing logic
Admin endpoints for management

Phase 5: Main Dashboard

Build dashboard pages
Aggregate queries
Real-time updates

Phase 6: Traffic Management & A/B

A/B test traffic splitting
Promote/deprecate workflow
Performance comparison views

Phase 7: Authentication

API key management
Client authentication middleware
Rate limiting (optional)

13. Success Metrics

API response time < 200ms for job submission
Webhook delivery within 5 seconds of job completion
Dashboard loads in < 2 seconds
Support 100+ concurrent scraping jobs
99% webhook delivery success rate
Clear visibility into scraper version performance

14. Open Questions

Authentication: How do external clients authenticate? API keys per client? → Resolved: API keys
Rate Limits: Per-client rate limiting? (deferred to Phase 7)
Retention: How long to keep completed job data? (needs decision)
Billing: Track usage for billing purposes? (future consideration)
Project Rename: Rename folder from google-reviews-scraper-pro to reviewiq?

15. Glossary

Term	Definition
Job	A single scraping task for one URL
Batch	A collection of related jobs submitted together
Job Type	Category of scraping (google_reviews, yelp_reviews, etc.)
Requester	External client/system that requests jobs
Scraper Version	Specific implementation of a scraper (v1.0.0, v2.0.0)
Variant	Stability tier: stable, beta, canary
Callback/Webhook	HTTP POST to notify client of job completion

Document Version: 1.2 Last Updated: 2025-01-24

20 KiB Raw Blame History