- Fix contrast issues in JobDevTools (level badges, text colors, timestamps) - Make log normalization more robust (handles old/new formats, edge cases) - Add ReviewIQ Platform Spec v1.2 defining: - Multi-tenant scraping-as-a-service architecture - Requester metadata, batches, webhooks, priority - Scraper versioning with A/B testing (stable/beta/canary) - API endpoints for job types, dashboard, admin - Output schemas for external service integration - Project structure reorganization plan Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
20 KiB
ReviewIQ Scraping Platform - Specification
Purpose: Define WHAT the platform should do, not HOW. This document serves as the source of truth during implementation.
1. Vision
Transform the current Google Reviews scraper into a multi-tenant scraping-as-a-service platform that:
- Serves external clients via API (initially veritasreview.com)
- Supports multiple scraping job types (reviews, business info, etc.)
- Provides full observability into system performance and problems
- Enables safe scraper iteration through versioning and A/B testing
2. Core Concepts
2.1 Job Types
The platform executes different types of scraping jobs:
google_reviews(current, primary)- Future:
yelp_reviews,tripadvisor_reviews,google_business_info, etc.
Each job type has its own:
- Input parameters
- Output schema
- Scraper implementation(s)
2.2 Requesters
External systems that request scraping jobs:
- Identified by
client_id(e.g., "veritas_client_123") - Originate from a
source(e.g., "veritasreview.com") - Have a
purposefor scraping:client_report- generating reports for their clientsprospect_screening- evaluating potential clientsmarket_research- competitive/market analysis
2.3 Batches
Jobs can be grouped into batches:
- A batch is a collection of related jobs (e.g., "Q1 Prospect List")
- Batches have their own completion callback
- Dashboard shows batch progress and aggregate stats
2.4 Scraper Versions
Each job type can have multiple scraper versions:
- Variants:
stable,beta,canary - Traffic routing: A/B testing via percentage allocation
- Version pinning: Clients can request specific versions
- Safe rollouts: Promote canary → beta → stable
2.5 Priority Levels
Jobs have priority that affects execution order:
0= normal1= high2= urgent
3. Features
3.1 API - Job Submission
Single job submission:
- Submit a scraping job for a specific job type
- Include requester identification
- Optionally specify priority, callback URL, scraper variant
- Returns job ID immediately
Batch submission:
- Submit multiple URLs as a single batch
- Batch has a name and optional batch-level callback
- Individual jobs track their position in batch
- Batch callback fires when all jobs complete
3.2 API - Job Management
- Get job status and results
- Cancel pending/running jobs
- Retry failed jobs
- List jobs with filtering (by client, status, date, batch, job type)
3.3 API - Webhooks
When a job completes (success or failure):
- POST to the provided
callback_url - Include job ID, status, summary results, error info if failed
- Track callback delivery status (pending, sent, failed)
- Retry failed callbacks
When a batch completes:
- POST to batch-level callback
- Include batch summary (total, succeeded, failed)
3.4 Main Dashboard
System Overview:
- Total jobs (24h / 7d / 30d)
- Success rate trend
- Currently running jobs
- Recent failures / problems requiring attention
By Client/Source:
- Jobs per client
- Top consumers (volume)
- Error rates by client
- Purpose breakdown per client
By Job Type:
- Volume per job type
- Success rate per type
- Average duration per type
By Scraper Version:
- Performance comparison across versions
- Success rate by version
- Duration by version
- Ability to identify when beta outperforms stable
Problems & Alerts:
- Recent failures with error types
- Slow jobs (exceeding expected duration)
- Callback delivery failures
- Clients with elevated error rates
3.5 Job Detail View (existing, enhanced)
Current functionality preserved, plus:
- Show requester info (client, source, purpose)
- Show batch membership if applicable
- Show scraper version that executed
- Link to related jobs (same batch, same client)
3.6 Analytics View
Per-job analytics (existing) remains for Google Reviews:
- Rating distribution
- Sentiment analysis
- Review topics
- Timeline
Future: type-specific analytics for other job types.
4. Data Model
4.1 Jobs (enhanced)
Existing fields preserved.
New requester fields:
requester_client_id- which client requested thisrequester_source- origin system (veritasreview.com)scrape_purpose- why (client_report, prospect_screening, market_research)requester_metadata- flexible JSON for additional context
New batch fields:
batch_id- links to batch if part of onebatch_index- position in batch (1, 2, 3...)
New execution fields:
job_type- type of scraping job (google_reviews, etc.)scraper_version- exact version that executed (1.2.0)scraper_variant- variant used (stable, beta, canary)priority- execution priority (0, 1, 2)
New callback fields:
callback_url- where to POST on completioncallback_status- pending, sent, failedcallback_sent_at- when callback was deliveredcallback_attempts- retry count
4.2 Batches (new)
id- unique identifiername- human readable namerequester_client_id- client who submittedrequester_source- origin systemscrape_purpose- purpose for all jobs in batchtotal_jobs- count of jobs in batchcompleted_jobs- count finished (success or fail)failed_jobs- count failedstatus- pending, running, completedcallback_url- batch completion webhookcallback_status- pending, sent, failedcreated_at- when batch was createdcompleted_at- when last job finishedmetadata- flexible JSON
4.3 Scraper Registry (new)
id- unique identifierjob_type- which job type this scraper handlesversion- semantic version (1.2.0, 2.0.0-beta)variant- stable, beta, canarymodule_path- Python module pathfunction_name- entry functionis_default- use if no version specifiedtraffic_pct- percentage of traffic for A/B testingmin_priority- only use for jobs at or above this prioritycreated_at- when registereddeprecated_at- when marked deprecated (null if active)config- version-specific configuration JSON
4.4 Generic Result Summary
Jobs have a result_summary JSON field for cross-type dashboard:
{
"item_count": 150,
"primary_metric": 4.2,
"primary_metric_label": "rating",
"secondary_metrics": {
"reviews_with_text": 120,
"avg_review_length": 45
}
}
This enables the dashboard to show unified metrics across job types.
5. API Endpoints
5.1 Scraping Endpoints
POST /api/scrape/google-reviews
POST /api/scrape/yelp-reviews (future)
POST /api/scrape/tripadvisor-reviews (future)
Each accepts type-specific parameters plus common fields:
requesterobject (client_id, source, purpose, metadata)priority(0, 1, 2)callback_urlscraper_versionorscraper_variant(optional)
5.2 Batch Endpoint
POST /api/scrape/google-reviews/batch
Accepts:
name- batch nameurls- array of URLsrequesterobjectprioritycallback_url- called when entire batch completes
5.3 Management Endpoints
GET /api/jobs - list with filters
GET /api/jobs/{id} - job detail
DELETE /api/jobs/{id} - cancel job
POST /api/jobs/{id}/retry - retry failed job
GET /api/batches - list batches
GET /api/batches/{id} - batch detail with job list
DELETE /api/batches/{id} - cancel all pending jobs in batch
5.4 Dashboard Endpoints
GET /api/dashboard/overview - system stats
GET /api/dashboard/by-client - breakdown by client
GET /api/dashboard/by-job-type - breakdown by job type
GET /api/dashboard/by-version - scraper version comparison
GET /api/dashboard/problems - recent failures, alerts
5.5 Admin Endpoints
GET /api/admin/scrapers - list registered scrapers
POST /api/admin/scrapers - register new scraper version
PUT /api/admin/scrapers/{id}/traffic - update traffic percentage
POST /api/admin/scrapers/{id}/deprecate - mark deprecated
POST /api/admin/scrapers/{id}/promote - promote to stable
6. Output Schemas
Each job type has a defined output schema. External services (like veritasreview.com) consume this data to generate insights.
6.1 Google Reviews Output
Business Summary:
{
"business": {
"name": "Acme Restaurant",
"place_id": "ChIJ...",
"address": "123 Main St, City, State",
"category": "Restaurant",
"total_reviews": 1250,
"rating": 4.3,
"rating_distribution": {
"5": 720,
"4": 280,
"3": 120,
"2": 80,
"1": 50
},
"scraped_at": "2025-01-24T10:30:00Z"
}
}
Review Object:
{
"review_id": "abc123",
"author": {
"name": "John D.",
"profile_url": "https://...",
"is_local_guide": true,
"review_count": 42,
"photo_count": 15
},
"rating": 4,
"text": "Great food and service...",
"language": "en",
"published_at": "2025-01-15T14:30:00Z",
"photos": [
{ "url": "https://...", "caption": null }
],
"owner_response": {
"text": "Thank you for your feedback...",
"responded_at": "2025-01-16T09:00:00Z"
},
"metadata": {
"source": "dom",
"extracted_at": "2025-01-24T10:35:00Z"
}
}
Key fields for insights service:
rating+text→ Sentiment analysis, rating correlationpublished_at→ Trend analysis, seasonalitylanguage→ Multi-language supportowner_response→ Engagement metrics, response rateauthor.is_local_guide→ Review credibility weightingrating_distribution→ Rating spread analysis
6.2 Future Job Types
Other scrapers (Yelp, TripAdvisor, etc.) will have their own schemas but follow similar patterns:
- Business summary with ratings
- Individual review objects
- Author metadata
- Timestamps for trend analysis
7. Webhook Payloads
6.1 Job Completion
{
"event": "job.completed",
"job_id": "uuid",
"job_type": "google_reviews",
"status": "completed",
"url": "https://google.com/maps/...",
"result_summary": {
"item_count": 150,
"primary_metric": 4.2
},
"scraper_version": "1.2.0",
"duration_seconds": 45.2,
"completed_at": "2024-01-15T10:30:00Z"
}
6.2 Job Failed
{
"event": "job.failed",
"job_id": "uuid",
"job_type": "google_reviews",
"status": "failed",
"url": "https://google.com/maps/...",
"error": {
"type": "rate_limited",
"message": "Google rate limit detected"
},
"scraper_version": "1.2.0",
"duration_seconds": 12.5,
"failed_at": "2024-01-15T10:30:00Z"
}
6.3 Batch Completion
{
"event": "batch.completed",
"batch_id": "uuid",
"name": "Q1 Prospects",
"total_jobs": 50,
"succeeded": 47,
"failed": 3,
"completed_at": "2024-01-15T10:30:00Z",
"failed_job_ids": ["uuid1", "uuid2", "uuid3"]
}
8. UI Pages
7.1 Main Dashboard (/dashboard)
- System health at a glance
- Key metrics with trends
- Problem alerts
- Quick links to drill down
7.2 Clients View (/dashboard/clients)
- Table of clients with job counts, success rates
- Click to see client's jobs
7.3 Scrapers View (/dashboard/scrapers)
- Registered scraper versions
- Performance comparison
- Traffic allocation controls
- Promote/deprecate actions
7.4 Jobs View (/jobs) - enhanced
- Add filters: client, job type, batch, scraper version
- Show requester info in job cards
7.5 Batches View (/batches)
- List of batches with progress
- Click to see batch detail and jobs
9. Project Structure
8.1 Backend Structure
reviewiq/ # Root (renamed from google-reviews-scraper-pro)
│
├── api/
│ ├── __init__.py
│ ├── server.py # FastAPI app, startup, middleware
│ ├── routes/
│ │ ├── __init__.py
│ │ ├── scrape.py # /api/scrape/* endpoints
│ │ ├── jobs.py # /api/jobs/* endpoints
│ │ ├── batches.py # /api/batches/* endpoints
│ │ ├── dashboard.py # /api/dashboard/* endpoints
│ │ └── admin.py # /api/admin/* endpoints
│ └── middleware/
│ ├── __init__.py
│ └── auth.py # API key authentication
│
├── scrapers/
│ ├── __init__.py
│ ├── registry.py # ScraperRegistry - version routing
│ ├── base.py # BaseScraper interface
│ │
│ ├── google_reviews/
│ │ ├── __init__.py
│ │ ├── v1_0_0.py # Current stable (migrated from scraper_clean.py)
│ │ └── parsers.py # Review parsing logic
│ │
│ └── yelp_reviews/ # Future
│ ├── __init__.py
│ └── v1_0_0.py
│
├── core/
│ ├── __init__.py
│ ├── database.py # Database manager
│ ├── models.py # Pydantic models (Job, Batch, etc.)
│ ├── enums.py # JobStatus, JobType, Priority, etc.
│ └── config.py # Settings, environment variables
│
├── services/
│ ├── __init__.py
│ ├── job_service.py # Job creation, management
│ ├── batch_service.py # Batch operations
│ ├── webhook_service.py # Callback delivery
│ └── dashboard_service.py # Aggregate queries
│
├── workers/
│ ├── __init__.py
│ ├── chrome_pool.py # Browser pool management
│ ├── job_executor.py # Job execution orchestration
│ └── webhook_worker.py # Async webhook delivery
│
├── utils/
│ ├── __init__.py
│ ├── logger.py # StructuredLogger
│ ├── crash_analyzer.py # Crash detection
│ └── health_checks.py # System health
│
├── tests/
│ ├── __init__.py
│ ├── conftest.py # Pytest fixtures
│ ├── api/ # API route tests
│ ├── scrapers/ # Scraper tests (mirrors scrapers/)
│ │ └── google_reviews/
│ │ └── test_v1_0_0.py
│ ├── services/ # Service tests
│ └── integration/ # End-to-end tests
│
├── migrations/ # Database migrations
│ └── versions/
│
├── web/ # Next.js frontend (existing)
│ └── ...
│
├── docker-compose.yml
├── Dockerfile
├── pyproject.toml # Python dependencies
└── README.md
8.2 Key Conventions
Naming:
- Scraper versions use underscores:
v1_0_0.py(valid Python module names) - Version strings use dots:
"1.0.0"(semantic versioning in data)
Imports:
from scrapers.google_reviews.v1_0_0 import GoogleReviewsScraper
from scrapers.registry import ScraperRegistry
from core.models import Job, Batch
from services.job_service import JobService
Scraper Interface: Each scraper version implements:
class GoogleReviewsScraper(BaseScraper):
VERSION = "1.0.0"
JOB_TYPE = "google_reviews"
async def scrape(self, url: str, options: dict) -> ScraperResult:
...
def validate_url(self, url: str) -> bool:
...
8.3 Frontend Structure (existing, minor additions)
web/
├── app/
│ ├── dashboard/ # New main dashboard
│ │ ├── page.tsx # Overview
│ │ ├── clients/page.tsx
│ │ ├── scrapers/page.tsx
│ │ └── problems/page.tsx
│ ├── batches/ # New
│ │ ├── page.tsx
│ │ └── [id]/page.tsx
│ ├── jobs/ # Enhanced
│ └── analytics/ # Existing
├── components/
│ ├── dashboard/ # Dashboard-specific components
│ └── ...
└── ...
10. Backwards Compatibility
9.1 Existing API
POST /api/scrape continues to work as-is:
- Defaults to
job_type: google_reviews - No requester required (legacy mode)
- No callback required
- Routes to the same scraper logic
9.2 Existing Database
- All new fields have defaults
- Existing jobs have null requester fields
job_typedefaults togoogle_reviews- Migration adds columns without breaking existing data
9.3 Scraper Migration
- Current scraper code moves to
scrapers/google_reviews/v1_0_0.py - Registered in scraper_registry as
stablewith 100% traffic - Old file
scraper_clean.pydeleted after migration - All imports updated to new paths
11. Additional Considerations
10.1 Authentication
- External API clients authenticate via API keys
- API keys stored in
api_keystable withclient_idreference - Keys can be scoped (read-only, submit jobs, admin)
- Rate limits can be per-key
10.2 Error Handling
- All API errors return consistent JSON structure:
{ "error": { "code": "VALIDATION_ERROR", "message": "URL is required", "details": { ... } } } - Scraper errors captured with crash analysis
- Failed webhooks retry with exponential backoff (max 5 attempts)
10.3 Logging
- All components use StructuredLogger
- Log levels: DEBUG, INFO, WARN, ERROR, FATAL
- Categories: api, scraper, webhook, system
- Logs include correlation IDs for tracing
10.4 Configuration
- Environment-based configuration via
core/config.py - Sensitive values from environment variables
- Per-scraper config in scraper_registry.config JSON
10.5 Monitoring
- Health check endpoint:
GET /health - Prometheus metrics endpoint:
GET /metrics(future) - Dashboard provides operational visibility
10.6 Data Retention
- Define retention policy for completed jobs
- Archive or delete old job data after N days
- Keep aggregate stats for historical reporting
12. Implementation Phases
Phase 0: Project Restructure
- Reorganize files to new structure
- Move
scraper_clean.py→scrapers/google_reviews/v1_0_0.py - Update all imports
- Verify everything still works
Phase 1: Data Model
- Add new fields to jobs table
- Create batches table
- Create scraper_registry table
- Create api_keys table
- Migration preserves existing data
Phase 2: Requester & Batch Support
- Update API to accept requester info
- Implement batch submission endpoint
- Store and display requester/batch info
Phase 3: Webhooks
- Implement callback delivery service
- Retry logic for failed callbacks
- Track delivery status
Phase 4: Scraper Versioning
- Implement scraper registry
- Version routing logic
- Admin endpoints for management
Phase 5: Main Dashboard
- Build dashboard pages
- Aggregate queries
- Real-time updates
Phase 6: Traffic Management & A/B
- A/B test traffic splitting
- Promote/deprecate workflow
- Performance comparison views
Phase 7: Authentication
- API key management
- Client authentication middleware
- Rate limiting (optional)
13. Success Metrics
- API response time < 200ms for job submission
- Webhook delivery within 5 seconds of job completion
- Dashboard loads in < 2 seconds
- Support 100+ concurrent scraping jobs
- 99% webhook delivery success rate
- Clear visibility into scraper version performance
14. Open Questions
Authentication: How do external clients authenticate? API keys per client?→ Resolved: API keys- Rate Limits: Per-client rate limiting? (deferred to Phase 7)
- Retention: How long to keep completed job data? (needs decision)
- Billing: Track usage for billing purposes? (future consideration)
- Project Rename: Rename folder from
google-reviews-scraper-protoreviewiq?
15. Glossary
| Term | Definition |
|---|---|
| Job | A single scraping task for one URL |
| Batch | A collection of related jobs submitted together |
| Job Type | Category of scraping (google_reviews, yelp_reviews, etc.) |
| Requester | External client/system that requests jobs |
| Scraper Version | Specific implementation of a scraper (v1.0.0, v2.0.0) |
| Variant | Stability tier: stable, beta, canary |
| Callback/Webhook | HTTP POST to notify client of job completion |
Document Version: 1.2 Last Updated: 2025-01-24