- Fix contrast issues in JobDevTools (level badges, text colors, timestamps) - Make log normalization more robust (handles old/new formats, edge cases) - Add ReviewIQ Platform Spec v1.2 defining: - Multi-tenant scraping-as-a-service architecture - Requester metadata, batches, webhooks, priority - Scraper versioning with A/B testing (stable/beta/canary) - API endpoints for job types, dashboard, admin - Output schemas for external service integration - Project structure reorganization plan Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
735 lines
20 KiB
Markdown
735 lines
20 KiB
Markdown
# ReviewIQ Scraping Platform - Specification
|
|
|
|
> **Purpose**: Define WHAT the platform should do, not HOW. This document serves as the source of truth during implementation.
|
|
|
|
---
|
|
|
|
## 1. Vision
|
|
|
|
Transform the current Google Reviews scraper into a **multi-tenant scraping-as-a-service platform** that:
|
|
|
|
- Serves external clients via API (initially veritasreview.com)
|
|
- Supports multiple scraping job types (reviews, business info, etc.)
|
|
- Provides full observability into system performance and problems
|
|
- Enables safe scraper iteration through versioning and A/B testing
|
|
|
|
---
|
|
|
|
## 2. Core Concepts
|
|
|
|
### 2.1 Job Types
|
|
The platform executes different types of scraping jobs:
|
|
- `google_reviews` (current, primary)
|
|
- Future: `yelp_reviews`, `tripadvisor_reviews`, `google_business_info`, etc.
|
|
|
|
Each job type has its own:
|
|
- Input parameters
|
|
- Output schema
|
|
- Scraper implementation(s)
|
|
|
|
### 2.2 Requesters
|
|
External systems that request scraping jobs:
|
|
- Identified by `client_id` (e.g., "veritas_client_123")
|
|
- Originate from a `source` (e.g., "veritasreview.com")
|
|
- Have a `purpose` for scraping:
|
|
- `client_report` - generating reports for their clients
|
|
- `prospect_screening` - evaluating potential clients
|
|
- `market_research` - competitive/market analysis
|
|
|
|
### 2.3 Batches
|
|
Jobs can be grouped into batches:
|
|
- A batch is a collection of related jobs (e.g., "Q1 Prospect List")
|
|
- Batches have their own completion callback
|
|
- Dashboard shows batch progress and aggregate stats
|
|
|
|
### 2.4 Scraper Versions
|
|
Each job type can have multiple scraper versions:
|
|
- **Variants**: `stable`, `beta`, `canary`
|
|
- **Traffic routing**: A/B testing via percentage allocation
|
|
- **Version pinning**: Clients can request specific versions
|
|
- **Safe rollouts**: Promote canary → beta → stable
|
|
|
|
### 2.5 Priority Levels
|
|
Jobs have priority that affects execution order:
|
|
- `0` = normal
|
|
- `1` = high
|
|
- `2` = urgent
|
|
|
|
---
|
|
|
|
## 3. Features
|
|
|
|
### 3.1 API - Job Submission
|
|
|
|
**Single job submission:**
|
|
- Submit a scraping job for a specific job type
|
|
- Include requester identification
|
|
- Optionally specify priority, callback URL, scraper variant
|
|
- Returns job ID immediately
|
|
|
|
**Batch submission:**
|
|
- Submit multiple URLs as a single batch
|
|
- Batch has a name and optional batch-level callback
|
|
- Individual jobs track their position in batch
|
|
- Batch callback fires when all jobs complete
|
|
|
|
### 3.2 API - Job Management
|
|
|
|
- Get job status and results
|
|
- Cancel pending/running jobs
|
|
- Retry failed jobs
|
|
- List jobs with filtering (by client, status, date, batch, job type)
|
|
|
|
### 3.3 API - Webhooks
|
|
|
|
When a job completes (success or failure):
|
|
- POST to the provided `callback_url`
|
|
- Include job ID, status, summary results, error info if failed
|
|
- Track callback delivery status (pending, sent, failed)
|
|
- Retry failed callbacks
|
|
|
|
When a batch completes:
|
|
- POST to batch-level callback
|
|
- Include batch summary (total, succeeded, failed)
|
|
|
|
### 3.4 Main Dashboard
|
|
|
|
**System Overview:**
|
|
- Total jobs (24h / 7d / 30d)
|
|
- Success rate trend
|
|
- Currently running jobs
|
|
- Recent failures / problems requiring attention
|
|
|
|
**By Client/Source:**
|
|
- Jobs per client
|
|
- Top consumers (volume)
|
|
- Error rates by client
|
|
- Purpose breakdown per client
|
|
|
|
**By Job Type:**
|
|
- Volume per job type
|
|
- Success rate per type
|
|
- Average duration per type
|
|
|
|
**By Scraper Version:**
|
|
- Performance comparison across versions
|
|
- Success rate by version
|
|
- Duration by version
|
|
- Ability to identify when beta outperforms stable
|
|
|
|
**Problems & Alerts:**
|
|
- Recent failures with error types
|
|
- Slow jobs (exceeding expected duration)
|
|
- Callback delivery failures
|
|
- Clients with elevated error rates
|
|
|
|
### 3.5 Job Detail View (existing, enhanced)
|
|
|
|
Current functionality preserved, plus:
|
|
- Show requester info (client, source, purpose)
|
|
- Show batch membership if applicable
|
|
- Show scraper version that executed
|
|
- Link to related jobs (same batch, same client)
|
|
|
|
### 3.6 Analytics View
|
|
|
|
Per-job analytics (existing) remains for Google Reviews:
|
|
- Rating distribution
|
|
- Sentiment analysis
|
|
- Review topics
|
|
- Timeline
|
|
|
|
Future: type-specific analytics for other job types.
|
|
|
|
---
|
|
|
|
## 4. Data Model
|
|
|
|
### 4.1 Jobs (enhanced)
|
|
|
|
**Existing fields preserved.**
|
|
|
|
**New requester fields:**
|
|
- `requester_client_id` - which client requested this
|
|
- `requester_source` - origin system (veritasreview.com)
|
|
- `scrape_purpose` - why (client_report, prospect_screening, market_research)
|
|
- `requester_metadata` - flexible JSON for additional context
|
|
|
|
**New batch fields:**
|
|
- `batch_id` - links to batch if part of one
|
|
- `batch_index` - position in batch (1, 2, 3...)
|
|
|
|
**New execution fields:**
|
|
- `job_type` - type of scraping job (google_reviews, etc.)
|
|
- `scraper_version` - exact version that executed (1.2.0)
|
|
- `scraper_variant` - variant used (stable, beta, canary)
|
|
- `priority` - execution priority (0, 1, 2)
|
|
|
|
**New callback fields:**
|
|
- `callback_url` - where to POST on completion
|
|
- `callback_status` - pending, sent, failed
|
|
- `callback_sent_at` - when callback was delivered
|
|
- `callback_attempts` - retry count
|
|
|
|
### 4.2 Batches (new)
|
|
|
|
- `id` - unique identifier
|
|
- `name` - human readable name
|
|
- `requester_client_id` - client who submitted
|
|
- `requester_source` - origin system
|
|
- `scrape_purpose` - purpose for all jobs in batch
|
|
- `total_jobs` - count of jobs in batch
|
|
- `completed_jobs` - count finished (success or fail)
|
|
- `failed_jobs` - count failed
|
|
- `status` - pending, running, completed
|
|
- `callback_url` - batch completion webhook
|
|
- `callback_status` - pending, sent, failed
|
|
- `created_at` - when batch was created
|
|
- `completed_at` - when last job finished
|
|
- `metadata` - flexible JSON
|
|
|
|
### 4.3 Scraper Registry (new)
|
|
|
|
- `id` - unique identifier
|
|
- `job_type` - which job type this scraper handles
|
|
- `version` - semantic version (1.2.0, 2.0.0-beta)
|
|
- `variant` - stable, beta, canary
|
|
- `module_path` - Python module path
|
|
- `function_name` - entry function
|
|
- `is_default` - use if no version specified
|
|
- `traffic_pct` - percentage of traffic for A/B testing
|
|
- `min_priority` - only use for jobs at or above this priority
|
|
- `created_at` - when registered
|
|
- `deprecated_at` - when marked deprecated (null if active)
|
|
- `config` - version-specific configuration JSON
|
|
|
|
### 4.4 Generic Result Summary
|
|
|
|
Jobs have a `result_summary` JSON field for cross-type dashboard:
|
|
```json
|
|
{
|
|
"item_count": 150,
|
|
"primary_metric": 4.2,
|
|
"primary_metric_label": "rating",
|
|
"secondary_metrics": {
|
|
"reviews_with_text": 120,
|
|
"avg_review_length": 45
|
|
}
|
|
}
|
|
```
|
|
|
|
This enables the dashboard to show unified metrics across job types.
|
|
|
|
---
|
|
|
|
## 5. API Endpoints
|
|
|
|
### 5.1 Scraping Endpoints
|
|
|
|
```
|
|
POST /api/scrape/google-reviews
|
|
POST /api/scrape/yelp-reviews (future)
|
|
POST /api/scrape/tripadvisor-reviews (future)
|
|
```
|
|
|
|
Each accepts type-specific parameters plus common fields:
|
|
- `requester` object (client_id, source, purpose, metadata)
|
|
- `priority` (0, 1, 2)
|
|
- `callback_url`
|
|
- `scraper_version` or `scraper_variant` (optional)
|
|
|
|
### 5.2 Batch Endpoint
|
|
|
|
```
|
|
POST /api/scrape/google-reviews/batch
|
|
```
|
|
|
|
Accepts:
|
|
- `name` - batch name
|
|
- `urls` - array of URLs
|
|
- `requester` object
|
|
- `priority`
|
|
- `callback_url` - called when entire batch completes
|
|
|
|
### 5.3 Management Endpoints
|
|
|
|
```
|
|
GET /api/jobs - list with filters
|
|
GET /api/jobs/{id} - job detail
|
|
DELETE /api/jobs/{id} - cancel job
|
|
POST /api/jobs/{id}/retry - retry failed job
|
|
|
|
GET /api/batches - list batches
|
|
GET /api/batches/{id} - batch detail with job list
|
|
DELETE /api/batches/{id} - cancel all pending jobs in batch
|
|
```
|
|
|
|
### 5.4 Dashboard Endpoints
|
|
|
|
```
|
|
GET /api/dashboard/overview - system stats
|
|
GET /api/dashboard/by-client - breakdown by client
|
|
GET /api/dashboard/by-job-type - breakdown by job type
|
|
GET /api/dashboard/by-version - scraper version comparison
|
|
GET /api/dashboard/problems - recent failures, alerts
|
|
```
|
|
|
|
### 5.5 Admin Endpoints
|
|
|
|
```
|
|
GET /api/admin/scrapers - list registered scrapers
|
|
POST /api/admin/scrapers - register new scraper version
|
|
PUT /api/admin/scrapers/{id}/traffic - update traffic percentage
|
|
POST /api/admin/scrapers/{id}/deprecate - mark deprecated
|
|
POST /api/admin/scrapers/{id}/promote - promote to stable
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Output Schemas
|
|
|
|
Each job type has a defined output schema. External services (like veritasreview.com) consume this data to generate insights.
|
|
|
|
### 6.1 Google Reviews Output
|
|
|
|
**Business Summary:**
|
|
```json
|
|
{
|
|
"business": {
|
|
"name": "Acme Restaurant",
|
|
"place_id": "ChIJ...",
|
|
"address": "123 Main St, City, State",
|
|
"category": "Restaurant",
|
|
"total_reviews": 1250,
|
|
"rating": 4.3,
|
|
"rating_distribution": {
|
|
"5": 720,
|
|
"4": 280,
|
|
"3": 120,
|
|
"2": 80,
|
|
"1": 50
|
|
},
|
|
"scraped_at": "2025-01-24T10:30:00Z"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Review Object:**
|
|
```json
|
|
{
|
|
"review_id": "abc123",
|
|
"author": {
|
|
"name": "John D.",
|
|
"profile_url": "https://...",
|
|
"is_local_guide": true,
|
|
"review_count": 42,
|
|
"photo_count": 15
|
|
},
|
|
"rating": 4,
|
|
"text": "Great food and service...",
|
|
"language": "en",
|
|
"published_at": "2025-01-15T14:30:00Z",
|
|
"photos": [
|
|
{ "url": "https://...", "caption": null }
|
|
],
|
|
"owner_response": {
|
|
"text": "Thank you for your feedback...",
|
|
"responded_at": "2025-01-16T09:00:00Z"
|
|
},
|
|
"metadata": {
|
|
"source": "dom",
|
|
"extracted_at": "2025-01-24T10:35:00Z"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Key fields for insights service:**
|
|
- `rating` + `text` → Sentiment analysis, rating correlation
|
|
- `published_at` → Trend analysis, seasonality
|
|
- `language` → Multi-language support
|
|
- `owner_response` → Engagement metrics, response rate
|
|
- `author.is_local_guide` → Review credibility weighting
|
|
- `rating_distribution` → Rating spread analysis
|
|
|
|
### 6.2 Future Job Types
|
|
|
|
Other scrapers (Yelp, TripAdvisor, etc.) will have their own schemas but follow similar patterns:
|
|
- Business summary with ratings
|
|
- Individual review objects
|
|
- Author metadata
|
|
- Timestamps for trend analysis
|
|
|
|
---
|
|
|
|
## 7. Webhook Payloads
|
|
|
|
### 6.1 Job Completion
|
|
|
|
```json
|
|
{
|
|
"event": "job.completed",
|
|
"job_id": "uuid",
|
|
"job_type": "google_reviews",
|
|
"status": "completed",
|
|
"url": "https://google.com/maps/...",
|
|
"result_summary": {
|
|
"item_count": 150,
|
|
"primary_metric": 4.2
|
|
},
|
|
"scraper_version": "1.2.0",
|
|
"duration_seconds": 45.2,
|
|
"completed_at": "2024-01-15T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
### 6.2 Job Failed
|
|
|
|
```json
|
|
{
|
|
"event": "job.failed",
|
|
"job_id": "uuid",
|
|
"job_type": "google_reviews",
|
|
"status": "failed",
|
|
"url": "https://google.com/maps/...",
|
|
"error": {
|
|
"type": "rate_limited",
|
|
"message": "Google rate limit detected"
|
|
},
|
|
"scraper_version": "1.2.0",
|
|
"duration_seconds": 12.5,
|
|
"failed_at": "2024-01-15T10:30:00Z"
|
|
}
|
|
```
|
|
|
|
### 6.3 Batch Completion
|
|
|
|
```json
|
|
{
|
|
"event": "batch.completed",
|
|
"batch_id": "uuid",
|
|
"name": "Q1 Prospects",
|
|
"total_jobs": 50,
|
|
"succeeded": 47,
|
|
"failed": 3,
|
|
"completed_at": "2024-01-15T10:30:00Z",
|
|
"failed_job_ids": ["uuid1", "uuid2", "uuid3"]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 8. UI Pages
|
|
|
|
### 7.1 Main Dashboard (`/dashboard`)
|
|
- System health at a glance
|
|
- Key metrics with trends
|
|
- Problem alerts
|
|
- Quick links to drill down
|
|
|
|
### 7.2 Clients View (`/dashboard/clients`)
|
|
- Table of clients with job counts, success rates
|
|
- Click to see client's jobs
|
|
|
|
### 7.3 Scrapers View (`/dashboard/scrapers`)
|
|
- Registered scraper versions
|
|
- Performance comparison
|
|
- Traffic allocation controls
|
|
- Promote/deprecate actions
|
|
|
|
### 7.4 Jobs View (`/jobs`) - enhanced
|
|
- Add filters: client, job type, batch, scraper version
|
|
- Show requester info in job cards
|
|
|
|
### 7.5 Batches View (`/batches`)
|
|
- List of batches with progress
|
|
- Click to see batch detail and jobs
|
|
|
|
---
|
|
|
|
## 9. Project Structure
|
|
|
|
### 8.1 Backend Structure
|
|
|
|
```
|
|
reviewiq/ # Root (renamed from google-reviews-scraper-pro)
|
|
│
|
|
├── api/
|
|
│ ├── __init__.py
|
|
│ ├── server.py # FastAPI app, startup, middleware
|
|
│ ├── routes/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── scrape.py # /api/scrape/* endpoints
|
|
│ │ ├── jobs.py # /api/jobs/* endpoints
|
|
│ │ ├── batches.py # /api/batches/* endpoints
|
|
│ │ ├── dashboard.py # /api/dashboard/* endpoints
|
|
│ │ └── admin.py # /api/admin/* endpoints
|
|
│ └── middleware/
|
|
│ ├── __init__.py
|
|
│ └── auth.py # API key authentication
|
|
│
|
|
├── scrapers/
|
|
│ ├── __init__.py
|
|
│ ├── registry.py # ScraperRegistry - version routing
|
|
│ ├── base.py # BaseScraper interface
|
|
│ │
|
|
│ ├── google_reviews/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── v1_0_0.py # Current stable (migrated from scraper_clean.py)
|
|
│ │ └── parsers.py # Review parsing logic
|
|
│ │
|
|
│ └── yelp_reviews/ # Future
|
|
│ ├── __init__.py
|
|
│ └── v1_0_0.py
|
|
│
|
|
├── core/
|
|
│ ├── __init__.py
|
|
│ ├── database.py # Database manager
|
|
│ ├── models.py # Pydantic models (Job, Batch, etc.)
|
|
│ ├── enums.py # JobStatus, JobType, Priority, etc.
|
|
│ └── config.py # Settings, environment variables
|
|
│
|
|
├── services/
|
|
│ ├── __init__.py
|
|
│ ├── job_service.py # Job creation, management
|
|
│ ├── batch_service.py # Batch operations
|
|
│ ├── webhook_service.py # Callback delivery
|
|
│ └── dashboard_service.py # Aggregate queries
|
|
│
|
|
├── workers/
|
|
│ ├── __init__.py
|
|
│ ├── chrome_pool.py # Browser pool management
|
|
│ ├── job_executor.py # Job execution orchestration
|
|
│ └── webhook_worker.py # Async webhook delivery
|
|
│
|
|
├── utils/
|
|
│ ├── __init__.py
|
|
│ ├── logger.py # StructuredLogger
|
|
│ ├── crash_analyzer.py # Crash detection
|
|
│ └── health_checks.py # System health
|
|
│
|
|
├── tests/
|
|
│ ├── __init__.py
|
|
│ ├── conftest.py # Pytest fixtures
|
|
│ ├── api/ # API route tests
|
|
│ ├── scrapers/ # Scraper tests (mirrors scrapers/)
|
|
│ │ └── google_reviews/
|
|
│ │ └── test_v1_0_0.py
|
|
│ ├── services/ # Service tests
|
|
│ └── integration/ # End-to-end tests
|
|
│
|
|
├── migrations/ # Database migrations
|
|
│ └── versions/
|
|
│
|
|
├── web/ # Next.js frontend (existing)
|
|
│ └── ...
|
|
│
|
|
├── docker-compose.yml
|
|
├── Dockerfile
|
|
├── pyproject.toml # Python dependencies
|
|
└── README.md
|
|
```
|
|
|
|
### 8.2 Key Conventions
|
|
|
|
**Naming:**
|
|
- Scraper versions use underscores: `v1_0_0.py` (valid Python module names)
|
|
- Version strings use dots: `"1.0.0"` (semantic versioning in data)
|
|
|
|
**Imports:**
|
|
```python
|
|
from scrapers.google_reviews.v1_0_0 import GoogleReviewsScraper
|
|
from scrapers.registry import ScraperRegistry
|
|
from core.models import Job, Batch
|
|
from services.job_service import JobService
|
|
```
|
|
|
|
**Scraper Interface:**
|
|
Each scraper version implements:
|
|
```python
|
|
class GoogleReviewsScraper(BaseScraper):
|
|
VERSION = "1.0.0"
|
|
JOB_TYPE = "google_reviews"
|
|
|
|
async def scrape(self, url: str, options: dict) -> ScraperResult:
|
|
...
|
|
|
|
def validate_url(self, url: str) -> bool:
|
|
...
|
|
```
|
|
|
|
### 8.3 Frontend Structure (existing, minor additions)
|
|
|
|
```
|
|
web/
|
|
├── app/
|
|
│ ├── dashboard/ # New main dashboard
|
|
│ │ ├── page.tsx # Overview
|
|
│ │ ├── clients/page.tsx
|
|
│ │ ├── scrapers/page.tsx
|
|
│ │ └── problems/page.tsx
|
|
│ ├── batches/ # New
|
|
│ │ ├── page.tsx
|
|
│ │ └── [id]/page.tsx
|
|
│ ├── jobs/ # Enhanced
|
|
│ └── analytics/ # Existing
|
|
├── components/
|
|
│ ├── dashboard/ # Dashboard-specific components
|
|
│ └── ...
|
|
└── ...
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Backwards Compatibility
|
|
|
|
### 9.1 Existing API
|
|
`POST /api/scrape` continues to work as-is:
|
|
- Defaults to `job_type: google_reviews`
|
|
- No requester required (legacy mode)
|
|
- No callback required
|
|
- Routes to the same scraper logic
|
|
|
|
### 9.2 Existing Database
|
|
- All new fields have defaults
|
|
- Existing jobs have null requester fields
|
|
- `job_type` defaults to `google_reviews`
|
|
- Migration adds columns without breaking existing data
|
|
|
|
### 9.3 Scraper Migration
|
|
- Current scraper code moves to `scrapers/google_reviews/v1_0_0.py`
|
|
- Registered in scraper_registry as `stable` with 100% traffic
|
|
- Old file `scraper_clean.py` deleted after migration
|
|
- All imports updated to new paths
|
|
|
|
---
|
|
|
|
## 11. Additional Considerations
|
|
|
|
### 10.1 Authentication
|
|
- External API clients authenticate via API keys
|
|
- API keys stored in `api_keys` table with `client_id` reference
|
|
- Keys can be scoped (read-only, submit jobs, admin)
|
|
- Rate limits can be per-key
|
|
|
|
### 10.2 Error Handling
|
|
- All API errors return consistent JSON structure:
|
|
```json
|
|
{
|
|
"error": {
|
|
"code": "VALIDATION_ERROR",
|
|
"message": "URL is required",
|
|
"details": { ... }
|
|
}
|
|
}
|
|
```
|
|
- Scraper errors captured with crash analysis
|
|
- Failed webhooks retry with exponential backoff (max 5 attempts)
|
|
|
|
### 10.3 Logging
|
|
- All components use StructuredLogger
|
|
- Log levels: DEBUG, INFO, WARN, ERROR, FATAL
|
|
- Categories: api, scraper, webhook, system
|
|
- Logs include correlation IDs for tracing
|
|
|
|
### 10.4 Configuration
|
|
- Environment-based configuration via `core/config.py`
|
|
- Sensitive values from environment variables
|
|
- Per-scraper config in scraper_registry.config JSON
|
|
|
|
### 10.5 Monitoring
|
|
- Health check endpoint: `GET /health`
|
|
- Prometheus metrics endpoint: `GET /metrics` (future)
|
|
- Dashboard provides operational visibility
|
|
|
|
### 10.6 Data Retention
|
|
- Define retention policy for completed jobs
|
|
- Archive or delete old job data after N days
|
|
- Keep aggregate stats for historical reporting
|
|
|
|
---
|
|
|
|
## 12. Implementation Phases
|
|
|
|
### Phase 0: Project Restructure
|
|
- Reorganize files to new structure
|
|
- Move `scraper_clean.py` → `scrapers/google_reviews/v1_0_0.py`
|
|
- Update all imports
|
|
- Verify everything still works
|
|
|
|
### Phase 1: Data Model
|
|
- Add new fields to jobs table
|
|
- Create batches table
|
|
- Create scraper_registry table
|
|
- Create api_keys table
|
|
- Migration preserves existing data
|
|
|
|
### Phase 2: Requester & Batch Support
|
|
- Update API to accept requester info
|
|
- Implement batch submission endpoint
|
|
- Store and display requester/batch info
|
|
|
|
### Phase 3: Webhooks
|
|
- Implement callback delivery service
|
|
- Retry logic for failed callbacks
|
|
- Track delivery status
|
|
|
|
### Phase 4: Scraper Versioning
|
|
- Implement scraper registry
|
|
- Version routing logic
|
|
- Admin endpoints for management
|
|
|
|
### Phase 5: Main Dashboard
|
|
- Build dashboard pages
|
|
- Aggregate queries
|
|
- Real-time updates
|
|
|
|
### Phase 6: Traffic Management & A/B
|
|
- A/B test traffic splitting
|
|
- Promote/deprecate workflow
|
|
- Performance comparison views
|
|
|
|
### Phase 7: Authentication
|
|
- API key management
|
|
- Client authentication middleware
|
|
- Rate limiting (optional)
|
|
|
|
---
|
|
|
|
## 13. Success Metrics
|
|
|
|
- API response time < 200ms for job submission
|
|
- Webhook delivery within 5 seconds of job completion
|
|
- Dashboard loads in < 2 seconds
|
|
- Support 100+ concurrent scraping jobs
|
|
- 99% webhook delivery success rate
|
|
- Clear visibility into scraper version performance
|
|
|
|
---
|
|
|
|
## 14. Open Questions
|
|
|
|
1. ~~**Authentication**: How do external clients authenticate? API keys per client?~~ → Resolved: API keys
|
|
2. **Rate Limits**: Per-client rate limiting? (deferred to Phase 7)
|
|
3. **Retention**: How long to keep completed job data? (needs decision)
|
|
4. **Billing**: Track usage for billing purposes? (future consideration)
|
|
5. **Project Rename**: Rename folder from `google-reviews-scraper-pro` to `reviewiq`?
|
|
|
|
---
|
|
|
|
## 15. Glossary
|
|
|
|
| Term | Definition |
|
|
|------|------------|
|
|
| Job | A single scraping task for one URL |
|
|
| Batch | A collection of related jobs submitted together |
|
|
| Job Type | Category of scraping (google_reviews, yelp_reviews, etc.) |
|
|
| Requester | External client/system that requests jobs |
|
|
| Scraper Version | Specific implementation of a scraper (v1.0.0, v2.0.0) |
|
|
| Variant | Stability tier: stable, beta, canary |
|
|
| Callback/Webhook | HTTP POST to notify client of job completion |
|
|
|
|
---
|
|
|
|
*Document Version: 1.2*
|
|
*Last Updated: 2025-01-24*
|