Files

Alejandro Gutiérrez 788ef84756 Phases 2-4: Requester support, batches, webhooks, scraper registry

Phase 2 - Requester & Batch Support:
- core/database.py: Added create_job params (requester_*, batch_*, priority, callback_*)
- core/database.py: Added batch methods (create_batch, get_batch, update_batch_progress, get_batches)
- core/database.py: Added update_job_callback for tracking webhook delivery
- api/routes/batches.py: New endpoints:
  - POST /api/scrape/google-reviews/batch (submit batch)
  - GET /api/batches (list batches)
  - GET /api/batches/{id} (batch detail)
  - DELETE /api/batches/{id} (cancel batch)
- api_server_production.py: Updated /api/scrape with requester, priority, callback fields
- api_server_production.py: New primary endpoint POST /api/scrape/google-reviews

Phase 3 - Webhooks:
- services/job_callback_service.py: New service with:
  - JobCallbackService: send_job_callback, send_batch_callback, retry_failed_callbacks
  - JobCallbackDispatcher: Background worker for callback monitoring
  - Payload formats per spec (job.completed, job.failed, batch.completed)
  - Exponential backoff for retries
  - Error classification for failure payloads

Phase 4 - Scraper Registry:
- scrapers/registry.py: Database-backed version routing:
  - get_scraper(): Version/variant/A/B routing
  - _get_weighted_scraper(): Traffic-weighted random selection
  - 60-second TTL cache for performance
  - register_scraper, deprecate_scraper, update_traffic_allocation
  - LegacyScraperRegistry preserved for backwards compatibility

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-24 15:35:58 +00:00

5.3 KiB

Raw Blame History

ReviewIQ Platform - Implementation Context Keeper

Purpose: Restore context quickly after conversation compaction. Read this first when resuming work.

What Is This Project?

ReviewIQ is a multi-tenant scraping-as-a-service platform. Currently scrapes Google Reviews, will expand to other sources (Yelp, TripAdvisor, etc.).

Primary consumer: veritasreview.com (external service that generates insights from scraped data)

Current State (as of 2025-01-24)

Working Features

Google Reviews scraper (modules/scraper_clean.py) - fully functional
Job queue with PostgreSQL storage
Real-time SSE streaming of logs/progress
Web UI for job management and analytics
Chrome pool for browser management
Crash detection and analysis
JobDevTools observability panel

Repository

Location: /Users/agutierrez/Desktop/google-reviews-scraper-pro
Branch: master
Spec document: .artifacts/ReviewIQ-Platform-Spec.md (v1.2)

What We're Building (Spec Summary)

New Capabilities

Requester tracking - who requested each scrape (client_id, source, purpose)
Batch jobs - submit multiple URLs as a group
Webhooks - callback when jobs complete
Priority levels - normal, high, urgent
Scraper versioning - stable/beta/canary with A/B traffic routing
Main dashboard - system health, client breakdown, scraper performance
Multiple job types - architecture supports future scrapers

API Design

Separate endpoints per job type: POST /api/scrape/google-reviews
Batch endpoint: POST /api/scrape/google-reviews/batch
Each scraper version is independent, registered in scraper_registry

Key Data Model Additions

Jobs table (new fields):
- requester_client_id, requester_source, scrape_purpose, requester_metadata
- batch_id, batch_index
- job_type, scraper_version, scraper_variant, priority
- callback_url, callback_status, callback_sent_at

New tables:
- batches (batch grouping)
- scraper_registry (version management)
- api_keys (authentication)

Target Project Structure

reviewiq/                          # Will rename from google-reviews-scraper-pro
├── api/
│   ├── server.py
│   └── routes/                    # scrape.py, jobs.py, batches.py, dashboard.py, admin.py
├── scrapers/
│   ├── registry.py
│   ├── base.py
│   └── google_reviews/
│       └── v1_0_0.py              # Migrated from scraper_clean.py
├── core/
│   ├── database.py
│   ├── models.py
│   ├── enums.py
│   └── config.py
├── services/
│   ├── job_service.py
│   ├── batch_service.py
│   ├── webhook_service.py
│   └── dashboard_service.py
├── workers/
│   ├── chrome_pool.py
│   ├── job_executor.py
│   └── webhook_worker.py
├── utils/
│   ├── logger.py
│   ├── crash_analyzer.py
│   └── health_checks.py
├── tests/
├── web/                           # Next.js frontend (existing)
└── migrations/

Implementation Phases

Phase	Description	Status
0	Project restructure (move files to new locations)	✅ COMPLETE
1	Database migrations (new fields + tables)	✅ COMPLETE
2	Requester & batch support	Not started
3	Webhooks	Not started
4	Scraper versioning & registry	Not started
5	Main dashboard UI	Not started
6	A/B traffic management	Not started
7	Authentication (API keys)	Not started

Phase 0 must complete first. Then phases 1-5 can parallelize.

Key Decisions Made

Separate endpoints per job type - not a single /api/scrape with type parameter
Scraper versions in files - v1_0_0.py, v2_0_0.py (underscores for valid Python)
No legacy aliases - scraper_clean.py deleted after migration, not kept as alias
API backwards compatible - POST /api/scrape still works (routes to google-reviews)
Output schema defined - for external insights service integration (see spec section 6)

Important Constraints

Don't break current scraper - it works, migrate carefully
Backwards compatible API - existing integrations must keep working
Clean architecture - no legacy file names, proper structure from start
Database migrations - preserve existing job data

Files to Know

Current Location	Purpose
`modules/scraper_clean.py`	Main Google Reviews scraper (96KB)
`modules/database.py`	PostgreSQL database manager
`api_server_production.py`	FastAPI server (will be split into api/)
`web/app/jobs/[id]/page.tsx`	Job detail page with DevTools
`.artifacts/ReviewIQ-Platform-Spec.md`	Full specification document

Quick Commands

# Run backend
python api_server_production.py

# Run frontend
cd web && npm run dev

# Docker
docker-compose -f docker-compose.production.yml up

# Build frontend
cd web && npm run build

Resuming Work

When resuming after context compaction:

Read this file first
Check .artifacts/ReviewIQ-Platform-Spec.md for full details
Check git log for recent changes: git log --oneline -10
Check current phase status in this file
Continue implementation from where left off

Last updated: 2025-01-24

5.3 KiB Raw Blame History