Files
whyrating-engine-legacy/.artifacts/CONTEXT-KEEPER.md
Alejandro Gutiérrez 788ef84756 Phases 2-4: Requester support, batches, webhooks, scraper registry
Phase 2 - Requester & Batch Support:
- core/database.py: Added create_job params (requester_*, batch_*, priority, callback_*)
- core/database.py: Added batch methods (create_batch, get_batch, update_batch_progress, get_batches)
- core/database.py: Added update_job_callback for tracking webhook delivery
- api/routes/batches.py: New endpoints:
  - POST /api/scrape/google-reviews/batch (submit batch)
  - GET /api/batches (list batches)
  - GET /api/batches/{id} (batch detail)
  - DELETE /api/batches/{id} (cancel batch)
- api_server_production.py: Updated /api/scrape with requester, priority, callback fields
- api_server_production.py: New primary endpoint POST /api/scrape/google-reviews

Phase 3 - Webhooks:
- services/job_callback_service.py: New service with:
  - JobCallbackService: send_job_callback, send_batch_callback, retry_failed_callbacks
  - JobCallbackDispatcher: Background worker for callback monitoring
  - Payload formats per spec (job.completed, job.failed, batch.completed)
  - Exponential backoff for retries
  - Error classification for failure payloads

Phase 4 - Scraper Registry:
- scrapers/registry.py: Database-backed version routing:
  - get_scraper(): Version/variant/A/B routing
  - _get_weighted_scraper(): Traffic-weighted random selection
  - 60-second TTL cache for performance
  - register_scraper, deprecate_scraper, update_traffic_allocation
  - LegacyScraperRegistry preserved for backwards compatibility

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 15:35:58 +00:00

181 lines
5.3 KiB
Markdown

# ReviewIQ Platform - Implementation Context Keeper
> **Purpose**: Restore context quickly after conversation compaction. Read this first when resuming work.
---
## What Is This Project?
**ReviewIQ** is a multi-tenant scraping-as-a-service platform. Currently scrapes Google Reviews, will expand to other sources (Yelp, TripAdvisor, etc.).
**Primary consumer**: veritasreview.com (external service that generates insights from scraped data)
---
## Current State (as of 2025-01-24)
### Working Features
- Google Reviews scraper (`modules/scraper_clean.py`) - fully functional
- Job queue with PostgreSQL storage
- Real-time SSE streaming of logs/progress
- Web UI for job management and analytics
- Chrome pool for browser management
- Crash detection and analysis
- JobDevTools observability panel
### Repository
- **Location**: `/Users/agutierrez/Desktop/google-reviews-scraper-pro`
- **Branch**: `master`
- **Spec document**: `.artifacts/ReviewIQ-Platform-Spec.md` (v1.2)
---
## What We're Building (Spec Summary)
### New Capabilities
1. **Requester tracking** - who requested each scrape (client_id, source, purpose)
2. **Batch jobs** - submit multiple URLs as a group
3. **Webhooks** - callback when jobs complete
4. **Priority levels** - normal, high, urgent
5. **Scraper versioning** - stable/beta/canary with A/B traffic routing
6. **Main dashboard** - system health, client breakdown, scraper performance
7. **Multiple job types** - architecture supports future scrapers
### API Design
- Separate endpoints per job type: `POST /api/scrape/google-reviews`
- Batch endpoint: `POST /api/scrape/google-reviews/batch`
- Each scraper version is independent, registered in scraper_registry
### Key Data Model Additions
```
Jobs table (new fields):
- requester_client_id, requester_source, scrape_purpose, requester_metadata
- batch_id, batch_index
- job_type, scraper_version, scraper_variant, priority
- callback_url, callback_status, callback_sent_at
New tables:
- batches (batch grouping)
- scraper_registry (version management)
- api_keys (authentication)
```
---
## Target Project Structure
```
reviewiq/ # Will rename from google-reviews-scraper-pro
├── api/
│ ├── server.py
│ └── routes/ # scrape.py, jobs.py, batches.py, dashboard.py, admin.py
├── scrapers/
│ ├── registry.py
│ ├── base.py
│ └── google_reviews/
│ └── v1_0_0.py # Migrated from scraper_clean.py
├── core/
│ ├── database.py
│ ├── models.py
│ ├── enums.py
│ └── config.py
├── services/
│ ├── job_service.py
│ ├── batch_service.py
│ ├── webhook_service.py
│ └── dashboard_service.py
├── workers/
│ ├── chrome_pool.py
│ ├── job_executor.py
│ └── webhook_worker.py
├── utils/
│ ├── logger.py
│ ├── crash_analyzer.py
│ └── health_checks.py
├── tests/
├── web/ # Next.js frontend (existing)
└── migrations/
```
---
## Implementation Phases
| Phase | Description | Status |
|-------|-------------|--------|
| 0 | Project restructure (move files to new locations) | ✅ COMPLETE |
| 1 | Database migrations (new fields + tables) | ✅ COMPLETE |
| 2 | Requester & batch support | Not started |
| 3 | Webhooks | Not started |
| 4 | Scraper versioning & registry | Not started |
| 5 | Main dashboard UI | Not started |
| 6 | A/B traffic management | Not started |
| 7 | Authentication (API keys) | Not started |
**Phase 0 must complete first.** Then phases 1-5 can parallelize.
---
## Key Decisions Made
1. **Separate endpoints per job type** - not a single `/api/scrape` with type parameter
2. **Scraper versions in files** - `v1_0_0.py`, `v2_0_0.py` (underscores for valid Python)
3. **No legacy aliases** - `scraper_clean.py` deleted after migration, not kept as alias
4. **API backwards compatible** - `POST /api/scrape` still works (routes to google-reviews)
5. **Output schema defined** - for external insights service integration (see spec section 6)
---
## Important Constraints
- **Don't break current scraper** - it works, migrate carefully
- **Backwards compatible API** - existing integrations must keep working
- **Clean architecture** - no legacy file names, proper structure from start
- **Database migrations** - preserve existing job data
---
## Files to Know
| Current Location | Purpose |
|------------------|---------|
| `modules/scraper_clean.py` | Main Google Reviews scraper (96KB) |
| `modules/database.py` | PostgreSQL database manager |
| `api_server_production.py` | FastAPI server (will be split into api/) |
| `web/app/jobs/[id]/page.tsx` | Job detail page with DevTools |
| `.artifacts/ReviewIQ-Platform-Spec.md` | Full specification document |
---
## Quick Commands
```bash
# Run backend
python api_server_production.py
# Run frontend
cd web && npm run dev
# Docker
docker-compose -f docker-compose.production.yml up
# Build frontend
cd web && npm run build
```
---
## Resuming Work
When resuming after context compaction:
1. Read this file first
2. Check `.artifacts/ReviewIQ-Platform-Spec.md` for full details
3. Check git log for recent changes: `git log --oneline -10`
4. Check current phase status in this file
5. Continue implementation from where left off
---
*Last updated: 2025-01-24*