diff --git a/.artifacts/CONTEXT-KEEPER.md b/.artifacts/CONTEXT-KEEPER.md new file mode 100644 index 0000000..d9a568c --- /dev/null +++ b/.artifacts/CONTEXT-KEEPER.md @@ -0,0 +1,180 @@ +# ReviewIQ Platform - Implementation Context Keeper + +> **Purpose**: Restore context quickly after conversation compaction. Read this first when resuming work. + +--- + +## What Is This Project? + +**ReviewIQ** is a multi-tenant scraping-as-a-service platform. Currently scrapes Google Reviews, will expand to other sources (Yelp, TripAdvisor, etc.). + +**Primary consumer**: veritasreview.com (external service that generates insights from scraped data) + +--- + +## Current State (as of 2025-01-24) + +### Working Features +- Google Reviews scraper (`modules/scraper_clean.py`) - fully functional +- Job queue with PostgreSQL storage +- Real-time SSE streaming of logs/progress +- Web UI for job management and analytics +- Chrome pool for browser management +- Crash detection and analysis +- JobDevTools observability panel + +### Repository +- **Location**: `/Users/agutierrez/Desktop/google-reviews-scraper-pro` +- **Branch**: `master` +- **Spec document**: `.artifacts/ReviewIQ-Platform-Spec.md` (v1.2) + +--- + +## What We're Building (Spec Summary) + +### New Capabilities +1. **Requester tracking** - who requested each scrape (client_id, source, purpose) +2. **Batch jobs** - submit multiple URLs as a group +3. **Webhooks** - callback when jobs complete +4. **Priority levels** - normal, high, urgent +5. **Scraper versioning** - stable/beta/canary with A/B traffic routing +6. **Main dashboard** - system health, client breakdown, scraper performance +7. **Multiple job types** - architecture supports future scrapers + +### API Design +- Separate endpoints per job type: `POST /api/scrape/google-reviews` +- Batch endpoint: `POST /api/scrape/google-reviews/batch` +- Each scraper version is independent, registered in scraper_registry + +### Key Data Model Additions +``` +Jobs table (new fields): +- requester_client_id, requester_source, scrape_purpose, requester_metadata +- batch_id, batch_index +- job_type, scraper_version, scraper_variant, priority +- callback_url, callback_status, callback_sent_at + +New tables: +- batches (batch grouping) +- scraper_registry (version management) +- api_keys (authentication) +``` + +--- + +## Target Project Structure + +``` +reviewiq/ # Will rename from google-reviews-scraper-pro +├── api/ +│ ├── server.py +│ └── routes/ # scrape.py, jobs.py, batches.py, dashboard.py, admin.py +├── scrapers/ +│ ├── registry.py +│ ├── base.py +│ └── google_reviews/ +│ └── v1_0_0.py # Migrated from scraper_clean.py +├── core/ +│ ├── database.py +│ ├── models.py +│ ├── enums.py +│ └── config.py +├── services/ +│ ├── job_service.py +│ ├── batch_service.py +│ ├── webhook_service.py +│ └── dashboard_service.py +├── workers/ +│ ├── chrome_pool.py +│ ├── job_executor.py +│ └── webhook_worker.py +├── utils/ +│ ├── logger.py +│ ├── crash_analyzer.py +│ └── health_checks.py +├── tests/ +├── web/ # Next.js frontend (existing) +└── migrations/ +``` + +--- + +## Implementation Phases + +| Phase | Description | Status | +|-------|-------------|--------| +| 0 | Project restructure (move files to new locations) | Not started | +| 1 | Database migrations (new fields + tables) | Not started | +| 2 | Requester & batch support | Not started | +| 3 | Webhooks | Not started | +| 4 | Scraper versioning & registry | Not started | +| 5 | Main dashboard UI | Not started | +| 6 | A/B traffic management | Not started | +| 7 | Authentication (API keys) | Not started | + +**Phase 0 must complete first.** Then phases 1-5 can parallelize. + +--- + +## Key Decisions Made + +1. **Separate endpoints per job type** - not a single `/api/scrape` with type parameter +2. **Scraper versions in files** - `v1_0_0.py`, `v2_0_0.py` (underscores for valid Python) +3. **No legacy aliases** - `scraper_clean.py` deleted after migration, not kept as alias +4. **API backwards compatible** - `POST /api/scrape` still works (routes to google-reviews) +5. **Output schema defined** - for external insights service integration (see spec section 6) + +--- + +## Important Constraints + +- **Don't break current scraper** - it works, migrate carefully +- **Backwards compatible API** - existing integrations must keep working +- **Clean architecture** - no legacy file names, proper structure from start +- **Database migrations** - preserve existing job data + +--- + +## Files to Know + +| Current Location | Purpose | +|------------------|---------| +| `modules/scraper_clean.py` | Main Google Reviews scraper (96KB) | +| `modules/database.py` | PostgreSQL database manager | +| `api_server_production.py` | FastAPI server (will be split into api/) | +| `web/app/jobs/[id]/page.tsx` | Job detail page with DevTools | +| `.artifacts/ReviewIQ-Platform-Spec.md` | Full specification document | + +--- + +## Quick Commands + +```bash +# Run backend +python api_server_production.py + +# Run frontend +cd web && npm run dev + +# Docker +docker-compose -f docker-compose.production.yml up + +# Build frontend +cd web && npm run build +``` + +--- + +## Resuming Work + +When resuming after context compaction: + +1. Read this file first +2. Check `.artifacts/ReviewIQ-Platform-Spec.md` for full details +3. Check git log for recent changes: `git log --oneline -10` +4. Check current phase status in this file +5. Continue implementation from where left off + +--- + +*Last updated: 2025-01-24*