- Replace non-working Google Maps embed iframe with animated location preview
- Add "Open in Google Maps" button to open location in new tab
- Add scraper type selection dropdown fetching from /api/admin/scrapers
- Show selected scraper info with formatted labels (Google Reviews v1.0.0)
- Include scraper_version and scraper_variant in job submission
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Dashboard page:
- Fetch top clients from /api/dashboard/by-client
- Show loading state while fetching
- Display empty state when no client data
- Show real client_id, job count, and success rate
Scrapers page:
- Fetch versions from /api/admin/scrapers
- Wire promote/deprecate buttons to real API calls
- Wire add version form to POST /api/admin/scrapers
- Wire traffic allocation to PUT /api/admin/scrapers/{id}/traffic
- Add loading and error states
Dockerfile:
- Add COPY commands for new directories (api/, core/, scrapers/, etc.)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 5 - Main Dashboard:
- Dashboard overview page with system health stats
- Jobs by status breakdown, success rates, top clients
- Dashboard API (/api/dashboard/overview, by-client, problems, by-version)
Phase 6 - Admin/Scraper Management:
- Scrapers management page with traffic allocation UI
- Admin API for scraper CRUD operations
- Traffic percentage updates for A/B testing
- Promote/deprecate scraper versions
Phase 7 - Authentication:
- API key authentication middleware
- SHA-256 key hashing (keys never stored in plain text)
- Scope-based authorization (jobs:read, jobs:write, admin)
- Rate limiting per API key
Also:
- Updated api_server_production.py to include new routers
- Extended core/database.py with dashboard query methods
- Added dashboard link to sidebar navigation
- Updated CONTEXT-KEEPER.md to mark all phases complete
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Quick-reference document for resuming work after context compaction.
Contains: project overview, current state, spec summary, phases,
key decisions, file locations, and resumption instructions.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Wrap handleJobsChange in useCallback to prevent infinite re-renders
caused by onJobsChange dependency changing on every render.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Task #18: Complete integration of all JobDevTools components
- Updated job detail page (/jobs/[id]) with full JobDevTools UI
- Connected SSE stream for real-time structured logs + metrics
- Added crash-report and retry API routes for Next.js
- Added format conversion for old/new log formats
- Added DevTools links to JobsView modal and actions column
- Wired up CrashReport retry with auto-fix parameters
- Integrated SessionPanel for fingerprint display
- Integrated MetricsDashboard for real-time charts
Job DevTools implementation complete: 18/18 tasks
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add chk_dedup_scoped constraint enforcing tenant-scoped dedup format
- Filter location_type='owned' in populate_facts() for 'ALL' rollup
- Document competitor exclusion from 'ALL' sentinel rollups
- Add explicit comments in aggregation code for maintainability
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two micro-risk mitigations documented:
1. dedup_group_id: Format "{business_id}:{hash}" to prevent
cross-tenant collision on similar reviews.
2. Sentinel conventions: 'ALL' (spatial) vs 'all' (semantic).
Case matters — do not normalize.
Spec frozen as v3.1.2.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Final fixes for production-ready spec:
1. locations.location_type: Added 'owned'|'competitor' flag.
Competitors now inserted into locations (preserves FK integrity).
2. Competitor fact query: Added business_id filter to prevent
cross-tenant contamination when same competitor tracked by
multiple customers.
3. issue_events versioning: Added source + review_version columns
for complete review reference in audit log.
4. Enrichment tenant-scoping: business_id now passed from ingest
job (not looked up). Validates place_id exists under tenant.
5. Footer: Fixed version string v3.1.1 → v3.1.2.
Status: Ship-ready specification.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three final fixes applied:
1. issue_spans versioning: Added source + review_version columns
with FK to reviews_enriched(source, review_id, review_version).
Spans now correctly reference the exact review version.
2. Competitor business_id rule: Clarified that competitor reviews
use customer's business_id + competitor's place_id (not NULL).
Keeps facts and joins working without special-case logic.
3. Trust-weighted facts: Clarified trust_weighted_* columns are
reserved but not populated in v3.1. Trust scoring applies to
issue priority only. Aggregation deferred to v3.2.
Status: Production-grade architecture specification.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete pipeline architecture for Google Reviews intelligence:
- Versioned reviews_enriched with (source, review_id, version) PK
- Tenant-scoped locations with (business_id, place_id) PK
- Relational issue_spans replacing array aggregation
- Unified fact_timeseries spine with 'ALL' sentinel for rollups
- Clean competitor model (separate table, no fake business_ids)
- Trust scoring and dedup support
- KPI-ready join keys
Reviewed and fixed: PK for edited reviews, multi-tenant overlap,
param ordering bugs, fact population scope, entity field deferral.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- reviewiq-pipeline-v1-final.md: Earlier pipeline specification
- test_metadata_extraction.py: Test script for metadata extraction
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace client-side state switching with proper Next.js routes:
- /new - New scrape form
- /jobs - Jobs list with table view
- /jobs/[id] - Individual job details and logs
- /analytics - Analytics overview (completed jobs)
- /analytics/[id] - Analytics for specific job
Add JobsContext for shared state across routes. Update Sidebar
to use next/link with pathname matching. Root page redirects to /new.
Also adds partial job status styling to JobsView.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Transfer user's browser fingerprint (user-agent, viewport, timezone,
language, geolocation) to Chrome for more authentic scraping
- Display review topics from Google Maps in analytics dashboard
- Show business category badge in analytics header
- Fix date_text null handling in analytics (handle undefined/timestamp fields)
- Add review_topics and business_category to JobStatus interface
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Clear cookies and navigate to about:blank before loading URL
(ensures clean state when reusing pooled driver)
- Simplified regex patterns for rating/reviews extraction
- Uses partial word matching like scrape_reviews
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
All functionality now in scraper_clean.py:
- fast_scrape_reviews (main scraper)
- get_business_card_info (validation)
Updated health_checks.py to import from scraper_clean.
Removes 1,935 lines of duplicate/obsolete code.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replaces fast_scraper validation with efficient polling-based extraction
using the same navigation pattern as scrape_reviews:
- 10ms polling for consent handling (no fixed waits)
- 100ms polling for data extraction
- Exits early when data found
Supports multiple languages:
- Rating: stars/estrellas/étoiles/sterne/stelle
- Reviews: reviews/reseñas/avis/bewertungen/recensioni
- Handles comma decimals (4,8 -> 4.8)
Result: 6.3s to extract name, address, rating, total_reviews
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The continue statement was skipping the card.style.display='none'
and card.innerHTML='' cleanup for cards already seen via API
interception. This caused DOM to grow unbounded during long scrapes.
Now ALL processed cards are hidden regardless of data source.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Small (~79 reviews): R. Fleitas Peluqueros
- Medium (~589 reviews): ClickRent Gran Canaria
- Large (~2000+ reviews): Hospital Doctor Negrín
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Import fast_scrape_reviews from scraper_clean instead of fast_scraper
- Keeps helper functions (check_reviews_available, get_business_card_info) from fast_scraper
- Production now uses clean scraper with hard refresh recovery
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add fast_scrape_reviews() wrapper to scraper_clean.py for API compatibility
- Set window size (1200x900) in wrapper to ensure proper Google Maps rendering
- Update job_manager.py to import from scraper_clean instead of fast_scraper
- Production now uses clean scraper with:
- Hard refresh recovery when stuck after 8+ soft recovery attempts
- API interception + DOM parsing for complete data collection
- Automatic deduplication across refreshes
Tested: 589/589 reviews collected in 55s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When the scraper gets stuck (8+ failed soft recovery attempts), it now
does a hard page refresh and re-setups everything:
- Reloads the page
- Re-clicks reviews tab
- Re-sorts by newest
- Re-injects API interceptor
- Continues collecting with existing seen_ids for deduplication
Key changes:
- Extract page setup into reusable setup_reviews_page() function
- Add do_hard_refresh() that calls setup on refresh
- Trigger hard refresh after 8 failed soft recoveries
- Try hard refresh before timeout gives up completely
- Max 3 hard refreshes before truly giving up
- Reset recovery counter after successful hard refresh
This ensures the scraper can recover from browser issues, DOM detachment,
or other problems that soft recovery (scroll tricks) can't fix.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
After sorting by newest, Google Maps may recreate DOM elements which
makes the Python scroll_container reference stale. Now re-find the
container after sorting to ensure we have a valid reference.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Check total_reviews before recovery attempts
- Exit loop as soon as current_count >= total_reviews
- Reduces scrape time significantly (13s vs 56s for 247 reviews)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Track DOM order for all reviews (review_order dict)
- Sort output by DOM position (preserves "Newest" sort order)
- API content + DOM order = best of both
- Remove click in recovery method 4 to avoid opening profile pages
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix API parser to use correct Google Maps response structure
- Review ID at [0], Author at [1][4][5][0], Rating at [2][0][0]
- Text at [2][15][0][0], Timestamp at [1][6]
- Use review_id as key for both API and DOM to avoid duplicates
- Prefer API data (original language, full text)
- Expand "More" buttons before sorting and during scroll loop
- Results: 246/247 full text (99.6%), down from 36/247 before
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Poll for up to 5s waiting for span[role="img"][aria-label*="review"]
- Element may not be present immediately after consent handling
- Tested: Soho Club 247/247 reviews in 31.4s with correct total
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Detect total BEFORE clicking reviews tab (element is on Overview)
- Use span[role="img"][aria-label*="review"] (robust, no class names)
- Extract count from aria-label (e.g., "260 reviews" → 260)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove separators (AyRUI, TFQHme) adjacent to already-hidden cards
- Separators removed on next cycle, not immediately (preserves scroll)
- DOM growth reduced by ~50% during long scrapes
- Tested: 2000 reviews in 103s (19.3/s) with all features
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use [data-review-id] + aria-label check for review cards
- Extract author from button[aria-label^="Photo of"]
- Use span[role="img"][aria-label*="star"] for rating
- Pattern matching for timestamp ("X time ago")
- Longest text span heuristic for review text
A/B tested: 100% match with old class-based selectors.
Survives Google's CSS class name changes.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Performance improvements:
- JS-based DOM parsing (single browser call vs Selenium round-trips)
- Batch flushing to disk every 500 reviews to free memory
- Hide parsed elements (display:none) to reduce DOM overhead
- Cycle timing instrumentation for debugging slowdowns
Results: 2826 reviews in 6.7min (7.1/sec) vs 2190 in 37min (1.0/sec)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previous detection was matching wrong elements (partial counts).
Now sums "X stars, Y reviews" aria-labels for accurate total.
Fallback methods:
1. Sum star rating counts (most accurate)
2. Reviews tab text like "Reviews (247)"
3. Span with "X reviews" text
Tested: Soho Club 247/247 correctly detected
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>