Commit Graph

7 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
2637d982e0 Wave 4: JobDevTools UI components and crash report API
- Task #5: Create JobDevTools container component
  (tabs: All/Scraper/Browser/Network/System, level filters, count badges)
- Task #11: Add crash report API endpoints
  (GET /jobs/{id}/crash-report, POST /jobs/{id}/retry?apply_fix=true, GET /crashes/stats)
- Task #14: Create SessionPanel component
  (fingerprint display, bot detection indicators, collapsible sections)
- Task #15: Create MetricsDashboard with recharts
  (extraction rate, cumulative reviews, memory usage, scroll progress)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:37:56 +00:00
Alejandro Gutiérrez
f4ca60349e Wave 3: SSE structured logs, crash analyzer, session fingerprint
- Task #3: Update SSE stream to emit structured log events
  (type: "log" for entries, type: "metrics" every 5s, ?format=legacy for backward compat)
- Task #10: Create crash pattern analyzer module
  (6 patterns: memory_exhaustion, dom_bloat, rate_limited, consent_loop, scroll_timeout, element_stale)
  (confidence scoring, auto-fix params, summarize_crash_patterns for recurring issues)
- Task #13: Capture session fingerprint in backend
  (user_agent, platform, timezone, webgl, canvas, bot_detection_tests)
  (saved on success and failure for debugging)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 12:34:17 +00:00
Alejandro Gutiérrez
a540ab97b1 Add browser fingerprint support and analytics metadata display
- Transfer user's browser fingerprint (user-agent, viewport, timezone,
  language, geolocation) to Chrome for more authentic scraping
- Display review topics from Google Maps in analytics dashboard
- Show business category badge in analytics header
- Fix date_text null handling in analytics (handle undefined/timestamp fields)
- Add review_topics and business_category to JobStatus interface

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 10:36:06 +00:00
Alejandro Gutiérrez
0682c0ec61 Add get_business_card_info to scraper_clean with multilingual support
Replaces fast_scraper validation with efficient polling-based extraction
using the same navigation pattern as scrape_reviews:
- 10ms polling for consent handling (no fixed waits)
- 100ms polling for data extraction
- Exits early when data found

Supports multiple languages:
- Rating: stars/estrellas/étoiles/sterne/stelle
- Reviews: reviews/reseñas/avis/bewertungen/recensioni
- Handles comma decimals (4,8 -> 4.8)

Result: 6.3s to extract name, address, rating, total_reviews

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:52:06 +00:00
Alejandro Gutiérrez
8ccf72a489 Remove old scraper files - consolidate to scraper_clean
Production (api_server_production.py) only uses:
- modules/scraper_clean.py - main scraping logic
- modules/fast_scraper.py - validation helpers
- modules/database.py, webhooks.py, health_checks.py, chrome_pool.py

Deleted 33 unused Python files including:
- Old API server (api_server.py)
- 14 start*.py experimental scrapers
- 7 *_scraper.py variants
- Old modules: scraper.py, api_interceptor.py, job_manager.py, cli.py
- Various debug/test/utility scripts

Saves ~11,000 lines of unmaintained code.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-23 17:25:00 +00:00
Alejandro Gutiérrez
8b36850838 Switch Docker production API to use scraper_clean
- Import fast_scrape_reviews from scraper_clean instead of fast_scraper
- Keeps helper functions (check_reviews_available, get_business_card_info) from fast_scraper
- Production now uses clean scraper with hard refresh recovery

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 14:19:40 +00:00
Alejandro Gutiérrez
faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00