# Job DevTools - Observability Suite Specification ## Executive Summary A comprehensive observability system for scraping jobs that provides real-time monitoring, crash analysis, session transparency, and debugging capabilities. Transforms opaque job execution into a fully inspectable process. --- ## Priority Matrix | Priority | Feature | Business Value | Complexity | |----------|---------|----------------|------------| | P0 | Structured Logging System | Foundation for all other features | Medium | | P0 | Crash Intelligence | Reduce job failures, debug tab crashes | High | | P0 | Copy & Export System | User productivity, support debugging | Low | | P1 | Session Fingerprint Panel | Trust, transparency, debugging | Medium | | P1 | Tabbed Log Viewer | Organized debugging experience | Medium | | P2 | Real-time Metrics Dashboard | Visual monitoring, performance insight | Medium | | P2 | Review Topics Inference | Enhanced analytics value | Low | | P3 | Network Inspector | Deep debugging for edge cases | High | | P3 | DOM Snapshots | Root cause analysis for crashes | High | | P3 | Job Comparison | Performance optimization insights | Medium | --- ## P0: Critical Features ### 1. Structured Logging System **Problem:** Current logs are flat text strings without categorization, making it impossible to filter or analyze specific aspects of job execution. **Solution:** JSON-structured log entries with metadata. **Log Entry Schema:** ```typescript interface LogEntry { timestamp: string; // ISO 8601 with milliseconds timestamp_ms: number; // Unix ms for sorting/graphing level: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR' | 'FATAL'; category: 'scraper' | 'browser' | 'network' | 'system'; message: string; // Optional contextual data metrics?: { memory_mb?: number; reviews_count?: number; scroll_position?: number; dom_nodes?: number; }; // For crash correlation snapshot_id?: string; // For network events network?: { url?: string; method?: string; status?: number; size_bytes?: number; duration_ms?: number; }; } ``` **Category Definitions:** | Category | What it captures | |----------|------------------| | `scraper` | Review extraction, batch progress, data parsing, topic extraction | | `browser` | Page navigation, consent handling, tab clicks, scroll events, element waits | | `network` | API interceptions, request/response data, rate limiting, failures | | `system` | Memory pressure, Chrome process health, worker pool status, timeouts | **Backend Changes:** - Replace `LogCapture` class with `StructuredLogger` - All log calls include category and optional metrics - Store as JSONB array in database (already `scrape_logs jsonb`) - Stream via SSE with category field **Example Logs:** ```json {"timestamp": "2024-01-24T14:32:01.234Z", "timestamp_ms": 1706103121234, "level": "INFO", "category": "browser", "message": "Navigating to Google Maps URL", "metrics": {"memory_mb": 245}} {"timestamp": "2024-01-24T14:32:02.456Z", "timestamp_ms": 1706103122456, "level": "WARN", "category": "browser", "message": "Consent popup detected, handling...", "metrics": {"memory_mb": 248}} {"timestamp": "2024-01-24T14:32:03.789Z", "timestamp_ms": 1706103123789, "level": "INFO", "category": "scraper", "message": "Extracted batch of 50 reviews from API", "metrics": {"reviews_count": 50, "memory_mb": 267}} {"timestamp": "2024-01-24T14:32:45.123Z", "timestamp_ms": 1706103165123, "level": "ERROR", "category": "system", "message": "Chrome memory pressure critical", "metrics": {"memory_mb": 489, "dom_nodes": 12847}} ``` --- ### 2. Crash Intelligence System **Problem:** Tab crashes are frequent but opaque. No visibility into what caused the crash or how to prevent it. **Solution:** Comprehensive crash detection, analysis, and remediation suggestions. **Crash Report Schema:** ```typescript interface CrashReport { crash_id: string; job_id: string; timestamp: string; // Crash classification crash_type: 'tab_crash' | 'memory_exhaustion' | 'timeout' | 'network_failure' | 'element_not_found' | 'rate_limited'; error_message: string; error_code?: string; // State at crash state: { reviews_extracted: number; total_expected: number; scroll_count: number; scroll_position: number; elapsed_seconds: number; }; // Resource metrics (last 10 readings) metrics_history: Array<{ timestamp_ms: number; memory_mb: number; dom_nodes: number; cpu_percent?: number; }>; // Last N log entries before crash logs_before_crash: LogEntry[]; // Last 20 entries // Recovery info last_successful_review_id?: string; checkpoint_available: boolean; // Analysis analysis: { pattern: string; // e.g., "memory_exhaustion", "rate_limit_cascade" confidence: number; // 0-100 similar_crashes: number; // Count in last 7 days suggested_fix: string; auto_fixable: boolean; }; // Optional artifacts screenshot_url?: string; dom_snapshot_id?: string; } ``` **Crash Patterns to Detect:** | Pattern | Indicators | Suggested Fix | |---------|------------|---------------| | Memory Exhaustion | memory_mb > 450, rapid growth | Enable aggressive DOM cleanup | | DOM Bloat | dom_nodes > 10000, not decreasing | Increase card hiding frequency | | Rate Limited | Multiple 429 responses | Increase delays, rotate proxy | | Consent Loop | Repeated consent URL detection | Clear cookies, different fingerprint | | Element Timeout | Multiple "element not found" | Increase wait times, check selectors | | Network Stall | No network activity > 30s | Refresh page, check connectivity | **Backend Implementation:** - Wrap scraper execution in try/catch with crash capture - Periodic metrics sampling (every 5 seconds) stored in ring buffer - On crash: compile report, analyze pattern, store to database - New table: `crash_reports` with JSONB data **Frontend Display:** - Dedicated "Crash" section when job fails - Timeline visualization showing metrics leading to crash - Pattern explanation with confidence score - One-click "Apply Fix & Retry" when auto_fixable=true --- ### 3. Copy & Export System **Problem:** Users can't easily copy logs for debugging, sharing, or support requests. **Solution:** Multi-level copy functionality with various export formats. **Copy Levels:** | Action | What it copies | |--------|----------------| | Click log line | Single log entry as text | | Shift+Click range | Selected range of logs | | "Copy All" button | Entire log as formatted text | | "Export JSON" | Full structured data with metrics | | "Export TXT" | Human-readable plain text | | "Share" | Generate shareable link (optional) | **Frontend Implementation:** ```typescript interface CopySystem { // Single line copy (click handler on each log row) copyLine(entry: LogEntry): void; // Range selection (shift+click) copyRange(startIndex: number, endIndex: number): void; // Full export exportJSON(logs: LogEntry[], includeMetrics: boolean): string; exportTXT(logs: LogEntry[]): string; // Clipboard with feedback copyToClipboard(text: string): Promise; // Shows toast on success } ``` **UI Elements:** - Each log row has hover-visible copy icon - Selection highlight for range copy - Top toolbar: [Copy All] [Export JSON] [Export TXT] - Toast notification: "Copied to clipboard" - Keyboard shortcuts: Ctrl+C for selected, Ctrl+Shift+C for all --- ## P1: Important Features ### 4. Session Fingerprint Panel **Problem:** Users don't know what browser identity was used during scraping, making it hard to debug location-specific issues or understand detection risks. **Solution:** Display all fingerprint parameters used during the session. **Session Info Schema:** ```typescript interface SessionFingerprint { // Identity ip_address: string; // Server's outbound IP ip_location: string; // "Frankfurt, DE" user_agent: string; platform: string; // "MacIntel" language: string; // "es-ES" // Geolocation geolocation: { lat: number; lng: number; city: string; accuracy_meters: number; }; timezone: string; // "Atlantic/Canary" // Viewport viewport: { width: number; height: number; device_pixel_ratio: number; }; // Anti-detection status bot_detection: { webdriver_hidden: boolean; headless_hidden: boolean; plugins_spoofed: boolean; canvas_fingerprint: 'unique' | 'generic' | 'blocked'; webgl_fingerprint: 'unique' | 'generic' | 'blocked'; }; // Source fingerprint_source: 'user_browser' | 'randomized' | 'default'; } ``` **Backend Implementation:** - Capture fingerprint at job start - Store in job metadata - Include bot detection test results (run fingerprint tests on page load) - Return in job status response **Frontend Display:** - Collapsible panel in job details - Visual indicators for detection risk (green/yellow/red) - "What Google Saw" framing for user understanding --- ### 5. Tabbed Log Viewer **Problem:** All logs mixed together makes it hard to focus on specific aspects. **Solution:** Category-based tabs with filtering. **Tab Structure:** ``` ┌──────────┬──────────┬──────────┬──────────┬──────────┐ │ All │ Scraper │ Browser │ Network │ System │ │ (847) │ (423) │ (201) │ (156) │ (67) │ └──────────┴──────────┴──────────┴──────────┴──────────┘ ``` **Features per Tab:** - Log count badge - Level filter dropdown (DEBUG/INFO/WARN/ERROR) - Search within tab - Auto-scroll toggle - Timestamp format toggle (relative/absolute) **Frontend Implementation:** ```typescript interface LogViewerState { activeTab: 'all' | 'scraper' | 'browser' | 'network' | 'system'; levelFilter: Set; searchQuery: string; autoScroll: boolean; timestampFormat: 'relative' | 'absolute'; } ``` --- ## P2: Enhanced Features ### 6. Real-time Metrics Dashboard **Problem:** No visual insight into job performance during execution. **Solution:** Live-updating charts showing key metrics. **Metrics to Display:** | Metric | Chart Type | Update Frequency | |--------|------------|------------------| | Extraction Rate | Line chart (reviews/sec) | 5s | | Cumulative Reviews | Area chart | 5s | | Memory Usage | Line chart (MB) | 5s | | Network Transfer | Line chart (KB) | 5s | | Data Source Ratio | Pie chart (API vs DOM) | On change | **Data Structure:** ```typescript interface MetricsSnapshot { timestamp_ms: number; reviews_total: number; reviews_delta: number; // Since last snapshot memory_mb: number; network_bytes: number; api_reviews: number; dom_reviews: number; } ``` **Backend Implementation:** - Emit metrics via SSE every 5 seconds during job execution - Store final metrics summary in job record **Frontend Implementation:** - Recharts line/area charts - 60-second rolling window during execution - Full history available after completion --- ### 7. Review Topics Inference **Problem:** We extract topic filters from Google but don't know which topics apply to each review. **Solution:** Post-processing to infer topic matches from review text. **Algorithm:** ```python def infer_review_topics(review_text: str, topics: List[dict]) -> List[str]: """ Match review text against extracted topic keywords. Args: review_text: The review content topics: List of {"topic": "cutting", "count": 3} Returns: List of matched topic names """ matched = [] text_lower = review_text.lower() for topic in topics: keyword = topic['topic'].lower() # Direct match if keyword in text_lower: matched.append(topic['topic']) # Stemmed/variant match (e.g., "cut" matches "cutting") elif any(variant in text_lower for variant in get_variants(keyword)): matched.append(topic['topic']) return matched ``` **Storage:** - Add `topics: string[]` field to each review object - Process during scrape (after topics extracted, before reviews saved) **Frontend Display:** - Topic tags on each review card - Filter reviews by topic - Topic distribution in analytics --- ## P3: Advanced Features ### 8. Network Inspector **Problem:** No visibility into API requests/responses for debugging rate limits or data issues. **Solution:** Chrome DevTools Network-style inspector. **Implementation:** - Use Chrome DevTools Protocol (CDP) to intercept all network requests - Filter for relevant domains (google.com/maps) - Capture: URL, method, status, headers, timing, size - Store subset in logs (full data in separate table if needed) **Display:** - Sortable table of requests - Click to expand: headers, response preview, timing breakdown - Filter by: status (2xx/4xx/5xx), type (XHR/image/etc) --- ### 9. DOM Snapshots **Problem:** Can't see what the page looked like at crash time or key moments. **Solution:** Periodic DOM state captures. **Implementation:** - Capture serialized DOM at key events (page load, tab click, every N scrolls, before crash) - Store compressed in blob storage or base64 in database - Include screenshot (CDP Page.captureScreenshot) **Display:** - Snapshot timeline with thumbnails - Side-by-side comparison between snapshots - Diff view showing DOM changes --- ### 10. Job Comparison **Problem:** No way to know if a job performed better or worse than typical. **Solution:** Compare metrics against historical baselines. **Metrics to Compare:** - Total time - Extraction rate (reviews/second) - Memory peak - Network transfer - Success rate (extracted/expected) **Display:** - Bar chart comparing this job vs average - Percentile ranking ("faster than 73% of jobs") - Anomaly detection ("This job used 2x more memory than typical") --- ## Database Schema Changes ```sql -- Crash reports table CREATE TABLE crash_reports ( crash_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), job_id UUID REFERENCES jobs(job_id) ON DELETE CASCADE, created_at TIMESTAMP NOT NULL DEFAULT NOW(), crash_type VARCHAR(50) NOT NULL, error_message TEXT, state JSONB NOT NULL, metrics_history JSONB, logs_before_crash JSONB, analysis JSONB, screenshot_url TEXT, dom_snapshot_id UUID ); CREATE INDEX idx_crash_reports_job ON crash_reports(job_id); CREATE INDEX idx_crash_reports_type ON crash_reports(crash_type); CREATE INDEX idx_crash_reports_created ON crash_reports(created_at DESC); -- Add session fingerprint to jobs ALTER TABLE jobs ADD COLUMN session_fingerprint JSONB; -- Add metrics snapshots for completed jobs ALTER TABLE jobs ADD COLUMN metrics_history JSONB; ``` --- ## API Changes ### New Endpoints ``` GET /jobs/{job_id}/logs?category=scraper&level=ERROR GET /jobs/{job_id}/crash-report GET /jobs/{job_id}/session GET /jobs/{job_id}/metrics POST /jobs/{job_id}/retry?apply_fix=memory_cleanup ``` ### SSE Stream Changes Current: `{"type": "log", "message": "..."}` New: ```json { "type": "log", "data": { "timestamp": "2024-01-24T14:32:01.234Z", "timestamp_ms": 1706103121234, "level": "INFO", "category": "browser", "message": "Navigating to Google Maps URL", "metrics": {"memory_mb": 245} } } { "type": "metrics", "data": { "timestamp_ms": 1706103121234, "reviews_total": 150, "memory_mb": 312, "network_bytes": 1248576 } } { "type": "crash", "data": { /* CrashReport object */ } } ``` --- ## Frontend Component Structure ``` components/ JobDevTools/ index.tsx # Main container with tabs LogViewer.tsx # Tabbed log display LogEntry.tsx # Single log row with copy CopyToolbar.tsx # Export buttons MetricsDashboard.tsx # Charts container SessionPanel.tsx # Fingerprint display CrashReport.tsx # Crash analysis view NetworkInspector.tsx # Request table (P3) DOMSnapshots.tsx # Snapshot viewer (P3) ``` --- ## Implementation Dependencies ``` P0: Structured Logging ──┬──▶ P1: Tabbed Log Viewer │ ├──▶ P0: Crash Intelligence │ └──▶ P2: Metrics Dashboard P0: Copy System ─────────▶ (independent, can parallel) P1: Session Fingerprint ─▶ (independent, can parallel) P2: Topics Inference ────▶ (independent, can parallel) P3: Network Inspector ───▶ Requires: Structured Logging P3: DOM Snapshots ───────▶ Requires: Crash Intelligence P3: Job Comparison ──────▶ Requires: Metrics Dashboard ``` --- ## Success Metrics | Feature | Success Metric | |---------|----------------| | Structured Logging | 100% of logs have category + timestamp_ms | | Crash Intelligence | 80% of crashes have identified pattern | | Copy System | < 200ms copy operation, toast feedback | | Session Panel | All 15+ fingerprint fields populated | | Tabbed Viewer | < 50ms tab switch, correct counts | | Metrics Dashboard | < 100ms chart update, no memory leak | | Topics Inference | > 70% accuracy vs manual labeling | --- ## Parallel Implementation Tracks ### Track A: Backend Logging Infrastructure 1. StructuredLogger class 2. Database schema changes 3. SSE stream updates 4. Crash detection wrapper ### Track B: Frontend Log Viewer 1. JobDevTools container 2. LogViewer with tabs 3. LogEntry with copy 4. CopyToolbar ### Track C: Crash Analysis 1. CrashReport schema 2. Pattern detection algorithms 3. CrashReport frontend component 4. Retry with fix functionality ### Track D: Session & Metrics 1. Fingerprint capture 2. SessionPanel component 3. Metrics streaming 4. MetricsDashboard charts ### Track E: Review Topics 1. Topic inference algorithm 2. Add topics to review schema 3. Frontend topic tags 4. Topic filter in analytics