whyrating-engine-legacy/.artifacts/job-devtools-spec.md

# Job DevTools - Observability Suite Specification

## Executive Summary

A comprehensive observability system for scraping jobs that provides real-time monitoring, crash analysis, session transparency, and debugging capabilities. Transforms opaque job execution into a fully inspectable process.

---

## Priority Matrix

| Priority | Feature | Business Value | Complexity |
|----------|---------|----------------|------------|
| P0 | Structured Logging System | Foundation for all other features | Medium |
| P0 | Crash Intelligence | Reduce job failures, debug tab crashes | High |
| P0 | Copy & Export System | User productivity, support debugging | Low |
| P1 | Session Fingerprint Panel | Trust, transparency, debugging | Medium |
| P1 | Tabbed Log Viewer | Organized debugging experience | Medium |
| P2 | Real-time Metrics Dashboard | Visual monitoring, performance insight | Medium |
| P2 | Review Topics Inference | Enhanced analytics value | Low |
| P3 | Network Inspector | Deep debugging for edge cases | High |
| P3 | DOM Snapshots | Root cause analysis for crashes | High |
| P3 | Job Comparison | Performance optimization insights | Medium |

---

## P0: Critical Features

### 1. Structured Logging System

**Problem:** Current logs are flat text strings without categorization, making it impossible to filter or analyze specific aspects of job execution.

**Solution:** JSON-structured log entries with metadata.

**Log Entry Schema:**
```typescript
interface LogEntry {
  timestamp: string;        // ISO 8601 with milliseconds
  timestamp_ms: number;     // Unix ms for sorting/graphing
  level: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR' | 'FATAL';
  category: 'scraper' | 'browser' | 'network' | 'system';
  message: string;

  // Optional contextual data
  metrics?: {
    memory_mb?: number;
    reviews_count?: number;
    scroll_position?: number;
    dom_nodes?: number;
  };

  // For crash correlation
  snapshot_id?: string;

  // For network events
  network?: {
    url?: string;
    method?: string;
    status?: number;
    size_bytes?: number;
    duration_ms?: number;
  };
}
```

**Category Definitions:**

| Category | What it captures |
|----------|------------------|
| `scraper` | Review extraction, batch progress, data parsing, topic extraction |
| `browser` | Page navigation, consent handling, tab clicks, scroll events, element waits |
| `network` | API interceptions, request/response data, rate limiting, failures |
| `system` | Memory pressure, Chrome process health, worker pool status, timeouts |

**Backend Changes:**
- Replace `LogCapture` class with `StructuredLogger`
- All log calls include category and optional metrics
- Store as JSONB array in database (already `scrape_logs jsonb`)
- Stream via SSE with category field

**Example Logs:**
```json
{"timestamp": "2024-01-24T14:32:01.234Z", "timestamp_ms": 1706103121234, "level": "INFO", "category": "browser", "message": "Navigating to Google Maps URL", "metrics": {"memory_mb": 245}}
{"timestamp": "2024-01-24T14:32:02.456Z", "timestamp_ms": 1706103122456, "level": "WARN", "category": "browser", "message": "Consent popup detected, handling...", "metrics": {"memory_mb": 248}}
{"timestamp": "2024-01-24T14:32:03.789Z", "timestamp_ms": 1706103123789, "level": "INFO", "category": "scraper", "message": "Extracted batch of 50 reviews from API", "metrics": {"reviews_count": 50, "memory_mb": 267}}
{"timestamp": "2024-01-24T14:32:45.123Z", "timestamp_ms": 1706103165123, "level": "ERROR", "category": "system", "message": "Chrome memory pressure critical", "metrics": {"memory_mb": 489, "dom_nodes": 12847}}
```

---

### 2. Crash Intelligence System

**Problem:** Tab crashes are frequent but opaque. No visibility into what caused the crash or how to prevent it.

**Solution:** Comprehensive crash detection, analysis, and remediation suggestions.

**Crash Report Schema:**
```typescript
interface CrashReport {
  crash_id: string;
  job_id: string;
  timestamp: string;

  // Crash classification
  crash_type: 'tab_crash' | 'memory_exhaustion' | 'timeout' | 'network_failure' | 'element_not_found' | 'rate_limited';
  error_message: string;
  error_code?: string;

  // State at crash
  state: {
    reviews_extracted: number;
    total_expected: number;
    scroll_count: number;
    scroll_position: number;
    elapsed_seconds: number;
  };

  // Resource metrics (last 10 readings)
  metrics_history: Array<{
    timestamp_ms: number;
    memory_mb: number;
    dom_nodes: number;
    cpu_percent?: number;
  }>;

  // Last N log entries before crash
  logs_before_crash: LogEntry[];  // Last 20 entries

  // Recovery info
  last_successful_review_id?: string;
  checkpoint_available: boolean;

  // Analysis
  analysis: {
    pattern: string;           // e.g., "memory_exhaustion", "rate_limit_cascade"
    confidence: number;        // 0-100
    similar_crashes: number;   // Count in last 7 days
    suggested_fix: string;
    auto_fixable: boolean;
  };

  // Optional artifacts
  screenshot_url?: string;
  dom_snapshot_id?: string;
}
```

**Crash Patterns to Detect:**

| Pattern | Indicators | Suggested Fix |
|---------|------------|---------------|
| Memory Exhaustion | memory_mb > 450, rapid growth | Enable aggressive DOM cleanup |
| DOM Bloat | dom_nodes > 10000, not decreasing | Increase card hiding frequency |
| Rate Limited | Multiple 429 responses | Increase delays, rotate proxy |
| Consent Loop | Repeated consent URL detection | Clear cookies, different fingerprint |
| Element Timeout | Multiple "element not found" | Increase wait times, check selectors |
| Network Stall | No network activity > 30s | Refresh page, check connectivity |

**Backend Implementation:**
- Wrap scraper execution in try/catch with crash capture
- Periodic metrics sampling (every 5 seconds) stored in ring buffer
- On crash: compile report, analyze pattern, store to database
- New table: `crash_reports` with JSONB data

**Frontend Display:**
- Dedicated "Crash" section when job fails
- Timeline visualization showing metrics leading to crash
- Pattern explanation with confidence score
- One-click "Apply Fix & Retry" when auto_fixable=true

---

### 3. Copy & Export System

**Problem:** Users can't easily copy logs for debugging, sharing, or support requests.

**Solution:** Multi-level copy functionality with various export formats.

**Copy Levels:**

| Action | What it copies |
|--------|----------------|
| Click log line | Single log entry as text |
| Shift+Click range | Selected range of logs |
| "Copy All" button | Entire log as formatted text |
| "Export JSON" | Full structured data with metrics |
| "Export TXT" | Human-readable plain text |
| "Share" | Generate shareable link (optional) |

**Frontend Implementation:**
```typescript
interface CopySystem {
  // Single line copy (click handler on each log row)
  copyLine(entry: LogEntry): void;

  // Range selection (shift+click)
  copyRange(startIndex: number, endIndex: number): void;

  // Full export
  exportJSON(logs: LogEntry[], includeMetrics: boolean): string;
  exportTXT(logs: LogEntry[]): string;

  // Clipboard with feedback
  copyToClipboard(text: string): Promise<void>;  // Shows toast on success
}
```

**UI Elements:**
- Each log row has hover-visible copy icon
- Selection highlight for range copy
- Top toolbar: [Copy All] [Export JSON] [Export TXT]
- Toast notification: "Copied to clipboard"
- Keyboard shortcuts: Ctrl+C for selected, Ctrl+Shift+C for all

---

## P1: Important Features

### 4. Session Fingerprint Panel

**Problem:** Users don't know what browser identity was used during scraping, making it hard to debug location-specific issues or understand detection risks.

**Solution:** Display all fingerprint parameters used during the session.

**Session Info Schema:**
```typescript
interface SessionFingerprint {
  // Identity
  ip_address: string;         // Server's outbound IP
  ip_location: string;        // "Frankfurt, DE"
  user_agent: string;
  platform: string;           // "MacIntel"
  language: string;           // "es-ES"

  // Geolocation
  geolocation: {
    lat: number;
    lng: number;
    city: string;
    accuracy_meters: number;
  };
  timezone: string;           // "Atlantic/Canary"

  // Viewport
  viewport: {
    width: number;
    height: number;
    device_pixel_ratio: number;
  };

  // Anti-detection status
  bot_detection: {
    webdriver_hidden: boolean;
    headless_hidden: boolean;
    plugins_spoofed: boolean;
    canvas_fingerprint: 'unique' | 'generic' | 'blocked';
    webgl_fingerprint: 'unique' | 'generic' | 'blocked';
  };

  // Source
  fingerprint_source: 'user_browser' | 'randomized' | 'default';
}
```

**Backend Implementation:**
- Capture fingerprint at job start
- Store in job metadata
- Include bot detection test results (run fingerprint tests on page load)
- Return in job status response

**Frontend Display:**
- Collapsible panel in job details
- Visual indicators for detection risk (green/yellow/red)
- "What Google Saw" framing for user understanding

---

### 5. Tabbed Log Viewer

**Problem:** All logs mixed together makes it hard to focus on specific aspects.

**Solution:** Category-based tabs with filtering.

**Tab Structure:**
```
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│   All    │ Scraper  │ Browser  │ Network  │  System  │
│  (847)   │  (423)   │  (201)   │  (156)   │   (67)   │
└──────────┴──────────┴──────────┴──────────┴──────────┘
```

**Features per Tab:**
- Log count badge
- Level filter dropdown (DEBUG/INFO/WARN/ERROR)
- Search within tab
- Auto-scroll toggle
- Timestamp format toggle (relative/absolute)

**Frontend Implementation:**
```typescript
interface LogViewerState {
  activeTab: 'all' | 'scraper' | 'browser' | 'network' | 'system';
  levelFilter: Set<LogLevel>;
  searchQuery: string;
  autoScroll: boolean;
  timestampFormat: 'relative' | 'absolute';
}
```

---

## P2: Enhanced Features

### 6. Real-time Metrics Dashboard

**Problem:** No visual insight into job performance during execution.

**Solution:** Live-updating charts showing key metrics.

**Metrics to Display:**

| Metric | Chart Type | Update Frequency |
|--------|------------|------------------|
| Extraction Rate | Line chart (reviews/sec) | 5s |
| Cumulative Reviews | Area chart | 5s |
| Memory Usage | Line chart (MB) | 5s |
| Network Transfer | Line chart (KB) | 5s |
| Data Source Ratio | Pie chart (API vs DOM) | On change |

**Data Structure:**
```typescript
interface MetricsSnapshot {
  timestamp_ms: number;
  reviews_total: number;
  reviews_delta: number;      // Since last snapshot
  memory_mb: number;
  network_bytes: number;
  api_reviews: number;
  dom_reviews: number;
}
```

**Backend Implementation:**
- Emit metrics via SSE every 5 seconds during job execution
- Store final metrics summary in job record

**Frontend Implementation:**
- Recharts line/area charts
- 60-second rolling window during execution
- Full history available after completion

---

### 7. Review Topics Inference

**Problem:** We extract topic filters from Google but don't know which topics apply to each review.

**Solution:** Post-processing to infer topic matches from review text.

**Algorithm:**
```python
def infer_review_topics(review_text: str, topics: List[dict]) -> List[str]:
    """
    Match review text against extracted topic keywords.

    Args:
        review_text: The review content
        topics: List of {"topic": "cutting", "count": 3}

    Returns:
        List of matched topic names
    """
    matched = []
    text_lower = review_text.lower()

    for topic in topics:
        keyword = topic['topic'].lower()
        # Direct match
        if keyword in text_lower:
            matched.append(topic['topic'])
        # Stemmed/variant match (e.g., "cut" matches "cutting")
        elif any(variant in text_lower for variant in get_variants(keyword)):
            matched.append(topic['topic'])

    return matched
```

**Storage:**
- Add `topics: string[]` field to each review object
- Process during scrape (after topics extracted, before reviews saved)

**Frontend Display:**
- Topic tags on each review card
- Filter reviews by topic
- Topic distribution in analytics

---

## P3: Advanced Features

### 8. Network Inspector

**Problem:** No visibility into API requests/responses for debugging rate limits or data issues.

**Solution:** Chrome DevTools Network-style inspector.

**Implementation:**
- Use Chrome DevTools Protocol (CDP) to intercept all network requests
- Filter for relevant domains (google.com/maps)
- Capture: URL, method, status, headers, timing, size
- Store subset in logs (full data in separate table if needed)

**Display:**
- Sortable table of requests
- Click to expand: headers, response preview, timing breakdown
- Filter by: status (2xx/4xx/5xx), type (XHR/image/etc)

---

### 9. DOM Snapshots

**Problem:** Can't see what the page looked like at crash time or key moments.

**Solution:** Periodic DOM state captures.

**Implementation:**
- Capture serialized DOM at key events (page load, tab click, every N scrolls, before crash)
- Store compressed in blob storage or base64 in database
- Include screenshot (CDP Page.captureScreenshot)

**Display:**
- Snapshot timeline with thumbnails
- Side-by-side comparison between snapshots
- Diff view showing DOM changes

---

### 10. Job Comparison

**Problem:** No way to know if a job performed better or worse than typical.

**Solution:** Compare metrics against historical baselines.

**Metrics to Compare:**
- Total time
- Extraction rate (reviews/second)
- Memory peak
- Network transfer
- Success rate (extracted/expected)

**Display:**
- Bar chart comparing this job vs average
- Percentile ranking ("faster than 73% of jobs")
- Anomaly detection ("This job used 2x more memory than typical")

---

## Database Schema Changes

```sql
-- Crash reports table
CREATE TABLE crash_reports (
    crash_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id UUID REFERENCES jobs(job_id) ON DELETE CASCADE,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    crash_type VARCHAR(50) NOT NULL,
    error_message TEXT,
    state JSONB NOT NULL,
    metrics_history JSONB,
    logs_before_crash JSONB,
    analysis JSONB,
    screenshot_url TEXT,
    dom_snapshot_id UUID
);

CREATE INDEX idx_crash_reports_job ON crash_reports(job_id);
CREATE INDEX idx_crash_reports_type ON crash_reports(crash_type);
CREATE INDEX idx_crash_reports_created ON crash_reports(created_at DESC);

-- Add session fingerprint to jobs
ALTER TABLE jobs ADD COLUMN session_fingerprint JSONB;

-- Add metrics snapshots for completed jobs
ALTER TABLE jobs ADD COLUMN metrics_history JSONB;
```

---

## API Changes

### New Endpoints

```
GET  /jobs/{job_id}/logs?category=scraper&level=ERROR
GET  /jobs/{job_id}/crash-report
GET  /jobs/{job_id}/session
GET  /jobs/{job_id}/metrics
POST /jobs/{job_id}/retry?apply_fix=memory_cleanup
```

### SSE Stream Changes

Current: `{"type": "log", "message": "..."}`

New:
```json
{
  "type": "log",
  "data": {
    "timestamp": "2024-01-24T14:32:01.234Z",
    "timestamp_ms": 1706103121234,
    "level": "INFO",
    "category": "browser",
    "message": "Navigating to Google Maps URL",
    "metrics": {"memory_mb": 245}
  }
}

{
  "type": "metrics",
  "data": {
    "timestamp_ms": 1706103121234,
    "reviews_total": 150,
    "memory_mb": 312,
    "network_bytes": 1248576
  }
}

{
  "type": "crash",
  "data": { /* CrashReport object */ }
}
```

---

## Frontend Component Structure

```
components/
  JobDevTools/
    index.tsx              # Main container with tabs
    LogViewer.tsx          # Tabbed log display
    LogEntry.tsx           # Single log row with copy
    CopyToolbar.tsx        # Export buttons
    MetricsDashboard.tsx   # Charts container
    SessionPanel.tsx       # Fingerprint display
    CrashReport.tsx        # Crash analysis view
    NetworkInspector.tsx   # Request table (P3)
    DOMSnapshots.tsx       # Snapshot viewer (P3)
```

---

## Implementation Dependencies

```
P0: Structured Logging ──┬──▶ P1: Tabbed Log Viewer
                         │
                         ├──▶ P0: Crash Intelligence
                         │
                         └──▶ P2: Metrics Dashboard

P0: Copy System ─────────▶ (independent, can parallel)

P1: Session Fingerprint ─▶ (independent, can parallel)

P2: Topics Inference ────▶ (independent, can parallel)

P3: Network Inspector ───▶ Requires: Structured Logging
P3: DOM Snapshots ───────▶ Requires: Crash Intelligence
P3: Job Comparison ──────▶ Requires: Metrics Dashboard
```

---

## Success Metrics

| Feature | Success Metric |
|---------|----------------|
| Structured Logging | 100% of logs have category + timestamp_ms |
| Crash Intelligence | 80% of crashes have identified pattern |
| Copy System | < 200ms copy operation, toast feedback |
| Session Panel | All 15+ fingerprint fields populated |
| Tabbed Viewer | < 50ms tab switch, correct counts |
| Metrics Dashboard | < 100ms chart update, no memory leak |
| Topics Inference | > 70% accuracy vs manual labeling |

---

## Parallel Implementation Tracks

### Track A: Backend Logging Infrastructure
1. StructuredLogger class
2. Database schema changes
3. SSE stream updates
4. Crash detection wrapper

### Track B: Frontend Log Viewer
1. JobDevTools container
2. LogViewer with tabs
3. LogEntry with copy
4. CopyToolbar

### Track C: Crash Analysis
1. CrashReport schema
2. Pattern detection algorithms
3. CrashReport frontend component
4. Retry with fix functionality

### Track D: Session & Metrics
1. Fingerprint capture
2. SessionPanel component
3. Metrics streaming
4. MetricsDashboard charts

### Track E: Review Topics
1. Topic inference algorithm
2. Add topics to review schema
3. Frontend topic tags
4. Topic filter in analytics