Files
whyrating-engine-legacy/.artifacts/job-devtools-spec.md
Alejandro Gutiérrez 65fcaf43e8 Add Job DevTools specification document
Comprehensive spec for observability suite including:
- Structured logging system with categories
- Crash intelligence and pattern analysis
- Copy/export functionality
- Session fingerprint panel
- Real-time metrics dashboard
- Review topics inference

Organized by priority (P0-P3) with parallel implementation tracks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 11:10:34 +00:00

18 KiB

Job DevTools - Observability Suite Specification

Executive Summary

A comprehensive observability system for scraping jobs that provides real-time monitoring, crash analysis, session transparency, and debugging capabilities. Transforms opaque job execution into a fully inspectable process.


Priority Matrix

Priority Feature Business Value Complexity
P0 Structured Logging System Foundation for all other features Medium
P0 Crash Intelligence Reduce job failures, debug tab crashes High
P0 Copy & Export System User productivity, support debugging Low
P1 Session Fingerprint Panel Trust, transparency, debugging Medium
P1 Tabbed Log Viewer Organized debugging experience Medium
P2 Real-time Metrics Dashboard Visual monitoring, performance insight Medium
P2 Review Topics Inference Enhanced analytics value Low
P3 Network Inspector Deep debugging for edge cases High
P3 DOM Snapshots Root cause analysis for crashes High
P3 Job Comparison Performance optimization insights Medium

P0: Critical Features

1. Structured Logging System

Problem: Current logs are flat text strings without categorization, making it impossible to filter or analyze specific aspects of job execution.

Solution: JSON-structured log entries with metadata.

Log Entry Schema:

interface LogEntry {
  timestamp: string;        // ISO 8601 with milliseconds
  timestamp_ms: number;     // Unix ms for sorting/graphing
  level: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR' | 'FATAL';
  category: 'scraper' | 'browser' | 'network' | 'system';
  message: string;

  // Optional contextual data
  metrics?: {
    memory_mb?: number;
    reviews_count?: number;
    scroll_position?: number;
    dom_nodes?: number;
  };

  // For crash correlation
  snapshot_id?: string;

  // For network events
  network?: {
    url?: string;
    method?: string;
    status?: number;
    size_bytes?: number;
    duration_ms?: number;
  };
}

Category Definitions:

Category What it captures
scraper Review extraction, batch progress, data parsing, topic extraction
browser Page navigation, consent handling, tab clicks, scroll events, element waits
network API interceptions, request/response data, rate limiting, failures
system Memory pressure, Chrome process health, worker pool status, timeouts

Backend Changes:

  • Replace LogCapture class with StructuredLogger
  • All log calls include category and optional metrics
  • Store as JSONB array in database (already scrape_logs jsonb)
  • Stream via SSE with category field

Example Logs:

{"timestamp": "2024-01-24T14:32:01.234Z", "timestamp_ms": 1706103121234, "level": "INFO", "category": "browser", "message": "Navigating to Google Maps URL", "metrics": {"memory_mb": 245}}
{"timestamp": "2024-01-24T14:32:02.456Z", "timestamp_ms": 1706103122456, "level": "WARN", "category": "browser", "message": "Consent popup detected, handling...", "metrics": {"memory_mb": 248}}
{"timestamp": "2024-01-24T14:32:03.789Z", "timestamp_ms": 1706103123789, "level": "INFO", "category": "scraper", "message": "Extracted batch of 50 reviews from API", "metrics": {"reviews_count": 50, "memory_mb": 267}}
{"timestamp": "2024-01-24T14:32:45.123Z", "timestamp_ms": 1706103165123, "level": "ERROR", "category": "system", "message": "Chrome memory pressure critical", "metrics": {"memory_mb": 489, "dom_nodes": 12847}}

2. Crash Intelligence System

Problem: Tab crashes are frequent but opaque. No visibility into what caused the crash or how to prevent it.

Solution: Comprehensive crash detection, analysis, and remediation suggestions.

Crash Report Schema:

interface CrashReport {
  crash_id: string;
  job_id: string;
  timestamp: string;

  // Crash classification
  crash_type: 'tab_crash' | 'memory_exhaustion' | 'timeout' | 'network_failure' | 'element_not_found' | 'rate_limited';
  error_message: string;
  error_code?: string;

  // State at crash
  state: {
    reviews_extracted: number;
    total_expected: number;
    scroll_count: number;
    scroll_position: number;
    elapsed_seconds: number;
  };

  // Resource metrics (last 10 readings)
  metrics_history: Array<{
    timestamp_ms: number;
    memory_mb: number;
    dom_nodes: number;
    cpu_percent?: number;
  }>;

  // Last N log entries before crash
  logs_before_crash: LogEntry[];  // Last 20 entries

  // Recovery info
  last_successful_review_id?: string;
  checkpoint_available: boolean;

  // Analysis
  analysis: {
    pattern: string;           // e.g., "memory_exhaustion", "rate_limit_cascade"
    confidence: number;        // 0-100
    similar_crashes: number;   // Count in last 7 days
    suggested_fix: string;
    auto_fixable: boolean;
  };

  // Optional artifacts
  screenshot_url?: string;
  dom_snapshot_id?: string;
}

Crash Patterns to Detect:

Pattern Indicators Suggested Fix
Memory Exhaustion memory_mb > 450, rapid growth Enable aggressive DOM cleanup
DOM Bloat dom_nodes > 10000, not decreasing Increase card hiding frequency
Rate Limited Multiple 429 responses Increase delays, rotate proxy
Consent Loop Repeated consent URL detection Clear cookies, different fingerprint
Element Timeout Multiple "element not found" Increase wait times, check selectors
Network Stall No network activity > 30s Refresh page, check connectivity

Backend Implementation:

  • Wrap scraper execution in try/catch with crash capture
  • Periodic metrics sampling (every 5 seconds) stored in ring buffer
  • On crash: compile report, analyze pattern, store to database
  • New table: crash_reports with JSONB data

Frontend Display:

  • Dedicated "Crash" section when job fails
  • Timeline visualization showing metrics leading to crash
  • Pattern explanation with confidence score
  • One-click "Apply Fix & Retry" when auto_fixable=true

3. Copy & Export System

Problem: Users can't easily copy logs for debugging, sharing, or support requests.

Solution: Multi-level copy functionality with various export formats.

Copy Levels:

Action What it copies
Click log line Single log entry as text
Shift+Click range Selected range of logs
"Copy All" button Entire log as formatted text
"Export JSON" Full structured data with metrics
"Export TXT" Human-readable plain text
"Share" Generate shareable link (optional)

Frontend Implementation:

interface CopySystem {
  // Single line copy (click handler on each log row)
  copyLine(entry: LogEntry): void;

  // Range selection (shift+click)
  copyRange(startIndex: number, endIndex: number): void;

  // Full export
  exportJSON(logs: LogEntry[], includeMetrics: boolean): string;
  exportTXT(logs: LogEntry[]): string;

  // Clipboard with feedback
  copyToClipboard(text: string): Promise<void>;  // Shows toast on success
}

UI Elements:

  • Each log row has hover-visible copy icon
  • Selection highlight for range copy
  • Top toolbar: [Copy All] [Export JSON] [Export TXT]
  • Toast notification: "Copied to clipboard"
  • Keyboard shortcuts: Ctrl+C for selected, Ctrl+Shift+C for all

P1: Important Features

4. Session Fingerprint Panel

Problem: Users don't know what browser identity was used during scraping, making it hard to debug location-specific issues or understand detection risks.

Solution: Display all fingerprint parameters used during the session.

Session Info Schema:

interface SessionFingerprint {
  // Identity
  ip_address: string;         // Server's outbound IP
  ip_location: string;        // "Frankfurt, DE"
  user_agent: string;
  platform: string;           // "MacIntel"
  language: string;           // "es-ES"

  // Geolocation
  geolocation: {
    lat: number;
    lng: number;
    city: string;
    accuracy_meters: number;
  };
  timezone: string;           // "Atlantic/Canary"

  // Viewport
  viewport: {
    width: number;
    height: number;
    device_pixel_ratio: number;
  };

  // Anti-detection status
  bot_detection: {
    webdriver_hidden: boolean;
    headless_hidden: boolean;
    plugins_spoofed: boolean;
    canvas_fingerprint: 'unique' | 'generic' | 'blocked';
    webgl_fingerprint: 'unique' | 'generic' | 'blocked';
  };

  // Source
  fingerprint_source: 'user_browser' | 'randomized' | 'default';
}

Backend Implementation:

  • Capture fingerprint at job start
  • Store in job metadata
  • Include bot detection test results (run fingerprint tests on page load)
  • Return in job status response

Frontend Display:

  • Collapsible panel in job details
  • Visual indicators for detection risk (green/yellow/red)
  • "What Google Saw" framing for user understanding

5. Tabbed Log Viewer

Problem: All logs mixed together makes it hard to focus on specific aspects.

Solution: Category-based tabs with filtering.

Tab Structure:

┌──────────┬──────────┬──────────┬──────────┬──────────┐
│   All    │ Scraper  │ Browser  │ Network  │  System  │
│  (847)   │  (423)   │  (201)   │  (156)   │   (67)   │
└──────────┴──────────┴──────────┴──────────┴──────────┘

Features per Tab:

  • Log count badge
  • Level filter dropdown (DEBUG/INFO/WARN/ERROR)
  • Search within tab
  • Auto-scroll toggle
  • Timestamp format toggle (relative/absolute)

Frontend Implementation:

interface LogViewerState {
  activeTab: 'all' | 'scraper' | 'browser' | 'network' | 'system';
  levelFilter: Set<LogLevel>;
  searchQuery: string;
  autoScroll: boolean;
  timestampFormat: 'relative' | 'absolute';
}

P2: Enhanced Features

6. Real-time Metrics Dashboard

Problem: No visual insight into job performance during execution.

Solution: Live-updating charts showing key metrics.

Metrics to Display:

Metric Chart Type Update Frequency
Extraction Rate Line chart (reviews/sec) 5s
Cumulative Reviews Area chart 5s
Memory Usage Line chart (MB) 5s
Network Transfer Line chart (KB) 5s
Data Source Ratio Pie chart (API vs DOM) On change

Data Structure:

interface MetricsSnapshot {
  timestamp_ms: number;
  reviews_total: number;
  reviews_delta: number;      // Since last snapshot
  memory_mb: number;
  network_bytes: number;
  api_reviews: number;
  dom_reviews: number;
}

Backend Implementation:

  • Emit metrics via SSE every 5 seconds during job execution
  • Store final metrics summary in job record

Frontend Implementation:

  • Recharts line/area charts
  • 60-second rolling window during execution
  • Full history available after completion

7. Review Topics Inference

Problem: We extract topic filters from Google but don't know which topics apply to each review.

Solution: Post-processing to infer topic matches from review text.

Algorithm:

def infer_review_topics(review_text: str, topics: List[dict]) -> List[str]:
    """
    Match review text against extracted topic keywords.

    Args:
        review_text: The review content
        topics: List of {"topic": "cutting", "count": 3}

    Returns:
        List of matched topic names
    """
    matched = []
    text_lower = review_text.lower()

    for topic in topics:
        keyword = topic['topic'].lower()
        # Direct match
        if keyword in text_lower:
            matched.append(topic['topic'])
        # Stemmed/variant match (e.g., "cut" matches "cutting")
        elif any(variant in text_lower for variant in get_variants(keyword)):
            matched.append(topic['topic'])

    return matched

Storage:

  • Add topics: string[] field to each review object
  • Process during scrape (after topics extracted, before reviews saved)

Frontend Display:

  • Topic tags on each review card
  • Filter reviews by topic
  • Topic distribution in analytics

P3: Advanced Features

8. Network Inspector

Problem: No visibility into API requests/responses for debugging rate limits or data issues.

Solution: Chrome DevTools Network-style inspector.

Implementation:

  • Use Chrome DevTools Protocol (CDP) to intercept all network requests
  • Filter for relevant domains (google.com/maps)
  • Capture: URL, method, status, headers, timing, size
  • Store subset in logs (full data in separate table if needed)

Display:

  • Sortable table of requests
  • Click to expand: headers, response preview, timing breakdown
  • Filter by: status (2xx/4xx/5xx), type (XHR/image/etc)

9. DOM Snapshots

Problem: Can't see what the page looked like at crash time or key moments.

Solution: Periodic DOM state captures.

Implementation:

  • Capture serialized DOM at key events (page load, tab click, every N scrolls, before crash)
  • Store compressed in blob storage or base64 in database
  • Include screenshot (CDP Page.captureScreenshot)

Display:

  • Snapshot timeline with thumbnails
  • Side-by-side comparison between snapshots
  • Diff view showing DOM changes

10. Job Comparison

Problem: No way to know if a job performed better or worse than typical.

Solution: Compare metrics against historical baselines.

Metrics to Compare:

  • Total time
  • Extraction rate (reviews/second)
  • Memory peak
  • Network transfer
  • Success rate (extracted/expected)

Display:

  • Bar chart comparing this job vs average
  • Percentile ranking ("faster than 73% of jobs")
  • Anomaly detection ("This job used 2x more memory than typical")

Database Schema Changes

-- Crash reports table
CREATE TABLE crash_reports (
    crash_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id UUID REFERENCES jobs(job_id) ON DELETE CASCADE,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    crash_type VARCHAR(50) NOT NULL,
    error_message TEXT,
    state JSONB NOT NULL,
    metrics_history JSONB,
    logs_before_crash JSONB,
    analysis JSONB,
    screenshot_url TEXT,
    dom_snapshot_id UUID
);

CREATE INDEX idx_crash_reports_job ON crash_reports(job_id);
CREATE INDEX idx_crash_reports_type ON crash_reports(crash_type);
CREATE INDEX idx_crash_reports_created ON crash_reports(created_at DESC);

-- Add session fingerprint to jobs
ALTER TABLE jobs ADD COLUMN session_fingerprint JSONB;

-- Add metrics snapshots for completed jobs
ALTER TABLE jobs ADD COLUMN metrics_history JSONB;

API Changes

New Endpoints

GET  /jobs/{job_id}/logs?category=scraper&level=ERROR
GET  /jobs/{job_id}/crash-report
GET  /jobs/{job_id}/session
GET  /jobs/{job_id}/metrics
POST /jobs/{job_id}/retry?apply_fix=memory_cleanup

SSE Stream Changes

Current: {"type": "log", "message": "..."}

New:

{
  "type": "log",
  "data": {
    "timestamp": "2024-01-24T14:32:01.234Z",
    "timestamp_ms": 1706103121234,
    "level": "INFO",
    "category": "browser",
    "message": "Navigating to Google Maps URL",
    "metrics": {"memory_mb": 245}
  }
}

{
  "type": "metrics",
  "data": {
    "timestamp_ms": 1706103121234,
    "reviews_total": 150,
    "memory_mb": 312,
    "network_bytes": 1248576
  }
}

{
  "type": "crash",
  "data": { /* CrashReport object */ }
}

Frontend Component Structure

components/
  JobDevTools/
    index.tsx              # Main container with tabs
    LogViewer.tsx          # Tabbed log display
    LogEntry.tsx           # Single log row with copy
    CopyToolbar.tsx        # Export buttons
    MetricsDashboard.tsx   # Charts container
    SessionPanel.tsx       # Fingerprint display
    CrashReport.tsx        # Crash analysis view
    NetworkInspector.tsx   # Request table (P3)
    DOMSnapshots.tsx       # Snapshot viewer (P3)

Implementation Dependencies

P0: Structured Logging ──┬──▶ P1: Tabbed Log Viewer
                         │
                         ├──▶ P0: Crash Intelligence
                         │
                         └──▶ P2: Metrics Dashboard

P0: Copy System ─────────▶ (independent, can parallel)

P1: Session Fingerprint ─▶ (independent, can parallel)

P2: Topics Inference ────▶ (independent, can parallel)

P3: Network Inspector ───▶ Requires: Structured Logging
P3: DOM Snapshots ───────▶ Requires: Crash Intelligence
P3: Job Comparison ──────▶ Requires: Metrics Dashboard

Success Metrics

Feature Success Metric
Structured Logging 100% of logs have category + timestamp_ms
Crash Intelligence 80% of crashes have identified pattern
Copy System < 200ms copy operation, toast feedback
Session Panel All 15+ fingerprint fields populated
Tabbed Viewer < 50ms tab switch, correct counts
Metrics Dashboard < 100ms chart update, no memory leak
Topics Inference > 70% accuracy vs manual labeling

Parallel Implementation Tracks

Track A: Backend Logging Infrastructure

  1. StructuredLogger class
  2. Database schema changes
  3. SSE stream updates
  4. Crash detection wrapper

Track B: Frontend Log Viewer

  1. JobDevTools container
  2. LogViewer with tabs
  3. LogEntry with copy
  4. CopyToolbar

Track C: Crash Analysis

  1. CrashReport schema
  2. Pattern detection algorithms
  3. CrashReport frontend component
  4. Retry with fix functionality

Track D: Session & Metrics

  1. Fingerprint capture
  2. SessionPanel component
  3. Metrics streaming
  4. MetricsDashboard charts

Track E: Review Topics

  1. Topic inference algorithm
  2. Add topics to review schema
  3. Frontend topic tags
  4. Topic filter in analytics