Files

Alejandro Gutiérrez 65fcaf43e8 Add Job DevTools specification document

Comprehensive spec for observability suite including:
- Structured logging system with categories
- Crash intelligence and pattern analysis
- Copy/export functionality
- Session fingerprint panel
- Real-time metrics dashboard
- Review topics inference

Organized by priority (P0-P3) with parallel implementation tracks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-24 11:10:34 +00:00

18 KiB

Raw Blame History

Job DevTools - Observability Suite Specification

Executive Summary

A comprehensive observability system for scraping jobs that provides real-time monitoring, crash analysis, session transparency, and debugging capabilities. Transforms opaque job execution into a fully inspectable process.

Priority Matrix

Priority	Feature	Business Value	Complexity
P0	Structured Logging System	Foundation for all other features	Medium
P0	Crash Intelligence	Reduce job failures, debug tab crashes	High
P0	Copy & Export System	User productivity, support debugging	Low
P1	Session Fingerprint Panel	Trust, transparency, debugging	Medium
P1	Tabbed Log Viewer	Organized debugging experience	Medium
P2	Real-time Metrics Dashboard	Visual monitoring, performance insight	Medium
P2	Review Topics Inference	Enhanced analytics value	Low
P3	Network Inspector	Deep debugging for edge cases	High
P3	DOM Snapshots	Root cause analysis for crashes	High
P3	Job Comparison	Performance optimization insights	Medium

P0: Critical Features

1. Structured Logging System

Problem: Current logs are flat text strings without categorization, making it impossible to filter or analyze specific aspects of job execution.

Solution: JSON-structured log entries with metadata.

Log Entry Schema:

interface LogEntry {
  timestamp: string;        // ISO 8601 with milliseconds
  timestamp_ms: number;     // Unix ms for sorting/graphing
  level: 'DEBUG' | 'INFO' | 'WARN' | 'ERROR' | 'FATAL';
  category: 'scraper' | 'browser' | 'network' | 'system';
  message: string;

  // Optional contextual data
  metrics?: {
    memory_mb?: number;
    reviews_count?: number;
    scroll_position?: number;
    dom_nodes?: number;
  };

  // For crash correlation
  snapshot_id?: string;

  // For network events
  network?: {
    url?: string;
    method?: string;
    status?: number;
    size_bytes?: number;
    duration_ms?: number;
  };
}

Category Definitions:

Category	What it captures
`scraper`	Review extraction, batch progress, data parsing, topic extraction
`browser`	Page navigation, consent handling, tab clicks, scroll events, element waits
`network`	API interceptions, request/response data, rate limiting, failures
`system`	Memory pressure, Chrome process health, worker pool status, timeouts

Backend Changes:

Replace LogCapture class with StructuredLogger
All log calls include category and optional metrics
Store as JSONB array in database (already scrape_logs jsonb)
Stream via SSE with category field

Example Logs:

{"timestamp": "2024-01-24T14:32:01.234Z", "timestamp_ms": 1706103121234, "level": "INFO", "category": "browser", "message": "Navigating to Google Maps URL", "metrics": {"memory_mb": 245}}
{"timestamp": "2024-01-24T14:32:02.456Z", "timestamp_ms": 1706103122456, "level": "WARN", "category": "browser", "message": "Consent popup detected, handling...", "metrics": {"memory_mb": 248}}
{"timestamp": "2024-01-24T14:32:03.789Z", "timestamp_ms": 1706103123789, "level": "INFO", "category": "scraper", "message": "Extracted batch of 50 reviews from API", "metrics": {"reviews_count": 50, "memory_mb": 267}}
{"timestamp": "2024-01-24T14:32:45.123Z", "timestamp_ms": 1706103165123, "level": "ERROR", "category": "system", "message": "Chrome memory pressure critical", "metrics": {"memory_mb": 489, "dom_nodes": 12847}}

2. Crash Intelligence System

Problem: Tab crashes are frequent but opaque. No visibility into what caused the crash or how to prevent it.

Solution: Comprehensive crash detection, analysis, and remediation suggestions.

Crash Report Schema:

interface CrashReport {
  crash_id: string;
  job_id: string;
  timestamp: string;

  // Crash classification
  crash_type: 'tab_crash' | 'memory_exhaustion' | 'timeout' | 'network_failure' | 'element_not_found' | 'rate_limited';
  error_message: string;
  error_code?: string;

  // State at crash
  state: {
    reviews_extracted: number;
    total_expected: number;
    scroll_count: number;
    scroll_position: number;
    elapsed_seconds: number;
  };

  // Resource metrics (last 10 readings)
  metrics_history: Array<{
    timestamp_ms: number;
    memory_mb: number;
    dom_nodes: number;
    cpu_percent?: number;
  }>;

  // Last N log entries before crash
  logs_before_crash: LogEntry[];  // Last 20 entries

  // Recovery info
  last_successful_review_id?: string;
  checkpoint_available: boolean;

  // Analysis
  analysis: {
    pattern: string;           // e.g., "memory_exhaustion", "rate_limit_cascade"
    confidence: number;        // 0-100
    similar_crashes: number;   // Count in last 7 days
    suggested_fix: string;
    auto_fixable: boolean;
  };

  // Optional artifacts
  screenshot_url?: string;
  dom_snapshot_id?: string;
}

Crash Patterns to Detect:

Pattern	Indicators	Suggested Fix
Memory Exhaustion	memory_mb > 450, rapid growth	Enable aggressive DOM cleanup
DOM Bloat	dom_nodes > 10000, not decreasing	Increase card hiding frequency
Rate Limited	Multiple 429 responses	Increase delays, rotate proxy
Consent Loop	Repeated consent URL detection	Clear cookies, different fingerprint
Element Timeout	Multiple "element not found"	Increase wait times, check selectors
Network Stall	No network activity > 30s	Refresh page, check connectivity

Backend Implementation:

Wrap scraper execution in try/catch with crash capture
Periodic metrics sampling (every 5 seconds) stored in ring buffer
On crash: compile report, analyze pattern, store to database
New table: crash_reports with JSONB data

Frontend Display:

Dedicated "Crash" section when job fails
Timeline visualization showing metrics leading to crash
Pattern explanation with confidence score
One-click "Apply Fix & Retry" when auto_fixable=true

3. Copy & Export System

Problem: Users can't easily copy logs for debugging, sharing, or support requests.

Solution: Multi-level copy functionality with various export formats.

Copy Levels:

Action	What it copies
Click log line	Single log entry as text
Shift+Click range	Selected range of logs
"Copy All" button	Entire log as formatted text
"Export JSON"	Full structured data with metrics
"Export TXT"	Human-readable plain text
"Share"	Generate shareable link (optional)

Frontend Implementation:

interface CopySystem {
  // Single line copy (click handler on each log row)
  copyLine(entry: LogEntry): void;

  // Range selection (shift+click)
  copyRange(startIndex: number, endIndex: number): void;

  // Full export
  exportJSON(logs: LogEntry[], includeMetrics: boolean): string;
  exportTXT(logs: LogEntry[]): string;

  // Clipboard with feedback
  copyToClipboard(text: string): Promise<void>;  // Shows toast on success
}

UI Elements:

Each log row has hover-visible copy icon
Selection highlight for range copy
Top toolbar: [Copy All] [Export JSON] [Export TXT]
Toast notification: "Copied to clipboard"
Keyboard shortcuts: Ctrl+C for selected, Ctrl+Shift+C for all

P1: Important Features

4. Session Fingerprint Panel

Problem: Users don't know what browser identity was used during scraping, making it hard to debug location-specific issues or understand detection risks.

Solution: Display all fingerprint parameters used during the session.

Session Info Schema:

interface SessionFingerprint {
  // Identity
  ip_address: string;         // Server's outbound IP
  ip_location: string;        // "Frankfurt, DE"
  user_agent: string;
  platform: string;           // "MacIntel"
  language: string;           // "es-ES"

  // Geolocation
  geolocation: {
    lat: number;
    lng: number;
    city: string;
    accuracy_meters: number;
  };
  timezone: string;           // "Atlantic/Canary"

  // Viewport
  viewport: {
    width: number;
    height: number;
    device_pixel_ratio: number;
  };

  // Anti-detection status
  bot_detection: {
    webdriver_hidden: boolean;
    headless_hidden: boolean;
    plugins_spoofed: boolean;
    canvas_fingerprint: 'unique' | 'generic' | 'blocked';
    webgl_fingerprint: 'unique' | 'generic' | 'blocked';
  };

  // Source
  fingerprint_source: 'user_browser' | 'randomized' | 'default';
}

Backend Implementation:

Capture fingerprint at job start
Store in job metadata
Include bot detection test results (run fingerprint tests on page load)
Return in job status response

Frontend Display:

Collapsible panel in job details
Visual indicators for detection risk (green/yellow/red)
"What Google Saw" framing for user understanding

5. Tabbed Log Viewer

Problem: All logs mixed together makes it hard to focus on specific aspects.

Solution: Category-based tabs with filtering.

Tab Structure:

┌──────────┬──────────┬──────────┬──────────┬──────────┐
│   All    │ Scraper  │ Browser  │ Network  │  System  │
│  (847)   │  (423)   │  (201)   │  (156)   │   (67)   │
└──────────┴──────────┴──────────┴──────────┴──────────┘

Features per Tab:

Log count badge
Level filter dropdown (DEBUG/INFO/WARN/ERROR)
Search within tab
Auto-scroll toggle
Timestamp format toggle (relative/absolute)

Frontend Implementation:

interface LogViewerState {
  activeTab: 'all' | 'scraper' | 'browser' | 'network' | 'system';
  levelFilter: Set<LogLevel>;
  searchQuery: string;
  autoScroll: boolean;
  timestampFormat: 'relative' | 'absolute';
}

P2: Enhanced Features

6. Real-time Metrics Dashboard

Problem: No visual insight into job performance during execution.

Solution: Live-updating charts showing key metrics.

Metrics to Display:

Metric	Chart Type	Update Frequency
Extraction Rate	Line chart (reviews/sec)	5s
Cumulative Reviews	Area chart	5s
Memory Usage	Line chart (MB)	5s
Network Transfer	Line chart (KB)	5s
Data Source Ratio	Pie chart (API vs DOM)	On change

Data Structure:

interface MetricsSnapshot {
  timestamp_ms: number;
  reviews_total: number;
  reviews_delta: number;      // Since last snapshot
  memory_mb: number;
  network_bytes: number;
  api_reviews: number;
  dom_reviews: number;
}

Backend Implementation:

Emit metrics via SSE every 5 seconds during job execution
Store final metrics summary in job record

Frontend Implementation:

Recharts line/area charts
60-second rolling window during execution
Full history available after completion

7. Review Topics Inference

Problem: We extract topic filters from Google but don't know which topics apply to each review.

Solution: Post-processing to infer topic matches from review text.

Algorithm:

def infer_review_topics(review_text: str, topics: List[dict]) -> List[str]:
    """
    Match review text against extracted topic keywords.

    Args:
        review_text: The review content
        topics: List of {"topic": "cutting", "count": 3}

    Returns:
        List of matched topic names
    """
    matched = []
    text_lower = review_text.lower()

    for topic in topics:
        keyword = topic['topic'].lower()
        # Direct match
        if keyword in text_lower:
            matched.append(topic['topic'])
        # Stemmed/variant match (e.g., "cut" matches "cutting")
        elif any(variant in text_lower for variant in get_variants(keyword)):
            matched.append(topic['topic'])

    return matched

Storage:

Add topics: string[] field to each review object
Process during scrape (after topics extracted, before reviews saved)

Frontend Display:

Topic tags on each review card
Filter reviews by topic
Topic distribution in analytics

P3: Advanced Features

8. Network Inspector

Problem: No visibility into API requests/responses for debugging rate limits or data issues.

Solution: Chrome DevTools Network-style inspector.

Implementation:

Use Chrome DevTools Protocol (CDP) to intercept all network requests
Filter for relevant domains (google.com/maps)
Capture: URL, method, status, headers, timing, size
Store subset in logs (full data in separate table if needed)

Display:

Sortable table of requests
Click to expand: headers, response preview, timing breakdown
Filter by: status (2xx/4xx/5xx), type (XHR/image/etc)

9. DOM Snapshots

Problem: Can't see what the page looked like at crash time or key moments.

Solution: Periodic DOM state captures.

Implementation:

Capture serialized DOM at key events (page load, tab click, every N scrolls, before crash)
Store compressed in blob storage or base64 in database
Include screenshot (CDP Page.captureScreenshot)

Display:

Snapshot timeline with thumbnails
Side-by-side comparison between snapshots
Diff view showing DOM changes

10. Job Comparison

Problem: No way to know if a job performed better or worse than typical.

Solution: Compare metrics against historical baselines.

Metrics to Compare:

Total time
Extraction rate (reviews/second)
Memory peak
Network transfer
Success rate (extracted/expected)

Display:

Bar chart comparing this job vs average
Percentile ranking ("faster than 73% of jobs")
Anomaly detection ("This job used 2x more memory than typical")

Database Schema Changes

-- Crash reports table
CREATE TABLE crash_reports (
    crash_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id UUID REFERENCES jobs(job_id) ON DELETE CASCADE,
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    crash_type VARCHAR(50) NOT NULL,
    error_message TEXT,
    state JSONB NOT NULL,
    metrics_history JSONB,
    logs_before_crash JSONB,
    analysis JSONB,
    screenshot_url TEXT,
    dom_snapshot_id UUID
);

CREATE INDEX idx_crash_reports_job ON crash_reports(job_id);
CREATE INDEX idx_crash_reports_type ON crash_reports(crash_type);
CREATE INDEX idx_crash_reports_created ON crash_reports(created_at DESC);

-- Add session fingerprint to jobs
ALTER TABLE jobs ADD COLUMN session_fingerprint JSONB;

-- Add metrics snapshots for completed jobs
ALTER TABLE jobs ADD COLUMN metrics_history JSONB;

API Changes

New Endpoints

GET  /jobs/{job_id}/logs?category=scraper&level=ERROR
GET  /jobs/{job_id}/crash-report
GET  /jobs/{job_id}/session
GET  /jobs/{job_id}/metrics
POST /jobs/{job_id}/retry?apply_fix=memory_cleanup

SSE Stream Changes

Current: {"type": "log", "message": "..."}

New:

{
  "type": "log",
  "data": {
    "timestamp": "2024-01-24T14:32:01.234Z",
    "timestamp_ms": 1706103121234,
    "level": "INFO",
    "category": "browser",
    "message": "Navigating to Google Maps URL",
    "metrics": {"memory_mb": 245}
  }
}

{
  "type": "metrics",
  "data": {
    "timestamp_ms": 1706103121234,
    "reviews_total": 150,
    "memory_mb": 312,
    "network_bytes": 1248576
  }
}

{
  "type": "crash",
  "data": { /* CrashReport object */ }
}

Frontend Component Structure

components/
  JobDevTools/
    index.tsx              # Main container with tabs
    LogViewer.tsx          # Tabbed log display
    LogEntry.tsx           # Single log row with copy
    CopyToolbar.tsx        # Export buttons
    MetricsDashboard.tsx   # Charts container
    SessionPanel.tsx       # Fingerprint display
    CrashReport.tsx        # Crash analysis view
    NetworkInspector.tsx   # Request table (P3)
    DOMSnapshots.tsx       # Snapshot viewer (P3)

Implementation Dependencies

P0: Structured Logging ──┬──▶ P1: Tabbed Log Viewer
                         │
                         ├──▶ P0: Crash Intelligence
                         │
                         └──▶ P2: Metrics Dashboard

P0: Copy System ─────────▶ (independent, can parallel)

P1: Session Fingerprint ─▶ (independent, can parallel)

P2: Topics Inference ────▶ (independent, can parallel)

P3: Network Inspector ───▶ Requires: Structured Logging
P3: DOM Snapshots ───────▶ Requires: Crash Intelligence
P3: Job Comparison ──────▶ Requires: Metrics Dashboard

Success Metrics

Feature	Success Metric
Structured Logging	100% of logs have category + timestamp_ms
Crash Intelligence	80% of crashes have identified pattern
Copy System	< 200ms copy operation, toast feedback
Session Panel	All 15+ fingerprint fields populated
Tabbed Viewer	< 50ms tab switch, correct counts
Metrics Dashboard	< 100ms chart update, no memory leak
Topics Inference	> 70% accuracy vs manual labeling

Parallel Implementation Tracks

Track A: Backend Logging Infrastructure

StructuredLogger class
Database schema changes
SSE stream updates
Crash detection wrapper

Track B: Frontend Log Viewer

JobDevTools container
LogViewer with tabs
LogEntry with copy
CopyToolbar

Track C: Crash Analysis

CrashReport schema
Pattern detection algorithms
CrashReport frontend component
Retry with fix functionality

Track D: Session & Metrics

Fingerprint capture
SessionPanel component
Metrics streaming
MetricsDashboard charts

Track E: Review Topics

Topic inference algorithm
Add topics to review schema
Frontend topic tags
Topic filter in analytics

18 KiB Raw Blame History

Job DevTools - Observability Suite Specification

Executive Summary

Priority Matrix

P0: Critical Features

1. Structured Logging System

2. Crash Intelligence System

3. Copy & Export System

P1: Important Features

4. Session Fingerprint Panel

5. Tabbed Log Viewer

P2: Enhanced Features

6. Real-time Metrics Dashboard

7. Review Topics Inference

P3: Advanced Features

8. Network Inspector

9. DOM Snapshots

10. Job Comparison

Database Schema Changes

API Changes

New Endpoints

SSE Stream Changes

Frontend Component Structure

Implementation Dependencies

Success Metrics

Parallel Implementation Tracks

Track A: Backend Logging Infrastructure

Track B: Frontend Log Viewer

Track C: Crash Analysis

Track D: Session & Metrics

Track E: Review Topics

18 KiB

Raw Blame History