Files
whyrating-engine-legacy/DATA_STRUCTURE_ANALYSIS.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

4.8 KiB

Review Data Structure Analysis

Current Data Types (All Correct)

Based on analysis of scraped reviews from the API:

interface Review {
  author: string;          // ✓ string
  rating: number;          // ✓ number (not string!)
  text: string | null;     // ✓ string or null
  date_text: string;       // ✓ string (relative dates)
  avatar_url: string | null;  // ✓ string or null
  profile_url: string | null; // ✓ string or null
  review_id: string;       // ✓ string
}

All API data types match the TypeScript interface - no conversion needed!

🐛 Bug Found & Fixed

Issue: Date Parsing

Problem: The parseDateText() function used parseInt(text) which returns NaN for strings like "Hace 2 semanas", then defaulted to 1 via || 1. This caused:

  • "Hace 2 semanas" (2 weeks ago) → parsed as 1 week ago
  • "Hace 6 años" (6 years ago) → parsed as 1 year ago
  • "Hace un año" (1 year ago) → parsed as 1 year ago ✓ (correct by accident)

Root cause: parseInt("Hace 2 semanas") = NaN, and NaN || 1 = 1

Fix: Added extractNumber() function that uses regex to extract the number:

function extractNumber(text: string): number {
  const match = text.match(/\d+/);
  if (match) return parseInt(match[0]);
  // Handle Spanish "un/una" (one)
  if (text.includes('un ') || text.includes('una ')) return 1;
  return 1;
}

Verified Results

Date: "Hace 2 semanas"  → 2026-01-04 ✓
Date: "Hace 2 meses"    → 2025-11-18 ✓
Date: "Hace un año"     → 2025-01-18 ✓
Date: "Hace 6 años"     → 2020-01-18 ✓

📅 Date Format Patterns Found

Standard Formats

  • "Hace X semanas" - X weeks ago
  • "Hace X meses" - X months ago
  • "Hace X años" - X years ago
  • "Hace un año" - 1 year ago (special case: "un" instead of "1")

Edited Review Format

  • "Fecha de edición: Hace X meses" - Edited X months ago

Date Range Distribution (from 244 reviews)

  • Last week: ~2 reviews
  • Last month: ~5-7 reviews
  • Last year: ~30-40 reviews
  • 1-2 years: ~20-30 reviews
  • 2+ years: ~150+ reviews

⚠️ Imprecision Considerations

Current Approach

Relative dates like "Hace 2 meses" are converted to exact dates (e.g., exactly 2 months ago from today).

Limitation

  • "Hace 2 meses" could mean anywhere from 2.0 to 2.99 months ago
  • This introduces a ~±15 day margin of error for month boundaries
  • Similar issues with "Hace un año" (could be 1.0 to 1.99 years)

Potential Improvements

Option 1: Conservative Filtering (Current Implementation)

  • Treat "Hace 2 meses" as exactly 2 months ago
  • Simple, fast, slightly underestimates recency
  • Status: ✓ Implemented

Option 2: Range-Based Filtering

// Consider "Hace 2 meses" as a range: [2 months, 3 months)
// Include in "last month" filter if lower bound < 1 month
  • More accurate for boundary cases
  • More complex implementation
  • May include slightly older reviews

Option 3: Add Buffer Zones

// Add 10% buffer to cutoff dates
const monthAgo = new Date();
monthAgo.setMonth(monthAgo.getMonth() - 1.1); // Include slight overlap
  • Catches boundary cases
  • Simple to implement
  • May include some false positives

Recommendation

Keep current implementation (Option 1) because:

  1. Date strings are already approximate ("Hace 2 meses" vs exact date)
  2. Users expect "Last Month" to mean roughly 30 days, not exactly
  3. Performance is better with simple date math
  4. The error margin is acceptable for review analytics

🎯 Filter Accuracy

With the fixed parsing, date filters now work correctly:

Filter Cutoff Date Expected Coverage
Last Week 7 days ago ~0-3 reviews
Last Month 30 days ago ~5-10 reviews
Last Year 365 days ago ~30-50 reviews
All Time No limit All 244 reviews

🔍 Additional Data Quality Notes

  1. Rating is numeric: Already a number (1-5), no parsing needed
  2. Duplicate review_ids: Some reviews share the same review_id, hence the key change to ${index}-${review_id}
  3. Null text: Some reviews have text: null - handled with || 'No review text'
  4. Avatar URLs: Most reviews have avatar images (~90%+)
  5. Spanish language: All dates in Spanish, handled by parsing logic

📊 Type Safety Checklist

  • Review interface matches API response
  • Rating is number type (not string)
  • Date parsing extracts numbers correctly
  • Null values handled for text, avatar_url, profile_url
  • Timeline data points typed correctly
  • Date range type defined ('week' | 'month' | 'year' | 'all')

Status: FIXED

The date filtering now works correctly with proper number extraction from Spanish date strings. All data types are validated and match the API schema.