Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
4.8 KiB
Review Data Structure Analysis
✅ Current Data Types (All Correct)
Based on analysis of scraped reviews from the API:
interface Review {
author: string; // ✓ string
rating: number; // ✓ number (not string!)
text: string | null; // ✓ string or null
date_text: string; // ✓ string (relative dates)
avatar_url: string | null; // ✓ string or null
profile_url: string | null; // ✓ string or null
review_id: string; // ✓ string
}
All API data types match the TypeScript interface - no conversion needed!
🐛 Bug Found & Fixed
Issue: Date Parsing
Problem: The parseDateText() function used parseInt(text) which returns NaN for strings like "Hace 2 semanas", then defaulted to 1 via || 1. This caused:
- "Hace 2 semanas" (2 weeks ago) → parsed as 1 week ago ❌
- "Hace 6 años" (6 years ago) → parsed as 1 year ago ❌
- "Hace un año" (1 year ago) → parsed as 1 year ago ✓ (correct by accident)
Root cause: parseInt("Hace 2 semanas") = NaN, and NaN || 1 = 1
Fix: Added extractNumber() function that uses regex to extract the number:
function extractNumber(text: string): number {
const match = text.match(/\d+/);
if (match) return parseInt(match[0]);
// Handle Spanish "un/una" (one)
if (text.includes('un ') || text.includes('una ')) return 1;
return 1;
}
Verified Results
Date: "Hace 2 semanas" → 2026-01-04 ✓
Date: "Hace 2 meses" → 2025-11-18 ✓
Date: "Hace un año" → 2025-01-18 ✓
Date: "Hace 6 años" → 2020-01-18 ✓
📅 Date Format Patterns Found
Standard Formats
"Hace X semanas"- X weeks ago"Hace X meses"- X months ago"Hace X años"- X years ago"Hace un año"- 1 year ago (special case: "un" instead of "1")
Edited Review Format
"Fecha de edición: Hace X meses"- Edited X months ago
Date Range Distribution (from 244 reviews)
- Last week: ~2 reviews
- Last month: ~5-7 reviews
- Last year: ~30-40 reviews
- 1-2 years: ~20-30 reviews
- 2+ years: ~150+ reviews
⚠️ Imprecision Considerations
Current Approach
Relative dates like "Hace 2 meses" are converted to exact dates (e.g., exactly 2 months ago from today).
Limitation
- "Hace 2 meses" could mean anywhere from 2.0 to 2.99 months ago
- This introduces a ~±15 day margin of error for month boundaries
- Similar issues with "Hace un año" (could be 1.0 to 1.99 years)
Potential Improvements
Option 1: Conservative Filtering (Current Implementation)
- Treat "Hace 2 meses" as exactly 2 months ago
- Simple, fast, slightly underestimates recency
- Status: ✓ Implemented
Option 2: Range-Based Filtering
// Consider "Hace 2 meses" as a range: [2 months, 3 months)
// Include in "last month" filter if lower bound < 1 month
- More accurate for boundary cases
- More complex implementation
- May include slightly older reviews
Option 3: Add Buffer Zones
// Add 10% buffer to cutoff dates
const monthAgo = new Date();
monthAgo.setMonth(monthAgo.getMonth() - 1.1); // Include slight overlap
- Catches boundary cases
- Simple to implement
- May include some false positives
Recommendation
Keep current implementation (Option 1) because:
- Date strings are already approximate ("Hace 2 meses" vs exact date)
- Users expect "Last Month" to mean roughly 30 days, not exactly
- Performance is better with simple date math
- The error margin is acceptable for review analytics
🎯 Filter Accuracy
With the fixed parsing, date filters now work correctly:
| Filter | Cutoff Date | Expected Coverage |
|---|---|---|
| Last Week | 7 days ago | ~0-3 reviews |
| Last Month | 30 days ago | ~5-10 reviews |
| Last Year | 365 days ago | ~30-50 reviews |
| All Time | No limit | All 244 reviews |
🔍 Additional Data Quality Notes
- Rating is numeric: Already a number (1-5), no parsing needed
- Duplicate review_ids: Some reviews share the same
review_id, hence the key change to${index}-${review_id} - Null text: Some reviews have
text: null- handled with|| 'No review text' - Avatar URLs: Most reviews have avatar images (~90%+)
- Spanish language: All dates in Spanish, handled by parsing logic
📊 Type Safety Checklist
- Review interface matches API response
- Rating is number type (not string)
- Date parsing extracts numbers correctly
- Null values handled for text, avatar_url, profile_url
- Timeline data points typed correctly
- Date range type defined ('week' | 'month' | 'year' | 'all')
✨ Status: FIXED
The date filtering now works correctly with proper number extraction from Spanish date strings. All data types are validated and match the API schema.