Files
whyrating-engine-legacy/DATA_STRUCTURE_ANALYSIS.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

146 lines
4.8 KiB
Markdown

# Review Data Structure Analysis
## ✅ Current Data Types (All Correct)
Based on analysis of scraped reviews from the API:
```typescript
interface Review {
author: string; // ✓ string
rating: number; // ✓ number (not string!)
text: string | null; // ✓ string or null
date_text: string; // ✓ string (relative dates)
avatar_url: string | null; // ✓ string or null
profile_url: string | null; // ✓ string or null
review_id: string; // ✓ string
}
```
**All API data types match the TypeScript interface - no conversion needed!**
## 🐛 Bug Found & Fixed
### Issue: Date Parsing
**Problem:** The `parseDateText()` function used `parseInt(text)` which returns `NaN` for strings like "Hace 2 semanas", then defaulted to `1` via `|| 1`. This caused:
- "Hace 2 semanas" (2 weeks ago) → parsed as **1 week ago**
- "Hace 6 años" (6 years ago) → parsed as **1 year ago**
- "Hace un año" (1 year ago) → parsed as **1 year ago** ✓ (correct by accident)
**Root cause:** `parseInt("Hace 2 semanas")` = `NaN`, and `NaN || 1` = `1`
**Fix:** Added `extractNumber()` function that uses regex to extract the number:
```typescript
function extractNumber(text: string): number {
const match = text.match(/\d+/);
if (match) return parseInt(match[0]);
// Handle Spanish "un/una" (one)
if (text.includes('un ') || text.includes('una ')) return 1;
return 1;
}
```
### Verified Results
```
Date: "Hace 2 semanas" → 2026-01-04 ✓
Date: "Hace 2 meses" → 2025-11-18 ✓
Date: "Hace un año" → 2025-01-18 ✓
Date: "Hace 6 años" → 2020-01-18 ✓
```
## 📅 Date Format Patterns Found
### Standard Formats
- `"Hace X semanas"` - X weeks ago
- `"Hace X meses"` - X months ago
- `"Hace X años"` - X years ago
- `"Hace un año"` - 1 year ago (special case: "un" instead of "1")
### Edited Review Format
- `"Fecha de edición: Hace X meses"` - Edited X months ago
### Date Range Distribution (from 244 reviews)
- **Last week:** ~2 reviews
- **Last month:** ~5-7 reviews
- **Last year:** ~30-40 reviews
- **1-2 years:** ~20-30 reviews
- **2+ years:** ~150+ reviews
## ⚠️ Imprecision Considerations
### Current Approach
Relative dates like "Hace 2 meses" are converted to **exact dates** (e.g., exactly 2 months ago from today).
### Limitation
- "Hace 2 meses" could mean anywhere from 2.0 to 2.99 months ago
- This introduces a ~±15 day margin of error for month boundaries
- Similar issues with "Hace un año" (could be 1.0 to 1.99 years)
### Potential Improvements
#### Option 1: Conservative Filtering (Current Implementation)
- Treat "Hace 2 meses" as exactly 2 months ago
- Simple, fast, slightly underestimates recency
- **Status: ✓ Implemented**
#### Option 2: Range-Based Filtering
```typescript
// Consider "Hace 2 meses" as a range: [2 months, 3 months)
// Include in "last month" filter if lower bound < 1 month
```
- More accurate for boundary cases
- More complex implementation
- May include slightly older reviews
#### Option 3: Add Buffer Zones
```typescript
// Add 10% buffer to cutoff dates
const monthAgo = new Date();
monthAgo.setMonth(monthAgo.getMonth() - 1.1); // Include slight overlap
```
- Catches boundary cases
- Simple to implement
- May include some false positives
### Recommendation
**Keep current implementation** (Option 1) because:
1. Date strings are already approximate ("Hace 2 meses" vs exact date)
2. Users expect "Last Month" to mean roughly 30 days, not exactly
3. Performance is better with simple date math
4. The error margin is acceptable for review analytics
## 🎯 Filter Accuracy
With the fixed parsing, date filters now work correctly:
| Filter | Cutoff Date | Expected Coverage |
|--------|------------|------------------|
| Last Week | 7 days ago | ~0-3 reviews |
| Last Month | 30 days ago | ~5-10 reviews |
| Last Year | 365 days ago | ~30-50 reviews |
| All Time | No limit | All 244 reviews |
## 🔍 Additional Data Quality Notes
1. **Rating is numeric:** Already a number (1-5), no parsing needed
2. **Duplicate review_ids:** Some reviews share the same `review_id`, hence the key change to `${index}-${review_id}`
3. **Null text:** Some reviews have `text: null` - handled with `|| 'No review text'`
4. **Avatar URLs:** Most reviews have avatar images (~90%+)
5. **Spanish language:** All dates in Spanish, handled by parsing logic
## 📊 Type Safety Checklist
- [x] Review interface matches API response
- [x] Rating is number type (not string)
- [x] Date parsing extracts numbers correctly
- [x] Null values handled for text, avatar_url, profile_url
- [x] Timeline data points typed correctly
- [x] Date range type defined ('week' | 'month' | 'year' | 'all')
## ✨ Status: FIXED
The date filtering now works correctly with proper number extraction from Spanish date strings. All data types are validated and match the API schema.