Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
146 lines
4.8 KiB
Markdown
146 lines
4.8 KiB
Markdown
# Review Data Structure Analysis
|
|
|
|
## ✅ Current Data Types (All Correct)
|
|
|
|
Based on analysis of scraped reviews from the API:
|
|
|
|
```typescript
|
|
interface Review {
|
|
author: string; // ✓ string
|
|
rating: number; // ✓ number (not string!)
|
|
text: string | null; // ✓ string or null
|
|
date_text: string; // ✓ string (relative dates)
|
|
avatar_url: string | null; // ✓ string or null
|
|
profile_url: string | null; // ✓ string or null
|
|
review_id: string; // ✓ string
|
|
}
|
|
```
|
|
|
|
**All API data types match the TypeScript interface - no conversion needed!**
|
|
|
|
## 🐛 Bug Found & Fixed
|
|
|
|
### Issue: Date Parsing
|
|
|
|
**Problem:** The `parseDateText()` function used `parseInt(text)` which returns `NaN` for strings like "Hace 2 semanas", then defaulted to `1` via `|| 1`. This caused:
|
|
|
|
- "Hace 2 semanas" (2 weeks ago) → parsed as **1 week ago** ❌
|
|
- "Hace 6 años" (6 years ago) → parsed as **1 year ago** ❌
|
|
- "Hace un año" (1 year ago) → parsed as **1 year ago** ✓ (correct by accident)
|
|
|
|
**Root cause:** `parseInt("Hace 2 semanas")` = `NaN`, and `NaN || 1` = `1`
|
|
|
|
**Fix:** Added `extractNumber()` function that uses regex to extract the number:
|
|
|
|
```typescript
|
|
function extractNumber(text: string): number {
|
|
const match = text.match(/\d+/);
|
|
if (match) return parseInt(match[0]);
|
|
// Handle Spanish "un/una" (one)
|
|
if (text.includes('un ') || text.includes('una ')) return 1;
|
|
return 1;
|
|
}
|
|
```
|
|
|
|
### Verified Results
|
|
|
|
```
|
|
Date: "Hace 2 semanas" → 2026-01-04 ✓
|
|
Date: "Hace 2 meses" → 2025-11-18 ✓
|
|
Date: "Hace un año" → 2025-01-18 ✓
|
|
Date: "Hace 6 años" → 2020-01-18 ✓
|
|
```
|
|
|
|
## 📅 Date Format Patterns Found
|
|
|
|
### Standard Formats
|
|
- `"Hace X semanas"` - X weeks ago
|
|
- `"Hace X meses"` - X months ago
|
|
- `"Hace X años"` - X years ago
|
|
- `"Hace un año"` - 1 year ago (special case: "un" instead of "1")
|
|
|
|
### Edited Review Format
|
|
- `"Fecha de edición: Hace X meses"` - Edited X months ago
|
|
|
|
### Date Range Distribution (from 244 reviews)
|
|
- **Last week:** ~2 reviews
|
|
- **Last month:** ~5-7 reviews
|
|
- **Last year:** ~30-40 reviews
|
|
- **1-2 years:** ~20-30 reviews
|
|
- **2+ years:** ~150+ reviews
|
|
|
|
## ⚠️ Imprecision Considerations
|
|
|
|
### Current Approach
|
|
Relative dates like "Hace 2 meses" are converted to **exact dates** (e.g., exactly 2 months ago from today).
|
|
|
|
### Limitation
|
|
- "Hace 2 meses" could mean anywhere from 2.0 to 2.99 months ago
|
|
- This introduces a ~±15 day margin of error for month boundaries
|
|
- Similar issues with "Hace un año" (could be 1.0 to 1.99 years)
|
|
|
|
### Potential Improvements
|
|
|
|
#### Option 1: Conservative Filtering (Current Implementation)
|
|
- Treat "Hace 2 meses" as exactly 2 months ago
|
|
- Simple, fast, slightly underestimates recency
|
|
- **Status: ✓ Implemented**
|
|
|
|
#### Option 2: Range-Based Filtering
|
|
```typescript
|
|
// Consider "Hace 2 meses" as a range: [2 months, 3 months)
|
|
// Include in "last month" filter if lower bound < 1 month
|
|
```
|
|
- More accurate for boundary cases
|
|
- More complex implementation
|
|
- May include slightly older reviews
|
|
|
|
#### Option 3: Add Buffer Zones
|
|
```typescript
|
|
// Add 10% buffer to cutoff dates
|
|
const monthAgo = new Date();
|
|
monthAgo.setMonth(monthAgo.getMonth() - 1.1); // Include slight overlap
|
|
```
|
|
- Catches boundary cases
|
|
- Simple to implement
|
|
- May include some false positives
|
|
|
|
### Recommendation
|
|
**Keep current implementation** (Option 1) because:
|
|
1. Date strings are already approximate ("Hace 2 meses" vs exact date)
|
|
2. Users expect "Last Month" to mean roughly 30 days, not exactly
|
|
3. Performance is better with simple date math
|
|
4. The error margin is acceptable for review analytics
|
|
|
|
## 🎯 Filter Accuracy
|
|
|
|
With the fixed parsing, date filters now work correctly:
|
|
|
|
| Filter | Cutoff Date | Expected Coverage |
|
|
|--------|------------|------------------|
|
|
| Last Week | 7 days ago | ~0-3 reviews |
|
|
| Last Month | 30 days ago | ~5-10 reviews |
|
|
| Last Year | 365 days ago | ~30-50 reviews |
|
|
| All Time | No limit | All 244 reviews |
|
|
|
|
## 🔍 Additional Data Quality Notes
|
|
|
|
1. **Rating is numeric:** Already a number (1-5), no parsing needed
|
|
2. **Duplicate review_ids:** Some reviews share the same `review_id`, hence the key change to `${index}-${review_id}`
|
|
3. **Null text:** Some reviews have `text: null` - handled with `|| 'No review text'`
|
|
4. **Avatar URLs:** Most reviews have avatar images (~90%+)
|
|
5. **Spanish language:** All dates in Spanish, handled by parsing logic
|
|
|
|
## 📊 Type Safety Checklist
|
|
|
|
- [x] Review interface matches API response
|
|
- [x] Rating is number type (not string)
|
|
- [x] Date parsing extracts numbers correctly
|
|
- [x] Null values handled for text, avatar_url, profile_url
|
|
- [x] Timeline data points typed correctly
|
|
- [x] Date range type defined ('week' | 'month' | 'year' | 'all')
|
|
|
|
## ✨ Status: FIXED
|
|
|
|
The date filtering now works correctly with proper number extraction from Spanish date strings. All data types are validated and match the API schema.
|