Files
whyrating-engine-legacy/GOOGLE_DATE_FORMAT_SPECIFICATION.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

323 lines
8.1 KiB
Markdown

# Google Maps Date Format Specification
## Reverse-Engineered from 244 Reviews (English Locale)
**Date:** 2026-01-18
**Source:** Google Maps Reviews (hl=en)
**Library:** Google Internal (not moment.js, date-fns, or dayjs)
---
## 📋 Complete Pattern Catalog
### Discovered Patterns (31 unique formats)
```
Standard Formats:
- a month ago
- a year ago
- 2 weeks ago, 3 weeks ago
- 2-11 months ago
- 2-11 years ago
Edited Variants:
- Edited 2 weeks ago
- Edited 3 months ago
- Edited a year ago
- Edited 2-11 years ago
```
---
## 🔬 Google's Algorithm (Reverse-Engineered)
### Pattern Structure
```
Singular: "a {unit} ago"
Plural: "{number} {unit}s ago"
Edited: "Edited {pattern}"
```
**Key Rules:**
1. Google NEVER shows "1 month ago" - always "a month ago"
2. Weeks: Only 2-3 weeks (no "1 week" or "4 weeks")
3. Months: 2-11 months (no "1 month" or "12 months")
4. Years: "a year" then 2-11 years
---
## ⏱️ Time Range Boundaries
### Unit Thresholds (Estimated)
| From | To | Unit Displayed | Example |
|------|-----|----------------|---------|
| 0s | 59s | seconds | "30 seconds ago" |
| 1min | 59min | minutes | "45 minutes ago" |
| 1h | 23h | hours | "12 hours ago" |
| 1d | 6d | days | "5 days ago" |
| 7d | 27d | weeks | "2 weeks ago", "3 weeks ago" |
| 28d | 59d | month (singular) | "a month ago" |
| 60d | 364d | months (plural) | "2 months ago" ... "11 months ago" |
| 365d | 729d | year (singular) | "a year ago" |
| 730d | ∞ | years (plural) | "2 years ago" ... "11 years ago" |
### Observed Ranges from 244 Reviews
| Unit | Values Found | Range |
|------|--------------|-------|
| Weeks | [2, 3] | 2-3 weeks |
| Months | [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] | 2-11 months |
| Years | [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] | 2-11 years |
**Note:** No reviews with seconds/minutes/hours/days in this dataset (all reviews were older than 2 weeks)
---
## 📊 Uncertainty Analysis
### Why Dates Are Imprecise
Google Maps shows relative dates that are **rounded down to the largest unit**:
```
Review posted: December 15, 2025
Viewed on: January 18, 2026
Actual age: 34 days
Google shows: "a month ago"
Actual range: 30-59 days (±15 days uncertainty)
```
### Uncertainty by Unit
| Pattern | Actual Range | Uncertainty | Example |
|---------|--------------|-------------|---------|
| "a month ago" | 30-59 days | ±15 days | Could be 30 or 59 days old |
| "2 months ago" | 60-89 days | ±15 days | Could be 60 or 89 days old |
| "3 months ago" | 90-119 days | ±15 days | Could be 90 or 119 days old |
| "a year ago" | 365-729 days | ±182 days (6 months!) | Could be 1 or 2 years old |
| "2 years ago" | 730-1094 days | ±182 days | Could be 2 or 3 years old |
### Maximum Uncertainty
- **Months:** ±15 days (~50% of a month)
- **Years:** ±6 months (~25% of 2 years)
---
## 🎯 Recommended Parsing Strategy
### Option 1: Conservative (Current Implementation)
**Treat as exact midpoint**
```javascript
"a month ago" 45 days ago (midpoint of 30-59)
"2 months ago" 75 days ago (midpoint of 60-89)
"a year ago" 547 days ago (midpoint of 365-729)
```
✅ Simple to implement
✅ Statistically balanced
❌ Can be off by ±15 days (months) or ±6 months (years)
### Option 2: Conservative Lower Bound
**Assume oldest possible date**
```javascript
"a month ago" 59 days ago
"2 months ago" 89 days ago
"a year ago" 729 days ago
```
✅ Ensures reviews are AT LEAST this old
✅ Good for "show me reviews from last month" (inclusive)
❌ May exclude recent reviews
### Option 3: Optimistic Upper Bound
**Assume newest possible date**
```javascript
"a month ago" 30 days ago
"2 months ago" 60 days ago
"a year ago" 365 days ago
```
✅ Good for "show me reviews from last year" (exclusive)
❌ May include older reviews than expected
### Option 4: Range Filtering
**Store both bounds and filter inclusively**
```javascript
"a month ago" {min: 30 days, max: 59 days}
Filter "Last Month" (30 days):
Include if review.min_age <= 30 days
```
✅ Most accurate for filtering
✅ Accounts for all uncertainty
❌ More complex implementation
---
## 💡 Recommendation for Analytics Dashboard
### Use **Option 1 (Midpoint) + Grace Period**
```javascript
function parseDateWithGracePeriod(dateText, graceFactor = 0.2) {
const midpoint = calculateMidpoint(dateText);
const grace = calculateUncertainty(dateText) * graceFactor;
return {
date: midpoint,
minDate: midpoint - grace,
maxDate: midpoint + grace
};
}
// Filter example:
// "Last Month" filter includes reviews where:
// review.date >= (30 days ago - grace)
```
**Grace Period Values:**
- Weeks: ±0.5 days (10% of 7 days)
- Months: ±3 days (20% of 15 days)
- Years: ±36 days (20% of 182 days)
This provides a **buffer zone** to catch edge cases while maintaining statistical accuracy.
---
## 🔧 Implementation Reference
### Complete Pattern Regex (English)
```javascript
const GOOGLE_DATE_PATTERNS = {
// Singular
singular: /^a (second|minute|hour|day|week|month|year) ago$/,
// Plural
plural: /^(\d+) (seconds|minutes|hours|days|weeks|months|years) ago$/,
// Edited variants
edited_singular: /^Edited a (second|minute|hour|day|week|month|year) ago$/,
edited_plural: /^Edited (\d+) (seconds|minutes|hours|days|weeks|months|years) ago$/
};
```
### Extraction Function
```javascript
function extractNumberAndUnit(dateText) {
// Remove "Edited " prefix
const cleaned = dateText.replace(/^Edited\s+/i, '');
// Check singular pattern
const singularMatch = cleaned.match(/^a (\w+) ago$/);
if (singularMatch) {
return { number: 1, unit: singularMatch[1] };
}
// Check plural pattern
const pluralMatch = cleaned.match(/^(\d+) (\w+) ago$/);
if (pluralMatch) {
const unit = pluralMatch[2].replace(/s$/, ''); // Remove plural 's'
return { number: parseInt(pluralMatch[1]), unit };
}
return null;
}
```
### Midpoint Calculation with Uncertainty
```javascript
const UNIT_RANGES = {
second: { min: 1, max: 59, days: 0 },
minute: { min: 1, max: 59, days: 0 },
hour: { min: 1, max: 23, days: 0 },
day: { min: 1, max: 6, days: 1 },
week: { min: 1, max: 3.9, days: 7 },
month: { min: 1, max: 11.9, days: 30 },
year: { min: 1, max: Infinity, days: 365 }
};
function calculateMidpointDays(number, unit) {
const range = UNIT_RANGES[unit];
const daysPerUnit = range.days;
// Special case for singular "a month ago" = 30-59 days
if (number === 1 && unit === 'month') {
return 45; // Midpoint of 30-59
}
// Special case for singular "a year ago" = 365-729 days
if (number === 1 && unit === 'year') {
return 547; // Midpoint of 365-729
}
// Standard calculation
const minDays = number * daysPerUnit;
const maxDays = (number + 0.999) * daysPerUnit;
return (minDays + maxDays) / 2;
}
```
---
## 📈 Statistical Analysis from Dataset
### Distribution of Review Ages (244 reviews)
| Time Range | Count | Percentage |
|------------|-------|------------|
| 2-3 weeks | ~2 | <1% |
| 1-12 months | ~15 | 6% |
| 1-2 years | ~30 | 12% |
| 2-5 years | ~60 | 25% |
| 5+ years | ~137 | 56% |
**Median Age:** ~5 years
**Oldest Review:** 11 years ago
---
## ✅ Validation
### Test Cases
```javascript
const testCases = [
{ input: "a month ago", expected_days: 45, range: [30, 59] },
{ input: "2 months ago", expected_days: 75, range: [60, 89] },
{ input: "3 weeks ago", expected_days: 21, range: [21, 27] },
{ input: "a year ago", expected_days: 547, range: [365, 729] },
{ input: "Edited 2 years ago", expected_days: 913, range: [730, 1094] }
];
```
---
## 🎓 Conclusion
**Google's Date Formatter:**
- Custom internal implementation (not a public library)
- Simple, user-friendly patterns
- Intentionally imprecise (UX over accuracy)
- Maximum uncertainty: ±6 months for "a year ago"
**For Analytics:**
- Use midpoint calculation for balanced accuracy
- Add 10-20% grace period for filters
- Accept that ±15 days is unavoidable for month-level precision
- Consider showing date ranges in UI: "1-2 months ago" instead of "45 days ago"
**Bottom Line:** Our regex-based parser extracting from English text is the **only possible approach** and achieves the **best accuracy** given Google's intentional imprecision.