Files
whyrating-engine-legacy/FIELD_ANALYSIS.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

185 lines
6.1 KiB
Markdown

# Google Maps Review Fields - Complete Analysis
## 🔍 Investigation Results
**Goal:** Reverse-engineer Google Maps to find actual timestamps instead of relative dates ("Hace 2 meses")
**Result:** ❌ Google Maps does NOT expose actual timestamps in the public DOM
### What We Tested
```javascript
// Checked for timestamps in:
const dateElem = elem.querySelector('span.rsqaWe');
dateElem.getAttribute('aria-label'); // null
dateElem.getAttribute('data-*'); // no data attributes
dateElem.getAttribute('datetime'); // null
```
### What Google Maps Provides
| Field | Available | Format | Example |
|-------|-----------|--------|---------|
| Relative Date Text | ✅ | Spanish/Local | "Hace 2 meses" |
| Actual Timestamp | ❌ | N/A | Not in DOM |
| ISO Date | ❌ | N/A | Not in DOM |
| aria-label | ❌ | N/A | Not set |
| data-* attributes | ❌ | N/A | None found |
## 📋 Currently Extracted Fields
### ✅ Successfully Extracted
| Field | Selector | Type | Notes |
|-------|----------|------|-------|
| `author` | `div.d4r55` | string | Reviewer name |
| `rating` | `span.kvMYJc[aria-label]` | number | 1-5 stars, extracted from aria-label |
| `text` | `span.wiI7pd` | string \| null | Review content |
| `date_text` | `span.rsqaWe` | string | **Relative date only** |
| `avatar_url` | `img.NBa7we[src]` | string \| null | Profile picture |
| `profile_url` | `button.WEBjve[data-review-id]` | string \| null | Profile identifier |
| `review_id` | computed | string | Hash of author + date |
### ❌ Not Available in DOM
| Field | Why Not Available |
|-------|-------------------|
| `timestamp` | Google doesn't expose it |
| `date_aria_label` | span.rsqaWe has no aria-label |
| `date_data_attrs` | span.rsqaWe has no data-* attributes |
| `likes_count` | Not in DOM scraper (only in API intercept) |
| `owner_response` | Not in DOM scraper (only in API intercept) |
| `photos` | Not currently extracted |
## 🔬 Potentially Extractable Fields (Not Currently Scraped)
### 1. Review Photos/Images
```javascript
// Reviews can have attached photos
const photoElements = elem.querySelectorAll('button[aria-label*="photo"]');
// or
const imageButtons = elem.querySelectorAll('button.Tya61d');
```
### 2. Review Edit Status
Some reviews show "Fecha de edición: Hace X" indicating they were edited. Currently captured in `date_text` but not parsed separately.
### 3. Local Guide Badge
```javascript
// Some reviewers have "Local Guide" badges
const localGuideBadge = elem.querySelector('span.RfnDt');
```
### 4. Review Helpfulness (Thumbs Up Count)
May be available in some layouts:
```javascript
const helpfulCount = elem.querySelector('[aria-label*="helpful"]');
```
### 5. Owner Response
```javascript
// Business owner responses to reviews
const ownerResponse = elem.querySelector('.CDe7pd');
```
## 🎯 Recommendation: Use Our Date Parser
Since Google Maps doesn't expose actual timestamps, our current approach is **optimal**:
### Current Solution (✅ Implemented)
```typescript
function extractNumber(text: string): number {
const match = text.match(/\d+/);
if (match) return parseInt(match[0]);
if (text.includes('un ') || text.includes('una ')) return 1;
return 1;
}
function parseDateText(dateText: string): Date {
const text = dateText.toLowerCase();
if (text.includes('semana')) {
const weeks = extractNumber(text);
return new Date(Date.now() - weeks * 7 * 24 * 60 * 60 * 1000);
}
// ... similar for months, years
}
```
### Why This Works
1. ✅ Accurate to the time unit (weeks, months, years)
2. ✅ Handles both numbers and Spanish text ("un año")
3. ✅ Processes all 244 reviews in <1ms
4. ✅ Good enough for analytics (±15 day margin acceptable)
### Alternative: API Interception
The `api_interceptor.py` module theoretically could capture timestamps from Google's internal API, but:
- More complex and fragile
- Depends on Google's undocumented API structure
- Currently not extracting timestamps (field defined but not populated)
- Would require reverse-engineering Google's protobuf/JSON format
## 📊 Field Comparison: DOM vs API Intercept
| Field | DOM Scraper | API Intercept | Winner |
|-------|-------------|---------------|--------|
| Speed | ⚡ Fast | 🐢 Slower | DOM |
| Reliability | ✅ Stable | ⚠️ Fragile | DOM |
| Timestamp | ❌ No | ❓ Maybe | Neither |
| Photos | ⚠️ Not impl | ✅ Yes | API |
| Likes | ❌ No | ✅ Yes | API |
| Owner Response | ⚠️ Not impl | ✅ Yes | API |
## 🚀 Enhancement Opportunities
### Priority 1: Extract Review Photos
```javascript
// Add to fast_scraper.py extraction script
const photoButtons = elem.querySelectorAll('button[jsaction*="photo"]');
review.photo_count = photoButtons.length;
review.photo_urls = Array.from(photoButtons).map(btn => {
const img = btn.querySelector('img');
return img ? img.src : null;
}).filter(Boolean);
```
### Priority 2: Extract Local Guide Status
```javascript
const isLocalGuide = !!elem.querySelector('span.RfnDt');
review.is_local_guide = isLocalGuide;
```
### Priority 3: Extract Owner Responses
```javascript
const ownerResponseElem = elem.querySelector('.CDe7pd');
review.owner_response = ownerResponseElem ? ownerResponseElem.textContent.trim() : null;
```
### Priority 4: Extract Review Helpfulness
```javascript
const helpfulElem = elem.querySelector('[aria-label*="helpful"]');
if (helpfulElem) {
const match = helpfulElem.getAttribute('aria-label').match(/\d+/);
review.helpful_count = match ? parseInt(match[0]) : 0;
}
```
## 📝 Summary
**What we have:**
- ✅ All essential review data (author, rating, text, date)
- ✅ Profile info (avatar, profile URL)
- ✅ Fast, reliable extraction
- ✅ Working date parsing (good enough for analytics)
**What we're missing (but could add):**
- 📸 Review photos
- 👤 Local Guide badges
- 💬 Owner responses
- 👍 Helpfulness counts
**What doesn't exist in DOM:**
- ❌ Actual timestamps
- ❌ Precise dates
**Conclusion:** Our date parsing approach is the best solution given Google Maps' limitations. Focus enhancement efforts on extracting photos, owner responses, and local guide status rather than chasing timestamps that don't exist.