Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
6.1 KiB
Google Maps Review Fields - Complete Analysis
🔍 Investigation Results
Goal: Reverse-engineer Google Maps to find actual timestamps instead of relative dates ("Hace 2 meses")
Result: ❌ Google Maps does NOT expose actual timestamps in the public DOM
What We Tested
// Checked for timestamps in:
const dateElem = elem.querySelector('span.rsqaWe');
dateElem.getAttribute('aria-label'); // null
dateElem.getAttribute('data-*'); // no data attributes
dateElem.getAttribute('datetime'); // null
What Google Maps Provides
| Field | Available | Format | Example |
|---|---|---|---|
| Relative Date Text | ✅ | Spanish/Local | "Hace 2 meses" |
| Actual Timestamp | ❌ | N/A | Not in DOM |
| ISO Date | ❌ | N/A | Not in DOM |
| aria-label | ❌ | N/A | Not set |
| data-* attributes | ❌ | N/A | None found |
📋 Currently Extracted Fields
✅ Successfully Extracted
| Field | Selector | Type | Notes |
|---|---|---|---|
author |
div.d4r55 |
string | Reviewer name |
rating |
span.kvMYJc[aria-label] |
number | 1-5 stars, extracted from aria-label |
text |
span.wiI7pd |
string | null | Review content |
date_text |
span.rsqaWe |
string | Relative date only |
avatar_url |
img.NBa7we[src] |
string | null | Profile picture |
profile_url |
button.WEBjve[data-review-id] |
string | null | Profile identifier |
review_id |
computed | string | Hash of author + date |
❌ Not Available in DOM
| Field | Why Not Available |
|---|---|
timestamp |
Google doesn't expose it |
date_aria_label |
span.rsqaWe has no aria-label |
date_data_attrs |
span.rsqaWe has no data-* attributes |
likes_count |
Not in DOM scraper (only in API intercept) |
owner_response |
Not in DOM scraper (only in API intercept) |
photos |
Not currently extracted |
🔬 Potentially Extractable Fields (Not Currently Scraped)
1. Review Photos/Images
// Reviews can have attached photos
const photoElements = elem.querySelectorAll('button[aria-label*="photo"]');
// or
const imageButtons = elem.querySelectorAll('button.Tya61d');
2. Review Edit Status
Some reviews show "Fecha de edición: Hace X" indicating they were edited. Currently captured in date_text but not parsed separately.
3. Local Guide Badge
// Some reviewers have "Local Guide" badges
const localGuideBadge = elem.querySelector('span.RfnDt');
4. Review Helpfulness (Thumbs Up Count)
May be available in some layouts:
const helpfulCount = elem.querySelector('[aria-label*="helpful"]');
5. Owner Response
// Business owner responses to reviews
const ownerResponse = elem.querySelector('.CDe7pd');
🎯 Recommendation: Use Our Date Parser
Since Google Maps doesn't expose actual timestamps, our current approach is optimal:
Current Solution (✅ Implemented)
function extractNumber(text: string): number {
const match = text.match(/\d+/);
if (match) return parseInt(match[0]);
if (text.includes('un ') || text.includes('una ')) return 1;
return 1;
}
function parseDateText(dateText: string): Date {
const text = dateText.toLowerCase();
if (text.includes('semana')) {
const weeks = extractNumber(text);
return new Date(Date.now() - weeks * 7 * 24 * 60 * 60 * 1000);
}
// ... similar for months, years
}
Why This Works
- ✅ Accurate to the time unit (weeks, months, years)
- ✅ Handles both numbers and Spanish text ("un año")
- ✅ Processes all 244 reviews in <1ms
- ✅ Good enough for analytics (±15 day margin acceptable)
Alternative: API Interception
The api_interceptor.py module theoretically could capture timestamps from Google's internal API, but:
- More complex and fragile
- Depends on Google's undocumented API structure
- Currently not extracting timestamps (field defined but not populated)
- Would require reverse-engineering Google's protobuf/JSON format
📊 Field Comparison: DOM vs API Intercept
| Field | DOM Scraper | API Intercept | Winner |
|---|---|---|---|
| Speed | ⚡ Fast | 🐢 Slower | DOM |
| Reliability | ✅ Stable | ⚠️ Fragile | DOM |
| Timestamp | ❌ No | ❓ Maybe | Neither |
| Photos | ⚠️ Not impl | ✅ Yes | API |
| Likes | ❌ No | ✅ Yes | API |
| Owner Response | ⚠️ Not impl | ✅ Yes | API |
🚀 Enhancement Opportunities
Priority 1: Extract Review Photos
// Add to fast_scraper.py extraction script
const photoButtons = elem.querySelectorAll('button[jsaction*="photo"]');
review.photo_count = photoButtons.length;
review.photo_urls = Array.from(photoButtons).map(btn => {
const img = btn.querySelector('img');
return img ? img.src : null;
}).filter(Boolean);
Priority 2: Extract Local Guide Status
const isLocalGuide = !!elem.querySelector('span.RfnDt');
review.is_local_guide = isLocalGuide;
Priority 3: Extract Owner Responses
const ownerResponseElem = elem.querySelector('.CDe7pd');
review.owner_response = ownerResponseElem ? ownerResponseElem.textContent.trim() : null;
Priority 4: Extract Review Helpfulness
const helpfulElem = elem.querySelector('[aria-label*="helpful"]');
if (helpfulElem) {
const match = helpfulElem.getAttribute('aria-label').match(/\d+/);
review.helpful_count = match ? parseInt(match[0]) : 0;
}
📝 Summary
What we have:
- ✅ All essential review data (author, rating, text, date)
- ✅ Profile info (avatar, profile URL)
- ✅ Fast, reliable extraction
- ✅ Working date parsing (good enough for analytics)
What we're missing (but could add):
- 📸 Review photos
- 👤 Local Guide badges
- 💬 Owner responses
- 👍 Helpfulness counts
What doesn't exist in DOM:
- ❌ Actual timestamps
- ❌ Precise dates
Conclusion: Our date parsing approach is the best solution given Google Maps' limitations. Focus enhancement efforts on extracting photos, owner responses, and local guide status rather than chasing timestamps that don't exist.