Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
185 lines
6.1 KiB
Markdown
185 lines
6.1 KiB
Markdown
# Google Maps Review Fields - Complete Analysis
|
|
|
|
## 🔍 Investigation Results
|
|
|
|
**Goal:** Reverse-engineer Google Maps to find actual timestamps instead of relative dates ("Hace 2 meses")
|
|
|
|
**Result:** ❌ Google Maps does NOT expose actual timestamps in the public DOM
|
|
|
|
### What We Tested
|
|
|
|
```javascript
|
|
// Checked for timestamps in:
|
|
const dateElem = elem.querySelector('span.rsqaWe');
|
|
dateElem.getAttribute('aria-label'); // null
|
|
dateElem.getAttribute('data-*'); // no data attributes
|
|
dateElem.getAttribute('datetime'); // null
|
|
```
|
|
|
|
### What Google Maps Provides
|
|
|
|
| Field | Available | Format | Example |
|
|
|-------|-----------|--------|---------|
|
|
| Relative Date Text | ✅ | Spanish/Local | "Hace 2 meses" |
|
|
| Actual Timestamp | ❌ | N/A | Not in DOM |
|
|
| ISO Date | ❌ | N/A | Not in DOM |
|
|
| aria-label | ❌ | N/A | Not set |
|
|
| data-* attributes | ❌ | N/A | None found |
|
|
|
|
## 📋 Currently Extracted Fields
|
|
|
|
### ✅ Successfully Extracted
|
|
|
|
| Field | Selector | Type | Notes |
|
|
|-------|----------|------|-------|
|
|
| `author` | `div.d4r55` | string | Reviewer name |
|
|
| `rating` | `span.kvMYJc[aria-label]` | number | 1-5 stars, extracted from aria-label |
|
|
| `text` | `span.wiI7pd` | string \| null | Review content |
|
|
| `date_text` | `span.rsqaWe` | string | **Relative date only** |
|
|
| `avatar_url` | `img.NBa7we[src]` | string \| null | Profile picture |
|
|
| `profile_url` | `button.WEBjve[data-review-id]` | string \| null | Profile identifier |
|
|
| `review_id` | computed | string | Hash of author + date |
|
|
|
|
### ❌ Not Available in DOM
|
|
|
|
| Field | Why Not Available |
|
|
|-------|-------------------|
|
|
| `timestamp` | Google doesn't expose it |
|
|
| `date_aria_label` | span.rsqaWe has no aria-label |
|
|
| `date_data_attrs` | span.rsqaWe has no data-* attributes |
|
|
| `likes_count` | Not in DOM scraper (only in API intercept) |
|
|
| `owner_response` | Not in DOM scraper (only in API intercept) |
|
|
| `photos` | Not currently extracted |
|
|
|
|
## 🔬 Potentially Extractable Fields (Not Currently Scraped)
|
|
|
|
### 1. Review Photos/Images
|
|
```javascript
|
|
// Reviews can have attached photos
|
|
const photoElements = elem.querySelectorAll('button[aria-label*="photo"]');
|
|
// or
|
|
const imageButtons = elem.querySelectorAll('button.Tya61d');
|
|
```
|
|
|
|
### 2. Review Edit Status
|
|
Some reviews show "Fecha de edición: Hace X" indicating they were edited. Currently captured in `date_text` but not parsed separately.
|
|
|
|
### 3. Local Guide Badge
|
|
```javascript
|
|
// Some reviewers have "Local Guide" badges
|
|
const localGuideBadge = elem.querySelector('span.RfnDt');
|
|
```
|
|
|
|
### 4. Review Helpfulness (Thumbs Up Count)
|
|
May be available in some layouts:
|
|
```javascript
|
|
const helpfulCount = elem.querySelector('[aria-label*="helpful"]');
|
|
```
|
|
|
|
### 5. Owner Response
|
|
```javascript
|
|
// Business owner responses to reviews
|
|
const ownerResponse = elem.querySelector('.CDe7pd');
|
|
```
|
|
|
|
## 🎯 Recommendation: Use Our Date Parser
|
|
|
|
Since Google Maps doesn't expose actual timestamps, our current approach is **optimal**:
|
|
|
|
### Current Solution (✅ Implemented)
|
|
```typescript
|
|
function extractNumber(text: string): number {
|
|
const match = text.match(/\d+/);
|
|
if (match) return parseInt(match[0]);
|
|
if (text.includes('un ') || text.includes('una ')) return 1;
|
|
return 1;
|
|
}
|
|
|
|
function parseDateText(dateText: string): Date {
|
|
const text = dateText.toLowerCase();
|
|
if (text.includes('semana')) {
|
|
const weeks = extractNumber(text);
|
|
return new Date(Date.now() - weeks * 7 * 24 * 60 * 60 * 1000);
|
|
}
|
|
// ... similar for months, years
|
|
}
|
|
```
|
|
|
|
### Why This Works
|
|
1. ✅ Accurate to the time unit (weeks, months, years)
|
|
2. ✅ Handles both numbers and Spanish text ("un año")
|
|
3. ✅ Processes all 244 reviews in <1ms
|
|
4. ✅ Good enough for analytics (±15 day margin acceptable)
|
|
|
|
### Alternative: API Interception
|
|
The `api_interceptor.py` module theoretically could capture timestamps from Google's internal API, but:
|
|
- More complex and fragile
|
|
- Depends on Google's undocumented API structure
|
|
- Currently not extracting timestamps (field defined but not populated)
|
|
- Would require reverse-engineering Google's protobuf/JSON format
|
|
|
|
## 📊 Field Comparison: DOM vs API Intercept
|
|
|
|
| Field | DOM Scraper | API Intercept | Winner |
|
|
|-------|-------------|---------------|--------|
|
|
| Speed | ⚡ Fast | 🐢 Slower | DOM |
|
|
| Reliability | ✅ Stable | ⚠️ Fragile | DOM |
|
|
| Timestamp | ❌ No | ❓ Maybe | Neither |
|
|
| Photos | ⚠️ Not impl | ✅ Yes | API |
|
|
| Likes | ❌ No | ✅ Yes | API |
|
|
| Owner Response | ⚠️ Not impl | ✅ Yes | API |
|
|
|
|
## 🚀 Enhancement Opportunities
|
|
|
|
### Priority 1: Extract Review Photos
|
|
```javascript
|
|
// Add to fast_scraper.py extraction script
|
|
const photoButtons = elem.querySelectorAll('button[jsaction*="photo"]');
|
|
review.photo_count = photoButtons.length;
|
|
review.photo_urls = Array.from(photoButtons).map(btn => {
|
|
const img = btn.querySelector('img');
|
|
return img ? img.src : null;
|
|
}).filter(Boolean);
|
|
```
|
|
|
|
### Priority 2: Extract Local Guide Status
|
|
```javascript
|
|
const isLocalGuide = !!elem.querySelector('span.RfnDt');
|
|
review.is_local_guide = isLocalGuide;
|
|
```
|
|
|
|
### Priority 3: Extract Owner Responses
|
|
```javascript
|
|
const ownerResponseElem = elem.querySelector('.CDe7pd');
|
|
review.owner_response = ownerResponseElem ? ownerResponseElem.textContent.trim() : null;
|
|
```
|
|
|
|
### Priority 4: Extract Review Helpfulness
|
|
```javascript
|
|
const helpfulElem = elem.querySelector('[aria-label*="helpful"]');
|
|
if (helpfulElem) {
|
|
const match = helpfulElem.getAttribute('aria-label').match(/\d+/);
|
|
review.helpful_count = match ? parseInt(match[0]) : 0;
|
|
}
|
|
```
|
|
|
|
## 📝 Summary
|
|
|
|
**What we have:**
|
|
- ✅ All essential review data (author, rating, text, date)
|
|
- ✅ Profile info (avatar, profile URL)
|
|
- ✅ Fast, reliable extraction
|
|
- ✅ Working date parsing (good enough for analytics)
|
|
|
|
**What we're missing (but could add):**
|
|
- 📸 Review photos
|
|
- 👤 Local Guide badges
|
|
- 💬 Owner responses
|
|
- 👍 Helpfulness counts
|
|
|
|
**What doesn't exist in DOM:**
|
|
- ❌ Actual timestamps
|
|
- ❌ Precise dates
|
|
|
|
**Conclusion:** Our date parsing approach is the best solution given Google Maps' limitations. Focus enhancement efforts on extracting photos, owner responses, and local guide status rather than chasing timestamps that don't exist.
|