Files
whyrating-engine-legacy/FIELD_ANALYSIS.md
Alejandro Gutiérrez faa0704737 Optimize scraper performance and add fallback selectors for robustness
Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-01-18 19:49:24 +00:00

6.1 KiB

Google Maps Review Fields - Complete Analysis

🔍 Investigation Results

Goal: Reverse-engineer Google Maps to find actual timestamps instead of relative dates ("Hace 2 meses")

Result: Google Maps does NOT expose actual timestamps in the public DOM

What We Tested

// Checked for timestamps in:
const dateElem = elem.querySelector('span.rsqaWe');
dateElem.getAttribute('aria-label');  // null
dateElem.getAttribute('data-*');      // no data attributes
dateElem.getAttribute('datetime');    // null

What Google Maps Provides

Field Available Format Example
Relative Date Text Spanish/Local "Hace 2 meses"
Actual Timestamp N/A Not in DOM
ISO Date N/A Not in DOM
aria-label N/A Not set
data-* attributes N/A None found

📋 Currently Extracted Fields

Successfully Extracted

Field Selector Type Notes
author div.d4r55 string Reviewer name
rating span.kvMYJc[aria-label] number 1-5 stars, extracted from aria-label
text span.wiI7pd string | null Review content
date_text span.rsqaWe string Relative date only
avatar_url img.NBa7we[src] string | null Profile picture
profile_url button.WEBjve[data-review-id] string | null Profile identifier
review_id computed string Hash of author + date

Not Available in DOM

Field Why Not Available
timestamp Google doesn't expose it
date_aria_label span.rsqaWe has no aria-label
date_data_attrs span.rsqaWe has no data-* attributes
likes_count Not in DOM scraper (only in API intercept)
owner_response Not in DOM scraper (only in API intercept)
photos Not currently extracted

🔬 Potentially Extractable Fields (Not Currently Scraped)

1. Review Photos/Images

// Reviews can have attached photos
const photoElements = elem.querySelectorAll('button[aria-label*="photo"]');
// or
const imageButtons = elem.querySelectorAll('button.Tya61d');

2. Review Edit Status

Some reviews show "Fecha de edición: Hace X" indicating they were edited. Currently captured in date_text but not parsed separately.

3. Local Guide Badge

// Some reviewers have "Local Guide" badges
const localGuideBadge = elem.querySelector('span.RfnDt');

4. Review Helpfulness (Thumbs Up Count)

May be available in some layouts:

const helpfulCount = elem.querySelector('[aria-label*="helpful"]');

5. Owner Response

// Business owner responses to reviews
const ownerResponse = elem.querySelector('.CDe7pd');

🎯 Recommendation: Use Our Date Parser

Since Google Maps doesn't expose actual timestamps, our current approach is optimal:

Current Solution ( Implemented)

function extractNumber(text: string): number {
  const match = text.match(/\d+/);
  if (match) return parseInt(match[0]);
  if (text.includes('un ') || text.includes('una ')) return 1;
  return 1;
}

function parseDateText(dateText: string): Date {
  const text = dateText.toLowerCase();
  if (text.includes('semana')) {
    const weeks = extractNumber(text);
    return new Date(Date.now() - weeks * 7 * 24 * 60 * 60 * 1000);
  }
  // ... similar for months, years
}

Why This Works

  1. Accurate to the time unit (weeks, months, years)
  2. Handles both numbers and Spanish text ("un año")
  3. Processes all 244 reviews in <1ms
  4. Good enough for analytics (±15 day margin acceptable)

Alternative: API Interception

The api_interceptor.py module theoretically could capture timestamps from Google's internal API, but:

  • More complex and fragile
  • Depends on Google's undocumented API structure
  • Currently not extracting timestamps (field defined but not populated)
  • Would require reverse-engineering Google's protobuf/JSON format

📊 Field Comparison: DOM vs API Intercept

Field DOM Scraper API Intercept Winner
Speed Fast 🐢 Slower DOM
Reliability Stable ⚠️ Fragile DOM
Timestamp No Maybe Neither
Photos ⚠️ Not impl Yes API
Likes No Yes API
Owner Response ⚠️ Not impl Yes API

🚀 Enhancement Opportunities

Priority 1: Extract Review Photos

// Add to fast_scraper.py extraction script
const photoButtons = elem.querySelectorAll('button[jsaction*="photo"]');
review.photo_count = photoButtons.length;
review.photo_urls = Array.from(photoButtons).map(btn => {
  const img = btn.querySelector('img');
  return img ? img.src : null;
}).filter(Boolean);

Priority 2: Extract Local Guide Status

const isLocalGuide = !!elem.querySelector('span.RfnDt');
review.is_local_guide = isLocalGuide;

Priority 3: Extract Owner Responses

const ownerResponseElem = elem.querySelector('.CDe7pd');
review.owner_response = ownerResponseElem ? ownerResponseElem.textContent.trim() : null;

Priority 4: Extract Review Helpfulness

const helpfulElem = elem.querySelector('[aria-label*="helpful"]');
if (helpfulElem) {
  const match = helpfulElem.getAttribute('aria-label').match(/\d+/);
  review.helpful_count = match ? parseInt(match[0]) : 0;
}

📝 Summary

What we have:

  • All essential review data (author, rating, text, date)
  • Profile info (avatar, profile URL)
  • Fast, reliable extraction
  • Working date parsing (good enough for analytics)

What we're missing (but could add):

  • 📸 Review photos
  • 👤 Local Guide badges
  • 💬 Owner responses
  • 👍 Helpfulness counts

What doesn't exist in DOM:

  • Actual timestamps
  • Precise dates

Conclusion: Our date parsing approach is the best solution given Google Maps' limitations. Focus enhancement efforts on extracting photos, owner responses, and local guide status rather than chasing timestamps that don't exist.