Optimize scraper performance and add fallback selectors for robustness

Performance improvements:
- Validation speed: 59.71s → 10.96s (5.5x improvement)
- Removed 50+ console.log statements from JavaScript extraction
- Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting
- Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls)

Scraping improvements:
- Increased idle detection from 6 to 12 consecutive idle scrolls for completeness
- Added real-time progress updates every 5 scrolls with percentage calculation
- Added crash recovery to extract partial reviews if Chrome crashes
- Removed artificial 200-review limit to scrape ALL reviews

Timestamp tracking:
- Added updated_at field separate from started_at for progress tracking
- Frontend now shows both "Started" (fixed) and "Last Update" (dynamic)

Robustness improvements:
- Added 5 fallback CSS selectors to handle different Google Maps page structures
- Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc.
- Automatic selector detection logs which selector works for debugging

Test results:
- Successfully scraped 550 reviews in 150.53s without crashes
- Memory management prevents Chrome tab crashes during heavy scraping

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-18 19:49:24 +00:00
parent bdffb5eaac
commit faa0704737
108 changed files with 23632 additions and 54 deletions

87
Dockerfile Normal file
View File

@@ -0,0 +1,87 @@
FROM python:3.11-slim
# Install system dependencies for Chrome, Selenium, and Xvfb (virtual display)
RUN apt-get update && apt-get install -y \
# Basic utilities
wget \
gnupg \
unzip \
curl \
# Xvfb for virtual display (allows non-headless Chrome in container)
xvfb \
# Chrome dependencies
fonts-liberation \
libasound2 \
libatk-bridge2.0-0 \
libatk1.0-0 \
libatspi2.0-0 \
libcups2 \
libdbus-1-3 \
libdrm2 \
libgbm1 \
libgtk-3-0 \
libnspr4 \
libnss3 \
libwayland-client0 \
libxcomposite1 \
libxdamage1 \
libxfixes3 \
libxkbcommon0 \
libxrandr2 \
xdg-utils \
# Additional dependencies
libu2f-udev \
libvulkan1 \
&& rm -rf /var/lib/apt/lists/*
# Install Chromium (works on all architectures)
RUN apt-get update \
&& apt-get install -y chromium chromium-driver \
&& rm -rf /var/lib/apt/lists/*
# Set working directory
WORKDIR /app
# Copy requirements and install Python dependencies
COPY requirements-production.txt .
RUN pip install --no-cache-dir -r requirements-production.txt
# Copy application code
COPY modules/ ./modules/
COPY api_server_production.py .
COPY config.yaml .
# Create startup script for Xvfb + API server
RUN echo '#!/bin/bash\n\
# Start Xvfb (virtual display) in background\n\
Xvfb :99 -screen 0 1920x1080x24 -ac +extension GLX +render -noreset &\n\
export DISPLAY=:99\n\
\n\
# Wait for Xvfb to start\n\
sleep 2\n\
\n\
# Start API server\n\
exec python api_server_production.py\n\
' > /app/start.sh && chmod +x /app/start.sh
# Create non-root user and give SeleniumBase write permissions
RUN useradd -m -u 1000 scraper && \
chown -R scraper:scraper /app && \
chown -R scraper:scraper /usr/local/lib/python3.11/site-packages/seleniumbase
USER scraper
# Expose port
EXPOSE 8000
# Environment variables for Chromium in container
ENV DISPLAY=:99
ENV CHROME_BIN=/usr/bin/chromium
ENV CHROME_PATH=/usr/bin/chromium
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
CMD curl -f http://localhost:8000/health/live || exit 1
# Run startup script (starts Xvfb + API server)
CMD ["/app/start.sh"]