Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
88 lines
2.1 KiB
Docker
88 lines
2.1 KiB
Docker
FROM python:3.11-slim
|
|
|
|
# Install system dependencies for Chrome, Selenium, and Xvfb (virtual display)
|
|
RUN apt-get update && apt-get install -y \
|
|
# Basic utilities
|
|
wget \
|
|
gnupg \
|
|
unzip \
|
|
curl \
|
|
# Xvfb for virtual display (allows non-headless Chrome in container)
|
|
xvfb \
|
|
# Chrome dependencies
|
|
fonts-liberation \
|
|
libasound2 \
|
|
libatk-bridge2.0-0 \
|
|
libatk1.0-0 \
|
|
libatspi2.0-0 \
|
|
libcups2 \
|
|
libdbus-1-3 \
|
|
libdrm2 \
|
|
libgbm1 \
|
|
libgtk-3-0 \
|
|
libnspr4 \
|
|
libnss3 \
|
|
libwayland-client0 \
|
|
libxcomposite1 \
|
|
libxdamage1 \
|
|
libxfixes3 \
|
|
libxkbcommon0 \
|
|
libxrandr2 \
|
|
xdg-utils \
|
|
# Additional dependencies
|
|
libu2f-udev \
|
|
libvulkan1 \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Install Chromium (works on all architectures)
|
|
RUN apt-get update \
|
|
&& apt-get install -y chromium chromium-driver \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
# Set working directory
|
|
WORKDIR /app
|
|
|
|
# Copy requirements and install Python dependencies
|
|
COPY requirements-production.txt .
|
|
RUN pip install --no-cache-dir -r requirements-production.txt
|
|
|
|
# Copy application code
|
|
COPY modules/ ./modules/
|
|
COPY api_server_production.py .
|
|
COPY config.yaml .
|
|
|
|
# Create startup script for Xvfb + API server
|
|
RUN echo '#!/bin/bash\n\
|
|
# Start Xvfb (virtual display) in background\n\
|
|
Xvfb :99 -screen 0 1920x1080x24 -ac +extension GLX +render -noreset &\n\
|
|
export DISPLAY=:99\n\
|
|
\n\
|
|
# Wait for Xvfb to start\n\
|
|
sleep 2\n\
|
|
\n\
|
|
# Start API server\n\
|
|
exec python api_server_production.py\n\
|
|
' > /app/start.sh && chmod +x /app/start.sh
|
|
|
|
# Create non-root user and give SeleniumBase write permissions
|
|
RUN useradd -m -u 1000 scraper && \
|
|
chown -R scraper:scraper /app && \
|
|
chown -R scraper:scraper /usr/local/lib/python3.11/site-packages/seleniumbase
|
|
|
|
USER scraper
|
|
|
|
# Expose port
|
|
EXPOSE 8000
|
|
|
|
# Environment variables for Chromium in container
|
|
ENV DISPLAY=:99
|
|
ENV CHROME_BIN=/usr/bin/chromium
|
|
ENV CHROME_PATH=/usr/bin/chromium
|
|
|
|
# Health check
|
|
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
|
CMD curl -f http://localhost:8000/health/live || exit 1
|
|
|
|
# Run startup script (starts Xvfb + API server)
|
|
CMD ["/app/start.sh"]
|