Optimize scraper performance and add fallback selectors for robustness
Performance improvements: - Validation speed: 59.71s → 10.96s (5.5x improvement) - Removed 50+ console.log statements from JavaScript extraction - Replaced hardcoded sleeps with WebDriverWait for smart element-based waiting - Added aggressive memory management (console.clear, GC, image unloading every 20 scrolls) Scraping improvements: - Increased idle detection from 6 to 12 consecutive idle scrolls for completeness - Added real-time progress updates every 5 scrolls with percentage calculation - Added crash recovery to extract partial reviews if Chrome crashes - Removed artificial 200-review limit to scrape ALL reviews Timestamp tracking: - Added updated_at field separate from started_at for progress tracking - Frontend now shows both "Started" (fixed) and "Last Update" (dynamic) Robustness improvements: - Added 5 fallback CSS selectors to handle different Google Maps page structures - Now tries: div.jftiEf.fontBodyMedium, div.jftiEf, div[data-review-id], etc. - Automatic selector detection logs which selector works for debugging Test results: - Successfully scraped 550 reviews in 150.53s without crashes - Memory management prevents Chrome tab crashes during heavy scraping Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
87
Dockerfile
Normal file
87
Dockerfile
Normal file
@@ -0,0 +1,87 @@
|
||||
FROM python:3.11-slim
|
||||
|
||||
# Install system dependencies for Chrome, Selenium, and Xvfb (virtual display)
|
||||
RUN apt-get update && apt-get install -y \
|
||||
# Basic utilities
|
||||
wget \
|
||||
gnupg \
|
||||
unzip \
|
||||
curl \
|
||||
# Xvfb for virtual display (allows non-headless Chrome in container)
|
||||
xvfb \
|
||||
# Chrome dependencies
|
||||
fonts-liberation \
|
||||
libasound2 \
|
||||
libatk-bridge2.0-0 \
|
||||
libatk1.0-0 \
|
||||
libatspi2.0-0 \
|
||||
libcups2 \
|
||||
libdbus-1-3 \
|
||||
libdrm2 \
|
||||
libgbm1 \
|
||||
libgtk-3-0 \
|
||||
libnspr4 \
|
||||
libnss3 \
|
||||
libwayland-client0 \
|
||||
libxcomposite1 \
|
||||
libxdamage1 \
|
||||
libxfixes3 \
|
||||
libxkbcommon0 \
|
||||
libxrandr2 \
|
||||
xdg-utils \
|
||||
# Additional dependencies
|
||||
libu2f-udev \
|
||||
libvulkan1 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Install Chromium (works on all architectures)
|
||||
RUN apt-get update \
|
||||
&& apt-get install -y chromium chromium-driver \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Set working directory
|
||||
WORKDIR /app
|
||||
|
||||
# Copy requirements and install Python dependencies
|
||||
COPY requirements-production.txt .
|
||||
RUN pip install --no-cache-dir -r requirements-production.txt
|
||||
|
||||
# Copy application code
|
||||
COPY modules/ ./modules/
|
||||
COPY api_server_production.py .
|
||||
COPY config.yaml .
|
||||
|
||||
# Create startup script for Xvfb + API server
|
||||
RUN echo '#!/bin/bash\n\
|
||||
# Start Xvfb (virtual display) in background\n\
|
||||
Xvfb :99 -screen 0 1920x1080x24 -ac +extension GLX +render -noreset &\n\
|
||||
export DISPLAY=:99\n\
|
||||
\n\
|
||||
# Wait for Xvfb to start\n\
|
||||
sleep 2\n\
|
||||
\n\
|
||||
# Start API server\n\
|
||||
exec python api_server_production.py\n\
|
||||
' > /app/start.sh && chmod +x /app/start.sh
|
||||
|
||||
# Create non-root user and give SeleniumBase write permissions
|
||||
RUN useradd -m -u 1000 scraper && \
|
||||
chown -R scraper:scraper /app && \
|
||||
chown -R scraper:scraper /usr/local/lib/python3.11/site-packages/seleniumbase
|
||||
|
||||
USER scraper
|
||||
|
||||
# Expose port
|
||||
EXPOSE 8000
|
||||
|
||||
# Environment variables for Chromium in container
|
||||
ENV DISPLAY=:99
|
||||
ENV CHROME_BIN=/usr/bin/chromium
|
||||
ENV CHROME_PATH=/usr/bin/chromium
|
||||
|
||||
# Health check
|
||||
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
|
||||
CMD curl -f http://localhost:8000/health/live || exit 1
|
||||
|
||||
# Run startup script (starts Xvfb + API server)
|
||||
CMD ["/app/start.sh"]
|
||||
Reference in New Issue
Block a user