Files
whyrating-engine-legacy/docs/TROUBLESHOOTING.md
George Khananaev 262f0c0be7 migrate to SeleniumBase UC Mode for automatic version management
- Replace undetected-chromedriver with seleniumbase for better Chrome/ChromeDriver compatibility
- Automatic version matching eliminates manual cache clearing and version conflicts
- Enhanced anti-detection with UC Mode and CDP stealth settings
- Simplified requirements.txt (SeleniumBase manages common dependencies)
- Fix sort selection bug (was selecting wrong menu items)
- Improve scrolling patience (max_idle: 3→15, max_attempts: 10→50)
- Add scroll position tracking to detect when stuck
- Add fallback pane selectors for better reliability
- Update documentation (README, ARCHITECTURE, TROUBLESHOOTING)
- Add comprehensive test suite for SeleniumBase integration
- Version bump to 1.0.1

Developed by George Khananaev
2025-12-07 19:40:13 +07:00

13 KiB

Troubleshooting Guide

This guide covers common issues and their solutions when running Google Reviews Scraper Pro.


Table of Contents

  1. Chrome & ChromeDriver Issues
  2. MongoDB Issues
  3. AWS S3 Issues
  4. Scraping Issues
  5. API Server Issues
  6. Image Download Issues
  7. Configuration Issues
  8. Performance Issues
  9. Python & Dependencies Issues

Chrome & ChromeDriver Issues

Issue: ChromeDriver Version Mismatch

Error Message:

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 143
Current browser version is 142.0.7444.176

Cause: Chrome/ChromeDriver version mismatch (this issue is now automatically handled by SeleniumBase).

Solution:

Good News: With SeleniumBase UC Mode, version mismatches are automatically resolved!

  1. Update Chrome to latest version:

    • macOS: Open Chrome → Menu → Help → About Google Chrome
    • Or run: open -a "Google Chrome" "chrome://settings/help"
  2. Upgrade SeleniumBase (if needed):

    pip install --upgrade seleniumbase
    
  3. Run scraper again - SeleniumBase automatically downloads the matching ChromeDriver.


Issue: ChromeOptions Reuse Error

Error Message:

RuntimeError: you cannot reuse the ChromeOptions object

Cause: Internal error when retrying Chrome initialization.

Solution: Clear the ChromeDriver cache (see above) and restart the scraper.


Issue: Chrome Binary Not Found

Error Message:

WebDriverException: Message: unknown error: cannot find Chrome binary

Cause: Chrome is not installed or not in the expected location.

Solution:

  1. Install Chrome:

  2. For custom Chrome location, set environment variable:

    export CHROME_BIN=/path/to/chrome
    
  3. Docker users: Ensure Chrome is installed in Dockerfile:

    RUN apt-get update && apt-get install -y google-chrome-stable
    ENV CHROME_BIN=/usr/bin/google-chrome
    

Issue: Chrome Crashes in Headless Mode

Error Message:

WebDriverException: Message: chrome not reachable

Solution:

  1. Add required flags (already included in scraper, but verify):

    --no-sandbox
    --disable-dev-shm-usage
    --disable-gpu
    
  2. Increase shared memory (Docker):

    docker run --shm-size=2g your-image
    
  3. Try non-headless mode to debug:

    python start.py --headless false
    

MongoDB Issues

Issue: Connection Timeout

Error Message:

ServerSelectionTimeoutError: connection timed out

Cause: MongoDB server unreachable or network issues.

Solution:

  1. Verify MongoDB is running:

    # Local MongoDB
    mongosh --eval "db.adminCommand('ping')"
    
    # Check service status
    sudo systemctl status mongod
    
  2. Check connection URI:

    # config.yaml
    mongodb:
      uri: "mongodb://username:password@host:27017/"
    
  3. For MongoDB Atlas:

    • Whitelist your IP address in Atlas dashboard
    • Verify cluster is active
    • Check network connectivity
  4. Test connection manually:

    python -c "from pymongo import MongoClient; c = MongoClient('your-uri', serverSelectionTimeoutMS=5000); print(c.server_info())"
    

Issue: Authentication Failed

Error Message:

OperationFailure: Authentication failed

Solution:

  1. Verify credentials in connection URI
  2. Check database name matches the authentication database
  3. Use correct URI format:
    mongodb://username:password@host:27017/database?authSource=admin
    

Issue: SSL Certificate Error

Error Message:

SSL: CERTIFICATE_VERIFY_FAILED

Solution:

  1. For macOS, run:

    /Applications/Python\ 3.x/Install\ Certificates.command
    
  2. Or install certifi:

    pip install --upgrade certifi
    
  3. The scraper auto-handles this, but if issues persist:

    import certifi
    import os
    os.environ['SSL_CERT_FILE'] = certifi.where()
    

AWS S3 Issues

Issue: Access Denied

Error Message:

ClientError: An error occurred (AccessDenied) when calling the PutObject operation

Solution:

  1. Verify AWS credentials:

    # config.yaml
    s3:
      aws_access_key_id: "YOUR_ACCESS_KEY"
      aws_secret_access_key: "YOUR_SECRET_KEY"
    
  2. Check IAM permissions - required policy:

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:ListBucket",
            "s3:PutObjectAcl"
          ],
          "Resource": [
            "arn:aws:s3:::your-bucket-name",
            "arn:aws:s3:::your-bucket-name/*"
          ]
        }
      ]
    }
    
  3. Check bucket policy allows public-read if using public URLs


Issue: Bucket Not Found

Error Message:

ClientError: An error occurred (NoSuchBucket)

Solution:

  1. Verify bucket name in config.yaml

  2. Check region matches bucket location:

    s3:
      region_name: "us-east-1"  # Must match bucket region
      bucket_name: "your-bucket"
    
  3. Create bucket if it doesn't exist via AWS Console or CLI


Issue: Invalid Credentials

Error Message:

NoCredentialsError: Unable to locate credentials

Solution:

  1. Set credentials in config.yaml or environment variables:

    export AWS_ACCESS_KEY_ID=your_key
    export AWS_SECRET_ACCESS_KEY=your_secret
    
  2. Or use AWS credentials file:

    ~/.aws/credentials
    [default]
    aws_access_key_id = YOUR_KEY
    aws_secret_access_key = YOUR_SECRET
    

Scraping Issues

Issue: Reviews Tab Not Found

Error Message:

TimeoutException: Reviews tab not found or could not be clicked

Cause: Google Maps UI changed or page didn't load properly.

Solution:

  1. Try non-headless mode to see what's happening:

    python start.py --headless false
    
  2. Check the URL is a valid Google Maps place URL

  3. Increase timeout - network may be slow

  4. Clear cookies/cache - Google may be showing consent dialogs

  5. Try different sort order:

    python start.py --sort relevance
    

Issue: No Reviews Found

Error Message:

WARNING: No review cards found in this iteration

Cause: Page structure changed or place has no reviews.

Solution:

  1. Verify the place has reviews by opening URL in browser
  2. Check if page requires login for reviews
  3. Wait longer for page to load - add delay in config
  4. Check for CAPTCHA - may need to solve manually first

Issue: Stale Element Reference

Error Message:

StaleElementReferenceException: stale element reference: element is not attached to the page document

Cause: Page updated while scraping.

Solution: This is handled automatically by the scraper. If persistent:

  1. Reduce scroll speed - increase sleep time
  2. Run in non-headless mode to observe behavior
  3. Restart scraper - temporary DOM issue

Cause: Cookie dialog not being dismissed.

Solution:

  1. Clear browser data:

    rm -rf ~/Library/Application\ Support/undetected_chromedriver
    
  2. The scraper handles this automatically, but you can:

    • Open the URL manually first and accept cookies
    • Use a different Google account region

API Server Issues

Issue: Port Already in Use

Error Message:

OSError: [Errno 48] Address already in use

Solution:

  1. Find and kill the process:

    # Find process using port 8000
    lsof -i :8000
    
    # Kill the process
    kill -9 <PID>
    
  2. Use different port:

    uvicorn api_server:app --port 8080
    

Issue: Max Concurrent Jobs Reached

Error Message:

HTTP 429: Maximum concurrent jobs (3) reached

Solution:

  1. Wait for existing jobs to complete
  2. Cancel pending jobs:
    curl -X POST "http://localhost:8000/jobs/{job_id}/cancel"
    
  3. Increase limit in api_server.py (not recommended for stability)

Issue: CORS Errors (Browser)

Error Message:

Access-Control-Allow-Origin header missing

Solution: CORS is enabled by default. If issues persist:

  1. Check allowed origins in api_server.py
  2. For development, ensure middleware is configured:
    app.add_middleware(
        CORSMiddleware,
        allow_origins=["*"],
        allow_methods=["*"],
        allow_headers=["*"],
    )
    

Image Download Issues

Issue: Images Not Downloading

Cause: Network issues or Google blocking requests.

Solution:

  1. Check network connectivity

  2. Verify image URLs are accessible

  3. Reduce parallel downloads:

    download_threads: 2  # Reduce from default 4
    
  4. Check disk space for image storage


Issue: Images Corrupted or Wrong Size

Cause: Partial downloads or URL issues.

Solution:

  1. Clear image directory and re-run:

    rm -rf review_images/
    
  2. Check max dimensions in config:

    max_width: 1200
    max_height: 1200
    

Issue: Permission Denied Writing Images

Error Message:

PermissionError: [Errno 13] Permission denied

Solution:

  1. Check directory permissions:

    chmod 755 review_images/
    
  2. Use different directory:

    image_dir: "/path/with/write/access"
    

Configuration Issues

Issue: Config File Not Found

Error Message:

FileNotFoundError: config.yaml not found

Solution:

  1. Create config.yaml from example:

    cp examples/config-example.txt config.yaml
    
  2. Specify custom path:

    python start.py --config /path/to/config.yaml
    

Issue: Invalid YAML Syntax

Error Message:

yaml.scanner.ScannerError: mapping values are not allowed here

Solution:

  1. Validate YAML syntax using online validator
  2. Check indentation - use spaces, not tabs
  3. Escape special characters in strings:
    url: "https://example.com?param=value"  # Use quotes
    

Issue: Invalid Configuration Values

Error Message:

ValueError: Invalid sort_by value

Solution:

  1. Check allowed values:

    • sort_by: newest, highest, lowest, relevance
    • headless: true, false
  2. Verify types:

    download_threads: 4      # Integer, not string
    headless: true           # Boolean, not string "true"
    

Performance Issues

Issue: Scraping Too Slow

Solution:

  1. Use headless mode:

    python start.py --headless
    
  2. Reduce image download threads if network is slow:

    download_threads: 2
    
  3. Disable image downloading for faster scraping:

    download_images: false
    
  4. Use SSD for faster JSON/image writes


Issue: High Memory Usage

Solution:

  1. Process in batches - use stop_on_match for incremental scraping
  2. Disable image downloading temporarily
  3. Close other applications
  4. Increase system swap if needed

Issue: Chrome Using Too Much CPU

Solution:

  1. Use headless mode - reduces rendering overhead
  2. Add GPU flags:
    --disable-gpu
    --disable-software-rasterizer
    
  3. Limit concurrent jobs in API mode

Python & Dependencies Issues

Issue: Module Not Found

Error Message:

ModuleNotFoundError: No module named 'undetected_chromedriver'

Solution:

  1. Install dependencies:

    pip install -r requirements.txt
    
  2. Verify virtual environment is activated:

    source venv/bin/activate  # Linux/macOS
    venv\Scripts\activate     # Windows
    

Issue: Incompatible Package Versions

Error Message:

ImportError: cannot import name 'X' from 'Y'

Solution:

  1. Reinstall all dependencies:

    pip uninstall -r requirements.txt -y
    pip install -r requirements.txt
    
  2. Create fresh virtual environment:

    python -m venv fresh_venv
    source fresh_venv/bin/activate
    pip install -r requirements.txt
    

Issue: Python Version Incompatibility

Error Message:

SyntaxError: invalid syntax

Solution:

  1. Check Python version (requires 3.9+):

    python --version
    
  2. Install correct Python version:

    # macOS with pyenv
    pyenv install 3.13.1
    pyenv local 3.13.1
    
    # Or use system package manager
    

Getting Help

If your issue isn't listed here:

  1. Enable debug logging:

    LOG_LEVEL=DEBUG python start.py
    
  2. Check logs for detailed error messages

  3. Search existing issues on GitHub

  4. Create a new issue with:

    • Error message (full traceback)
    • Python version (python --version)
    • OS and version
    • Chrome version
    • Steps to reproduce