Files

George Khananaev 262f0c0be7 migrate to SeleniumBase UC Mode for automatic version management

- Replace undetected-chromedriver with seleniumbase for better Chrome/ChromeDriver compatibility
- Automatic version matching eliminates manual cache clearing and version conflicts
- Enhanced anti-detection with UC Mode and CDP stealth settings
- Simplified requirements.txt (SeleniumBase manages common dependencies)
- Fix sort selection bug (was selecting wrong menu items)
- Improve scrolling patience (max_idle: 3→15, max_attempts: 10→50)
- Add scroll position tracking to detect when stuck
- Add fallback pane selectors for better reliability
- Update documentation (README, ARCHITECTURE, TROUBLESHOOTING)
- Add comprehensive test suite for SeleniumBase integration
- Version bump to 1.0.1

Developed by George Khananaev

2025-12-07 19:40:13 +07:00

13 KiB

Raw Blame History

Troubleshooting Guide

This guide covers common issues and their solutions when running Google Reviews Scraper Pro.

Chrome & ChromeDriver Issues
MongoDB Issues
AWS S3 Issues
Scraping Issues
API Server Issues
Image Download Issues
Configuration Issues
Performance Issues
Python & Dependencies Issues

Chrome & ChromeDriver Issues

Issue: ChromeDriver Version Mismatch

Error Message:

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 143
Current browser version is 142.0.7444.176

Cause: Chrome/ChromeDriver version mismatch (this issue is now automatically handled by SeleniumBase).

Solution:

Good News: With SeleniumBase UC Mode, version mismatches are automatically resolved!

Update Chrome to latest version:
- macOS: Open Chrome → Menu → Help → About Google Chrome
- Or run: open -a "Google Chrome" "chrome://settings/help"
Upgrade SeleniumBase (if needed):
```
pip install --upgrade seleniumbase
```
Run scraper again - SeleniumBase automatically downloads the matching ChromeDriver.

Issue: ChromeOptions Reuse Error

Error Message:

RuntimeError: you cannot reuse the ChromeOptions object

Cause: Internal error when retrying Chrome initialization.

Solution: Clear the ChromeDriver cache (see above) and restart the scraper.

Issue: Chrome Binary Not Found

Error Message:

WebDriverException: Message: unknown error: cannot find Chrome binary

Cause: Chrome is not installed or not in the expected location.

Solution:

Install Chrome:
- Download from: https://www.google.com/chrome/
For custom Chrome location, set environment variable:
```
export CHROME_BIN=/path/to/chrome
```

Docker users: Ensure Chrome is installed in Dockerfile:

RUN apt-get update && apt-get install -y google-chrome-stable
ENV CHROME_BIN=/usr/bin/google-chrome

Issue: Chrome Crashes in Headless Mode

Error Message:

WebDriverException: Message: chrome not reachable

Solution:

Add required flags (already included in scraper, but verify):
```
--no-sandbox
--disable-dev-shm-usage
--disable-gpu
```
Increase shared memory (Docker):
```
docker run --shm-size=2g your-image
```
Try non-headless mode to debug:
```
python start.py --headless false
```

MongoDB Issues

Issue: Connection Timeout

Error Message:

ServerSelectionTimeoutError: connection timed out

Cause: MongoDB server unreachable or network issues.

Solution:

Verify MongoDB is running:

# Local MongoDB
mongosh --eval "db.adminCommand('ping')"

# Check service status
sudo systemctl status mongod

Check connection URI:

# config.yaml
mongodb:
  uri: "mongodb://username:password@host:27017/"

For MongoDB Atlas:
- Whitelist your IP address in Atlas dashboard
- Verify cluster is active
- Check network connectivity

Test connection manually:

python -c "from pymongo import MongoClient; c = MongoClient('your-uri', serverSelectionTimeoutMS=5000); print(c.server_info())"

Issue: Authentication Failed

Error Message:

OperationFailure: Authentication failed

Solution:

Verify credentials in connection URI
Check database name matches the authentication database

Use correct URI format:

mongodb://username:password@host:27017/database?authSource=admin

Issue: SSL Certificate Error

Error Message:

SSL: CERTIFICATE_VERIFY_FAILED

Solution:

For macOS, run:

/Applications/Python\ 3.x/Install\ Certificates.command

Or install certifi:
```
pip install --upgrade certifi
```

The scraper auto-handles this, but if issues persist:

import certifi
import os
os.environ['SSL_CERT_FILE'] = certifi.where()

AWS S3 Issues

Issue: Access Denied

Error Message:

ClientError: An error occurred (AccessDenied) when calling the PutObject operation

Solution:

Verify AWS credentials:

# config.yaml
s3:
  aws_access_key_id: "YOUR_ACCESS_KEY"
  aws_secret_access_key: "YOUR_SECRET_KEY"

Check IAM permissions - required policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:PutObjectAcl"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket-name",
        "arn:aws:s3:::your-bucket-name/*"
      ]
    }
  ]
}

Check bucket policy allows public-read if using public URLs

Issue: Bucket Not Found

Error Message:

ClientError: An error occurred (NoSuchBucket)

Solution:

Verify bucket name in config.yaml

Check region matches bucket location:

s3:
  region_name: "us-east-1"  # Must match bucket region
  bucket_name: "your-bucket"

Create bucket if it doesn't exist via AWS Console or CLI

Issue: Invalid Credentials

Error Message:

NoCredentialsError: Unable to locate credentials

Solution:

Set credentials in config.yaml or environment variables:

export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret

Or use AWS credentials file:

~/.aws/credentials
[default]
aws_access_key_id = YOUR_KEY
aws_secret_access_key = YOUR_SECRET

Scraping Issues

Issue: Reviews Tab Not Found

Error Message:

TimeoutException: Reviews tab not found or could not be clicked

Cause: Google Maps UI changed or page didn't load properly.

Solution:

Try non-headless mode to see what's happening:
```
python start.py --headless false
```
Check the URL is a valid Google Maps place URL
Increase timeout - network may be slow
Clear cookies/cache - Google may be showing consent dialogs
Try different sort order:
```
python start.py --sort relevance
```

Issue: No Reviews Found

Error Message:

WARNING: No review cards found in this iteration

Cause: Page structure changed or place has no reviews.

Solution:

Verify the place has reviews by opening URL in browser
Check if page requires login for reviews
Wait longer for page to load - add delay in config
Check for CAPTCHA - may need to solve manually first

Issue: Stale Element Reference

Error Message:

StaleElementReferenceException: stale element reference: element is not attached to the page document

Cause: Page updated while scraping.

Solution: This is handled automatically by the scraper. If persistent:

Reduce scroll speed - increase sleep time
Run in non-headless mode to observe behavior
Restart scraper - temporary DOM issue

Cause: Cookie dialog not being dismissed.

Solution:

Clear browser data:

rm -rf ~/Library/Application\ Support/undetected_chromedriver

The scraper handles this automatically, but you can:
- Open the URL manually first and accept cookies
- Use a different Google account region

API Server Issues

Issue: Port Already in Use

Error Message:

OSError: [Errno 48] Address already in use

Solution:

Find and kill the process:

# Find process using port 8000
lsof -i :8000

# Kill the process
kill -9 <PID>

Use different port:
```
uvicorn api_server:app --port 8080
```

Issue: Max Concurrent Jobs Reached

Error Message:

HTTP 429: Maximum concurrent jobs (3) reached

Solution:

Wait for existing jobs to complete

Cancel pending jobs:

curl -X POST "http://localhost:8000/jobs/{job_id}/cancel"

Increase limit in api_server.py (not recommended for stability)

Issue: CORS Errors (Browser)

Error Message:

Access-Control-Allow-Origin header missing

Solution: CORS is enabled by default. If issues persist:

Check allowed origins in api_server.py

For development, ensure middleware is configured:

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

Image Download Issues

Issue: Images Not Downloading

Cause: Network issues or Google blocking requests.

Solution:

Check network connectivity
Verify image URLs are accessible

Reduce parallel downloads:

download_threads: 2  # Reduce from default 4

Check disk space for image storage

Issue: Images Corrupted or Wrong Size

Cause: Partial downloads or URL issues.

Solution:

Clear image directory and re-run:
```
rm -rf review_images/
```
Check max dimensions in config:
```
max_width: 1200
max_height: 1200
```

Issue: Permission Denied Writing Images

Error Message:

PermissionError: [Errno 13] Permission denied

Solution:

Check directory permissions:
```
chmod 755 review_images/
```
Use different directory:
```
image_dir: "/path/with/write/access"
```

Configuration Issues

Issue: Config File Not Found

Error Message:

FileNotFoundError: config.yaml not found

Solution:

Create config.yaml from example:

cp examples/config-example.txt config.yaml

Specify custom path:

python start.py --config /path/to/config.yaml

Issue: Invalid YAML Syntax

Error Message:

yaml.scanner.ScannerError: mapping values are not allowed here

Solution:

Validate YAML syntax using online validator
Check indentation - use spaces, not tabs

Escape special characters in strings:

url: "https://example.com?param=value"  # Use quotes

Issue: Invalid Configuration Values

Error Message:

ValueError: Invalid sort_by value

Solution:

Check allowed values:
- sort_by: newest, highest, lowest, relevance
- headless: true, false

Verify types:

download_threads: 4      # Integer, not string
headless: true           # Boolean, not string "true"

Performance Issues

Issue: Scraping Too Slow

Solution:

Use headless mode:
```
python start.py --headless
```
Reduce image download threads if network is slow:
```
download_threads: 2
```
Disable image downloading for faster scraping:
```
download_images: false
```
Use SSD for faster JSON/image writes

Issue: High Memory Usage

Solution:

Process in batches - use stop_on_match for incremental scraping
Disable image downloading temporarily
Close other applications
Increase system swap if needed

Issue: Chrome Using Too Much CPU

Solution:

Use headless mode - reduces rendering overhead

Add GPU flags:

--disable-gpu
--disable-software-rasterizer

Limit concurrent jobs in API mode

Python & Dependencies Issues

Issue: Module Not Found

Error Message:

ModuleNotFoundError: No module named 'undetected_chromedriver'

Solution:

Install dependencies:
```
pip install -r requirements.txt
```

Verify virtual environment is activated:

source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

Issue: Incompatible Package Versions

Error Message:

ImportError: cannot import name 'X' from 'Y'

Solution:

Reinstall all dependencies:

pip uninstall -r requirements.txt -y
pip install -r requirements.txt

Create fresh virtual environment:

python -m venv fresh_venv
source fresh_venv/bin/activate
pip install -r requirements.txt

Issue: Python Version Incompatibility

Error Message:

SyntaxError: invalid syntax

Solution:

Check Python version (requires 3.9+):
```
python --version
```

Install correct Python version:

# macOS with pyenv
pyenv install 3.13.1
pyenv local 3.13.1

# Or use system package manager

Getting Help

If your issue isn't listed here:

Enable debug logging:
```
LOG_LEVEL=DEBUG python start.py
```
Check logs for detailed error messages
Search existing issues on GitHub
Create a new issue with:
- Error message (full traceback)
- Python version (python --version)
- OS and version
- Chrome version
- Steps to reproduce

13 KiB Raw Blame History

Troubleshooting Guide

Table of Contents

Chrome & ChromeDriver Issues

Issue: ChromeDriver Version Mismatch

Issue: ChromeOptions Reuse Error

Issue: Chrome Binary Not Found

Issue: Chrome Crashes in Headless Mode

MongoDB Issues

Issue: Connection Timeout

Issue: Authentication Failed

Issue: SSL Certificate Error

AWS S3 Issues

Issue: Access Denied

Issue: Bucket Not Found

Issue: Invalid Credentials

Scraping Issues

Issue: Reviews Tab Not Found

Issue: No Reviews Found

Issue: Stale Element Reference

Issue: Cookie Consent Blocking

API Server Issues

Issue: Port Already in Use

Issue: Max Concurrent Jobs Reached

Issue: CORS Errors (Browser)

Image Download Issues

Issue: Images Not Downloading

Issue: Images Corrupted or Wrong Size

Issue: Permission Denied Writing Images

Configuration Issues

Issue: Config File Not Found

Issue: Invalid YAML Syntax

Issue: Invalid Configuration Values

Performance Issues

Issue: Scraping Too Slow

Issue: High Memory Usage

Issue: Chrome Using Too Much CPU

Python & Dependencies Issues

Issue: Module Not Found

Issue: Incompatible Package Versions

Issue: Python Version Incompatibility

Getting Help

13 KiB

Raw Blame History