migrate to SeleniumBase UC Mode for automatic version management

- Replace undetected-chromedriver with seleniumbase for better Chrome/ChromeDriver compatibility - Automatic version matching eliminates manual cache clearing and version conflicts - Enhanced anti-detection with UC Mode and CDP stealth settings - Simplified requirements.txt (SeleniumBase manages common dependencies) - Fix sort selection bug (was selecting wrong menu items) - Improve scrolling patience (max_idle: 3→15, max_attempts: 10→50) - Add scroll position tracking to detect when stuck - Add fallback pane selectors for better reliability - Update documentation (README, ARCHITECTURE, TROUBLESHOOTING) - Add comprehensive test suite for SeleniumBase integration - Version bump to 1.0.1 Developed by George Khananaev
2025-12-07 19:40:13 +07:00
parent 6b60b02eec
commit 262f0c0be7
7 changed files with 3802 additions and 106 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
--- a/docs/TROUBLESHOOTING.md
+++ b/docs/TROUBLESHOOTING.md
@@ -0,0 +1,708 @@
+# Troubleshooting Guide
+
+This guide covers common issues and their solutions when running Google Reviews Scraper Pro.
+
+---
+
+## Table of Contents
+
+1. [Chrome & ChromeDriver Issues](#chrome--chromedriver-issues)
+2. [MongoDB Issues](#mongodb-issues)
+3. [AWS S3 Issues](#aws-s3-issues)
+4. [Scraping Issues](#scraping-issues)
+5. [API Server Issues](#api-server-issues)
+6. [Image Download Issues](#image-download-issues)
+7. [Configuration Issues](#configuration-issues)
+8. [Performance Issues](#performance-issues)
+9. [Python & Dependencies Issues](#python--dependencies-issues)
+
+---
+
+## Chrome & ChromeDriver Issues
+
+### Issue: ChromeDriver Version Mismatch
+
+**Error Message:**
+```
+SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 143
+Current browser version is 142.0.7444.176
+```
+
+**Cause:** Chrome/ChromeDriver version mismatch (this issue is now automatically handled by SeleniumBase).
+
+**Solution:**
+
+**Good News:** With SeleniumBase UC Mode, version mismatches are automatically resolved!
+
+1. **Update Chrome to latest version:**
+   - macOS: Open Chrome → Menu → Help → About Google Chrome
+   - Or run: `open -a "Google Chrome" "chrome://settings/help"`
+
+2. **Upgrade SeleniumBase (if needed):**
+   ```bash
+   pip install --upgrade seleniumbase
+   ```
+
+3. **Run scraper again** - SeleniumBase automatically downloads the matching ChromeDriver.
+
+---
+
+### Issue: ChromeOptions Reuse Error
+
+**Error Message:**
+```
+RuntimeError: you cannot reuse the ChromeOptions object
+```
+
+**Cause:** Internal error when retrying Chrome initialization.
+
+**Solution:** Clear the ChromeDriver cache (see above) and restart the scraper.
+
+---
+
+### Issue: Chrome Binary Not Found
+
+**Error Message:**
+```
+WebDriverException: Message: unknown error: cannot find Chrome binary
+```
+
+**Cause:** Chrome is not installed or not in the expected location.
+
+**Solution:**
+
+1. **Install Chrome:**
+   - Download from: https://www.google.com/chrome/
+
+2. **For custom Chrome location, set environment variable:**
+   ```bash
+   export CHROME_BIN=/path/to/chrome
+   ```
+
+3. **Docker users:** Ensure Chrome is installed in Dockerfile:
+   ```dockerfile
+   RUN apt-get update && apt-get install -y google-chrome-stable
+   ENV CHROME_BIN=/usr/bin/google-chrome
+   ```
+
+---
+
+### Issue: Chrome Crashes in Headless Mode
+
+**Error Message:**
+```
+WebDriverException: Message: chrome not reachable
+```
+
+**Solution:**
+
+1. **Add required flags** (already included in scraper, but verify):
+   ```
+   --no-sandbox
+   --disable-dev-shm-usage
+   --disable-gpu
+   ```
+
+2. **Increase shared memory** (Docker):
+   ```bash
+   docker run --shm-size=2g your-image
+   ```
+
+3. **Try non-headless mode** to debug:
+   ```bash
+   python start.py --headless false
+   ```
+
+---
+
+## MongoDB Issues
+
+### Issue: Connection Timeout
+
+**Error Message:**
+```
+ServerSelectionTimeoutError: connection timed out
+```
+
+**Cause:** MongoDB server unreachable or network issues.
+
+**Solution:**
+
+1. **Verify MongoDB is running:**
+   ```bash
+   # Local MongoDB
+   mongosh --eval "db.adminCommand('ping')"
+
+   # Check service status
+   sudo systemctl status mongod
+   ```
+
+2. **Check connection URI:**
+   ```yaml
+   # config.yaml
+   mongodb:
+     uri: "mongodb://username:password@host:27017/"
+   ```
+
+3. **For MongoDB Atlas:**
+   - Whitelist your IP address in Atlas dashboard
+   - Verify cluster is active
+   - Check network connectivity
+
+4. **Test connection manually:**
+   ```bash
+   python -c "from pymongo import MongoClient; c = MongoClient('your-uri', serverSelectionTimeoutMS=5000); print(c.server_info())"
+   ```
+
+---
+
+### Issue: Authentication Failed
+
+**Error Message:**
+```
+OperationFailure: Authentication failed
+```
+
+**Solution:**
+
+1. **Verify credentials** in connection URI
+2. **Check database name** matches the authentication database
+3. **Use correct URI format:**
+   ```
+   mongodb://username:password@host:27017/database?authSource=admin
+   ```
+
+---
+
+### Issue: SSL Certificate Error
+
+**Error Message:**
+```
+SSL: CERTIFICATE_VERIFY_FAILED
+```
+
+**Solution:**
+
+1. **For macOS**, run:
+   ```bash
+   /Applications/Python\ 3.x/Install\ Certificates.command
+   ```
+
+2. **Or install certifi:**
+   ```bash
+   pip install --upgrade certifi
+   ```
+
+3. **The scraper auto-handles this**, but if issues persist:
+   ```python
+   import certifi
+   import os
+   os.environ['SSL_CERT_FILE'] = certifi.where()
+   ```
+
+---
+
+## AWS S3 Issues
+
+### Issue: Access Denied
+
+**Error Message:**
+```
+ClientError: An error occurred (AccessDenied) when calling the PutObject operation
+```
+
+**Solution:**
+
+1. **Verify AWS credentials:**
+   ```yaml
+   # config.yaml
+   s3:
+     aws_access_key_id: "YOUR_ACCESS_KEY"
+     aws_secret_access_key: "YOUR_SECRET_KEY"
+   ```
+
+2. **Check IAM permissions** - required policy:
+   ```json
+   {
+     "Version": "2012-10-17",
+     "Statement": [
+       {
+         "Effect": "Allow",
+         "Action": [
+           "s3:PutObject",
+           "s3:GetObject",
+           "s3:ListBucket",
+           "s3:PutObjectAcl"
+         ],
+         "Resource": [
+           "arn:aws:s3:::your-bucket-name",
+           "arn:aws:s3:::your-bucket-name/*"
+         ]
+       }
+     ]
+   }
+   ```
+
+3. **Check bucket policy** allows public-read if using public URLs
+
+---
+
+### Issue: Bucket Not Found
+
+**Error Message:**
+```
+ClientError: An error occurred (NoSuchBucket)
+```
+
+**Solution:**
+
+1. **Verify bucket name** in config.yaml
+2. **Check region** matches bucket location:
+   ```yaml
+   s3:
+     region_name: "us-east-1"  # Must match bucket region
+     bucket_name: "your-bucket"
+   ```
+
+3. **Create bucket** if it doesn't exist via AWS Console or CLI
+
+---
+
+### Issue: Invalid Credentials
+
+**Error Message:**
+```
+NoCredentialsError: Unable to locate credentials
+```
+
+**Solution:**
+
+1. **Set credentials in config.yaml** or environment variables:
+   ```bash
+   export AWS_ACCESS_KEY_ID=your_key
+   export AWS_SECRET_ACCESS_KEY=your_secret
+   ```
+
+2. **Or use AWS credentials file:**
+   ```
+   ~/.aws/credentials
+   [default]
+   aws_access_key_id = YOUR_KEY
+   aws_secret_access_key = YOUR_SECRET
+   ```
+
+---
+
+## Scraping Issues
+
+### Issue: Reviews Tab Not Found
+
+**Error Message:**
+```
+TimeoutException: Reviews tab not found or could not be clicked
+```
+
+**Cause:** Google Maps UI changed or page didn't load properly.
+
+**Solution:**
+
+1. **Try non-headless mode** to see what's happening:
+   ```bash
+   python start.py --headless false
+   ```
+
+2. **Check the URL** is a valid Google Maps place URL
+
+3. **Increase timeout** - network may be slow
+
+4. **Clear cookies/cache** - Google may be showing consent dialogs
+
+5. **Try different sort order:**
+   ```bash
+   python start.py --sort relevance
+   ```
+
+---
+
+### Issue: No Reviews Found
+
+**Error Message:**
+```
+WARNING: No review cards found in this iteration
+```
+
+**Cause:** Page structure changed or place has no reviews.
+
+**Solution:**
+
+1. **Verify the place has reviews** by opening URL in browser
+2. **Check if page requires login** for reviews
+3. **Wait longer** for page to load - add delay in config
+4. **Check for CAPTCHA** - may need to solve manually first
+
+---
+
+### Issue: Stale Element Reference
+
+**Error Message:**
+```
+StaleElementReferenceException: stale element reference: element is not attached to the page document
+```
+
+**Cause:** Page updated while scraping.
+
+**Solution:** This is handled automatically by the scraper. If persistent:
+
+1. **Reduce scroll speed** - increase sleep time
+2. **Run in non-headless mode** to observe behavior
+3. **Restart scraper** - temporary DOM issue
+
+---
+
+### Issue: Cookie Consent Blocking
+
+**Cause:** Cookie dialog not being dismissed.
+
+**Solution:**
+
+1. **Clear browser data:**
+   ```bash
+   rm -rf ~/Library/Application\ Support/undetected_chromedriver
+   ```
+
+2. **The scraper handles this automatically**, but you can:
+   - Open the URL manually first and accept cookies
+   - Use a different Google account region
+
+---
+
+## API Server Issues
+
+### Issue: Port Already in Use
+
+**Error Message:**
+```
+OSError: [Errno 48] Address already in use
+```
+
+**Solution:**
+
+1. **Find and kill the process:**
+   ```bash
+   # Find process using port 8000
+   lsof -i :8000
+
+   # Kill the process
+   kill -9 <PID>
+   ```
+
+2. **Use different port:**
+   ```bash
+   uvicorn api_server:app --port 8080
+   ```
+
+---
+
+### Issue: Max Concurrent Jobs Reached
+
+**Error Message:**
+```
+HTTP 429: Maximum concurrent jobs (3) reached
+```
+
+**Solution:**
+
+1. **Wait for existing jobs** to complete
+2. **Cancel pending jobs:**
+   ```bash
+   curl -X POST "http://localhost:8000/jobs/{job_id}/cancel"
+   ```
+3. **Increase limit** in `api_server.py` (not recommended for stability)
+
+---
+
+### Issue: CORS Errors (Browser)
+
+**Error Message:**
+```
+Access-Control-Allow-Origin header missing
+```
+
+**Solution:** CORS is enabled by default. If issues persist:
+
+1. **Check allowed origins** in `api_server.py`
+2. **For development**, ensure middleware is configured:
+   ```python
+   app.add_middleware(
+       CORSMiddleware,
+       allow_origins=["*"],
+       allow_methods=["*"],
+       allow_headers=["*"],
+   )
+   ```
+
+---
+
+## Image Download Issues
+
+### Issue: Images Not Downloading
+
+**Cause:** Network issues or Google blocking requests.
+
+**Solution:**
+
+1. **Check network connectivity**
+2. **Verify image URLs** are accessible
+3. **Reduce parallel downloads:**
+   ```yaml
+   download_threads: 2  # Reduce from default 4
+   ```
+
+4. **Check disk space** for image storage
+
+---
+
+### Issue: Images Corrupted or Wrong Size
+
+**Cause:** Partial downloads or URL issues.
+
+**Solution:**
+
+1. **Clear image directory** and re-run:
+   ```bash
+   rm -rf review_images/
+   ```
+
+2. **Check max dimensions** in config:
+   ```yaml
+   max_width: 1200
+   max_height: 1200
+   ```
+
+---
+
+### Issue: Permission Denied Writing Images
+
+**Error Message:**
+```
+PermissionError: [Errno 13] Permission denied
+```
+
+**Solution:**
+
+1. **Check directory permissions:**
+   ```bash
+   chmod 755 review_images/
+   ```
+
+2. **Use different directory:**
+   ```yaml
+   image_dir: "/path/with/write/access"
+   ```
+
+---
+
+## Configuration Issues
+
+### Issue: Config File Not Found
+
+**Error Message:**
+```
+FileNotFoundError: config.yaml not found
+```
+
+**Solution:**
+
+1. **Create config.yaml** from example:
+   ```bash
+   cp examples/config-example.txt config.yaml
+   ```
+
+2. **Specify custom path:**
+   ```bash
+   python start.py --config /path/to/config.yaml
+   ```
+
+---
+
+### Issue: Invalid YAML Syntax
+
+**Error Message:**
+```
+yaml.scanner.ScannerError: mapping values are not allowed here
+```
+
+**Solution:**
+
+1. **Validate YAML syntax** using online validator
+2. **Check indentation** - use spaces, not tabs
+3. **Escape special characters** in strings:
+   ```yaml
+   url: "https://example.com?param=value"  # Use quotes
+   ```
+
+---
+
+### Issue: Invalid Configuration Values
+
+**Error Message:**
+```
+ValueError: Invalid sort_by value
+```
+
+**Solution:**
+
+1. **Check allowed values:**
+   - `sort_by`: newest, highest, lowest, relevance
+   - `headless`: true, false
+
+2. **Verify types:**
+   ```yaml
+   download_threads: 4      # Integer, not string
+   headless: true           # Boolean, not string "true"
+   ```
+
+---
+
+## Performance Issues
+
+### Issue: Scraping Too Slow
+
+**Solution:**
+
+1. **Use headless mode:**
+   ```bash
+   python start.py --headless
+   ```
+
+2. **Reduce image download threads** if network is slow:
+   ```yaml
+   download_threads: 2
+   ```
+
+3. **Disable image downloading** for faster scraping:
+   ```yaml
+   download_images: false
+   ```
+
+4. **Use SSD** for faster JSON/image writes
+
+---
+
+### Issue: High Memory Usage
+
+**Solution:**
+
+1. **Process in batches** - use `stop_on_match` for incremental scraping
+2. **Disable image downloading** temporarily
+3. **Close other applications**
+4. **Increase system swap** if needed
+
+---
+
+### Issue: Chrome Using Too Much CPU
+
+**Solution:**
+
+1. **Use headless mode** - reduces rendering overhead
+2. **Add GPU flags:**
+   ```
+   --disable-gpu
+   --disable-software-rasterizer
+   ```
+3. **Limit concurrent jobs** in API mode
+
+---
+
+## Python & Dependencies Issues
+
+### Issue: Module Not Found
+
+**Error Message:**
+```
+ModuleNotFoundError: No module named 'undetected_chromedriver'
+```
+
+**Solution:**
+
+1. **Install dependencies:**
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+2. **Verify virtual environment is activated:**
+   ```bash
+   source venv/bin/activate  # Linux/macOS
+   venv\Scripts\activate     # Windows
+   ```
+
+---
+
+### Issue: Incompatible Package Versions
+
+**Error Message:**
+```
+ImportError: cannot import name 'X' from 'Y'
+```
+
+**Solution:**
+
+1. **Reinstall all dependencies:**
+   ```bash
+   pip uninstall -r requirements.txt -y
+   pip install -r requirements.txt
+   ```
+
+2. **Create fresh virtual environment:**
+   ```bash
+   python -m venv fresh_venv
+   source fresh_venv/bin/activate
+   pip install -r requirements.txt
+   ```
+
+---
+
+### Issue: Python Version Incompatibility
+
+**Error Message:**
+```
+SyntaxError: invalid syntax
+```
+
+**Solution:**
+
+1. **Check Python version** (requires 3.9+):
+   ```bash
+   python --version
+   ```
+
+2. **Install correct Python version:**
+   ```bash
+   # macOS with pyenv
+   pyenv install 3.13.1
+   pyenv local 3.13.1
+
+   # Or use system package manager
+   ```
+
+---
+
+## Getting Help
+
+If your issue isn't listed here:
+
+1. **Enable debug logging:**
+   ```bash
+   LOG_LEVEL=DEBUG python start.py
+   ```
+
+2. **Check logs** for detailed error messages
+
+3. **Search existing issues** on GitHub
+
+4. **Create a new issue** with:
+   - Error message (full traceback)
+   - Python version (`python --version`)
+   - OS and version
+   - Chrome version
+   - Steps to reproduce