migrate to SeleniumBase UC Mode for automatic version management

- Replace undetected-chromedriver with seleniumbase for better Chrome/ChromeDriver compatibility
- Automatic version matching eliminates manual cache clearing and version conflicts
- Enhanced anti-detection with UC Mode and CDP stealth settings
- Simplified requirements.txt (SeleniumBase manages common dependencies)
- Fix sort selection bug (was selecting wrong menu items)
- Improve scrolling patience (max_idle: 3→15, max_attempts: 10→50)
- Add scroll position tracking to detect when stuck
- Add fallback pane selectors for better reliability
- Update documentation (README, ARCHITECTURE, TROUBLESHOOTING)
- Add comprehensive test suite for SeleniumBase integration
- Version bump to 1.0.1

Developed by George Khananaev
This commit is contained in:
George Khananaev
2025-12-07 19:40:13 +07:00
parent 6b60b02eec
commit 262f0c0be7
7 changed files with 3802 additions and 106 deletions

8
.gitignore vendored
View File

@@ -11,6 +11,7 @@ Desktop.ini
# ----------------------------------------------------------- # -----------------------------------------------------------
.idea/ .idea/
.vscode/ .vscode/
.claude/
*.swp *.swp
*.swo *.swo
*~ *~
@@ -48,6 +49,7 @@ logs.db
*.sqlite *.sqlite
*.sqlite3 *.sqlite3
*.db *.db
docs/AGENTS_LOG
# ----------------------------------------------------------- # -----------------------------------------------------------
# Config Files # Config Files
@@ -68,6 +70,12 @@ review_images/
images/ images/
downloaded_images/ downloaded_images/
# -----------------------------------------------------------
# SeleniumBase Files
# -----------------------------------------------------------
downloaded_files/
*.lock
# ----------------------------------------------------------- # -----------------------------------------------------------
# Temporary and Output Files # Temporary and Output Files
# ----------------------------------------------------------- # -----------------------------------------------------------

View File

@@ -1,16 +1,16 @@
# 🔥 Google Reviews Scraper Pro (2025) 🔥 # 🔥 Google Reviews Scraper Pro (2025) 🔥
![Google Reviews Scraper Pro](https://img.shields.io/badge/Version-1.0.0-brightgreen) ![Google Reviews Scraper Pro](https://img.shields.io/badge/Version-1.0.1-brightgreen)
![Python](https://img.shields.io/badge/Python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue) ![Python](https://img.shields.io/badge/Python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue)
![License](https://img.shields.io/badge/License-MIT-yellow) ![License](https://img.shields.io/badge/License-MIT-yellow)
![Last Update](https://img.shields.io/badge/Last%20Updated-April%202025-red) ![Last Update](https://img.shields.io/badge/Last%20Updated-December%202025-red)
**FINALLY! A scraper that ACTUALLY WORKS in 2025!** While others break with every Google update, this bad boy keeps on trucking. Say goodbye to the frustration of constantly broken scrapers and hello to a beast that rips through Google's defenses like a hot knife through butter. This battle-tested, rock-solid solution will extract every juicy detail from Google reviews while laughing in the face of rate limiting. **FINALLY! A scraper that ACTUALLY WORKS in 2025!** While others break with every Google update, this bad boy keeps on trucking. Say goodbye to the frustration of constantly broken scrapers and hello to a beast that rips through Google's defenses like a hot knife through butter. This battle-tested, rock-solid solution will extract every juicy detail from Google reviews while laughing in the face of rate limiting.
## 🌟 Feature Artillery ## 🌟 Feature Artillery
- **Bulletproof in 2025**: While the competition falls apart, we've cracked Google's latest tricks - **Bulletproof in 2025**: While the competition falls apart, we've cracked Google's latest tricks
- **Ninja-Mode Selenium**: Our undetected-chromedriver flies under the radar where others get insta-blocked - **Enhanced SeleniumBase UC Mode**: Superior anti-detection with automatic Chrome/ChromeDriver version matching - no more version headaches!
- **Polyglot Powerhouse**: Devours reviews in a smorgasbord of languages - English, Hebrew, Thai, German, you name it! - **Polyglot Powerhouse**: Devours reviews in a smorgasbord of languages - English, Hebrew, Thai, German, you name it!
- **MongoDB Mastery**: Dumps pristine data structures straight into your MongoDB instance - **MongoDB Mastery**: Dumps pristine data structures straight into your MongoDB instance
- **Paranoid Backups**: Mirrors everything to local JSON files because losing data sucks - **Paranoid Backups**: Mirrors everything to local JSON files because losing data sucks
@@ -350,9 +350,10 @@ print(f"Reviews with images: {len(reviews_with_images)}")
### DEFCON Scenarios & Quick Fixes ### DEFCON Scenarios & Quick Fixes
1. **Chrome/Driver Having a Lovers' Quarrel** 1. **Chrome/Driver Having a Lovers' Quarrel**
- Update your damn Chrome browser already! It's 2025, people - **Good news!** SeleniumBase handles Chrome/ChromeDriver version matching automatically
- Nuke and reinstall the driver: `pip uninstall undetected-chromedriver` then `pip install undetected-chromedriver==3.5.4` - Update Chrome browser: Go to chrome://settings/help
- If you're on Ubuntu, sometimes a simple `apt update && apt upgrade` fixes weird Chrome issues - SeleniumBase will automatically download the matching ChromeDriver - no manual intervention needed!
- If issues persist: `pip install --upgrade seleniumbase`
2. **MongoDB Throwing a Tantrum** 2. **MongoDB Throwing a Tantrum**
- Double-check your connection string - typos are the #1 culprit - Double-check your connection string - typos are the #1 culprit

2760
docs/ARCHITECTURE.md Normal file

File diff suppressed because it is too large Load Diff

708
docs/TROUBLESHOOTING.md Normal file
View File

@@ -0,0 +1,708 @@
# Troubleshooting Guide
This guide covers common issues and their solutions when running Google Reviews Scraper Pro.
---
## Table of Contents
1. [Chrome & ChromeDriver Issues](#chrome--chromedriver-issues)
2. [MongoDB Issues](#mongodb-issues)
3. [AWS S3 Issues](#aws-s3-issues)
4. [Scraping Issues](#scraping-issues)
5. [API Server Issues](#api-server-issues)
6. [Image Download Issues](#image-download-issues)
7. [Configuration Issues](#configuration-issues)
8. [Performance Issues](#performance-issues)
9. [Python & Dependencies Issues](#python--dependencies-issues)
---
## Chrome & ChromeDriver Issues
### Issue: ChromeDriver Version Mismatch
**Error Message:**
```
SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 143
Current browser version is 142.0.7444.176
```
**Cause:** Chrome/ChromeDriver version mismatch (this issue is now automatically handled by SeleniumBase).
**Solution:**
**Good News:** With SeleniumBase UC Mode, version mismatches are automatically resolved!
1. **Update Chrome to latest version:**
- macOS: Open Chrome → Menu → Help → About Google Chrome
- Or run: `open -a "Google Chrome" "chrome://settings/help"`
2. **Upgrade SeleniumBase (if needed):**
```bash
pip install --upgrade seleniumbase
```
3. **Run scraper again** - SeleniumBase automatically downloads the matching ChromeDriver.
---
### Issue: ChromeOptions Reuse Error
**Error Message:**
```
RuntimeError: you cannot reuse the ChromeOptions object
```
**Cause:** Internal error when retrying Chrome initialization.
**Solution:** Clear the ChromeDriver cache (see above) and restart the scraper.
---
### Issue: Chrome Binary Not Found
**Error Message:**
```
WebDriverException: Message: unknown error: cannot find Chrome binary
```
**Cause:** Chrome is not installed or not in the expected location.
**Solution:**
1. **Install Chrome:**
- Download from: https://www.google.com/chrome/
2. **For custom Chrome location, set environment variable:**
```bash
export CHROME_BIN=/path/to/chrome
```
3. **Docker users:** Ensure Chrome is installed in Dockerfile:
```dockerfile
RUN apt-get update && apt-get install -y google-chrome-stable
ENV CHROME_BIN=/usr/bin/google-chrome
```
---
### Issue: Chrome Crashes in Headless Mode
**Error Message:**
```
WebDriverException: Message: chrome not reachable
```
**Solution:**
1. **Add required flags** (already included in scraper, but verify):
```
--no-sandbox
--disable-dev-shm-usage
--disable-gpu
```
2. **Increase shared memory** (Docker):
```bash
docker run --shm-size=2g your-image
```
3. **Try non-headless mode** to debug:
```bash
python start.py --headless false
```
---
## MongoDB Issues
### Issue: Connection Timeout
**Error Message:**
```
ServerSelectionTimeoutError: connection timed out
```
**Cause:** MongoDB server unreachable or network issues.
**Solution:**
1. **Verify MongoDB is running:**
```bash
# Local MongoDB
mongosh --eval "db.adminCommand('ping')"
# Check service status
sudo systemctl status mongod
```
2. **Check connection URI:**
```yaml
# config.yaml
mongodb:
uri: "mongodb://username:password@host:27017/"
```
3. **For MongoDB Atlas:**
- Whitelist your IP address in Atlas dashboard
- Verify cluster is active
- Check network connectivity
4. **Test connection manually:**
```bash
python -c "from pymongo import MongoClient; c = MongoClient('your-uri', serverSelectionTimeoutMS=5000); print(c.server_info())"
```
---
### Issue: Authentication Failed
**Error Message:**
```
OperationFailure: Authentication failed
```
**Solution:**
1. **Verify credentials** in connection URI
2. **Check database name** matches the authentication database
3. **Use correct URI format:**
```
mongodb://username:password@host:27017/database?authSource=admin
```
---
### Issue: SSL Certificate Error
**Error Message:**
```
SSL: CERTIFICATE_VERIFY_FAILED
```
**Solution:**
1. **For macOS**, run:
```bash
/Applications/Python\ 3.x/Install\ Certificates.command
```
2. **Or install certifi:**
```bash
pip install --upgrade certifi
```
3. **The scraper auto-handles this**, but if issues persist:
```python
import certifi
import os
os.environ['SSL_CERT_FILE'] = certifi.where()
```
---
## AWS S3 Issues
### Issue: Access Denied
**Error Message:**
```
ClientError: An error occurred (AccessDenied) when calling the PutObject operation
```
**Solution:**
1. **Verify AWS credentials:**
```yaml
# config.yaml
s3:
aws_access_key_id: "YOUR_ACCESS_KEY"
aws_secret_access_key: "YOUR_SECRET_KEY"
```
2. **Check IAM permissions** - required policy:
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::your-bucket-name",
"arn:aws:s3:::your-bucket-name/*"
]
}
]
}
```
3. **Check bucket policy** allows public-read if using public URLs
---
### Issue: Bucket Not Found
**Error Message:**
```
ClientError: An error occurred (NoSuchBucket)
```
**Solution:**
1. **Verify bucket name** in config.yaml
2. **Check region** matches bucket location:
```yaml
s3:
region_name: "us-east-1" # Must match bucket region
bucket_name: "your-bucket"
```
3. **Create bucket** if it doesn't exist via AWS Console or CLI
---
### Issue: Invalid Credentials
**Error Message:**
```
NoCredentialsError: Unable to locate credentials
```
**Solution:**
1. **Set credentials in config.yaml** or environment variables:
```bash
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
```
2. **Or use AWS credentials file:**
```
~/.aws/credentials
[default]
aws_access_key_id = YOUR_KEY
aws_secret_access_key = YOUR_SECRET
```
---
## Scraping Issues
### Issue: Reviews Tab Not Found
**Error Message:**
```
TimeoutException: Reviews tab not found or could not be clicked
```
**Cause:** Google Maps UI changed or page didn't load properly.
**Solution:**
1. **Try non-headless mode** to see what's happening:
```bash
python start.py --headless false
```
2. **Check the URL** is a valid Google Maps place URL
3. **Increase timeout** - network may be slow
4. **Clear cookies/cache** - Google may be showing consent dialogs
5. **Try different sort order:**
```bash
python start.py --sort relevance
```
---
### Issue: No Reviews Found
**Error Message:**
```
WARNING: No review cards found in this iteration
```
**Cause:** Page structure changed or place has no reviews.
**Solution:**
1. **Verify the place has reviews** by opening URL in browser
2. **Check if page requires login** for reviews
3. **Wait longer** for page to load - add delay in config
4. **Check for CAPTCHA** - may need to solve manually first
---
### Issue: Stale Element Reference
**Error Message:**
```
StaleElementReferenceException: stale element reference: element is not attached to the page document
```
**Cause:** Page updated while scraping.
**Solution:** This is handled automatically by the scraper. If persistent:
1. **Reduce scroll speed** - increase sleep time
2. **Run in non-headless mode** to observe behavior
3. **Restart scraper** - temporary DOM issue
---
### Issue: Cookie Consent Blocking
**Cause:** Cookie dialog not being dismissed.
**Solution:**
1. **Clear browser data:**
```bash
rm -rf ~/Library/Application\ Support/undetected_chromedriver
```
2. **The scraper handles this automatically**, but you can:
- Open the URL manually first and accept cookies
- Use a different Google account region
---
## API Server Issues
### Issue: Port Already in Use
**Error Message:**
```
OSError: [Errno 48] Address already in use
```
**Solution:**
1. **Find and kill the process:**
```bash
# Find process using port 8000
lsof -i :8000
# Kill the process
kill -9 <PID>
```
2. **Use different port:**
```bash
uvicorn api_server:app --port 8080
```
---
### Issue: Max Concurrent Jobs Reached
**Error Message:**
```
HTTP 429: Maximum concurrent jobs (3) reached
```
**Solution:**
1. **Wait for existing jobs** to complete
2. **Cancel pending jobs:**
```bash
curl -X POST "http://localhost:8000/jobs/{job_id}/cancel"
```
3. **Increase limit** in `api_server.py` (not recommended for stability)
---
### Issue: CORS Errors (Browser)
**Error Message:**
```
Access-Control-Allow-Origin header missing
```
**Solution:** CORS is enabled by default. If issues persist:
1. **Check allowed origins** in `api_server.py`
2. **For development**, ensure middleware is configured:
```python
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
```
---
## Image Download Issues
### Issue: Images Not Downloading
**Cause:** Network issues or Google blocking requests.
**Solution:**
1. **Check network connectivity**
2. **Verify image URLs** are accessible
3. **Reduce parallel downloads:**
```yaml
download_threads: 2 # Reduce from default 4
```
4. **Check disk space** for image storage
---
### Issue: Images Corrupted or Wrong Size
**Cause:** Partial downloads or URL issues.
**Solution:**
1. **Clear image directory** and re-run:
```bash
rm -rf review_images/
```
2. **Check max dimensions** in config:
```yaml
max_width: 1200
max_height: 1200
```
---
### Issue: Permission Denied Writing Images
**Error Message:**
```
PermissionError: [Errno 13] Permission denied
```
**Solution:**
1. **Check directory permissions:**
```bash
chmod 755 review_images/
```
2. **Use different directory:**
```yaml
image_dir: "/path/with/write/access"
```
---
## Configuration Issues
### Issue: Config File Not Found
**Error Message:**
```
FileNotFoundError: config.yaml not found
```
**Solution:**
1. **Create config.yaml** from example:
```bash
cp examples/config-example.txt config.yaml
```
2. **Specify custom path:**
```bash
python start.py --config /path/to/config.yaml
```
---
### Issue: Invalid YAML Syntax
**Error Message:**
```
yaml.scanner.ScannerError: mapping values are not allowed here
```
**Solution:**
1. **Validate YAML syntax** using online validator
2. **Check indentation** - use spaces, not tabs
3. **Escape special characters** in strings:
```yaml
url: "https://example.com?param=value" # Use quotes
```
---
### Issue: Invalid Configuration Values
**Error Message:**
```
ValueError: Invalid sort_by value
```
**Solution:**
1. **Check allowed values:**
- `sort_by`: newest, highest, lowest, relevance
- `headless`: true, false
2. **Verify types:**
```yaml
download_threads: 4 # Integer, not string
headless: true # Boolean, not string "true"
```
---
## Performance Issues
### Issue: Scraping Too Slow
**Solution:**
1. **Use headless mode:**
```bash
python start.py --headless
```
2. **Reduce image download threads** if network is slow:
```yaml
download_threads: 2
```
3. **Disable image downloading** for faster scraping:
```yaml
download_images: false
```
4. **Use SSD** for faster JSON/image writes
---
### Issue: High Memory Usage
**Solution:**
1. **Process in batches** - use `stop_on_match` for incremental scraping
2. **Disable image downloading** temporarily
3. **Close other applications**
4. **Increase system swap** if needed
---
### Issue: Chrome Using Too Much CPU
**Solution:**
1. **Use headless mode** - reduces rendering overhead
2. **Add GPU flags:**
```
--disable-gpu
--disable-software-rasterizer
```
3. **Limit concurrent jobs** in API mode
---
## Python & Dependencies Issues
### Issue: Module Not Found
**Error Message:**
```
ModuleNotFoundError: No module named 'undetected_chromedriver'
```
**Solution:**
1. **Install dependencies:**
```bash
pip install -r requirements.txt
```
2. **Verify virtual environment is activated:**
```bash
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
```
---
### Issue: Incompatible Package Versions
**Error Message:**
```
ImportError: cannot import name 'X' from 'Y'
```
**Solution:**
1. **Reinstall all dependencies:**
```bash
pip uninstall -r requirements.txt -y
pip install -r requirements.txt
```
2. **Create fresh virtual environment:**
```bash
python -m venv fresh_venv
source fresh_venv/bin/activate
pip install -r requirements.txt
```
---
### Issue: Python Version Incompatibility
**Error Message:**
```
SyntaxError: invalid syntax
```
**Solution:**
1. **Check Python version** (requires 3.9+):
```bash
python --version
```
2. **Install correct Python version:**
```bash
# macOS with pyenv
pyenv install 3.13.1
pyenv local 3.13.1
# Or use system package manager
```
---
## Getting Help
If your issue isn't listed here:
1. **Enable debug logging:**
```bash
LOG_LEVEL=DEBUG python start.py
```
2. **Check logs** for detailed error messages
3. **Search existing issues** on GitHub
4. **Create a new issue** with:
- Error message (full traceback)
- Python version (`python --version`)
- OS and version
- Chrome version
- Steps to reproduce

View File

@@ -1,5 +1,6 @@
""" """
Selenium scraping logic for Google Maps Reviews. Selenium scraping logic for Google Maps Reviews.
Uses SeleniumBase UC Mode for enhanced anti-detection and better Chrome version management.
""" """
import logging import logging
@@ -10,7 +11,7 @@ import time
import traceback import traceback
from typing import Dict, Any, List from typing import Dict, Any, List
import undetected_chromedriver as uc from seleniumbase import Driver
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
from selenium.webdriver import Chrome from selenium.webdriver import Chrome
from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.common.action_chains import ActionChains
@@ -169,72 +170,87 @@ class GoogleReviewsScraper:
self.backup_to_json = config.get("backup_to_json", True) self.backup_to_json = config.get("backup_to_json", True)
self.overwrite_existing = config.get("overwrite_existing", False) self.overwrite_existing = config.get("overwrite_existing", False)
def setup_driver(self, headless: bool) -> Chrome: def setup_driver(self, headless: bool):
""" """
Set up and configure Chrome driver with flexibility for different environments. Set up and configure Chrome driver using SeleniumBase UC Mode.
SeleniumBase provides enhanced anti-detection and automatic Chrome/ChromeDriver version management.
Works in both Docker containers and on regular OS installations (Windows, Mac, Linux). Works in both Docker containers and on regular OS installations (Windows, Mac, Linux).
""" """
# Determine if we're running in a container
in_container = os.environ.get('CHROME_BIN') is not None
# Create Chrome options
opts = uc.ChromeOptions()
opts.add_argument("--window-size=1400,900")
opts.add_argument("--ignore-certificate-errors")
opts.add_argument("--disable-gpu") # Improves performance
opts.add_argument("--disable-dev-shm-usage") # Helps with stability
opts.add_argument("--no-sandbox") # More stable in some environments
# Use headless mode if requested
if headless:
opts.add_argument("--headless=new")
# Log platform information for debugging # Log platform information for debugging
log.info(f"Platform: {platform.platform()}") log.info(f"Platform: {platform.platform()}")
log.info(f"Python version: {platform.python_version()}") log.info(f"Python version: {platform.python_version()}")
log.info("Using SeleniumBase UC Mode for enhanced anti-detection")
# Determine if we're running in a container
in_container = os.environ.get('CHROME_BIN') is not None
# If in container, use environment-provided binaries
if in_container: if in_container:
chrome_binary = os.environ.get('CHROME_BIN') chrome_binary = os.environ.get('CHROME_BIN')
chromedriver_path = os.environ.get('CHROMEDRIVER_PATH')
log.info(f"Container environment detected") log.info(f"Container environment detected")
log.info(f"Chrome binary: {chrome_binary}") log.info(f"Chrome binary: {chrome_binary}")
log.info(f"ChromeDriver path: {chromedriver_path}")
# Create driver with custom binary location for containers
if chrome_binary and os.path.exists(chrome_binary): if chrome_binary and os.path.exists(chrome_binary):
log.info(f"Using Chrome binary from environment: {chrome_binary}")
opts.binary_location = chrome_binary
try: try:
# Try creating Chrome driver with undetected_chromedriver driver = Driver(
log.info("Attempting to create undetected_chromedriver instance") uc=True,
driver = uc.Chrome(options=opts) headless=headless,
log.info("Successfully created undetected_chromedriver instance") binary_location=chrome_binary,
page_load_strategy="normal"
)
log.info("Successfully created SeleniumBase UC driver with custom binary")
except Exception as e: except Exception as e:
# Fall back to regular Selenium if undetected_chromedriver fails log.warning(f"Failed to create driver with custom binary: {e}")
log.warning(f"Failed to create undetected_chromedriver instance: {e}") # Fall back to default
log.info("Falling back to regular Selenium Chrome") driver = Driver(
uc=True,
# Import Selenium webdriver here to avoid potential import issues headless=headless,
from selenium import webdriver page_load_strategy="normal"
from selenium.webdriver.chrome.service import Service )
log.info("Successfully created SeleniumBase UC driver with defaults")
if chromedriver_path and os.path.exists(chromedriver_path):
log.info(f"Using ChromeDriver from path: {chromedriver_path}")
service = Service(executable_path=chromedriver_path)
driver = webdriver.Chrome(service=service, options=opts)
else: else:
log.info("Using default ChromeDriver") driver = Driver(
driver = webdriver.Chrome(options=opts) uc=True,
headless=headless,
page_load_strategy="normal"
)
log.info("Successfully created SeleniumBase UC driver")
else: else:
# On regular OS, use default undetected_chromedriver # Regular OS environment - SeleniumBase handles version matching automatically
log.info("Using standard undetected_chromedriver setup") log.info("Creating SeleniumBase UC Mode driver")
driver = uc.Chrome(options=opts) try:
driver = Driver(
uc=True,
headless=headless,
page_load_strategy="normal",
incognito=True # Use incognito mode for better stealth
)
log.info("Successfully created SeleniumBase UC driver")
except Exception as e:
log.error(f"Failed to create SeleniumBase driver: {e}")
raise
# Set page load timeout to avoid hanging # Set page load timeout to avoid hanging
driver.set_page_load_timeout(30) driver.set_page_load_timeout(30)
log.info("Chrome driver setup completed successfully")
# Set window size
driver.set_window_size(1400, 900)
# Add additional stealth settings
try:
# Disable automation flags
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': '''
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
'''
})
log.info("Additional stealth settings applied")
except Exception as e:
log.debug(f"Could not apply additional stealth settings: {e}")
log.info("SeleniumBase UC driver setup completed successfully")
return driver return driver
def dismiss_cookies(self, driver: Chrome): def dismiss_cookies(self, driver: Chrome):
@@ -471,9 +487,11 @@ class GoogleReviewsScraper:
parts = current_url.split('/place/') parts = current_url.split('/place/')
new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews?hl={lang_code}" new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews?hl={lang_code}"
driver.get(new_url) driver.get(new_url)
time.sleep(2) time.sleep(3) # Increased wait time for page load
if "review" in driver.current_url.lower(): if "review" in driver.current_url.lower():
log.info("Navigated directly to reviews page via URL") log.info("Navigated directly to reviews page via URL")
# Extra wait for reviews to render after URL navigation
time.sleep(2)
return True return True
# Try to identify reviews link in URL # Try to identify reviews link in URL
@@ -481,9 +499,11 @@ class GoogleReviewsScraper:
parts = current_url.split('/place/') parts = current_url.split('/place/')
new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews" new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews"
driver.get(new_url) driver.get(new_url)
time.sleep(2) time.sleep(3) # Increased wait time for page load
if "review" in driver.current_url.lower(): if "review" in driver.current_url.lower():
log.info("Navigated directly to reviews page via URL") log.info("Navigated directly to reviews page via URL")
# Extra wait for reviews to render after URL navigation
time.sleep(2)
return True return True
except Exception as url_error: except Exception as url_error:
log.warning(f"Failed to navigate to reviews via URL: {url_error}") log.warning(f"Failed to navigate to reviews via URL: {url_error}")
@@ -831,23 +851,10 @@ class GoogleReviewsScraper:
target_item = None target_item = None
matched_text = None matched_text = None
# 1. First try direct text matching # Log all available menu items for debugging
wanted_labels = SORT_OPTIONS.get(method, []) log.info(f"Available menu items: {[text for _, text in visible_items]}")
for item, text in visible_items: # Use position-based selection (most reliable for Google Maps)
for label in wanted_labels:
if (label in text or text in label or
(len(text) > 0 and len(label) > 0 and
text.lower().startswith(label.lower()[:3]))):
target_item = item
matched_text = text
log.info(f"Found matching menu item: '{text}' for '{label}'")
break
if target_item:
break
# 2. If no match found, try position-based selection
if not target_item and visible_items:
position_map = { position_map = {
"relevance": 0, # Usually the first option "relevance": 0, # Usually the first option
"newest": 1, # Usually the second option "newest": 1, # Usually the second option
@@ -858,7 +865,23 @@ class GoogleReviewsScraper:
pos = position_map.get(method, -1) pos = position_map.get(method, -1)
if pos >= 0 and pos < len(visible_items): if pos >= 0 and pos < len(visible_items):
target_item, matched_text = visible_items[pos] target_item, matched_text = visible_items[pos]
log.info(f"Using position-based selection (position {pos}) for '{method}'") log.info(f"Selected menu item at position {pos + 1}: '{matched_text}' for sort method '{method}'")
# Validate the selection makes sense
wanted_labels = SORT_OPTIONS.get(method, [])
text_clean = matched_text.lower()
# Check if selected text contains any of the expected keywords
valid_selection = False
for label in wanted_labels:
if label.lower() in text_clean or text_clean in label.lower():
valid_selection = True
break
if not valid_selection:
log.warning(f"WARNING: Selected '{matched_text}' doesn't match expected '{method}' - might be wrong sort!")
else:
log.warning(f"Position {pos} not available in menu (only {len(visible_items)} items)")
# 3. If target found, click it # 3. If target found, click it
if target_item: if target_item:
@@ -1108,16 +1131,55 @@ class GoogleReviewsScraper:
self.dismiss_cookies(driver) self.dismiss_cookies(driver)
self.click_reviews_tab(driver) self.click_reviews_tab(driver)
self.set_sort(driver, sort_by)
# Add a wait after setting sort to allow results to load # Extra wait after clicking reviews tab to ensure page loads
time.sleep(1) log.info("Waiting for reviews page to fully load...")
time.sleep(3)
# Wait for page to be fully interactive
try:
wait.until(lambda d: d.execute_script("return document.readyState") == "complete")
log.info("Page DOM is ready")
except:
log.debug("Could not verify page ready state")
# Verify we're on a reviews page before proceeding
if "review" not in driver.current_url.lower():
log.warning("URL doesn't contain 'review' - might not be on reviews page")
# Try to set sort - but don't fail if it doesn't work
try:
self.set_sort(driver, sort_by)
except Exception as sort_error:
log.warning(f"Sort failed but continuing: {sort_error}")
# Add a longer wait after setting sort to allow results to load
log.info("Waiting for reviews to render...")
time.sleep(3)
# Use try-except to handle cases where the pane is not found # Use try-except to handle cases where the pane is not found
# Try multiple selectors for the reviews pane
pane = None
pane_selectors = [
PANE_SEL, # Primary selector
'div[role="main"] div.m6QErb', # Simplified version
'div.m6QErb.DxyBCb', # Even more simplified
'div[role="main"]' # Most generic
]
for selector in pane_selectors:
try: try:
pane = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, PANE_SEL))) log.info(f"Trying to find reviews pane with selector: {selector}")
pane = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))
if pane:
log.info(f"Found reviews pane with selector: {selector}")
break
except TimeoutException: except TimeoutException:
log.warning("Could not find reviews pane. Page structure might have changed.") log.debug(f"Pane not found with selector: {selector}")
continue
if not pane:
log.warning("Could not find reviews pane with any selector. Page structure might have changed.")
return False return False
pbar = tqdm(desc="Scraped", ncols=80, initial=len(seen)) pbar = tqdm(desc="Scraped", ncols=80, initial=len(seen))
@@ -1132,8 +1194,12 @@ class GoogleReviewsScraper:
log.warning(f"Error setting up scroll script: {e}") log.warning(f"Error setting up scroll script: {e}")
scroll_script = "window.scrollBy(0, 300);" # Fallback to simple scrolling scroll_script = "window.scrollBy(0, 300);" # Fallback to simple scrolling
max_attempts = 10 # Limit the number of attempts to find reviews max_attempts = 50 # Increased from 10 to 50 for very patient scrolling
attempts = 0 attempts = 0
max_idle = 15 # Increased from 3 to 15 - much more patience for lazy-loaded reviews
consecutive_no_cards = 0 # Track how many times we find zero cards
last_scroll_position = 0
scroll_stuck_count = 0
while attempts < max_attempts: while attempts < max_attempts:
try: try:
@@ -1142,12 +1208,23 @@ class GoogleReviewsScraper:
# Check for valid cards # Check for valid cards
if len(cards) == 0: if len(cards) == 0:
log.debug("No review cards found in this iteration") consecutive_no_cards += 1
log.info(f"No review cards found in this iteration (consecutive: {consecutive_no_cards})")
# If we keep finding no cards, might have hit the end
if consecutive_no_cards > 5:
log.warning("No cards found for 5+ iterations - might be at end of reviews")
break
attempts += 1 attempts += 1
# Try scrolling anyway # Try aggressive scrolling
driver.execute_script(scroll_script) driver.execute_script(scroll_script)
time.sleep(1) time.sleep(1)
driver.execute_script("window.scrollBy(0, 1000);") # Extra scroll
time.sleep(1.5)
continue continue
else:
consecutive_no_cards = 0 # Reset counter when we find cards
for c in cards: for c in cards:
try: try:
@@ -1186,12 +1263,48 @@ class GoogleReviewsScraper:
idle = 0 idle = 0
attempts = 0 # Reset attempts counter when we successfully process a review attempts = 0 # Reset attempts counter when we successfully process a review
if idle >= 3: if idle >= max_idle:
log.info(f"Stopping: No new reviews found after {max_idle} scroll attempts")
break break
if not fresh_cards: if not fresh_cards:
idle += 1 idle += 1
attempts += 1 attempts += 1
log.info(f"No new reviews in this iteration (idle: {idle}/{max_idle}, attempts: {attempts}/{max_attempts}, total seen: {len(seen)})")
# When no new reviews, scroll more aggressively
try:
# Try multiple scroll methods
driver.execute_script(scroll_script)
time.sleep(0.5)
driver.execute_script("window.scrollBy(0, 500);") # Extra scroll
time.sleep(0.5)
except Exception as e:
log.warning(f"Error scrolling: {e}")
else:
log.info(f"Found {len(fresh_cards)} new reviews in this iteration")
# Check if we're actually scrolling or stuck
try:
current_scroll = driver.execute_script("return arguments[0].scrollTop;", pane)
if current_scroll == last_scroll_position and len(fresh_cards) == 0:
scroll_stuck_count += 1
log.warning(f"Scroll position hasn't changed (stuck at {current_scroll}px, stuck count: {scroll_stuck_count})")
if scroll_stuck_count > 5:
log.warning("Scroll is stuck - trying alternative scroll method")
# Try clicking the last visible review to force loading
try:
driver.execute_script("arguments[0].lastElementChild.scrollIntoView();", pane)
time.sleep(2)
except:
pass
scroll_stuck_count = 0
else:
scroll_stuck_count = 0
last_scroll_position = current_scroll
except:
pass
# Use JavaScript for smoother scrolling # Use JavaScript for smoother scrolling
try: try:
@@ -1201,8 +1314,13 @@ class GoogleReviewsScraper:
# Try a simpler scroll method # Try a simpler scroll method
driver.execute_script("window.scrollBy(0, 300);") driver.execute_script("window.scrollBy(0, 300);")
# Dynamic sleep: sleep less when processing many reviews # Dynamic sleep: sleep less when processing many reviews, more when finding none
sleep_time = 0.7 if len(fresh_cards) > 5 else 1.0 if len(fresh_cards) > 5:
sleep_time = 0.7
elif len(fresh_cards) == 0:
sleep_time = 2.0 # Wait longer when finding nothing (let page load)
else:
sleep_time = 1.0
time.sleep(sleep_time) time.sleep(sleep_time)
except StaleElementReferenceException: except StaleElementReferenceException:

View File

@@ -1,17 +1,8 @@
requests==2.32.3 seleniumbase>=4.34.9
beautifulsoup4==4.12.3
aiohttp==3.11.11
googletrans==4.0.2 googletrans==4.0.2
selenium==4.15.2 tqdm>=4.66.3
undetected-chromedriver==3.5.4
tqdm==4.66.3
pymongo==4.12.0 pymongo==4.12.0
pyyaml==6.0.1
certifi==2024.7.4
webdriver-manager==4.0.2
setuptools==79.0.1
boto3==1.35.1 boto3==1.35.1
pytest==7.4.3
fastapi==0.104.1 fastapi==0.104.1
uvicorn==0.24.0 uvicorn==0.24.0
botocore~=1.35.99 botocore~=1.35.99

View File

@@ -0,0 +1,110 @@
"""
Tests for SeleniumBase UC Mode integration.
Verifies that the driver setup works correctly with the new library.
"""
import pytest
from modules.scraper import GoogleReviewsScraper
def test_seleniumbase_driver_creation():
"""Test that SeleniumBase driver can be created successfully"""
config = {
"url": "https://maps.app.goo.gl/test",
"headless": True,
"use_mongodb": False,
"backup_to_json": False
}
scraper = GoogleReviewsScraper(config)
# Test driver creation
driver = None
try:
driver = scraper.setup_driver(headless=True)
assert driver is not None
assert driver.name == "chrome"
# Verify driver can navigate
driver.get("https://www.google.com")
assert "google" in driver.current_url.lower()
finally:
if driver:
driver.quit()
def test_seleniumbase_driver_headless_mode():
"""Test that headless mode works correctly"""
config = {
"url": "https://maps.app.goo.gl/test",
"headless": True,
"use_mongodb": False,
"backup_to_json": False
}
scraper = GoogleReviewsScraper(config)
driver = None
try:
driver = scraper.setup_driver(headless=True)
assert driver is not None
# In headless mode, window size should still be set
size = driver.get_window_size()
assert size['width'] == 1400
assert size['height'] == 900
finally:
if driver:
driver.quit()
def test_seleniumbase_driver_nonheadless_mode():
"""Test that non-headless mode works correctly"""
config = {
"url": "https://maps.app.goo.gl/test",
"headless": False,
"use_mongodb": False,
"backup_to_json": False
}
scraper = GoogleReviewsScraper(config)
driver = None
try:
driver = scraper.setup_driver(headless=False)
assert driver is not None
assert driver.name == "chrome"
finally:
if driver:
driver.quit()
@pytest.mark.skip(reason="Integration test - requires network access")
def test_seleniumbase_google_maps_access():
"""Test that driver can access Google Maps (integration test)"""
config = {
"url": "https://maps.app.goo.gl/6tkNMDjcj3SS6LJe9",
"headless": True,
"use_mongodb": False,
"backup_to_json": False
}
scraper = GoogleReviewsScraper(config)
driver = None
try:
driver = scraper.setup_driver(headless=True)
driver.get(config["url"])
# Wait for redirect to Google Maps
import time
time.sleep(3)
assert "google.com/maps" in driver.current_url
finally:
if driver:
driver.quit()