migrate to SeleniumBase UC Mode for automatic version management
- Replace undetected-chromedriver with seleniumbase for better Chrome/ChromeDriver compatibility - Automatic version matching eliminates manual cache clearing and version conflicts - Enhanced anti-detection with UC Mode and CDP stealth settings - Simplified requirements.txt (SeleniumBase manages common dependencies) - Fix sort selection bug (was selecting wrong menu items) - Improve scrolling patience (max_idle: 3→15, max_attempts: 10→50) - Add scroll position tracking to detect when stuck - Add fallback pane selectors for better reliability - Update documentation (README, ARCHITECTURE, TROUBLESHOOTING) - Add comprehensive test suite for SeleniumBase integration - Version bump to 1.0.1 Developed by George Khananaev
This commit is contained in:
8
.gitignore
vendored
8
.gitignore
vendored
@@ -11,6 +11,7 @@ Desktop.ini
|
|||||||
# -----------------------------------------------------------
|
# -----------------------------------------------------------
|
||||||
.idea/
|
.idea/
|
||||||
.vscode/
|
.vscode/
|
||||||
|
.claude/
|
||||||
*.swp
|
*.swp
|
||||||
*.swo
|
*.swo
|
||||||
*~
|
*~
|
||||||
@@ -48,6 +49,7 @@ logs.db
|
|||||||
*.sqlite
|
*.sqlite
|
||||||
*.sqlite3
|
*.sqlite3
|
||||||
*.db
|
*.db
|
||||||
|
docs/AGENTS_LOG
|
||||||
|
|
||||||
# -----------------------------------------------------------
|
# -----------------------------------------------------------
|
||||||
# Config Files
|
# Config Files
|
||||||
@@ -68,6 +70,12 @@ review_images/
|
|||||||
images/
|
images/
|
||||||
downloaded_images/
|
downloaded_images/
|
||||||
|
|
||||||
|
# -----------------------------------------------------------
|
||||||
|
# SeleniumBase Files
|
||||||
|
# -----------------------------------------------------------
|
||||||
|
downloaded_files/
|
||||||
|
*.lock
|
||||||
|
|
||||||
# -----------------------------------------------------------
|
# -----------------------------------------------------------
|
||||||
# Temporary and Output Files
|
# Temporary and Output Files
|
||||||
# -----------------------------------------------------------
|
# -----------------------------------------------------------
|
||||||
|
|||||||
13
README.md
13
README.md
@@ -1,16 +1,16 @@
|
|||||||
# 🔥 Google Reviews Scraper Pro (2025) 🔥
|
# 🔥 Google Reviews Scraper Pro (2025) 🔥
|
||||||
|
|
||||||

|

|
||||||

|

|
||||||

|

|
||||||

|

|
||||||
|
|
||||||
**FINALLY! A scraper that ACTUALLY WORKS in 2025!** While others break with every Google update, this bad boy keeps on trucking. Say goodbye to the frustration of constantly broken scrapers and hello to a beast that rips through Google's defenses like a hot knife through butter. This battle-tested, rock-solid solution will extract every juicy detail from Google reviews while laughing in the face of rate limiting.
|
**FINALLY! A scraper that ACTUALLY WORKS in 2025!** While others break with every Google update, this bad boy keeps on trucking. Say goodbye to the frustration of constantly broken scrapers and hello to a beast that rips through Google's defenses like a hot knife through butter. This battle-tested, rock-solid solution will extract every juicy detail from Google reviews while laughing in the face of rate limiting.
|
||||||
|
|
||||||
## 🌟 Feature Artillery
|
## 🌟 Feature Artillery
|
||||||
|
|
||||||
- **Bulletproof in 2025**: While the competition falls apart, we've cracked Google's latest tricks
|
- **Bulletproof in 2025**: While the competition falls apart, we've cracked Google's latest tricks
|
||||||
- **Ninja-Mode Selenium**: Our undetected-chromedriver flies under the radar where others get insta-blocked
|
- **Enhanced SeleniumBase UC Mode**: Superior anti-detection with automatic Chrome/ChromeDriver version matching - no more version headaches!
|
||||||
- **Polyglot Powerhouse**: Devours reviews in a smorgasbord of languages - English, Hebrew, Thai, German, you name it!
|
- **Polyglot Powerhouse**: Devours reviews in a smorgasbord of languages - English, Hebrew, Thai, German, you name it!
|
||||||
- **MongoDB Mastery**: Dumps pristine data structures straight into your MongoDB instance
|
- **MongoDB Mastery**: Dumps pristine data structures straight into your MongoDB instance
|
||||||
- **Paranoid Backups**: Mirrors everything to local JSON files because losing data sucks
|
- **Paranoid Backups**: Mirrors everything to local JSON files because losing data sucks
|
||||||
@@ -350,9 +350,10 @@ print(f"Reviews with images: {len(reviews_with_images)}")
|
|||||||
### DEFCON Scenarios & Quick Fixes
|
### DEFCON Scenarios & Quick Fixes
|
||||||
|
|
||||||
1. **Chrome/Driver Having a Lovers' Quarrel**
|
1. **Chrome/Driver Having a Lovers' Quarrel**
|
||||||
- Update your damn Chrome browser already! It's 2025, people
|
- **Good news!** SeleniumBase handles Chrome/ChromeDriver version matching automatically
|
||||||
- Nuke and reinstall the driver: `pip uninstall undetected-chromedriver` then `pip install undetected-chromedriver==3.5.4`
|
- Update Chrome browser: Go to chrome://settings/help
|
||||||
- If you're on Ubuntu, sometimes a simple `apt update && apt upgrade` fixes weird Chrome issues
|
- SeleniumBase will automatically download the matching ChromeDriver - no manual intervention needed!
|
||||||
|
- If issues persist: `pip install --upgrade seleniumbase`
|
||||||
|
|
||||||
2. **MongoDB Throwing a Tantrum**
|
2. **MongoDB Throwing a Tantrum**
|
||||||
- Double-check your connection string - typos are the #1 culprit
|
- Double-check your connection string - typos are the #1 culprit
|
||||||
|
|||||||
2760
docs/ARCHITECTURE.md
Normal file
2760
docs/ARCHITECTURE.md
Normal file
File diff suppressed because it is too large
Load Diff
708
docs/TROUBLESHOOTING.md
Normal file
708
docs/TROUBLESHOOTING.md
Normal file
@@ -0,0 +1,708 @@
|
|||||||
|
# Troubleshooting Guide
|
||||||
|
|
||||||
|
This guide covers common issues and their solutions when running Google Reviews Scraper Pro.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Chrome & ChromeDriver Issues](#chrome--chromedriver-issues)
|
||||||
|
2. [MongoDB Issues](#mongodb-issues)
|
||||||
|
3. [AWS S3 Issues](#aws-s3-issues)
|
||||||
|
4. [Scraping Issues](#scraping-issues)
|
||||||
|
5. [API Server Issues](#api-server-issues)
|
||||||
|
6. [Image Download Issues](#image-download-issues)
|
||||||
|
7. [Configuration Issues](#configuration-issues)
|
||||||
|
8. [Performance Issues](#performance-issues)
|
||||||
|
9. [Python & Dependencies Issues](#python--dependencies-issues)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Chrome & ChromeDriver Issues
|
||||||
|
|
||||||
|
### Issue: ChromeDriver Version Mismatch
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 143
|
||||||
|
Current browser version is 142.0.7444.176
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cause:** Chrome/ChromeDriver version mismatch (this issue is now automatically handled by SeleniumBase).
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
**Good News:** With SeleniumBase UC Mode, version mismatches are automatically resolved!
|
||||||
|
|
||||||
|
1. **Update Chrome to latest version:**
|
||||||
|
- macOS: Open Chrome → Menu → Help → About Google Chrome
|
||||||
|
- Or run: `open -a "Google Chrome" "chrome://settings/help"`
|
||||||
|
|
||||||
|
2. **Upgrade SeleniumBase (if needed):**
|
||||||
|
```bash
|
||||||
|
pip install --upgrade seleniumbase
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Run scraper again** - SeleniumBase automatically downloads the matching ChromeDriver.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: ChromeOptions Reuse Error
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
RuntimeError: you cannot reuse the ChromeOptions object
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cause:** Internal error when retrying Chrome initialization.
|
||||||
|
|
||||||
|
**Solution:** Clear the ChromeDriver cache (see above) and restart the scraper.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Chrome Binary Not Found
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
WebDriverException: Message: unknown error: cannot find Chrome binary
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cause:** Chrome is not installed or not in the expected location.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Install Chrome:**
|
||||||
|
- Download from: https://www.google.com/chrome/
|
||||||
|
|
||||||
|
2. **For custom Chrome location, set environment variable:**
|
||||||
|
```bash
|
||||||
|
export CHROME_BIN=/path/to/chrome
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Docker users:** Ensure Chrome is installed in Dockerfile:
|
||||||
|
```dockerfile
|
||||||
|
RUN apt-get update && apt-get install -y google-chrome-stable
|
||||||
|
ENV CHROME_BIN=/usr/bin/google-chrome
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Chrome Crashes in Headless Mode
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
WebDriverException: Message: chrome not reachable
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Add required flags** (already included in scraper, but verify):
|
||||||
|
```
|
||||||
|
--no-sandbox
|
||||||
|
--disable-dev-shm-usage
|
||||||
|
--disable-gpu
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Increase shared memory** (Docker):
|
||||||
|
```bash
|
||||||
|
docker run --shm-size=2g your-image
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Try non-headless mode** to debug:
|
||||||
|
```bash
|
||||||
|
python start.py --headless false
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## MongoDB Issues
|
||||||
|
|
||||||
|
### Issue: Connection Timeout
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
ServerSelectionTimeoutError: connection timed out
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cause:** MongoDB server unreachable or network issues.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Verify MongoDB is running:**
|
||||||
|
```bash
|
||||||
|
# Local MongoDB
|
||||||
|
mongosh --eval "db.adminCommand('ping')"
|
||||||
|
|
||||||
|
# Check service status
|
||||||
|
sudo systemctl status mongod
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check connection URI:**
|
||||||
|
```yaml
|
||||||
|
# config.yaml
|
||||||
|
mongodb:
|
||||||
|
uri: "mongodb://username:password@host:27017/"
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **For MongoDB Atlas:**
|
||||||
|
- Whitelist your IP address in Atlas dashboard
|
||||||
|
- Verify cluster is active
|
||||||
|
- Check network connectivity
|
||||||
|
|
||||||
|
4. **Test connection manually:**
|
||||||
|
```bash
|
||||||
|
python -c "from pymongo import MongoClient; c = MongoClient('your-uri', serverSelectionTimeoutMS=5000); print(c.server_info())"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Authentication Failed
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
OperationFailure: Authentication failed
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Verify credentials** in connection URI
|
||||||
|
2. **Check database name** matches the authentication database
|
||||||
|
3. **Use correct URI format:**
|
||||||
|
```
|
||||||
|
mongodb://username:password@host:27017/database?authSource=admin
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: SSL Certificate Error
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
SSL: CERTIFICATE_VERIFY_FAILED
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **For macOS**, run:
|
||||||
|
```bash
|
||||||
|
/Applications/Python\ 3.x/Install\ Certificates.command
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Or install certifi:**
|
||||||
|
```bash
|
||||||
|
pip install --upgrade certifi
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **The scraper auto-handles this**, but if issues persist:
|
||||||
|
```python
|
||||||
|
import certifi
|
||||||
|
import os
|
||||||
|
os.environ['SSL_CERT_FILE'] = certifi.where()
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## AWS S3 Issues
|
||||||
|
|
||||||
|
### Issue: Access Denied
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
ClientError: An error occurred (AccessDenied) when calling the PutObject operation
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Verify AWS credentials:**
|
||||||
|
```yaml
|
||||||
|
# config.yaml
|
||||||
|
s3:
|
||||||
|
aws_access_key_id: "YOUR_ACCESS_KEY"
|
||||||
|
aws_secret_access_key: "YOUR_SECRET_KEY"
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check IAM permissions** - required policy:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Version": "2012-10-17",
|
||||||
|
"Statement": [
|
||||||
|
{
|
||||||
|
"Effect": "Allow",
|
||||||
|
"Action": [
|
||||||
|
"s3:PutObject",
|
||||||
|
"s3:GetObject",
|
||||||
|
"s3:ListBucket",
|
||||||
|
"s3:PutObjectAcl"
|
||||||
|
],
|
||||||
|
"Resource": [
|
||||||
|
"arn:aws:s3:::your-bucket-name",
|
||||||
|
"arn:aws:s3:::your-bucket-name/*"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Check bucket policy** allows public-read if using public URLs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Bucket Not Found
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
ClientError: An error occurred (NoSuchBucket)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Verify bucket name** in config.yaml
|
||||||
|
2. **Check region** matches bucket location:
|
||||||
|
```yaml
|
||||||
|
s3:
|
||||||
|
region_name: "us-east-1" # Must match bucket region
|
||||||
|
bucket_name: "your-bucket"
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Create bucket** if it doesn't exist via AWS Console or CLI
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Invalid Credentials
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
NoCredentialsError: Unable to locate credentials
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Set credentials in config.yaml** or environment variables:
|
||||||
|
```bash
|
||||||
|
export AWS_ACCESS_KEY_ID=your_key
|
||||||
|
export AWS_SECRET_ACCESS_KEY=your_secret
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Or use AWS credentials file:**
|
||||||
|
```
|
||||||
|
~/.aws/credentials
|
||||||
|
[default]
|
||||||
|
aws_access_key_id = YOUR_KEY
|
||||||
|
aws_secret_access_key = YOUR_SECRET
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scraping Issues
|
||||||
|
|
||||||
|
### Issue: Reviews Tab Not Found
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
TimeoutException: Reviews tab not found or could not be clicked
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cause:** Google Maps UI changed or page didn't load properly.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Try non-headless mode** to see what's happening:
|
||||||
|
```bash
|
||||||
|
python start.py --headless false
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check the URL** is a valid Google Maps place URL
|
||||||
|
|
||||||
|
3. **Increase timeout** - network may be slow
|
||||||
|
|
||||||
|
4. **Clear cookies/cache** - Google may be showing consent dialogs
|
||||||
|
|
||||||
|
5. **Try different sort order:**
|
||||||
|
```bash
|
||||||
|
python start.py --sort relevance
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: No Reviews Found
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
WARNING: No review cards found in this iteration
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cause:** Page structure changed or place has no reviews.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Verify the place has reviews** by opening URL in browser
|
||||||
|
2. **Check if page requires login** for reviews
|
||||||
|
3. **Wait longer** for page to load - add delay in config
|
||||||
|
4. **Check for CAPTCHA** - may need to solve manually first
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Stale Element Reference
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
StaleElementReferenceException: stale element reference: element is not attached to the page document
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cause:** Page updated while scraping.
|
||||||
|
|
||||||
|
**Solution:** This is handled automatically by the scraper. If persistent:
|
||||||
|
|
||||||
|
1. **Reduce scroll speed** - increase sleep time
|
||||||
|
2. **Run in non-headless mode** to observe behavior
|
||||||
|
3. **Restart scraper** - temporary DOM issue
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Cookie Consent Blocking
|
||||||
|
|
||||||
|
**Cause:** Cookie dialog not being dismissed.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Clear browser data:**
|
||||||
|
```bash
|
||||||
|
rm -rf ~/Library/Application\ Support/undetected_chromedriver
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **The scraper handles this automatically**, but you can:
|
||||||
|
- Open the URL manually first and accept cookies
|
||||||
|
- Use a different Google account region
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## API Server Issues
|
||||||
|
|
||||||
|
### Issue: Port Already in Use
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
OSError: [Errno 48] Address already in use
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Find and kill the process:**
|
||||||
|
```bash
|
||||||
|
# Find process using port 8000
|
||||||
|
lsof -i :8000
|
||||||
|
|
||||||
|
# Kill the process
|
||||||
|
kill -9 <PID>
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Use different port:**
|
||||||
|
```bash
|
||||||
|
uvicorn api_server:app --port 8080
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Max Concurrent Jobs Reached
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
HTTP 429: Maximum concurrent jobs (3) reached
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Wait for existing jobs** to complete
|
||||||
|
2. **Cancel pending jobs:**
|
||||||
|
```bash
|
||||||
|
curl -X POST "http://localhost:8000/jobs/{job_id}/cancel"
|
||||||
|
```
|
||||||
|
3. **Increase limit** in `api_server.py` (not recommended for stability)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: CORS Errors (Browser)
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
Access-Control-Allow-Origin header missing
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:** CORS is enabled by default. If issues persist:
|
||||||
|
|
||||||
|
1. **Check allowed origins** in `api_server.py`
|
||||||
|
2. **For development**, ensure middleware is configured:
|
||||||
|
```python
|
||||||
|
app.add_middleware(
|
||||||
|
CORSMiddleware,
|
||||||
|
allow_origins=["*"],
|
||||||
|
allow_methods=["*"],
|
||||||
|
allow_headers=["*"],
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Image Download Issues
|
||||||
|
|
||||||
|
### Issue: Images Not Downloading
|
||||||
|
|
||||||
|
**Cause:** Network issues or Google blocking requests.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Check network connectivity**
|
||||||
|
2. **Verify image URLs** are accessible
|
||||||
|
3. **Reduce parallel downloads:**
|
||||||
|
```yaml
|
||||||
|
download_threads: 2 # Reduce from default 4
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Check disk space** for image storage
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Images Corrupted or Wrong Size
|
||||||
|
|
||||||
|
**Cause:** Partial downloads or URL issues.
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Clear image directory** and re-run:
|
||||||
|
```bash
|
||||||
|
rm -rf review_images/
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check max dimensions** in config:
|
||||||
|
```yaml
|
||||||
|
max_width: 1200
|
||||||
|
max_height: 1200
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Permission Denied Writing Images
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
PermissionError: [Errno 13] Permission denied
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Check directory permissions:**
|
||||||
|
```bash
|
||||||
|
chmod 755 review_images/
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Use different directory:**
|
||||||
|
```yaml
|
||||||
|
image_dir: "/path/with/write/access"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Configuration Issues
|
||||||
|
|
||||||
|
### Issue: Config File Not Found
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
FileNotFoundError: config.yaml not found
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Create config.yaml** from example:
|
||||||
|
```bash
|
||||||
|
cp examples/config-example.txt config.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Specify custom path:**
|
||||||
|
```bash
|
||||||
|
python start.py --config /path/to/config.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Invalid YAML Syntax
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
yaml.scanner.ScannerError: mapping values are not allowed here
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Validate YAML syntax** using online validator
|
||||||
|
2. **Check indentation** - use spaces, not tabs
|
||||||
|
3. **Escape special characters** in strings:
|
||||||
|
```yaml
|
||||||
|
url: "https://example.com?param=value" # Use quotes
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Invalid Configuration Values
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
ValueError: Invalid sort_by value
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Check allowed values:**
|
||||||
|
- `sort_by`: newest, highest, lowest, relevance
|
||||||
|
- `headless`: true, false
|
||||||
|
|
||||||
|
2. **Verify types:**
|
||||||
|
```yaml
|
||||||
|
download_threads: 4 # Integer, not string
|
||||||
|
headless: true # Boolean, not string "true"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Issues
|
||||||
|
|
||||||
|
### Issue: Scraping Too Slow
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Use headless mode:**
|
||||||
|
```bash
|
||||||
|
python start.py --headless
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Reduce image download threads** if network is slow:
|
||||||
|
```yaml
|
||||||
|
download_threads: 2
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Disable image downloading** for faster scraping:
|
||||||
|
```yaml
|
||||||
|
download_images: false
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Use SSD** for faster JSON/image writes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: High Memory Usage
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Process in batches** - use `stop_on_match` for incremental scraping
|
||||||
|
2. **Disable image downloading** temporarily
|
||||||
|
3. **Close other applications**
|
||||||
|
4. **Increase system swap** if needed
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Chrome Using Too Much CPU
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Use headless mode** - reduces rendering overhead
|
||||||
|
2. **Add GPU flags:**
|
||||||
|
```
|
||||||
|
--disable-gpu
|
||||||
|
--disable-software-rasterizer
|
||||||
|
```
|
||||||
|
3. **Limit concurrent jobs** in API mode
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Python & Dependencies Issues
|
||||||
|
|
||||||
|
### Issue: Module Not Found
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
ModuleNotFoundError: No module named 'undetected_chromedriver'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Install dependencies:**
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Verify virtual environment is activated:**
|
||||||
|
```bash
|
||||||
|
source venv/bin/activate # Linux/macOS
|
||||||
|
venv\Scripts\activate # Windows
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Incompatible Package Versions
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
ImportError: cannot import name 'X' from 'Y'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Reinstall all dependencies:**
|
||||||
|
```bash
|
||||||
|
pip uninstall -r requirements.txt -y
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Create fresh virtual environment:**
|
||||||
|
```bash
|
||||||
|
python -m venv fresh_venv
|
||||||
|
source fresh_venv/bin/activate
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Issue: Python Version Incompatibility
|
||||||
|
|
||||||
|
**Error Message:**
|
||||||
|
```
|
||||||
|
SyntaxError: invalid syntax
|
||||||
|
```
|
||||||
|
|
||||||
|
**Solution:**
|
||||||
|
|
||||||
|
1. **Check Python version** (requires 3.9+):
|
||||||
|
```bash
|
||||||
|
python --version
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Install correct Python version:**
|
||||||
|
```bash
|
||||||
|
# macOS with pyenv
|
||||||
|
pyenv install 3.13.1
|
||||||
|
pyenv local 3.13.1
|
||||||
|
|
||||||
|
# Or use system package manager
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Getting Help
|
||||||
|
|
||||||
|
If your issue isn't listed here:
|
||||||
|
|
||||||
|
1. **Enable debug logging:**
|
||||||
|
```bash
|
||||||
|
LOG_LEVEL=DEBUG python start.py
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Check logs** for detailed error messages
|
||||||
|
|
||||||
|
3. **Search existing issues** on GitHub
|
||||||
|
|
||||||
|
4. **Create a new issue** with:
|
||||||
|
- Error message (full traceback)
|
||||||
|
- Python version (`python --version`)
|
||||||
|
- OS and version
|
||||||
|
- Chrome version
|
||||||
|
- Steps to reproduce
|
||||||
@@ -1,5 +1,6 @@
|
|||||||
"""
|
"""
|
||||||
Selenium scraping logic for Google Maps Reviews.
|
Selenium scraping logic for Google Maps Reviews.
|
||||||
|
Uses SeleniumBase UC Mode for enhanced anti-detection and better Chrome version management.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import logging
|
import logging
|
||||||
@@ -10,7 +11,7 @@ import time
|
|||||||
import traceback
|
import traceback
|
||||||
from typing import Dict, Any, List
|
from typing import Dict, Any, List
|
||||||
|
|
||||||
import undetected_chromedriver as uc
|
from seleniumbase import Driver
|
||||||
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
|
from selenium.common.exceptions import TimeoutException, StaleElementReferenceException
|
||||||
from selenium.webdriver import Chrome
|
from selenium.webdriver import Chrome
|
||||||
from selenium.webdriver.common.action_chains import ActionChains
|
from selenium.webdriver.common.action_chains import ActionChains
|
||||||
@@ -169,72 +170,87 @@ class GoogleReviewsScraper:
|
|||||||
self.backup_to_json = config.get("backup_to_json", True)
|
self.backup_to_json = config.get("backup_to_json", True)
|
||||||
self.overwrite_existing = config.get("overwrite_existing", False)
|
self.overwrite_existing = config.get("overwrite_existing", False)
|
||||||
|
|
||||||
def setup_driver(self, headless: bool) -> Chrome:
|
def setup_driver(self, headless: bool):
|
||||||
"""
|
"""
|
||||||
Set up and configure Chrome driver with flexibility for different environments.
|
Set up and configure Chrome driver using SeleniumBase UC Mode.
|
||||||
|
SeleniumBase provides enhanced anti-detection and automatic Chrome/ChromeDriver version management.
|
||||||
Works in both Docker containers and on regular OS installations (Windows, Mac, Linux).
|
Works in both Docker containers and on regular OS installations (Windows, Mac, Linux).
|
||||||
"""
|
"""
|
||||||
# Determine if we're running in a container
|
|
||||||
in_container = os.environ.get('CHROME_BIN') is not None
|
|
||||||
|
|
||||||
# Create Chrome options
|
|
||||||
opts = uc.ChromeOptions()
|
|
||||||
opts.add_argument("--window-size=1400,900")
|
|
||||||
opts.add_argument("--ignore-certificate-errors")
|
|
||||||
opts.add_argument("--disable-gpu") # Improves performance
|
|
||||||
opts.add_argument("--disable-dev-shm-usage") # Helps with stability
|
|
||||||
opts.add_argument("--no-sandbox") # More stable in some environments
|
|
||||||
|
|
||||||
# Use headless mode if requested
|
|
||||||
if headless:
|
|
||||||
opts.add_argument("--headless=new")
|
|
||||||
|
|
||||||
# Log platform information for debugging
|
# Log platform information for debugging
|
||||||
log.info(f"Platform: {platform.platform()}")
|
log.info(f"Platform: {platform.platform()}")
|
||||||
log.info(f"Python version: {platform.python_version()}")
|
log.info(f"Python version: {platform.python_version()}")
|
||||||
|
log.info("Using SeleniumBase UC Mode for enhanced anti-detection")
|
||||||
|
|
||||||
|
# Determine if we're running in a container
|
||||||
|
in_container = os.environ.get('CHROME_BIN') is not None
|
||||||
|
|
||||||
# If in container, use environment-provided binaries
|
|
||||||
if in_container:
|
if in_container:
|
||||||
chrome_binary = os.environ.get('CHROME_BIN')
|
chrome_binary = os.environ.get('CHROME_BIN')
|
||||||
chromedriver_path = os.environ.get('CHROMEDRIVER_PATH')
|
|
||||||
|
|
||||||
log.info(f"Container environment detected")
|
log.info(f"Container environment detected")
|
||||||
log.info(f"Chrome binary: {chrome_binary}")
|
log.info(f"Chrome binary: {chrome_binary}")
|
||||||
log.info(f"ChromeDriver path: {chromedriver_path}")
|
|
||||||
|
|
||||||
|
# Create driver with custom binary location for containers
|
||||||
if chrome_binary and os.path.exists(chrome_binary):
|
if chrome_binary and os.path.exists(chrome_binary):
|
||||||
log.info(f"Using Chrome binary from environment: {chrome_binary}")
|
try:
|
||||||
opts.binary_location = chrome_binary
|
driver = Driver(
|
||||||
|
uc=True,
|
||||||
try:
|
headless=headless,
|
||||||
# Try creating Chrome driver with undetected_chromedriver
|
binary_location=chrome_binary,
|
||||||
log.info("Attempting to create undetected_chromedriver instance")
|
page_load_strategy="normal"
|
||||||
driver = uc.Chrome(options=opts)
|
)
|
||||||
log.info("Successfully created undetected_chromedriver instance")
|
log.info("Successfully created SeleniumBase UC driver with custom binary")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
# Fall back to regular Selenium if undetected_chromedriver fails
|
log.warning(f"Failed to create driver with custom binary: {e}")
|
||||||
log.warning(f"Failed to create undetected_chromedriver instance: {e}")
|
# Fall back to default
|
||||||
log.info("Falling back to regular Selenium Chrome")
|
driver = Driver(
|
||||||
|
uc=True,
|
||||||
# Import Selenium webdriver here to avoid potential import issues
|
headless=headless,
|
||||||
from selenium import webdriver
|
page_load_strategy="normal"
|
||||||
from selenium.webdriver.chrome.service import Service
|
)
|
||||||
|
log.info("Successfully created SeleniumBase UC driver with defaults")
|
||||||
if chromedriver_path and os.path.exists(chromedriver_path):
|
else:
|
||||||
log.info(f"Using ChromeDriver from path: {chromedriver_path}")
|
driver = Driver(
|
||||||
service = Service(executable_path=chromedriver_path)
|
uc=True,
|
||||||
driver = webdriver.Chrome(service=service, options=opts)
|
headless=headless,
|
||||||
else:
|
page_load_strategy="normal"
|
||||||
log.info("Using default ChromeDriver")
|
)
|
||||||
driver = webdriver.Chrome(options=opts)
|
log.info("Successfully created SeleniumBase UC driver")
|
||||||
else:
|
else:
|
||||||
# On regular OS, use default undetected_chromedriver
|
# Regular OS environment - SeleniumBase handles version matching automatically
|
||||||
log.info("Using standard undetected_chromedriver setup")
|
log.info("Creating SeleniumBase UC Mode driver")
|
||||||
driver = uc.Chrome(options=opts)
|
try:
|
||||||
|
driver = Driver(
|
||||||
|
uc=True,
|
||||||
|
headless=headless,
|
||||||
|
page_load_strategy="normal",
|
||||||
|
incognito=True # Use incognito mode for better stealth
|
||||||
|
)
|
||||||
|
log.info("Successfully created SeleniumBase UC driver")
|
||||||
|
except Exception as e:
|
||||||
|
log.error(f"Failed to create SeleniumBase driver: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
# Set page load timeout to avoid hanging
|
# Set page load timeout to avoid hanging
|
||||||
driver.set_page_load_timeout(30)
|
driver.set_page_load_timeout(30)
|
||||||
log.info("Chrome driver setup completed successfully")
|
|
||||||
|
# Set window size
|
||||||
|
driver.set_window_size(1400, 900)
|
||||||
|
|
||||||
|
# Add additional stealth settings
|
||||||
|
try:
|
||||||
|
# Disable automation flags
|
||||||
|
driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
|
||||||
|
'source': '''
|
||||||
|
Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
|
||||||
|
Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3, 4, 5]});
|
||||||
|
Object.defineProperty(navigator, 'languages', {get: () => ['en-US', 'en']});
|
||||||
|
'''
|
||||||
|
})
|
||||||
|
log.info("Additional stealth settings applied")
|
||||||
|
except Exception as e:
|
||||||
|
log.debug(f"Could not apply additional stealth settings: {e}")
|
||||||
|
|
||||||
|
log.info("SeleniumBase UC driver setup completed successfully")
|
||||||
return driver
|
return driver
|
||||||
|
|
||||||
def dismiss_cookies(self, driver: Chrome):
|
def dismiss_cookies(self, driver: Chrome):
|
||||||
@@ -471,9 +487,11 @@ class GoogleReviewsScraper:
|
|||||||
parts = current_url.split('/place/')
|
parts = current_url.split('/place/')
|
||||||
new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews?hl={lang_code}"
|
new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews?hl={lang_code}"
|
||||||
driver.get(new_url)
|
driver.get(new_url)
|
||||||
time.sleep(2)
|
time.sleep(3) # Increased wait time for page load
|
||||||
if "review" in driver.current_url.lower():
|
if "review" in driver.current_url.lower():
|
||||||
log.info("Navigated directly to reviews page via URL")
|
log.info("Navigated directly to reviews page via URL")
|
||||||
|
# Extra wait for reviews to render after URL navigation
|
||||||
|
time.sleep(2)
|
||||||
return True
|
return True
|
||||||
|
|
||||||
# Try to identify reviews link in URL
|
# Try to identify reviews link in URL
|
||||||
@@ -481,9 +499,11 @@ class GoogleReviewsScraper:
|
|||||||
parts = current_url.split('/place/')
|
parts = current_url.split('/place/')
|
||||||
new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews"
|
new_url = f"{parts[0]}/place/{parts[1].split('/')[0]}/reviews"
|
||||||
driver.get(new_url)
|
driver.get(new_url)
|
||||||
time.sleep(2)
|
time.sleep(3) # Increased wait time for page load
|
||||||
if "review" in driver.current_url.lower():
|
if "review" in driver.current_url.lower():
|
||||||
log.info("Navigated directly to reviews page via URL")
|
log.info("Navigated directly to reviews page via URL")
|
||||||
|
# Extra wait for reviews to render after URL navigation
|
||||||
|
time.sleep(2)
|
||||||
return True
|
return True
|
||||||
except Exception as url_error:
|
except Exception as url_error:
|
||||||
log.warning(f"Failed to navigate to reviews via URL: {url_error}")
|
log.warning(f"Failed to navigate to reviews via URL: {url_error}")
|
||||||
@@ -831,34 +851,37 @@ class GoogleReviewsScraper:
|
|||||||
target_item = None
|
target_item = None
|
||||||
matched_text = None
|
matched_text = None
|
||||||
|
|
||||||
# 1. First try direct text matching
|
# Log all available menu items for debugging
|
||||||
wanted_labels = SORT_OPTIONS.get(method, [])
|
log.info(f"Available menu items: {[text for _, text in visible_items]}")
|
||||||
|
|
||||||
for item, text in visible_items:
|
# Use position-based selection (most reliable for Google Maps)
|
||||||
|
position_map = {
|
||||||
|
"relevance": 0, # Usually the first option
|
||||||
|
"newest": 1, # Usually the second option
|
||||||
|
"highest": 2, # Usually the third option
|
||||||
|
"lowest": 3 # Usually the fourth option
|
||||||
|
}
|
||||||
|
|
||||||
|
pos = position_map.get(method, -1)
|
||||||
|
if pos >= 0 and pos < len(visible_items):
|
||||||
|
target_item, matched_text = visible_items[pos]
|
||||||
|
log.info(f"Selected menu item at position {pos + 1}: '{matched_text}' for sort method '{method}'")
|
||||||
|
|
||||||
|
# Validate the selection makes sense
|
||||||
|
wanted_labels = SORT_OPTIONS.get(method, [])
|
||||||
|
text_clean = matched_text.lower()
|
||||||
|
|
||||||
|
# Check if selected text contains any of the expected keywords
|
||||||
|
valid_selection = False
|
||||||
for label in wanted_labels:
|
for label in wanted_labels:
|
||||||
if (label in text or text in label or
|
if label.lower() in text_clean or text_clean in label.lower():
|
||||||
(len(text) > 0 and len(label) > 0 and
|
valid_selection = True
|
||||||
text.lower().startswith(label.lower()[:3]))):
|
|
||||||
target_item = item
|
|
||||||
matched_text = text
|
|
||||||
log.info(f"Found matching menu item: '{text}' for '{label}'")
|
|
||||||
break
|
break
|
||||||
if target_item:
|
|
||||||
break
|
|
||||||
|
|
||||||
# 2. If no match found, try position-based selection
|
if not valid_selection:
|
||||||
if not target_item and visible_items:
|
log.warning(f"WARNING: Selected '{matched_text}' doesn't match expected '{method}' - might be wrong sort!")
|
||||||
position_map = {
|
else:
|
||||||
"relevance": 0, # Usually the first option
|
log.warning(f"Position {pos} not available in menu (only {len(visible_items)} items)")
|
||||||
"newest": 1, # Usually the second option
|
|
||||||
"highest": 2, # Usually the third option
|
|
||||||
"lowest": 3 # Usually the fourth option
|
|
||||||
}
|
|
||||||
|
|
||||||
pos = position_map.get(method, -1)
|
|
||||||
if pos >= 0 and pos < len(visible_items):
|
|
||||||
target_item, matched_text = visible_items[pos]
|
|
||||||
log.info(f"Using position-based selection (position {pos}) for '{method}'")
|
|
||||||
|
|
||||||
# 3. If target found, click it
|
# 3. If target found, click it
|
||||||
if target_item:
|
if target_item:
|
||||||
@@ -1108,16 +1131,55 @@ class GoogleReviewsScraper:
|
|||||||
|
|
||||||
self.dismiss_cookies(driver)
|
self.dismiss_cookies(driver)
|
||||||
self.click_reviews_tab(driver)
|
self.click_reviews_tab(driver)
|
||||||
self.set_sort(driver, sort_by)
|
|
||||||
|
|
||||||
# Add a wait after setting sort to allow results to load
|
# Extra wait after clicking reviews tab to ensure page loads
|
||||||
time.sleep(1)
|
log.info("Waiting for reviews page to fully load...")
|
||||||
|
time.sleep(3)
|
||||||
|
|
||||||
|
# Wait for page to be fully interactive
|
||||||
|
try:
|
||||||
|
wait.until(lambda d: d.execute_script("return document.readyState") == "complete")
|
||||||
|
log.info("Page DOM is ready")
|
||||||
|
except:
|
||||||
|
log.debug("Could not verify page ready state")
|
||||||
|
|
||||||
|
# Verify we're on a reviews page before proceeding
|
||||||
|
if "review" not in driver.current_url.lower():
|
||||||
|
log.warning("URL doesn't contain 'review' - might not be on reviews page")
|
||||||
|
|
||||||
|
# Try to set sort - but don't fail if it doesn't work
|
||||||
|
try:
|
||||||
|
self.set_sort(driver, sort_by)
|
||||||
|
except Exception as sort_error:
|
||||||
|
log.warning(f"Sort failed but continuing: {sort_error}")
|
||||||
|
|
||||||
|
# Add a longer wait after setting sort to allow results to load
|
||||||
|
log.info("Waiting for reviews to render...")
|
||||||
|
time.sleep(3)
|
||||||
|
|
||||||
# Use try-except to handle cases where the pane is not found
|
# Use try-except to handle cases where the pane is not found
|
||||||
try:
|
# Try multiple selectors for the reviews pane
|
||||||
pane = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, PANE_SEL)))
|
pane = None
|
||||||
except TimeoutException:
|
pane_selectors = [
|
||||||
log.warning("Could not find reviews pane. Page structure might have changed.")
|
PANE_SEL, # Primary selector
|
||||||
|
'div[role="main"] div.m6QErb', # Simplified version
|
||||||
|
'div.m6QErb.DxyBCb', # Even more simplified
|
||||||
|
'div[role="main"]' # Most generic
|
||||||
|
]
|
||||||
|
|
||||||
|
for selector in pane_selectors:
|
||||||
|
try:
|
||||||
|
log.info(f"Trying to find reviews pane with selector: {selector}")
|
||||||
|
pane = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, selector)))
|
||||||
|
if pane:
|
||||||
|
log.info(f"Found reviews pane with selector: {selector}")
|
||||||
|
break
|
||||||
|
except TimeoutException:
|
||||||
|
log.debug(f"Pane not found with selector: {selector}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
if not pane:
|
||||||
|
log.warning("Could not find reviews pane with any selector. Page structure might have changed.")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
pbar = tqdm(desc="Scraped", ncols=80, initial=len(seen))
|
pbar = tqdm(desc="Scraped", ncols=80, initial=len(seen))
|
||||||
@@ -1132,8 +1194,12 @@ class GoogleReviewsScraper:
|
|||||||
log.warning(f"Error setting up scroll script: {e}")
|
log.warning(f"Error setting up scroll script: {e}")
|
||||||
scroll_script = "window.scrollBy(0, 300);" # Fallback to simple scrolling
|
scroll_script = "window.scrollBy(0, 300);" # Fallback to simple scrolling
|
||||||
|
|
||||||
max_attempts = 10 # Limit the number of attempts to find reviews
|
max_attempts = 50 # Increased from 10 to 50 for very patient scrolling
|
||||||
attempts = 0
|
attempts = 0
|
||||||
|
max_idle = 15 # Increased from 3 to 15 - much more patience for lazy-loaded reviews
|
||||||
|
consecutive_no_cards = 0 # Track how many times we find zero cards
|
||||||
|
last_scroll_position = 0
|
||||||
|
scroll_stuck_count = 0
|
||||||
|
|
||||||
while attempts < max_attempts:
|
while attempts < max_attempts:
|
||||||
try:
|
try:
|
||||||
@@ -1142,12 +1208,23 @@ class GoogleReviewsScraper:
|
|||||||
|
|
||||||
# Check for valid cards
|
# Check for valid cards
|
||||||
if len(cards) == 0:
|
if len(cards) == 0:
|
||||||
log.debug("No review cards found in this iteration")
|
consecutive_no_cards += 1
|
||||||
|
log.info(f"No review cards found in this iteration (consecutive: {consecutive_no_cards})")
|
||||||
|
|
||||||
|
# If we keep finding no cards, might have hit the end
|
||||||
|
if consecutive_no_cards > 5:
|
||||||
|
log.warning("No cards found for 5+ iterations - might be at end of reviews")
|
||||||
|
break
|
||||||
|
|
||||||
attempts += 1
|
attempts += 1
|
||||||
# Try scrolling anyway
|
# Try aggressive scrolling
|
||||||
driver.execute_script(scroll_script)
|
driver.execute_script(scroll_script)
|
||||||
time.sleep(1)
|
time.sleep(1)
|
||||||
|
driver.execute_script("window.scrollBy(0, 1000);") # Extra scroll
|
||||||
|
time.sleep(1.5)
|
||||||
continue
|
continue
|
||||||
|
else:
|
||||||
|
consecutive_no_cards = 0 # Reset counter when we find cards
|
||||||
|
|
||||||
for c in cards:
|
for c in cards:
|
||||||
try:
|
try:
|
||||||
@@ -1186,12 +1263,48 @@ class GoogleReviewsScraper:
|
|||||||
idle = 0
|
idle = 0
|
||||||
attempts = 0 # Reset attempts counter when we successfully process a review
|
attempts = 0 # Reset attempts counter when we successfully process a review
|
||||||
|
|
||||||
if idle >= 3:
|
if idle >= max_idle:
|
||||||
|
log.info(f"Stopping: No new reviews found after {max_idle} scroll attempts")
|
||||||
break
|
break
|
||||||
|
|
||||||
if not fresh_cards:
|
if not fresh_cards:
|
||||||
idle += 1
|
idle += 1
|
||||||
attempts += 1
|
attempts += 1
|
||||||
|
log.info(f"No new reviews in this iteration (idle: {idle}/{max_idle}, attempts: {attempts}/{max_attempts}, total seen: {len(seen)})")
|
||||||
|
|
||||||
|
# When no new reviews, scroll more aggressively
|
||||||
|
try:
|
||||||
|
# Try multiple scroll methods
|
||||||
|
driver.execute_script(scroll_script)
|
||||||
|
time.sleep(0.5)
|
||||||
|
driver.execute_script("window.scrollBy(0, 500);") # Extra scroll
|
||||||
|
time.sleep(0.5)
|
||||||
|
except Exception as e:
|
||||||
|
log.warning(f"Error scrolling: {e}")
|
||||||
|
else:
|
||||||
|
log.info(f"Found {len(fresh_cards)} new reviews in this iteration")
|
||||||
|
|
||||||
|
# Check if we're actually scrolling or stuck
|
||||||
|
try:
|
||||||
|
current_scroll = driver.execute_script("return arguments[0].scrollTop;", pane)
|
||||||
|
if current_scroll == last_scroll_position and len(fresh_cards) == 0:
|
||||||
|
scroll_stuck_count += 1
|
||||||
|
log.warning(f"Scroll position hasn't changed (stuck at {current_scroll}px, stuck count: {scroll_stuck_count})")
|
||||||
|
|
||||||
|
if scroll_stuck_count > 5:
|
||||||
|
log.warning("Scroll is stuck - trying alternative scroll method")
|
||||||
|
# Try clicking the last visible review to force loading
|
||||||
|
try:
|
||||||
|
driver.execute_script("arguments[0].lastElementChild.scrollIntoView();", pane)
|
||||||
|
time.sleep(2)
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
scroll_stuck_count = 0
|
||||||
|
else:
|
||||||
|
scroll_stuck_count = 0
|
||||||
|
last_scroll_position = current_scroll
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
# Use JavaScript for smoother scrolling
|
# Use JavaScript for smoother scrolling
|
||||||
try:
|
try:
|
||||||
@@ -1201,8 +1314,13 @@ class GoogleReviewsScraper:
|
|||||||
# Try a simpler scroll method
|
# Try a simpler scroll method
|
||||||
driver.execute_script("window.scrollBy(0, 300);")
|
driver.execute_script("window.scrollBy(0, 300);")
|
||||||
|
|
||||||
# Dynamic sleep: sleep less when processing many reviews
|
# Dynamic sleep: sleep less when processing many reviews, more when finding none
|
||||||
sleep_time = 0.7 if len(fresh_cards) > 5 else 1.0
|
if len(fresh_cards) > 5:
|
||||||
|
sleep_time = 0.7
|
||||||
|
elif len(fresh_cards) == 0:
|
||||||
|
sleep_time = 2.0 # Wait longer when finding nothing (let page load)
|
||||||
|
else:
|
||||||
|
sleep_time = 1.0
|
||||||
time.sleep(sleep_time)
|
time.sleep(sleep_time)
|
||||||
|
|
||||||
except StaleElementReferenceException:
|
except StaleElementReferenceException:
|
||||||
|
|||||||
@@ -1,17 +1,8 @@
|
|||||||
requests==2.32.3
|
seleniumbase>=4.34.9
|
||||||
beautifulsoup4==4.12.3
|
|
||||||
aiohttp==3.11.11
|
|
||||||
googletrans==4.0.2
|
googletrans==4.0.2
|
||||||
selenium==4.15.2
|
tqdm>=4.66.3
|
||||||
undetected-chromedriver==3.5.4
|
|
||||||
tqdm==4.66.3
|
|
||||||
pymongo==4.12.0
|
pymongo==4.12.0
|
||||||
pyyaml==6.0.1
|
|
||||||
certifi==2024.7.4
|
|
||||||
webdriver-manager==4.0.2
|
|
||||||
setuptools==79.0.1
|
|
||||||
boto3==1.35.1
|
boto3==1.35.1
|
||||||
pytest==7.4.3
|
|
||||||
fastapi==0.104.1
|
fastapi==0.104.1
|
||||||
uvicorn==0.24.0
|
uvicorn==0.24.0
|
||||||
botocore~=1.35.99
|
botocore~=1.35.99
|
||||||
|
|||||||
110
tests/test_seleniumbase_integration.py
Normal file
110
tests/test_seleniumbase_integration.py
Normal file
@@ -0,0 +1,110 @@
|
|||||||
|
"""
|
||||||
|
Tests for SeleniumBase UC Mode integration.
|
||||||
|
Verifies that the driver setup works correctly with the new library.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from modules.scraper import GoogleReviewsScraper
|
||||||
|
|
||||||
|
|
||||||
|
def test_seleniumbase_driver_creation():
|
||||||
|
"""Test that SeleniumBase driver can be created successfully"""
|
||||||
|
config = {
|
||||||
|
"url": "https://maps.app.goo.gl/test",
|
||||||
|
"headless": True,
|
||||||
|
"use_mongodb": False,
|
||||||
|
"backup_to_json": False
|
||||||
|
}
|
||||||
|
|
||||||
|
scraper = GoogleReviewsScraper(config)
|
||||||
|
|
||||||
|
# Test driver creation
|
||||||
|
driver = None
|
||||||
|
try:
|
||||||
|
driver = scraper.setup_driver(headless=True)
|
||||||
|
assert driver is not None
|
||||||
|
assert driver.name == "chrome"
|
||||||
|
|
||||||
|
# Verify driver can navigate
|
||||||
|
driver.get("https://www.google.com")
|
||||||
|
assert "google" in driver.current_url.lower()
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if driver:
|
||||||
|
driver.quit()
|
||||||
|
|
||||||
|
|
||||||
|
def test_seleniumbase_driver_headless_mode():
|
||||||
|
"""Test that headless mode works correctly"""
|
||||||
|
config = {
|
||||||
|
"url": "https://maps.app.goo.gl/test",
|
||||||
|
"headless": True,
|
||||||
|
"use_mongodb": False,
|
||||||
|
"backup_to_json": False
|
||||||
|
}
|
||||||
|
|
||||||
|
scraper = GoogleReviewsScraper(config)
|
||||||
|
driver = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
driver = scraper.setup_driver(headless=True)
|
||||||
|
assert driver is not None
|
||||||
|
|
||||||
|
# In headless mode, window size should still be set
|
||||||
|
size = driver.get_window_size()
|
||||||
|
assert size['width'] == 1400
|
||||||
|
assert size['height'] == 900
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if driver:
|
||||||
|
driver.quit()
|
||||||
|
|
||||||
|
|
||||||
|
def test_seleniumbase_driver_nonheadless_mode():
|
||||||
|
"""Test that non-headless mode works correctly"""
|
||||||
|
config = {
|
||||||
|
"url": "https://maps.app.goo.gl/test",
|
||||||
|
"headless": False,
|
||||||
|
"use_mongodb": False,
|
||||||
|
"backup_to_json": False
|
||||||
|
}
|
||||||
|
|
||||||
|
scraper = GoogleReviewsScraper(config)
|
||||||
|
driver = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
driver = scraper.setup_driver(headless=False)
|
||||||
|
assert driver is not None
|
||||||
|
assert driver.name == "chrome"
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if driver:
|
||||||
|
driver.quit()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip(reason="Integration test - requires network access")
|
||||||
|
def test_seleniumbase_google_maps_access():
|
||||||
|
"""Test that driver can access Google Maps (integration test)"""
|
||||||
|
config = {
|
||||||
|
"url": "https://maps.app.goo.gl/6tkNMDjcj3SS6LJe9",
|
||||||
|
"headless": True,
|
||||||
|
"use_mongodb": False,
|
||||||
|
"backup_to_json": False
|
||||||
|
}
|
||||||
|
|
||||||
|
scraper = GoogleReviewsScraper(config)
|
||||||
|
driver = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
driver = scraper.setup_driver(headless=True)
|
||||||
|
driver.get(config["url"])
|
||||||
|
|
||||||
|
# Wait for redirect to Google Maps
|
||||||
|
import time
|
||||||
|
time.sleep(3)
|
||||||
|
|
||||||
|
assert "google.com/maps" in driver.current_url
|
||||||
|
|
||||||
|
finally:
|
||||||
|
if driver:
|
||||||
|
driver.quit()
|
||||||
Reference in New Issue
Block a user