Go to file

Alejandro Gutiérrez 8b925ba965 Implement continuous scrolling with smart gap-based timeout

Major refactoring to achieve 100% review collection:

CONTINUOUS SCROLLING:
- Background thread scrolls NON-STOP at 5ms intervals (no gaps!)
- Main thread checks every 2s while scrolling continues
- Stops immediately when all reviews collected
- Solves the core problem: gaps between bursts caused Google to stop loading

SMART TIMEOUT:
- Gap-based: 3x average gap between review loads
- Initial timeout: 3x time since first load (or 15s default)
- Adaptive: evolves from conservative early timeout to smart gap-based
- Detailed logging shows timeout calculations

RESULTS:
- 100% completion (271/271) vs previous 91% (247/271)
- 3.5x faster (~17s vs 60s)
- Clean thread management with proper shutdown

REMOVED:
- All burst scrolling code (~100 lines)
- Scroll stuck detection (no longer needed)
- Dynamic sleep logic (replaced with continuous scrolling)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-19 01:39:47 +00:00

api_response_samples

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

docs

migrate to SeleniumBase UC Mode for automatic version management

2025-12-07 19:40:13 +07:00

examples

Added config example and sample output

2025-04-24 23:19:36 +07:00

modules

Implement continuous scrolling with smart gap-based timeout

2026-01-19 01:39:47 +00:00

tests

migrate to SeleniumBase UC Mode for automatic version management

2025-12-07 19:40:13 +07:00

web

Add robust structural pattern matching and early no-reviews detection

2026-01-18 19:52:39 +00:00

.env.example

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

.gitignore

migrate to SeleniumBase UC Mode for automatic version management

2025-12-07 19:40:13 +07:00

API_DOCUMENTATION.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

API_INTERCEPTOR_DEBUG_SUMMARY.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

API_OPTIMIZATION_SUMMARY.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

API_QUICKSTART.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

api_server_production.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

api_server.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

API_TEST_RESULTS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

brute_force_selector.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

check_page_structure.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

CHROME_WORKER_POOLS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

CONCURRENT_JOBS_TEST_RESULTS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

CONTAINERIZED_SOLUTION_SUMMARY.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

cookie_based_scraper.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

DATA_STRUCTURE_ANALYSIS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

debug_business_card.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

debug_check.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

debug_detail_page.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

debug_search_results.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

debug_soho.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

debug_tabs.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

debug_wait_for_results.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

DEPLOYMENT_GUIDE.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

diagnose_reviews_panel.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

diagnose_selectors.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

direct_api_scraper.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

DOCKER_CHROME_SETUP.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

docker-compose.production.yml

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

Dockerfile

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

dump_api_response.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

dump_api_responses.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

fast_api_scraper.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

FIELD_ANALYSIS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

FINAL_RESULTS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

find_actual_reviews.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

GOOGLE_DATE_FORMAT_SPECIFICATION.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

header_capture_scraper.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

HEALTH_CHECKS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

hybrid_api_scraper.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

inspect_pane_content.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

LICENSE

Release Google Reviews Scraper Pro v1.0.0 (2025)

2025-04-24 22:12:07 +07:00

manual_inspect.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

MICROSERVICE_ARCHITECTURE.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

OPTIMIZATION_RESULTS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

pane_not_found.png

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

PARALLEL_OPTIMIZATION_RESULTS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

PHASE1_COMPLETE.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

QUICK_START_API_MODE.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

QUICKSTART.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

README.md

migrate to SeleniumBase UC Mode for automatic version management

2025-12-07 19:40:13 +07:00

requirements-production.txt

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

requirements.txt

migrate to SeleniumBase UC Mode for automatic version management

2025-12-07 19:40:13 +07:00

RESULTS_SUMMARY.txt

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

reverse_engineer_date_formatter_v2.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

reverse_engineer_date_formatter.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

SPEED_OPTIMIZATION_SUMMARY.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_api_244.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_complete.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_dom_only_fast.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_fast.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_fastest_stable.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_hybrid_parallel.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_optimized_hybrid.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_parallel_hybrid.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_parallel_v2.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_parallel.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_ultra_fast_complete.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_ultra_fast_v2.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start_ultra_fast.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

start.py

Add API interception for hybrid scraping and update selectors

2026-01-17 21:51:10 +00:00

STORAGE_COMPARISON.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

terms-of-usage.md

Release Google Reviews Scraper Pro v1.0.0 (2025)

2025-04-24 22:12:07 +07:00

test_api_quick.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_concurrent_jobs.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_debug_extraction.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_docker_chrome.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_english_dates_simple.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_english_dates.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_extract_app_state.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_fast_api.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_lithuanian_hospital.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

test_phase1.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_soho_vilna.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_user_selector.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_validation_local.py

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

test_without_english.py

Fix: Add early no-reviews detection and hide analytics for empty jobs

2026-01-18 20:14:04 +00:00

TESTING_INTERFACE.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

ULTIMATE_RESULTS.md

Optimize scraper performance and add fallback selectors for robustness

2026-01-18 19:49:24 +00:00

README.md

🔥 Google Reviews Scraper Pro (2025) 🔥

FINALLY! A scraper that ACTUALLY WORKS in 2025! While others break with every Google update, this bad boy keeps on trucking. Say goodbye to the frustration of constantly broken scrapers and hello to a beast that rips through Google's defenses like a hot knife through butter. This battle-tested, rock-solid solution will extract every juicy detail from Google reviews while laughing in the face of rate limiting.

🌟 Feature Artillery

Bulletproof in 2025: While the competition falls apart, we've cracked Google's latest tricks
Enhanced SeleniumBase UC Mode: Superior anti-detection with automatic Chrome/ChromeDriver version matching - no more version headaches!
Polyglot Powerhouse: Devours reviews in a smorgasbord of languages - English, Hebrew, Thai, German, you name it!
MongoDB Mastery: Dumps pristine data structures straight into your MongoDB instance
Paranoid Backups: Mirrors everything to local JSON files because losing data sucks
Aggressive Image Capture:
- Snags EVERY damn photo from reviews and profiles
- Hoards local paths or swaps URLs to your domain like a boss
- Multi-threaded downloading that would make NASA jealous
- S3 Cloud Storage: Auto-upload images to AWS S3 with custom folder structure
REST API Server: Trigger scraping jobs via HTTP endpoints with background processing
Time-Bending Magic: Transforms Google's vague "2 weeks ago" garbage into precise ISO timestamps
Sort Any Damn Way: Newest, highest, lowest, relevance - we've got you covered
Metadata on Steroids: Inject custom parameters into every review record
Pick Up Where You Left Off: Resume scraping after crashes, because life happens
Ghost Mode: Run silently in headless mode, no browser window in sight
Battle-Hardened Resilience: Network hiccups? Google's tricks? HAH! We eat those for breakfast
Obsessive Logging: Every action documented in glorious detail for when things get weird

📋 Battle Station Requirements

Python 3.10+ (don't even try with 3.9, seriously)
Chrome browser (the fresher the better)
MongoDB (optional, but c'mon, live a little)
AWS S3 Account (optional, for cloud image storage)
Coffee (mandatory for watching thousands of reviews roll in)

🚀 Deployment Instructions

Grab the source code:

git clone https://github.com/georgekhananaev/google-reviews-scraper-pro.git
cd google-reviews-scraper-pro

Arm your environment:

pip install -r requirements.txt
# Pro tip: Use a virtual env unless you enjoy dependency hell

Make sure this sucker works:

python start.py --help
# If this spits out options, you're golden. If not, check your Python path!

⚙️ Fine-Tuning Your Beast

Look, this isn't some one-size-fits-all garbage. You've got two ways to bend this tool to your will: the almighty config.yaml file or straight-up command-line arguments. When they clash, command-line is king (obviously).

Example `config.yaml`:

# Google Maps Reviews Scraper Configuration

# URL to scrape
url: "https://maps.app.goo.gl/6tkNMDjcj3SS6LJe9"

# Scraper settings
headless: true                # Run Chrome in headless mode
sort_by: "newest"             # Options: "newest", "highest", "lowest", "relevance"
stop_on_match: false          # Stop when first already-seen review is encountered
overwrite_existing: false     # Whether to overwrite existing reviews or append

# MongoDB settings
use_mongodb: true             # Whether to use MongoDB for storage
mongodb:
  uri: "mongodb://username:password@localhost:27017/"
  database: "reviews"
  collection: "google_reviews"

# JSON backup settings
backup_to_json: true          # Whether to backup data to JSON files
json_path: "google_reviews.json"
seen_ids_path: "google_reviews.ids"

# Data processing settings
convert_dates: true           # Convert string dates to MongoDB Date objects

# Image download settings
download_images: true         # Download images from reviews
image_dir: "review_images"    # Directory to store downloaded images
download_threads: 4           # Number of threads for downloading images
store_local_paths: true       # Whether to store local image paths in documents
max_width: 1200               # Maximum width for downloaded images (Google images)
max_height: 1200              # Maximum height for downloaded images (Google images)

# S3 settings (optional)
use_s3: false                 # Whether to upload images to S3
s3:
  aws_access_key_id: ""       # AWS Access Key ID
  aws_secret_access_key: ""   # AWS Secret Access Key
  region_name: "us-east-1"    # AWS region
  bucket_name: ""             # S3 bucket name
  prefix: "reviews/"          # Base prefix for uploaded files
  profiles_folder: "profiles/"    # Folder name for profile images within prefix
  reviews_folder: "reviews/"      # Folder name for review images within prefix
  delete_local_after_upload: false  # Delete local files after successful S3 upload
  s3_base_url: ""             # Custom S3 base URL for accessing files (if empty, uses AWS default)

# URL replacement settings
replace_urls: true           # Whether to replace original URLs with custom ones
custom_url_base: "https://yourdomain.com/images"  # Base URL for replacement
custom_url_profiles: "/profiles/"  # Path for profile images
custom_url_reviews: "/reviews/"    # Path for review images
preserve_original_urls: true  # Whether to preserve original URLs in original_* fields

# Custom parameters to add to each document
# These will be added statically to all documents
custom_params:
  company: "Your Business Name"
  source: "Google Maps"
  location: "Bangkok, Thailand"

🖥️ Unleashing Hell

Command Line Usage

python start.py --url "https://maps.app.goo.gl/YOUR_URL"
# Boom. That's it. Now go grab a coffee while the magic happens.

🚀 API Server Mode (NEW!)

Want to trigger scraping jobs via REST API? We've got you covered:

# Start the API server
python api_server.py
# Server runs on http://localhost:8000

API Endpoints:

Start a scraping job:

curl -X POST "http://localhost:8000/scrape" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://maps.app.goo.gl/YOUR_URL",
    "headless": true,
    "sort_by": "newest",
    "download_images": true
  }'

Check job status:

curl "http://localhost:8000/jobs/{job_id}"

List all jobs:

curl "http://localhost:8000/jobs"

Get job statistics:

curl "http://localhost:8000/stats"

Interactive API docs available at: http://localhost:8000/docs

Battle-Tested Recipes

Stealth Mode + Fresh Stuff First:

python start.py --url "https://maps.app.goo.gl/YOUR_URL" --headless --sort newest
# Perfect for a cron job. They'll never see you coming.

Incremental Grab (why waste CPU cycles?):

python start.py --url "https://maps.app.goo.gl/YOUR_URL" --stop-on-match
# Once it hits a review it's seen before, it taps out. Efficiency, baby!

JSON-Only Diet (MongoDB haters unite):

python start.py --url "https://maps.app.goo.gl/YOUR_URL" --use-mongodb false
# For the "I just want a damn file" crowd.

Custom Tags Galore:

python start.py --url "https://maps.app.goo.gl/YOUR_URL" --custom-params '{"company":"Hotel California","location":"Los Angeles"}'
# Brand these puppies however you want. Go nuts.

Image Hoarding Deluxe:

python start.py --url "https://maps.app.goo.gl/YOUR_URL" --download-images true --replace-urls true --custom-url-base "https://yourdomain.com/images"
# Every. Single. Picture. With your domain stamped all over 'em.

S3 Cloud Storage Beast Mode:

python start.py --url "https://maps.app.goo.gl/YOUR_URL" --download-images true --use-s3 true
# Downloads locally AND uploads to S3. Best of both worlds, baby!

Command Line Arguments

usage: start.py [-h] [-q] [-s {newest,highest,lowest,relevance}] [--stop-on-match] [--url URL] [--overwrite] [--config CONFIG] [--use-mongodb USE_MONGODB]
                [--convert-dates CONVERT_DATES] [--download-images DOWNLOAD_IMAGES] [--image-dir IMAGE_DIR] [--download-threads DOWNLOAD_THREADS]
                [--store-local-paths STORE_LOCAL_PATHS] [--replace-urls REPLACE_URLS] [--custom-url-base CUSTOM_URL_BASE]
                [--custom-url-profiles CUSTOM_URL_PROFILES] [--custom-url-reviews CUSTOM_URL_REVIEWS] [--preserve-original-urls PRESERVE_ORIGINAL_URLS]
                [--custom-params CUSTOM_PARAMS]

Google‑Maps review scraper with MongoDB integration

options:
  -h, --help            show this help message and exit
  -q, --headless        run Chrome in the background
  -s {newest,highest,lowest,relevance}, --sort {newest,highest,lowest,relevance}
                        sorting order for reviews
  --stop-on-match       stop scrolling when first already‑seen id is met (useful with --sort newest)
  --url URL             custom Google Maps URL to scrape
  --overwrite           overwrite existing reviews instead of appending
  --config CONFIG       path to custom configuration file
  --use-mongodb USE_MONGODB
                        whether to use MongoDB for storage
  --convert-dates CONVERT_DATES
                        convert string dates to MongoDB Date objects
  --download-images DOWNLOAD_IMAGES
                        download images from reviews
  --image-dir IMAGE_DIR
                        directory to store downloaded images
  --download-threads DOWNLOAD_THREADS
                        number of threads for downloading images
  --store-local-paths STORE_LOCAL_PATHS
                        whether to store local image paths in documents
  --replace-urls REPLACE_URLS
                        whether to replace original URLs with custom ones
  --custom-url-base CUSTOM_URL_BASE
                        base URL for replacement
  --custom-url-profiles CUSTOM_URL_PROFILES
                        path for profile images
  --custom-url-reviews CUSTOM_URL_REVIEWS
                        path for review images
  --preserve-original-urls PRESERVE_ORIGINAL_URLS
                        whether to preserve original URLs in original_* fields
  --custom-params CUSTOM_PARAMS
                        JSON string with custom parameters to add to each document (e.g. '{"company":"Your Business"}'

📊 The Juicy Data Payload

Here's what you'll rip out of Google's clutches for each review (and yes, it's way more than their official API gives you):

{
  "review_id": "ChdDSUhNMG9nS0VJQ0FnSUNVck95dDlBRRAB",
  "author": "John Smith",
  "rating": 4.0,
  "description": {
    "en": "Great place, loved the service. Will definitely come back!",
    "th": "สถานที่ที่ยอดเยี่ยม บริการดีมาก จะกลับมาอีกแน่นอน!"
    // Multilingual gold mine - ALL languages preserved!
  },
  "likes": 3, // Yes, we even grab those useless "likes" numbers
  "user_images": [
    "https://lh5.googleusercontent.com/p/AF1QipOj-3H8...",
    "https://lh5.googleusercontent.com/p/AF1QipM2xG8..."
    // ALL review images - not just the first one like inferior scrapers
  ],
  "author_profile_url": "https://www.google.com/maps/contrib/112419862785748982094",
  "profile_picture": "https://lh3.googleusercontent.com/a-/ALV-UjXtxT...", // Stalk much?
  "owner_responses": {
    "en": {
      "text": "Thank you for your kind words! We look forward to seeing you again."
      // Yes, even those canned replies from the business owner
    }
  },
  "created_date": "2025-04-22T14:30:45.123456+00:00", // When we first grabbed it
  "last_modified_date": "2025-04-22T14:30:45.123456+00:00", // Last update
  "review_date": "2025-04-15T08:15:22+00:00", // When they posted
  "company": "Your Business Name", // Your custom metadata
  "source": "Google Maps",
  "location": "Bangkok, Thailand" 
  // Add whatever other fields you want - this baby is extensible
}

📁 Output Files

When running with default settings, the scraper creates:

google_reviews.json - Contains all extracted reviews
google_reviews.ids - A list of already processed review IDs
review_images/ - Directory containing downloaded images:
- review_images/profiles/ - Profile pictures
- review_images/reviews/ - Review images
S3 Bucket (when enabled) - Images uploaded to your configured S3 bucket with custom folder structure

🔄 Integration Examples

Import to MongoDB Compass

The JSON output is fully compatible with MongoDB Compass import:

Open MongoDB Compass
Navigate to your database and collection
Click "Add Data" → "Import File"
Select your google_reviews.json file
Select JSON format and import

Process Reviews with Python

import json

# Load reviews
with open('google_reviews.json', 'r', encoding='utf-8') as f:
    reviews = json.load(f)

# Calculate average rating
total_rating = sum(review['rating'] for review in reviews)
avg_rating = total_rating / len(reviews)
print(f"Average rating: {avg_rating:.2f}")

# Filter reviews by language
english_reviews = [r for r in reviews if 'en' in r['description']]
print(f"English reviews: {len(english_reviews)}")

# Find reviews with images
reviews_with_images = [r for r in reviews if r['user_images']]
print(f"Reviews with images: {len(reviews_with_images)}")

🛠️ When Shit Hits The Fan

DEFCON Scenarios & Quick Fixes

Chrome/Driver Having a Lovers' Quarrel
- Good news! SeleniumBase handles Chrome/ChromeDriver version matching automatically
- Update Chrome browser: Go to chrome://settings/help
- SeleniumBase will automatically download the matching ChromeDriver - no manual intervention needed!
- If issues persist: pip install --upgrade seleniumbase
MongoDB Throwing a Tantrum
- Double-check your connection string - typos are the #1 culprit
- Is your IP whitelisted? MongoDB Atlas loves to block new IPs
- Run nc -zv your-mongodb-host 27017 to check if the port's even reachable
- Did you forget to start Mongo? sudo systemctl start mongod (Linux) or brew services start mongodb-community (Mac)
"Where Are My Reviews?!" Crisis
- Make sure your URL isn't garbage - copy directly from the address bar in Google Maps
- Not all sort options work for all businesses. Try --sort relevance if all else fails
- Some locations have zero reviews. Yes, it happens. No, it's not the scraper's fault.
Image Download Apocalypse
- Check if Google is throttling you (likely if you've been hammering them)
- Run with sudo if you're getting permission errors (not ideal but gets the job done)
- Some images vanish from Google's CDN faster than your ex. Nothing we can do about that.
S3 Upload Chaos
- Double-check your AWS credentials and bucket permissions
- Make sure your bucket exists and is in the specified region
- Check if your bucket policy allows public-read for uploaded objects
- AWS charges for every API call, so don't go crazy with test uploads

Operation Logs (AKA "What The Hell Is It Doing?")

We don't just log, we OBSESSIVELY document the scraper's every breath:

[2025-04-22 14:30:45] Starting scraper with settings: headless=True, sort_by=newest
[2025-04-22 14:30:45] URL: https://maps.app.goo.gl/6tkNMDjcj3SS6LJe9
[2025-04-22 14:30:47] Platform: Linux-5.15.0-58-generic-x86_64-with-glibc2.35
[2025-04-22 14:30:47] Python version: 3.13.1
[2025-04-22 14:30:47] Using standard undetected_chromedriver setup
[2025-04-22 14:30:52] Chrome driver setup completed successfully
[2025-04-22 14:30:55] Found reviews tab, attempting to click
[2025-04-22 14:30:57] Successfully clicked reviews tab using method 1 and selector '[data-tab-index="1"]'
[2025-04-22 14:30:58] Attempting to set sort order to 'newest'
[2025-04-22 14:30:59] Found sort button with selector: 'button[aria-label*="Sort" i]'
[2025-04-22 14:30:59] Sort menu opened with click method 1
[2025-04-22 14:31:00] Found 4 visible menu items
[2025-04-22 14:31:00] Found matching menu item: 'Newest' for 'Newest'
[2025-04-22 14:31:01] Successfully clicked menu item with method 1
[2025-04-22 14:31:01] Successfully set sort order to 'newest'

If you can't figure out what's happening from these logs, you probably shouldn't be using command-line tools at all. We tell you EVERYTHING.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

❓ FAQs From The Trenches

Q: Is scraping Google Maps reviews legal?
A: Look, I'm not your lawyer. Google doesn't want you to do it. It violates their ToS. It's your business whether that scares you or not. This tool exists for "research purposes" (wink wink). Use at your own risk, hotshot.

Q: Will this still work tomorrow/next week/when Google changes stuff?
A: Unlike 99% of the GitHub garbage that breaks when Google changes a CSS class, we're battle-hardened veterans of Google's interface wars. We update this beast CONSTANTLY. April 2025? Rock solid. May 2025? Probably still golden. 2026? Check back for updates.

Q: How do I avoid Google's ban hammer?
A: Our undetected-chromedriver does the heavy lifting, but:

Don't be stupid greedy – set reasonable delays
Spread requests across IPs if you're going enterprise-level
Rotate user agents if you're truly paranoid
Consider a proxy rotation service (worth every penny)

Q: Can this handle enterprise-level scraping (10k+ reviews)?
A: Damn straight. We've pulled 50k+ reviews without breaking a sweat. The MongoDB integration isn't just for show – it's made for serious volume. Just make sure your machine has the RAM to handle it.

Q: I found a bug/have a killer feature idea!
A: Jump on GitHub and file an issue or PR. But do your homework first – if you're reporting something already in the README, we'll roast you publicly.

☁️ AWS S3 Setup Guide

Want to store your images in the cloud like a boss? Here's how to set up S3 integration:

1. Create an S3 Bucket

Log into AWS Console
Click "Create bucket"
Choose a unique bucket name (e.g., your-company-reviews)
Select your preferred region
Important: Under "Block public access settings" - UNCHECK "Block all public access" if you want images to be publicly accessible
Create the bucket

2. Set Bucket Permissions

For public image access, add this bucket policy (replace your-bucket-name):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "PublicReadGetObject",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    }
  ]
}

3. Create IAM User for API Access

Go to IAM Console
Create a new user with programmatic access
Attach this policy (replace your-bucket-name):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:PutObjectAcl",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::your-bucket-name"
    }
  ]
}

Save the Access Key ID and Secret Access Key

4. Configure Your Scraper

Update your config.yaml:

use_s3: true
s3:
  aws_access_key_id: "YOUR_ACCESS_KEY_ID"
  aws_secret_access_key: "YOUR_SECRET_ACCESS_KEY"
  region_name: "us-east-1"  # Match your bucket region
  bucket_name: "your-bucket-name"
  prefix: "google_reviews/"
  profiles_folder: "profiles/"
  reviews_folder: "reviews/"
  delete_local_after_upload: false  # Keep local copies
  s3_base_url: ""  # Leave empty for default AWS URLs

5. Test Your Setup

Run the included tests to verify everything works:

# Install dependencies
pip install -r requirements.txt

# Test S3 connection
pytest tests/test_s3_connection.py -v

6. Folder Structure

Your S3 bucket will organize images like this:

your-bucket/
├── google_reviews/
│   ├── profiles/
│   │   ├── user123.jpg
│   │   └── user456.jpg
│   └── reviews/
│       ├── review789.jpg
│       └── review101.jpg

Pro Tips:

Cost Optimization: Enable S3 Intelligent Tiering for automatic cost savings
CDN: Add CloudFront distribution for faster global image delivery
Security: Use IAM roles instead of hardcoded keys in production
Monitoring: Enable S3 access logging to track usage

🌐 Links

🔎 SEO Keywords

Google Maps reviews scraper, Google reviews exporter, review analysis tool, business review tool, Python web scraper, MongoDB review database, multilingual review scraper, Google Maps data extraction, business intelligence tool, customer feedback analysis, review data mining, Google business reviews, local SEO analysis, review image downloader, Python Selenium scraper, automated review collection, Google Maps API alternative, review monitoring tool, scrape Google reviews, Google business ratings

README.md Unescape Escape

🔥 Google Reviews Scraper Pro (2025) 🔥

🌟 Feature Artillery

📋 Battle Station Requirements

🚀 Deployment Instructions

⚙️ Fine-Tuning Your Beast

Example config.yaml:

🖥️ Unleashing Hell

Command Line Usage

🚀 API Server Mode (NEW!)

API Endpoints:

Battle-Tested Recipes

Command Line Arguments

📊 The Juicy Data Payload

📁 Output Files

🔄 Integration Examples

Import to MongoDB Compass

Process Reviews with Python

🛠️ When Shit Hits The Fan

DEFCON Scenarios & Quick Fixes

Operation Logs (AKA "What The Hell Is It Doing?")

📝 License

❓ FAQs From The Trenches

☁️ AWS S3 Setup Guide

1. Create an S3 Bucket

2. Set Bucket Permissions

3. Create IAM User for API Access

4. Configure Your Scraper

5. Test Your Setup

6. Folder Structure

Pro Tips:

🌐 Links

🔎 SEO Keywords

README.md

Example `config.yaml`: