whyrating-engine-legacy

alezmad/whyrating-engine-legacy

Fork 0

Commit Graph

Author	SHA1	Message	Date
Alejandro Gutiérrez	218927bd9b	Filter out garbage API data (language codes, metadata) - Reject authors with <= 3 chars (language codes like "es", "it", "no") - Reject known non-review authors ("google", "maps", etc.) - Reject timestamps that are URLs or very short strings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:47:08 +00:00
Alejandro Gutiérrez	0e8a711a9c	Fix clean scraper: specific selectors, consent reload, DOM parsing - Use div.jftiEf[data-review-id] selector to exclude button elements - Reload original URL after consent (prevents URL corruption) - Parse full DOM data after scrolling stops - Deduplicate API reviews by author match - Remove slow "More" button clicking for speed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:40:15 +00:00
Alejandro Gutiérrez	2c7ba2ae40	Add clean scraper with fixed smooth scrolling Key improvements: - Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll - JavaScript-based review ID collection (doesn't affect scroll position) - API interception via injected fetch/XHR interceptor - Total review count extraction from page - Auto-stop when all reviews collected or timeout reached The scroll issue was caused by Selenium's find_elements() affecting scroll position. Using pure JavaScript for data collection keeps scroll pinned to bottom. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-21 20:28:24 +00:00

Author

SHA1

Message

Date

Alejandro Gutiérrez

218927bd9b

Filter out garbage API data (language codes, metadata)

- Reject authors with <= 3 chars (language codes like "es", "it", "no")
- Reject known non-review authors ("google", "maps", etc.)
- Reject timestamps that are URLs or very short strings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-21 20:47:08 +00:00

Alejandro Gutiérrez

0e8a711a9c

Fix clean scraper: specific selectors, consent reload, DOM parsing

- Use div.jftiEf[data-review-id] selector to exclude button elements
- Reload original URL after consent (prevents URL corruption)
- Parse full DOM data after scrolling stops
- Deduplicate API reviews by author match
- Remove slow "More" button clicking for speed

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-21 20:40:15 +00:00

Alejandro Gutiérrez

2c7ba2ae40

Add clean scraper with fixed smooth scrolling

Key improvements:
- Background thread scrolling at 10Hz (0.1s intervals) for smooth continuous scroll
- JavaScript-based review ID collection (doesn't affect scroll position)
- API interception via injected fetch/XHR interceptor
- Total review count extraction from page
- Auto-stop when all reviews collected or timeout reached

The scroll issue was caused by Selenium's find_elements() affecting scroll
position. Using pure JavaScript for data collection keeps scroll pinned to bottom.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-21 20:28:24 +00:00

3 Commits