fix(synthesis): Select most common business_id to handle data leakage

Changed the business name query to ORDER BY COUNT(*) DESC instead of
arbitrary LIMIT 1, ensuring the correct business is identified even
when trace amounts of other business data leak into a job.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-01-30 15:28:02 +00:00
parent 69d617ca38
commit 2a292e0754

View File

@@ -486,10 +486,13 @@ class Stage5Synthesizer:
ORDER BY negative DESC ORDER BY negative DESC
""", job_id) """, job_id)
# Business name # Business name - get the most common one (in case of data leakage)
business = await self.pool.fetchval(""" business = await self.pool.fetchval("""
SELECT DISTINCT business_id FROM pipeline.reviews_enriched SELECT business_id FROM pipeline.reviews_enriched
WHERE job_id = $1::uuid LIMIT 1 WHERE job_id = $1::uuid
GROUP BY business_id
ORDER BY COUNT(*) DESC
LIMIT 1
""", job_id) """, job_id)
# MOMENTUM: Calculate from data (not LLM guess) # MOMENTUM: Calculate from data (not LLM guess)