← Back to projects2025-05-10
Multi-Source Advertiser Data Deduplication & Quality Analysis
Data deduplication toolkit that cleans, normalizes, and matches advertiser records across all collections while flagging quality issues.
DataAutomationLLM
Outcome
Automated matching, clustering, and fiscal ID validation with LLM prompting.
Stack
PythonAzure Data FactoryDatabricksLangChainAzure OpenAI
Related service
Automation Engineering
Production workflow automation, batch tooling, integrations, and operational systems that remove manual coordination and brittle handoffs.
Explore Automation Engineering →Scope
Advertiser data flowed from CRM, payment gateways, and manual spreadsheets, creating duplicates and mismatched fiscal identifiers across sources.
Approach
- Built a Python pipeline that normalizes identifiers, cleans punctuation, and matches records through clustering + probabilistic joins.
- Generated quality reports with Excel/CSV exports while LangChain agents validate fiscal IDs and fill missing addresses using Azure OpenAI.
- Stored the cleaned master table in Azure SQL and surfaced alerts through Databricks notebooks.
Impact
The process delivered a single, auditable view of advertiser data, surfaced anomalies for finance teams, and automated downstream report generation.
Projects
Need this kind of system in your team?
I help teams ship document agents, RAG copilots, computer vision pipelines, and operational automations without the usual prototype-to-production gap.