2025-05-10

Multi-Source Advertiser Data Deduplication & Quality Analysis

Data deduplication toolkit that cleans, normalizes, and matches advertiser records across all collections while flagging quality issues.

DataAutomationLLM

Outcome

Automated matching, clustering, and fiscal ID validation with LLM prompting.

Stack

PythonAzure Data FactoryDatabricksLangChainAzure OpenAI

Related service

Automation Engineering

Production workflow automation, batch tooling, integrations, and operational systems that remove manual coordination and brittle handoffs.

Explore Automation Engineering →

Scope

Advertiser data flowed from CRM, payment gateways, and manual spreadsheets, creating duplicates and mismatched fiscal identifiers across sources.

Approach

Built a Python pipeline that normalizes identifiers, cleans punctuation, and matches records through clustering + probabilistic joins.
Generated quality reports with Excel/CSV exports while LangChain agents validate fiscal IDs and fill missing addresses using Azure OpenAI.
Stored the cleaned master table in Azure SQL and surfaced alerts through Databricks notebooks.

Impact

The process delivered a single, auditable view of advertiser data, surfaced anomalies for finance teams, and automated downstream report generation.

Projects

Need this kind of system in your team?

I help teams ship document agents, RAG copilots, computer vision pipelines, and operational automations without the usual prototype-to-production gap.

Start a project conversation ivanncaamano@gmail.com