Entity Resolution
What is Entity Resolution?
Entity Resolution is the data management process of identifying and linking records across different systems, databases, and data sources that refer to the same real-world entity—such as a person, company, or account—despite variations in how that entity is represented. Also known as record linkage, data matching, or deduplication, entity resolution solves the fundamental challenge that the same customer might appear as "John Smith, Acme Corp" in your CRM, "J. Smith, ACME Corporation" in marketing automation, and "johnsmith@acme.com, Acme Inc." in product analytics.
The complexity of entity resolution stems from the reality that data rarely arrives clean, consistent, or standardized across systems. Companies merge and rebrand, people change jobs and names, email addresses change with company transitions, and different teams enter data with varying conventions. Without effective entity resolution, organizations operate with fragmented customer views—sending duplicate marketing messages, missing cross-sell opportunities, understating customer lifetime value, and making poor decisions based on incomplete data. A single enterprise customer might exist as 15+ separate records across your technology stack, each containing different information and engagement history.
According to Gartner's research on data quality, poor data quality costs organizations an average of $12.9 million annually, with duplicate and unlinked records representing the most common data quality issue. Entity resolution addresses this challenge through sophisticated matching algorithms that evaluate multiple data points—names, email addresses, phone numbers, company identifiers, IP addresses, device IDs, and behavioral patterns—to determine when distinct records actually represent the same entity with high confidence, even when no perfect exact matches exist.
Key Takeaways
Identity Unification Challenge: The same person or company typically exists as multiple records across CRM, marketing automation, product analytics, support systems, and data warehouses due to data entry variations and system silos
Multi-Dimensional Matching: Effective entity resolution combines deterministic matching (exact email matches) with probabilistic algorithms evaluating multiple signals (name similarity, company, location, behavioral patterns) to identify matches even with incomplete data
Data Quality Foundation: Entity resolution is prerequisite to accurate reporting, personalization, attribution, and customer 360 views—without it, all downstream analytics and operations suffer from fragmented identity
Privacy and Compliance Critical: Must balance identity matching accuracy with privacy regulations (GDPR, CCPA) that govern how personal data is linked, stored, and used across systems
Continuous Process: Not one-time cleanup but ongoing system maintaining unified identity as new data arrives, existing records update, and entities change over time
How It Works
Entity Resolution operates through systematic data collection, standardization, matching algorithm application, and continuous identity graph maintenance:
Data Collection and Standardization: Raw data from multiple sources—CRM systems, marketing platforms, product analytics, support tickets, website interactions, third-party data providers—is ingested into a unified environment. Before matching can occur, data undergoes normalization: names converted to consistent formats (all lowercase or title case), addresses standardized to postal formats, phone numbers stripped of formatting variations, company names cleaned of legal suffixes (Inc., LLC, Ltd.), and emails validated. This preprocessing dramatically improves matching accuracy by eliminating false negatives caused by trivial formatting differences.
Deterministic Matching: The first matching pass applies exact match rules on unique identifiers. If two records share the same email address, phone number, customer ID, or other unique identifier, they're linked with high confidence. Deterministic matching provides the foundation of entity resolution with very low false positive rates. However, it only resolves entities where exact identifier matches exist—missing connections where people use different emails across systems or companies appear with name variations.
Probabilistic Matching: For records without exact identifier matches, probabilistic algorithms evaluate similarity across multiple fields to calculate match likelihood. Techniques include: string similarity algorithms (Levenshtein distance, Jaro-Winkler) comparing names and addresses, phonetic matching (Soundex, Metaphone) handling misspellings, token-based comparison identifying shared words in different order, and machine learning models trained on historical match patterns. Each comparison generates a match score; when the composite score exceeds a confidence threshold (typically 85-95%), records are linked as probable matches.
Clustering and Identity Graph Construction: Matched records are grouped into clusters representing single entities. These clusters form identity graphs—network structures mapping all known identifiers and attributes for each entity. The graph structure handles transitive relationships: if Record A matches Record B (same email), and Record B matches Record C (same phone number), then A, B, and C all represent the same entity even though A and C share no direct identifiers. Identity graphs grow and evolve as new data arrives and additional connections are discovered.
Survivorship and Golden Records: When multiple records resolve to the same entity but contain conflicting information (different job titles, company names, addresses), survivorship rules determine which data to trust. Rules might prioritize: most recent data, most complete data, data from authoritative sources (direct from CRM versus third-party append), or most frequently occurring values. The result is a "golden record"—the single best representation of the entity combining the most accurate and complete information from all source records.
Continuous Resolution and Monitoring: Entity resolution isn't a one-time project but an ongoing system. As new leads enter marketing automation, website visitors convert to known contacts, customers update profiles, and people change companies, the resolution engine continuously evaluates new data against existing identity graphs. Data stewards monitor match quality metrics—false positive rates (incorrectly merged records), false negative rates (missed matches), cluster size distributions, and data completeness—adjusting matching rules and thresholds to maintain accuracy.
Key Features
Multi-source data ingestion supporting CRM, marketing automation, product analytics, data warehouses, and third-party data providers with automated ETL/ELT pipelines
Hybrid matching algorithms combining deterministic exact matches with probabilistic fuzzy matching for comprehensive entity identification
Identity graph architecture maintaining network relationships between all identifiers and attributes associated with unified entities
Configurable survivorship rules determining which data values to trust when multiple source records conflict
Real-time resolution capabilities processing new incoming data against existing identity graphs with sub-second matching for operational use cases
Use Cases
Customer 360 View Construction
A B2B SaaS company with 12,000 customers struggled with fragmented customer data across Salesforce (account and opportunity data), HubSpot (marketing engagement), Segment (product usage), Zendesk (support interactions), and Snowflake (data warehouse). The same customer might have 8-12 related records across systems with no unified view. Implementing entity resolution using customer data platform (CDP) technology with identity graphs unified these fragments. The system matched records using email addresses (deterministic), then applied probabilistic matching on company names, domains, phone numbers, and IP addresses to connect anonymous website visitors to known contacts. This created unified customer profiles showing complete engagement history—marketing touches, product usage patterns, support issues, and account health—enabling customer success teams to operate from single source of truth rather than switching between five systems. Customer health scoring accuracy improved 43% by incorporating previously siloed product usage and support data.
Marketing Attribution and Deduplication
A marketing team tracked campaigns across paid search, display advertising, LinkedIn ads, email marketing, webinars, and content downloads, but couldn't accurately attribute conversions because the same prospect appeared as different records in each channel. Implementing entity resolution with multi-touch attribution revealed that their highest-value customers averaged 12.3 touchpoints across 6.8 different channels over 89 days before converting—insights impossible to surface with fragmented data. The system also identified 2,847 duplicate records (23% of their database) receiving redundant email campaigns, wasting budget and damaging sender reputation. By resolving entities and deduplicating, they reduced email send volume 18% while improving engagement rates 27% through better targeting. According to Forrester's research on identity resolution, companies implementing entity resolution improve marketing ROI 15-25% through better targeting and attribution.
Compliance and Privacy Management
A financial services company needed to honor GDPR and CCPA data subject requests—right to access, right to deletion, right to opt-out—but struggled to identify all data associated with requesting individuals across 47 different systems and databases. Without entity resolution, a data subject access request for "Jane Smith" might return incomplete results missing records under maiden names, nickname variations, or old email addresses. Implementing enterprise entity resolution with privacy-focused identity graphs enabled comprehensive discovery of all records associated with data subjects across systems. When an opt-out request arrived, the system identified all 13 variations of that person's data across systems and propagated deletion/suppression to every location. This reduced compliance response time from 4-6 weeks (manual searching) to 48 hours (automated resolution) while ensuring completeness. Platforms like Saber's identity capabilities can help track when individuals change companies or roles, an important signal for maintaining accurate identity resolution over time.
Implementation Example
Here's a comprehensive entity resolution framework for a B2B marketing technology company unifying customer data across multiple systems:
Entity Resolution Architecture
Matching Algorithm Configuration
Deterministic Matching Rules (Exact Match)
Match Type | Confidence | Action | Examples |
|---|---|---|---|
Email Match | 99% | Auto-merge | johnsmith@acme.com matches across all systems |
Customer ID | 99% | Auto-merge | CUST-12345 exists in CRM and data warehouse |
Phone + Last Name | 95% | Auto-merge | +1-555-0123 + "Smith" appears in multiple records |
Company Domain + Name | 90% | Auto-merge | acme.com domain + "John Smith" |
Cookie/Device ID | 85% | Link (not merge) | Same device ID across sessions—links but maintains separate behavioral records |
Probabilistic Matching Rules (Fuzzy Match)
Comparison Fields | Algorithm | Threshold | Weight | Example Match |
|---|---|---|---|---|
First + Last Name | Levenshtein distance | >85% similar | 25% | "John Smith" vs "Jon Smith" |
Email Prefix | String similarity | >90% similar | 20% | johnsmith vs john.smith at same domain |
Company Name | Token-based | >80% overlap | 20% | "Acme Corporation" vs "ACME Corp" |
Address | Fuzzy + Standardization | >85% similar | 15% | "123 Main St" vs "123 Main Street" |
Phone Number | Digit matching | Exact match on 10 digits | 10% | (555) 123-4567 vs 555-123-4567 |
Job Title | Semantic similarity | >70% similar | 10% | "VP Marketing" vs "Vice President, Marketing" |
Composite Match Score Calculation:
Survivorship Rules (Golden Record Creation)
When merging records with conflicting data, apply these survivorship rules:
Data Field | Survivorship Rule | Rationale |
|---|---|---|
Email Address | Most recent from authoritative source (CRM) | Prefer directly entered over appended |
Phone Number | Most recent from authoritative source (CRM) | Recent data more likely current |
Job Title | Most recent with recency < 90 days | Job titles change frequently |
Company Name | CRM data > Marketing Automation > Enrichment | CRM vetted by sales team |
Address | Most complete (fewest nulls) | Prefer full addresses over partial |
First/Last Name | Most complete + proper capitalization | Prefer "John Smith" over "john smith" |
Lead Source | First known source (oldest timestamp) | Attribution to original acquisition |
Engagement History | Append all (no overwrite) | Preserve complete activity timeline |
Custom Fields | Source system priority defined per field | Business rules vary by field importance |
Identity Graph Structure
Example Identity Graph for Single Customer:
Implementation in Modern Data Stack
Technology Components:
Data Sources: Salesforce, HubSpot, Segment, Zendesk, Snowflake
ETL/ELT: Fivetran or Airbyte for data ingestion into warehouse
Resolution Engine:
- CDP Solution: Segment Unify, mParticle, or Treasure Data
- Open Source: Apache Spark with Dedupe.io or Splink libraries
- Custom: Python + fuzzy-wuzzy + recordlinkage packagesIdentity Graph Storage: Graph database (Neo4j) or relational (PostgreSQL with foreign keys)
Golden Records: Reverse ETL tools (Census, Hightouch) syncing resolved data back to operational systems
Configuration Steps:
Schema Mapping: Map equivalent fields across source systems (CRM "Contact Name" → Marketing "Full Name" → Product "User Name")
Normalization Rules: Define standardization for each field type (names, emails, phones, companies)
Match Rules Configuration: Set deterministic and probabilistic rules with confidence thresholds
Survivorship Configuration: Define rules for handling conflicting data per field
Validation: Test against known good matches and confirmed non-matches to calibrate thresholds
Monitoring: Dashboard tracking match rates, review queue size, false positive incidents, data completeness
Performance Metrics
Track entity resolution effectiveness:
Metric | Target | Measurement Method |
|---|---|---|
Match Rate | >90% | % of new records successfully matched to existing entities |
False Positive Rate | <2% | Manual review of 100 random auto-merged records |
False Negative Rate | <5% | Manual review identifying missed matches |
Data Completeness | >85% | % of golden record fields populated (non-null) |
Resolution Latency | <5 minutes | Time from data ingestion to resolved golden record |
Review Queue Size | <100 records | Records in manual review requiring data steward decisions |
Cluster Size Distribution | 95% have 2-8 source records | Flags over-merging (large clusters) or under-merging |
This entity resolution framework helped one B2B company reduce duplicate records from 28% to 3% of their database, improve customer lifetime value calculations by 34% through complete history visibility, and enable personalization that increased conversion rates 19% by leveraging unified cross-channel engagement data.
Related Terms
Identity Graph: The network structure mapping all identifiers and relationships for unified entities created through entity resolution
Identity Resolution: Closely related term often used interchangeably with entity resolution, particularly in marketing contexts
Customer Data Platform (CDP): Marketing technology that often includes entity resolution capabilities for unified customer profiles
Data Normalization: Data preprocessing step essential to effective entity resolution
Deterministic Matching: Exact match methodology used in entity resolution for high-confidence links
Data Quality Automation: Broader category including entity resolution as key component
Reverse ETL: Process for syncing resolved golden records back to operational systems
Account Identification: Company-level entity resolution for B2B use cases
Frequently Asked Questions
What is Entity Resolution?
Quick Answer: Entity Resolution is the data management process of identifying and linking records across different databases and systems that refer to the same real-world person, company, or object—despite variations in names, formats, and identifiers—to create unified, accurate data profiles.
Entity resolution solves the fundamental challenge that data rarely arrives standardized across systems. The same customer appears as multiple records with different email addresses, name spellings, or company name variations. By applying matching algorithms that evaluate multiple data points simultaneously, entity resolution determines when different records actually represent the same entity, enabling organizations to maintain accurate, deduplicated data that powers reliable analytics, personalization, and operations.
How does entity resolution differ from deduplication?
Quick Answer: Deduplication identifies and removes duplicate records within a single database or system, while entity resolution links related records across multiple systems and data sources to create unified entities—though deduplication is often a component of broader entity resolution efforts.
Deduplication typically operates within system boundaries: finding duplicate contacts in your CRM or duplicate leads in marketing automation. Entity resolution tackles the harder cross-system challenge: recognizing that the contact in your CRM, the user in your product analytics, the ticket submitter in your support system, and the anonymous website visitor are all the same person. Entity resolution creates identity graphs connecting these disparate records without necessarily deleting them from source systems, while providing unified "golden record" views. According to research from TDWI, cross-system entity resolution typically improves data accuracy 3-4x more than within-system deduplication alone.
What matching algorithms are used in entity resolution?
Quick Answer: Entity resolution combines deterministic matching using exact matches on unique identifiers (emails, phone numbers, IDs) with probabilistic matching using fuzzy algorithms (Levenshtein distance, Jaro-Winkler, phonetic matching) that calculate similarity scores across multiple fields to identify likely matches even with data inconsistencies.
Deterministic rules provide the foundation: if two records share the same email address, they're almost certainly the same person. Probabilistic algorithms handle messier scenarios: "John Smith at Acme Corp with phone 555-1234" versus "Jon Smith at ACME Corporation with phone (555) 123-4567" might not match exactly on any single field but shows 92% overall similarity warranting a match. Advanced implementations use machine learning models trained on historical match patterns to predict match likelihood. The optimal approach layers multiple techniques—starting with high-confidence deterministic matches, then applying progressively more sophisticated probabilistic methods to resolve remaining ambiguities.
What are the privacy implications of entity resolution?
Entity resolution must balance data utility with privacy compliance under GDPR, CCPA, and similar regulations. Key considerations include: (1) Lawful basis for processing—you need legitimate grounds to link personal data across systems, (2) Purpose limitation—entity resolution should serve the purposes disclosed to data subjects, not create surveillance profiles, (3) Data minimization—link only data necessary for business purposes, not all possible data points, (4) Right to object—individuals can object to profiling and linking that occurs through entity resolution, (5) Transparency—privacy notices should disclose that data is linked across systems for unified views. Best practices include pseudonymization of resolved entities, access controls limiting who can view unified profiles, audit logging of resolution activities, and implementing privacy-by-design principles ensuring resolution respects consent and preference management.
Can entity resolution be done in real-time?
Yes, modern entity resolution systems support real-time operation for operational use cases requiring immediate identity recognition—such as website personalization, next-best-action decisioning, or preventing duplicate account creation. Real-time resolution typically uses: (1) Pre-computed identity graphs so new data is matched against existing resolved entities rather than re-resolving the entire database, (2) Fast deterministic matching first (exact email/phone lookups completing in milliseconds), (3) Selective probabilistic matching only when deterministic fails, (4) Caching of recent lookups, (5) Asynchronous enrichment where immediate matching uses basic attributes while deeper probabilistic analysis happens in background. Batch resolution remains valuable for comprehensive database cleanup, model retraining, and complex graph restructuring, but streaming architectures enable sub-second matching for customer-facing applications. Platforms like Saber provide real-time company and contact signals that can enhance entity resolution with external data sources.
Conclusion
Entity Resolution forms the foundational data infrastructure enabling accurate customer understanding, personalized experiences, and reliable analytics across modern B2B technology stacks. Without effective entity resolution, organizations operate with fragmented customer views—the same individual appearing as 5, 10, or 15+ disconnected records across CRM, marketing, product, and support systems, each containing partial information and incomplete engagement histories.
For data and analytics teams, entity resolution transforms unreliable fragmented data into trustworthy unified profiles that power accurate reporting, attribution modeling, and predictive analytics. Marketing teams leverage resolved identities to eliminate duplicate communications, enable sophisticated multi-touch attribution, and personalize experiences based on complete cross-channel engagement. Customer success teams access comprehensive customer 360 views combining product usage, support interactions, and account health signals previously siloed in separate systems.
As B2B organizations accumulate data across expanding technology stacks and face increasing pressure to demonstrate ROI from data investments, entity resolution excellence increasingly differentiates data-mature organizations from those struggling with "garbage in, garbage out" challenges. Companies that implement sophisticated entity resolution—combining deterministic and probabilistic matching, maintaining identity graphs, and establishing continuous resolution processes—create the data quality foundation necessary for AI-powered automation, accurate customer analytics, and seamless customer experiences across the entire lifecycle.
Last Updated: January 18, 2026
