Identifying Duplicate Customers (Part 3)

It ought to be pretty clear by now that identifying duplicates within and across Customer data sources is a very tricky business. This is due to:-

  • The absence of a single unifying key.
  • Differences in data structure between sources.
  • Consistency of data across sources.
  • Variations due to data quality – missing data, mis-spelt data, erroneous data, use of abbreviations and synonyms.

So when it comes to developing or choosing a matching engine to power a Single Customer View how do you go about this?

There are two main approaches to Customer matching:-

  • Deterministic Matching
  • Probabalistic Matching

I will attempt to explain the differences as simply as possible.

Deterministic Matching

Deterministic Matching solutions essentially involve determining the rules by which you will identify matches / non-matches between pairs of customer records identified by pre-matching (the identification of candidate matches – often referred to as “blocking”).

These rules do not have to specify that exact matches only are allowed. Deterministic match rules can still incorporate fuzzy logic.

So, given typical name elements for a pair of candidates we might say:-

  • Surnames must match allowing for mis-spellings but not on the first character
  • AND
  • Forenames must match allowing for mis-spellings and common synonyms
  • AND
  • Dates of Birth must match…

So it is really a case of establishing the logic that works best given the available data and the requirements of the solution.

These rules can become quite involved when they start to account for data quality issues such as missing values i.e. Forenames must match but if one record has a full forename but the other only has an initial then provided the initial is the same as the first character of the surname then allow a match.

And often it becomes easier to specify the reasons why two candidate records should not be allowed to match due to definite differences.

Nevertheless the match rules that determine whether two records should or should not match can normally be written explicitly in plain English that the business users can understand. This understandability is essential when something goes wrong and it is necessary to explain (to a Customer or and Auditor) what happened and what steps are being taken to resolve the issue and underlying cause.

So in summary, Deterministic matching involves establishing the rules / logic that determine whether customers records match or not.

It should be noted that such rules may also identify relationships between non-matching customer records i.e. Same Address different person, Same Family different person. And Rules can also be set to establish the strength of the match i.e. Probably the same person, Possibly the same person. So although we are using the terms “rule” and “logic” the outcome does not have to be black or white / Yes or no.

Probabilistic Matching

Probabilistic matching is the use of statistical analysis to establish the likelihood of two customer records belonging to the same real world entity. This involves extensive analysis of the available data to establish the strengths of the different data elements and their actual content (common names such as John vs John will carry less weight than uncommon names such as Quentin vs Quentin) to set weighted scores which are accumulated to establish an overall match score. This will typically be within a range of zero to 100 with the higher score representing an increased probability that the two candidate records should match.

I am not going to pretend to fully understand the underlying statistical analysis – I am not a statistician. But it is not difficult to grasp the basic principle.

But given a probability score that may range between zero and 100, which score or scores represent definite matches, which represent probable matches, which represent possible matches and which represent non-matches? These values would typically  be set by manual inspection of the results by trained business users.

Which is best?

Which approach is best – Deterministic or Probabilistic? This is a question anyone interested, involved in or looking to implement a Single Customer View solution would ask.

In my opinion, the somewhat glib answer would be – whichever one works best given your requirements, your data, your strategy and your budget.

  • Neither approach is going to be perfect.
  • They are both capable of making errors – false positives and false negatives.
  • They both rely on some form of pre-matching or blocking to establish candidates for matching. If the blocking is too tight they risk increasing numbers of false negatives, if the blocking is too loose they risk increasing the numbers of false positives.
  • They can both use the same fuzzy match techniques (including open source algorithms).
  •  They are both impacted by poor data quality.
  • Both will get most match decisions correct.
  • Both will struggle in the grey area – when does variation become a difference.
  • They both have to deal with multiple and conflicting match scenarios i.e. A equals B, B equals C, A does not equal C.
  • They both have to deal with ambiguity A equals B and A equals C.
  • Both will have to deal with “timing” issues – Yesterday we decided that A was equal B, Today C has appeared and now we’re not so sure.

Deterministic matching has been around longer and given it is rules / logic based is probably easier to understand. So may be preferred by business users. If the rules are written clearly it ought to be easy to say “record A matched to record B because rules 1, 2 and 3 were met”

Probabilistic matching arguably has more science behind it. And will probably be more likely to be the focus of  further developments as machine learning / artificial intelligence plays an ever increasing role. So may be preferred by the more technical community. But explaining why records did or did not match may be more difficult – what exactly is the different between a score of 89 and a score of 90? The answer may be buried within the statistical analysis which may not be particularly “consumable” to a non-technical audience.

So unfortunately I do not think that there is a clear answer as to which approach is best. A well established Deterministic match algorithm may outperform a less well established Probabilistic solution and vice versa.

So at risk of repeating myself, the first step in any Single Customer View project is to be clear and unambiguous about your requirements and expectations.

 

This entry was posted in data matching, Data Quality, Data Strategy, entity resolution, Golden Record, Project Management, Single Customer View, Uncategorized. Bookmark the permalink.

Comments are closed.