Identifying Duplicate Customers (Part 1)

In this article I will begin discuss some of the detailed issues that you must consider in relation to implementing a Single Customer View solution. Whether you are a business considering embarking on an implementation, whether you already have a solution but it is not providing the benefits expected, or maybe you are a Single Customer View solution provider I hope that this and future articles will be of value. As a minimum these articles should provide food for thought and may well spark some debate.

Note that throughout this and future articles I will be using examples of person data. These are totally fictitious. Any resemblance to real people living or deceased is purely coincidental.

What are Duplicates?

Let’s just remind ourselves on what we mean by the term “duplicates”:-

  • Duplicates are where we have multiple digital representations of the same real world entity.

For the purpose of this article, the real world entity that we are interested in is the customer. And my definition of a customer entity is that of an individual person. You. Me. Your sibling. An entity of which there is and will always be only one of in the whole world.

Attributes of a person.

In my experience the person attributes that you are most likely to have at your disposal  for identifying an individual are :-

  • Surname e.g. Smith
  • Forename 1 e.g. John
  • Date of birth e.g. 05/07/1989
  • 1st Initial e.g. J
  • 2nd Forename e.g. Peter
  • 2nd Initial e.g. P
  • Title e.g. Mr
  • Gender e.g. Male

I have ranked these in an order of importance, likelihood of being captured, discriminating power though it has to be said that this is more of a “vanilla” list and different businesses may have different views on what is available and how important each is. This is a subject that can spark much debate. I am deliberately ignoring bio-metrics / facial recognition for obvious reasons.

Nevertheless the 8 elements above would  yield 256 different combinations of elements. Though of course some are very (hopefully) unlikely to be encountered – why would you have a gender only and nothing else!? And some elements would effectively be redundant – if you have a full forename then you don’t really need the corresponding initial.

But I feel it is important to go into this level of detail when planning a Single Customer View solution as it will focus minds on what is important.

Of course even the best quality name data cannot guarantee uniqueness. How many “Mr John Peter Smith”s born on 5th July 1989 are there in the country / world. Likely more than one. So when searching for duplicates within a data set it is normal to reduce the search to likely candidates. Candidates for matching might be those person records with the same postal address, same mobile phone number or even the same email address.

Even the use of candidate criteria cannot absolutely guarantee uniqueness, particularly when incomplete data comes into the equation. But in reality, and under the guise of “best endeavours” all Single Customer View solutions will use some form of “pre-matching” in order to reduce the number of candidate matches to a manageable number and to provide the level of confidence that acceptable matches do in fact represent the same real world person.

Quality of Match vs Quality of Data.

For the purpose of this discussion, I am going to assume that all examples used are suitable candidate matches in that the subjects reside at the same postal address.

Record Title Forename Initial Surname DoB
A Mr John P Smith 05/07/1989
B Mr John P Smith 05/07/1989

In this example, there are no differences between records A and B. By anybody’s standards this is a good match. You would be confident in saying that record A and record B belonged to the same real world entity.

Now consider this example.

Record Title Forename Initial Surname DoB
A Mr J  Smith
B Mr J  Smith

In this case, the quality of the match is the same as the previous example. All the elements present in record A are also present in record B. In matching terms it is a perfect match.  The problem is that the quality of the data does not provide the same level of confidence that record A and record B belong to the same person.

If you are a business embarking on a Single Customer View implementation, it is vital that you establish with your solution provider how they deal with such data quality issues.

And as a business the single most important thing that you can do to maximise success is to tackle data quality.

As we will see in future articles there can be many undesirable and unexpected outcomes related to how matching software / algorithms make their decisions and how the combinations of data and even the sequencing of decision making can lead to less than optimal results.

And lets face it, you may never achieve perfect data quality across all data sources. So understanding the impact of the complex relationship between matching algorithms, data / software architecture and data quality is not to be underestimated.

In the next article we will build upon what we have already covered and start to look at “fuzzy” matching, both strengths and weaknesses.

This entry was posted in data matching, Data Quality, Data Strategy, entity resolution, Project Management, Single Customer View, Uncategorized. Bookmark the permalink.