In this post I am going to continue to look at the logic involved in identifying duplicate customers that I began in my previous post (http://wp.me/p2lbpw-au).
To recap, and for ease of explanation all examples given will be for personal data (Names) that are possible candidates for matching by virtue of residing at the same postal address (other candidate logic is available).
Computers are Stupid
I want to start by saying that when it comes to identifying matching customer records computers are really really stupid. Compared to the human eye and brain they are useless.
A normal test of equality (Does A = B) will only by true if every single character in A is identical to every single character in B.
- Does “John P O’Connel” equal “John P O’Connel” would be True.
- Does “Andrew Jones, 5/7/1986” equal “Andrew Jones, 5/7/1986” would be True.
- Does “Mr Peter J Smith, 2/11/78″ equal Mr Peter J Smith, 2/11/78” would be True.
- Does “John P O’Connel” equal “John P OConnel” would be False – missing apostrophe.
- Does “Andrew Jones, 5/7/1986” equal “Andrwe Jones, 5/7/1986” would be False – common typo.
- Does “Mr Peter J Smith, 2/11/78” equal “Mr Peter Smith, 2/11/78” would be False – missing initial.
- Does “Mr Peter J Smith, 2/11/78” equal “Mr Peter j Smith, 2/11/78” would be False – different casing.
- Does “Andrew Jones, 5/7/1986” equal “Andrew Jones, 5/7/1986” would be False – extra space.
Like I say, computers are really stupid.
Overcoming Data Variations
In order to get a computer to recognise that two entities with slight differences in their data are in fact the same entity requires the application of something which over the years has become known generally as “Fuzzy Logic”.
Fuzzy logic is a human construct. Computers do not do fuzzy.
So a human has to tell the computer how to recognise that two things that are by definition different and actually the same. And it is worth remembering that – when we employ fuzzy logic we are choosing to match things that by definition are not the same. We will return to this later.
This is not the place to go into all the different forms of fuzzy logic that is available suffice to say that such programming techniques developed by human beings (e.g. soundex, hashing, numerical equivalencing, distance matching…) all work differently and will have different applications. But all result in giving us the ability to match data that is different.
Eliminating Data Variations
If all the customer data (Name and Address Data) that we had were perfect – every field populated with no mistakes – then identifying duplicates would be easy. We would not have to make difficult choices. We would not have to use fuzzy logic. There would be no risk.
But in the real world perfection does not exist.
But that does not mean that we cannot strive for it.
As well as doing everything we can to ensure that data is captured accurately at source we can apply data cleansing techniques to the data that we already have:-
- Parsing – breaking the data into its component parts i.e. Title, Forename, Surname etc.
- Standardising casing – i.e. convert everything to upper case.
- Cleansing – remove spurious characters and excess spaces.
- Correcting / expanding – i.e. convert “Albert St.” to “Albert Street”. Convert “Bob” to “Robert”.
What we are trying to do is to remove the variation within our data. By doing so we are reducing the reliance of fuzzy logic – remember fuzzy logic is matching things that are different. And we are reducing the likelihood of making an erroneous match decision.
The bane of of any Single Customer View matching solution is variation in data. The question – When does a variation become a genuine difference? has caused me many sleepless nights over the years.
In my next post we will continue to look at matching imperfect data and consider how, as a business we can manage uncertainty.