Common Data Problems #1

I’d like to share some of my experiences relating to common (and maybe not so common) data issues. You see I spent many years working for a service provider – a bureau operation – and processed literally thousands of data sets, from hundreds of data providers across most business sectors. I shudder to think how many billions of customer records have passed through my hands over the years….

 Anyway, without a doubt, the bain of my life was a problem which I called simply – “the phantom second initial”.

 Let me illustrate.

 What’s wrong with this – “Mr John J Smith”?

Answer – nothing on the face of it.

 What’s wrong with this – “Mr John J Smith, Mr Peter P Jones, Mr Alan A Prost, Mr Brian B Green”?

The problem here is that it seems probable that the initial in the name is actually a repeat of the first character of the first name. Of course one cannot be certain simply by looking at the individual names -each may be pefectly legitimate. In order to know if there is a problem with a data source you need to know what the baseline statistics relating to the number of individuals who have multiple names with the same first letter are. If the true value is 3% but your data source is significantly higher then you may have a problem.

 But let me tell you – this is one of the most common and insidious data problems out there. Not only is the the data wrong, it is difficult to identify on an individual name basis. It causes difficulty in matching systems leading to the presence of duplicates and ambiguities with associated undesirable reactions from customers – I am not impressed if a service provider gets my name wrong or sends me duplicate mailings.

 The most likley source of this problem is poorly designed data capture screens, or inadequate training of operators using said screens. One of the most important principals you can apply when designing such screens (and this applies to survey questions also) is that if something can be mis-interpreted it will be mis-interpreted. If you ask for a first name you will most likely get a first name. If you then ask for initials do you mean the initial of the first name, second name or all names including the surname. In answer to that question I could reply “R”, RA”, “A”, “AD” or “RAD” (in case your wondering my first name is Robert but I don’t use it).

 I once came across this phantom initial problem in data from a major financial services provider. Following an analysis of the data it bacame apparent that the source of the error could be traced to a single branch of the organisation. Whether this was caused by a problem with the data capture screen used by this particular branch or whether it was simply a training issue I was never able to ascertain.

 There are of course other sources of this problem including data processing and ETL errors.

 One thing is for certain – organisations need to take this problem seriously. They need to look at their data input processes and take all reasonable steps to ensure that they capture what they expect to capture. In addition, regular data profiling needs to be conducted to identify where erroneous data may be being drip-fed into the organisation.

 If you have a pet data issue why not share it. Comment on this blog or email me at

This entry was posted in Data Quality. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *