There is an old adage – You cannot manage what you do not measure.
This is very true when it comes to Data Quality. How do you know if your data quality is good or bad? If you have any sort of Data Governance program, how do you know if it’s having a positive or negative effect on your Data Quality? The simple answer is, unless you know what you are measuring you do not. This is why the first step in any Data Quality initiative must be – measure, report and benchmark.
But measure what? And how?
Well in order to get started there is, what can be a painstaking task, to establish what good looks like for every piece of data that you are interested in. Normally by “piece of data” we mean “column in a database”.
So for instance if we have an Opt-In flag we might expect this to be either Y or N. So the test for bad would be something like – if Opt-IN flag is not Y and is not N then count at bad. If we test all rows in the database we will get a total count which we can turn into a simple % to tell us what the data quality of our Opt-In flag is e.g. 10% of customers do not have a valid Opt-in flag.
This is probably the simplest type of Data Quality test which can be applied to any piece of data where there are a limited and known set of values.
Another type of test involves checking that a data item conforms to an identifiable “pattern” i.e. only alphabetic characters allowed, only numeric, date format must be DDMMCCYY etc. Any data item which does not conform to the acceptable pattern or business rule can be counted as bad.
We can also consider bad combinations of data items i.e. data items which on their own might be considered OK but which in combination with the setting of another data item may form an illegal combination. For instance, a particular product may only be available within a specific territory. If a combination of product and territory is found to be incompatible then this can be counted as bad.
Other types of data quality issues can be more difficult to spot and in order to find them we have to look at multiple records. For instance, what is wrong with this:-
“Mr John J Smith”
On the face of it nothing. All the elements are valid alpha characters – perfectly OK.
However if we look at more data :-
“Mr John J Smith”,
“Mr Peter P Jones”,
“Mrs Mary M Brown”,
“Ms Jennifer J Small”,
“Mr Brian B Green”…
A pattern emerges – all the 2nd initials are the same as the first initial. Now each name in its own right may be OK. It is perfectly possible for people to have multiple forenames that begin with the same letter. But if every customer record in your database looked like this then it would indicate some sort of processing error. The question is, what is the expected % of people who genuinely do have multiple forenames with the same first letter? In order to answer this question you need to find a suitable baseline dataset whose quality is known. In this particular example you may be able to access the UK Voters roll to establish a baseline %. For example 7% of the UK voters have the same 1st and 2nd initial – any % above 7% could indicate a quality issue.
Data is many and varied. I have touched on a few examples to indicate the techniques that can be applied to measure Data Quality. But in essence it boils down to identifying the business rules that constitute valid data and then devising the specific tests which establish conformity against those rules.
In establishing the business rules you may need a little imagination and creativity.