I have seen quite a few posts recently from the Data Science community, bemoaning the amount of time spent on addressing data quality in preparation for getting into the science part of the role.
When I (and I’m sure many other people) reflect on my data quality experience I automatically think about data elements such as Names, Addresses, Postcodes, Dates, Product codes, Product descriptions, Account Types etc. Data that, for the most part originates from a human being i.e. typing something in via a keyboard, selecting an item from a drop-down list, copy and paste etc. These types of data elements have business rules (which may or may not be enforced) and quality issues arise due to human beings not following those rules. There are many ways in which these quality issues can be addressed which will in some form involve applying those business rules (in some cases retrospectively).
But this sort of data is not what I would associate with “Big Data”. It is data about customers, products, assets etc. In business terms these are the business entities. In DWH terms these are the dimensions. And whilst it is important that this data is accurate, I am not sure that this is the sort of data that the Data Science community is really interested in from an analytics perspective (though somebody please put me right).
The Big Data of interest to Data Scientists does of course originate in many cases from human activity i.e. clicks, views, likes, re-tweets, items viewed, shopping baskets, generated invoices, delivery receipts etc. But also from machines i.e. sensors such as smart meters, CCTV etc.
Where this type of data differs from the dimensional data mentioned above is that it does not get / cannot be generated directly by human beings. When you click on a URL, you do not record the IP, URL, Datetime Stamp etc, it is done for you “behind the scenes”. When your Smart meter sends its readings back to the data collectors, you are not entering the readings, it is done on your behalf.
So whilst we can say that dimensional data originates directly from human beings, Big Data in the form of Transactions or Facts originates from machines.
And it follows that Dimensional Data Quality issues are created by human beings, but Big Data Quality issues are generated by the machines (or the people that program them).
I can see that some data quality problems may be similar across they two data types i.e. data being truncated due to poor database design, or items being missing due to human error. But whereas Dimensional Quality issues may impact a single data item in a single row of data, Machine generated issues must be more likely to impact in bulk i.e. an entire batch of data being missing due to connection / transmission problems, bug ridden apps being made live resulting in a whole days worth of corrupt data.
So whilst we tend to use the term Data Quality as a catch all for any data that does not meet our requirements, I think it is useful to differentiate between the two general sources (human vs machine) since the means of identifying and resolving will require subtly different approaches.
It would be interesting to hear about some real life data quality issues and how they were resolved from the Data Science community…