How healthy is your data?
Posted on February 21, 2017
One of the biggest misconceptions in business is believing your data is accurate and useful for basing decisions on.
In reality you are looking at a set which has been collated from a range of inputs that have been collected together.
The 'data' is never right or wrong – it is the set that you have assembled (or that someone else has assembled for you) and it is only as good as the assembly process.
Set theory was first developed by Geog Cantor in 1874 but languished without a practical application. The first such application was provided by Ted Codd at the IBM Research lab in 1970 with his seminal paper that described the rules for how relational databases would work.
Some far-sighted educationalists started introducing set theory into the math curriculum as early as 1967, which was controversial at the time with most parents failing to understand how important and pervasive set theory would become.
There are many things that can go wrong when joining data components together. Having a healthy database will ensure fully connected data will allow for the creation of robust sets. The content of the set is controlled by filtering, the assumptions of the collator, and the presence of conditional data - did the person constructing understand all the nuances hidden in the data?
Data can become unhealthy in many ways for example where data entry forms allow ambiguous data entry or allow data that should be stored in one field to be stored in another. Does your system allow records to be saved with incomplete data? How often do your business processes and incentives discourage front line staff from entering all the data that you will need to construct robust sets for reporting and analysis?
Do your report writers and designers ever calculate what the final set should look like before constructing their sets and then compare actual vs predicted, or do they just build a report and hope that it's 'right'?
Does your system store spatial data in a poor relational schema which encourages unhealthy set building? How valid are the datasets that you are relying on? Data only makes it through if it is fully connected - how much of your data is simply hidden from you when you are making your decisions?
There is plenty that can go wrong when compiling a set of data even when the data is in a healthy condition. Much more can and will go wrong when it is not. To make sound decisions for your business you need to have healthy data.
What processes do you have in place to understand the health of your data? Chances are, if you haven't got these processes in place already, then it's not going to happen without outside help. You need to be able to quantify the health of the data in your system. You need to be able to identify problem data areas so that preventative strategies can be put in place. This stops the problem growing. You then have the option of rectifying existing problems should you choose, prioritising on data that affects critical decisions and processes.
Finally you need to be able to measure whether the health of your data is improving or deteriorating over time. By being able to link unhealthy data to the actual costs, losses and poor KPI performance that the business is exposed to, you can demonstrate a clear business case to take positive action.