When 83% of companies believe their revenue is affected because of data quality problems, we need to take a look at what is going wrong. Few endeavors have received the media attention, promise, and even hype as the revolution organizations today are undergoing to become data-driven. But the value from that investment and work evaporates when the data is misinformation.
Is that really common?
According to one Harvard Business Review (HBR) survey, 47% of the data records reviewed had critical, work impacting errors. It’s also worth noting that only 3% of those companies had an acceptable level of accuracy in their data*. The frightening conclusion of the survey is that an erroneous data record, once created, is 10 times as expensive to correct as maintaining a correct one.
There are some simple steps to make sure you are on the right track to solving this problem. The first is to understand what data quality looks like and what kind of errors exist. Most of the time when erroneous data is mentioned, the kinds of problems that come to mind are mistakes made at the time of data entry. Selecting the wrong date for a customer’s birthday, for example, when collecting information for marketing, or spelling their last name wrong when putting them into a CRM, are obvious examples of this.
The real culprit, though, in most cases is a failure to manage vocabulary. Finding agreement on the definition of terms like “churn,” “average customer value,” or even basic terms like “account” or “lead” is a challenge. Each individual has their own perception of the meaning. However, if one goes around the organization asks each person to write down their meaning, the confusion then becomes clear. Most people believe when they have a distinct definition in their mind that others must have learned the same definition.
The complexity of getting an organization to think in lockstep about its information is easy to underestimate. That problem is exacerbated in growing, mid-size businesses where previously this could be addressed with hallway conversations. Data has made such ideas quaint. The only way to address this common problem is to approach it with good management. Simple steps can get a team on the right track. I tend to recommend beginning by organizing a group of stakeholders to review the key domain concepts and identify their source of truths. That’s how you create your vocabulary, and once everyone is working with the same definitions, the real work of managing errors begins.
There are three types of errors in data, and most people are only familiar with one. By understanding the different ways that errors can creep into databases, we enable more sophisticated and effective techniques for evaluating and preventing those problems. The first, and most obvious, of these is the logical error. Logical errors, as mentioned above, are what initially comes to mind when data quality is considered. Missing data that was left un-entered, an accidental key press while entering an account name, or an entry button pressed twice that created a duplicate record, are intuitive examples of when logical errors and inaccuracies slip unseen into data systems.
The second type of error is sensor, or granularity, error. This is an error that comes up when a system depends on specific information but the source of the information doesn’t provide the needed level of detail. For example, if I have a thermometer that reads in 5 degree increments, and I am trying to get an environment to 123.2 degrees Fahrenheit, then I simply have the wrong tool for the job. (You can use that example to remember why it is called sensor error). This may sound like something that comes up in only system control and scientific applications from that example, but that is not the case. Consider the GPA calculation at most universities. Many courses base their grades on an A,B,C,D,F scale, but then ask graduates to report their results on a translation of each letter to a number (say 0.00-4.00) and report on it to the second decimal place. This is a classic example of the error, right in midst of academia. Businesses are no exception and often this error is seen when translating categories to a numeric value to present a mean, which appears to be highly specific. A much better measure would be the mode or the median in those cases to reflect the low granularity of the data.
Finally, our last error will seem the most foreign but it is surprisingly common. Imagine Nick and Sally are two different types of drivers. Each uses the company car and they regularly drive through automatic speed checkers that will mail any tickets to the company address. Nick is known for speeding regularly, but Sally is quite safe and only occasionally speeds. When the company receives a speeding ticket in the mail, who is the probable culprit? Of course, we would most likely say it was Nick, even though we know there is a small chance it was Sally. The truth is we can’t really know for certain without more evidence. Making that mistake is referred to as Bayes error, named after Thomas Bayes the 18th century mathematician who made many contributions to the field of probability. In a business context, when we apply rules to classify records such as marking a web visitor as a particular demographic just because of content in which they are interested, then we make our systems more intelligent. We also, however, introduce Bayes error. Records in which this has been done should always be tagged with the prior underlying population information so the error can be considered when it is shared with other systems in business units or even outside of the organization.
Several tools exist in the open source space that can help with data errors. Of course, there are also proprietary data governance and Master Data Management (MDM) tools that were specifically designed for enterprises. They are often a substantial investment for a mid-size business. One popular tool in the open source space is Talend (https://www.talend.com/products/talend-open-studio/), which offers straightforward data integration functionality as well as data governance and quality improvement features. Another surprisingly beneficial tool in this space, if you are looking to drive the development of your own data quality checks, is KNIME (www.knime.com ). KNIME is a data integration and machine learning workbench that can do pattern recognition, which is quite relevant to improving quality. Duplicate detection, missing information, outliers and the calculation of underlying properties of distribution to address Bayes error are all possible tasks to accomplish with that platform. Finally, the Hadoop ecosystem has been maturing enough recently to bring data governance and quality improvement to the space of Big Data Analytics. More literature is regularly being published on this topic and major Hadoop providers are bringing their flavor of solutions to their distributions. Tools aren’t the only solution, short term assessments and guidance from outsiders to get be valuable approaches to start you on the right track without getting into an unnecessary upfront investment.
Data quality, though, always starts with management. Once a common vocabulary is assembled, and the team has consensus, the errors in each of the information sources for the organization can begin to be addressed. When organizations pay 10X the for the work performed on bad data, the benefit of a more aggressive investment and better practices is evident.
*To achieve real data quality, we need 97% accuracy on the data records in the system. An evaluation can be conducted by sampling the data available to your team.