Forget Big Data and worry about Bad Data
Two consulting projects this year have involved lots and lots of data. One was the migration of a very complex customer database and transaction logging system to a cloud-based CRM platform from a homegrown system. The other involved performing serious analytics on a non-profit’s membership system that had data spanning decades.
Both projects required incredible manual intervention in the data processing. Data came from different original sources and had wildly varying schemas. Some data was relational, and some was flat-file. Some of the data was clearly contradictory. Timestamps were missing on records. Valuable data was stored in comment fields. Documentation didn’t exist. Keys were lost. Fields were abandoned. Live data was mixed with archival data.
Both systems were a mess—but that wasn’t the problem. The issues I’m describing are the everyday result of messy software, evolving databases and the real world. Solving those challenges takes some effort, but we all know the importance of factoring data cleansing into any type of migration or analytics project, both in terms of time and of finances.
The real challenge is that most of the data was totally wrong. No resemblance to reality. That person never lived at that address. The relationships in the SQL database were not correct. Conventions were nonexistent. When data is being collected by many systems—and stored in many systems—over years and decades, this is what happens.
Yes, both projects were successful. However, we had to throw away a lot of data that would have added value to the organizations and their customers or members. Worse, we learned that both organizations had been using bad data for years, resulting in missed opportunities, less-than-ideal customer service, and flawed business planning.
Garbage in, garbage out. After all, if you are thinking about offering a new product or service, and are basing your decisions on bad data, you aren’t making a good decision. You are guessing.
What went wrong? It wasn’t in the migration and analytics projects. We went in, cleaned up the data best we could, and got out. It was a finite task and went as well as could be expected.
The root causes weren’t bad programming either, or poor database administration. In many IT shops, schemas change. Documents are lost. Corruption happens. Ideas are tried and abandoned. That’s simply what happens when data is kept past its sell-by data.
The failure is that nobody regularly (or ever) checked the data to make sure that it’s still good. Nobody performs period data hygiene. Nobody tested addresses, or eyeballed records to see if they made sense, or validated the databases against other sources (or even against themselves).
Data is a valuable corporate asset. In fact, when it comes to customer data and transaction records, data may be the single biggest asset of your company. Most companies work hard to ensure that their assets are solid. A manufacturer checks its raw materials and finished goods to ensure that they are as expected. Materials in warehouses are inventoried. Random samples are pulled from time to time, tested, and examined carefully.
When it comes to data, long-term quality is rarely a consideration. Data is stored and used. Is it checked? Rarely, if ever. We all know the benefits of Big Data for our business. What about the costs of Bad Data? Unknown, but real. I’ve seen this time and again. As Bad Data is used and reused, it will only get worse.