Larry Odebrecht published this article first on LinkedIn here.
First, let me acknowledge that this is a perennial problem for most businesses. Upstream data quality issues have taken on the tone of a term a colleague of mine coined: learned helplessness. Time and time again I’ve seen wide-eyed, bright, and bushy youngsters head into Ye Olde Data Mine with a song in their head and a dream in their heart of fixing the issues. But, they only to come out a grizzled old veteran with a dead canary and spouting about “things ‘round here ain’t never gonna change”.
So, what is the root cause of the problem? What’s the root cause? Well, even though there are a ton of good tools out there to solve the problems we keep having the same old issues. This is due to process problems, the procedures we use to garner data, the pressure of our day jobs, related tasks, and the need to provide supportive mental health to the youngsters we are sending into Ye Olde Data Mine. So, while I am an optimist, there is still a data quality career path out there that will have a robust amount of work for the foreseeable future.
Business Intelligence is Only as Good as the Data!
Bigger and bigger data sets are exposing the issue of data quality as businesses start to leverage their data more and more. Many moons ago, I worked for a large big-box retailer in the Twin Cities that was on the cutting edge of using BI to identify and target population subsets using buying history and SKU information. We were able to create a great revenue stream for the company and made a big splash in the NYT (among others) for the capability. Challenge was… the data was not as good as we thought and upstream challenges would change key parts of the data, rendering our models somewhat less than useless.
The stark truth is that we’ll be living in an imperfect world full of challenging data for the foreseeable future. I worked with a client once who was prone to say “the world is full of C- students, Larry” …and so it is with data. We all live with the pain. I know I’m in good company.
So, let’s all get together and say the Data Serenity Prayer to whatever higher power you believe in: “God grant me the serenity to accept the data I should not change, courage to change the data I can, and the wisdom to know the difference (note that the word ‘should’ is doing a lot of work in this prayer).” The Good News is that there is a lot in the “Courage” bucket if you use these 4 techniques. The EVEN BETTER NEWS is that the downstream impacts of the changes can be even better.
1. Think About the Source Data – It’s Where the Fix Needs to Happen
Too often, the Business Intelligence Team are incented to fix the problems as they see them. That comes in a variety of ways, but often it’s just too hard or too time-consuming to work with the data team to fix the appropriate ETL (Extract Transform and Load) processes and/or calculate the downstream impact. So, we put on our super suit and start flying around the company doing little fixes to the data to make our timelines work.
The problem is… every time you win with that strategy – you make it a little harder to win again with that strategy.
Well, sort of like Tech Debt – you’ve created what I call Data Debt. Data Debt is that table that you must manually make tweaks to in order to get the numbers to work. Further, that approach often ends in a larger missed opportunity where you’ve thought more deeply about what the business is actually asking for. You may miss an opportunity for a longitudinal table that is reusable and answers more business questions. You may be so narrowly focused that you miss where the business is going.
The Business Intelligence team should get their data with the data quality changes already applied from the Data Lake. Way, way, way too often we look to BI to fix the data. Not BI’s job. Let me repeat that: NOT BI’S JOB.
But that’s exactly where we try to apply the fix. Your ETL (Extract, Transform, and Load) processes should be chimp simple. Load the data – as close to the source system as humanly possible – and fix it there before releasing it into the portion of the Data Lake that BI uses. To the right is a typical view of a Cloudera Data Lake (photo credit Cloudera.com). The pattern is straightforward: load the data into the Raw Zone, refine it for the Refined Zone, and allow BI to access it in the Trusted Data Zone. Simple litmus test: with limited coaching can a non-ETL developer read your ETL code? If yes, you’re on the right track.
Attempting to address a Data Quality issue with BI is very much akin to trying to address software bugs in production or trying to fix a sewage backflow issue with a bottle of Channel #5.
Now we can get into a whole conversation about Continuous Improvement either through Lean or through ITIL, but the point is the same: the process of Data Quality identification should include something to allow you to look back at the data and fix it at its source on a continuous basis. Any decent Management Consultant (see the article I wrote here: https://www.linkedin.com/pulse/top-5-things-consider-when-hiring-customer-experience-larry-odebrecht/) should be able to help you set up that process and drive the improvement cycle.
2. Leverage Risk Management Techniques to Improve Data Quality
Let me introduce you to a concept that Henry Ford probably understood well: Rolled Throughput Yield. Conceptually, it’s easy to understand, but it’s often forgotten when it comes to Data Quality. I developed the picture below to talk about this with my students at Minnesota State.
Every upstream data loss has a multiplicative effect as it’s combined with other data. Sometimes, this creates a much, much bigger issue for the end-user. So, we live in a world where it is totally possible to have both relatively minor Data Quality issues at the source, and a massive Data Quality issue at the end-user. Looking at the example, the yield is .955 * .970 * .944 – or – 87.5%! Or a Defect Per Million Opportunities (DPMO) of 125,526.
From a Management Consulting perspective, the solution can be straightforward. Certainly, we can and should work upstream to solve the Data Quality issues. But often the more impactful solution is to work where the pain is experienced. I normally work directly with the “smart people” in a facilitated session to catalog the problems. In other words, those experiencing the pain most acutely.
I score them against three categories: severity, the likelihood of occurrence, and detectability. I bake that score into a Risk Priority Number (RPN). After, I work with the team to develop a score for Organizational Capability (OC). We can plot the RPN and OC on a 2×2 matrix. Then we launch projects against the most impactful and sensible efforts. The outcome is actionable and has the buy-in from the team you worked with. After we complete the fixes, we can rescore to measure the band-for-the-buck.
I’m describing a tool called “Failure Modes and Effect Analysis (FMEA)”. There is a ton of information available for it online.
3. Accountability is Key for Data Quality
If it is true (and I believe it is) that “that which cannot be measured cannot be managed” then it is also true that “that which should be managed but has no accountability will NOT be managed well”. Stewardship teams are wonderful in this space, and certainly helpful – but having key accountable persons is not optional. I have a few recommendations here:
- This is tactical, but a report that combs for any Data Quality fix happening in the Refined Data Zone or past it that is reviewed regularly by an empowered leadership team is critical.
- Your improvement process should include the responsible parties and teams.
- Basic look-back statistics for the Continuous Improvement portions are also crucial. How many data elements were fixed last month? How many impacts did you have at the point of pain (section 2 above)?
Lastly, Business Intelligence is a key player in the fix. It is best that senior leadership owns the dashboards and scorecards that demonstrate the quality of the data.
Having senior leadership engaged, reviewing those reports, and driving the accountability is not optional.
If you’re able to hit these three bases as you build out your Data Quality Program, your youngsters will remain wide-eyed, bright, and bushy data folks as they apply these rules.
Are you interested in learning more about our data-centric services? You can find out more here