While data quality has been the topic of much discussion in the market research industry for the past few years, little effort has been made to objectively define the concept. Data quality is a hygiene factor that is often overlooked when present, but becomes noticeably problematic when missing. However, by defining data quality solely according to the absence of outliers, we risk losing sight of what truly makes data beautiful. What if we defined data quality based on what it is, rather than what it is not?
Defining Data Quality Based on What it is Not
Often, the way we define data quality is limited to what it is not, by removing Satisfiers, Speeders, and Straight-liners. How we define these in-survey checks is subjective in nature and whether that practice actually works in improving overall results is questionable.
Picture this: You have just completed a long and arduous research project, and you’re eager to present your findings to your client. However, as you begin to delve into the data, your client starts to notice something troubling: the story doesn’t make sense. You feel your stomach drop as your client raises this concern, asking you to explain what’s going on. You rack your brain for an answer and finally settle on “But…there are no Speeders in our data.” Even as you say it, you realize that this is a poor defense. The absence of Speeders does not make the quality of your data good.
Instead, we ought to focus on defining what qualifies as good data.
The Role of Cohesion in Achieving Data Quality
Let’s take a philosophical step back and consider what makes data beautiful.
At its core, beautiful data makes sense. When we view data quality through this lens, it becomes less subjective than we might think. Data makes sense when the story of each participant is cohesive.
If you’ve seen bad data, you know that participants who cheat in surveys usually answer randomly, and the results are incoherent. For example, Gen Zs buying retirement properties, plumbers performing DNA sequencing, and retirees enrolling in kindergarten classes.
Cohesion doesn’t mean that the findings can’t be surprising; that’s why we do research! But if you were to look at each survey participant in your dataset row by row, you would find that good participants typically remain true to their persona throughout the survey. That’s cohesion.
Another hallmark of good data quality is when open-ended responses are relevant to the question at hand. Open-end responses that are consistent with the rest of the data in terms of themes or patterns further reinforce the cohesiveness of the data. Some might argue that gauging responses this way is also subjective, but the ultimate test is straightforward: Are you comfortable sharing the open-end responses with your client?
Avoiding Confirmation Bias by Developing Tools to Assess Cohesion
Simply removing Satisfiers, Straight-liners, and Speeders is not enough on its own. When we remove participants based on these rules, we simply shoehorn the metrics we have into telling us what we want to see instead of actually determining what we need to know.
To truly achieve good data quality, we need to develop tools that can help us identify a lack of participant-level cohesion. As an example, the Root Likelihood fit score is a great way of improving data quality by identifying participants who may have randomly responded to a choice task, such as a Conjoint exercise. These types of consistency checks are not only better indicators of good-quality data, but they are also less obvious to participants who may become skilled at avoiding the obvious quality assurance traps.