Skip to main content

The “Deadly Dozen” Data Science Mistakes:
Mistake 1 - Lack of Crucial Data

Dr. John Elder - Vice President and Technical Fellow for Data Science
In my introduction to this series, I shared a saying that resonated with me: “Good judgment comes from experience, and experience comes from bad judgment.” My goal is to help you succeed by revealing the key mistakes my colleagues and I have made over many years—the “deadly dozen” culprits that derail projects, starting with Mistake 1: Lack of Crucial Data.

 

The best analyses require labeled cases, i.e., an output or target variable. If you only have input variables, all you can do is look for subsets with similar characteristics (cluster) or find the dimensions which best capture the data variation (principal components). Those unsupervised techniques often provide useful insight but are much less powerful than a good (supervised) prediction or classification model.

 

Given a target variable, the most interesting class or type of observation is usually the rarest by orders of magnitude. For instance, roughly 1/10 of “risky” individuals given credit will default within two years, 1/100 people mailed a catalog will respond with a purchase, and 1/1,000 CT scans actually cause cancer.[1] The less probable the interesting events, the more overall data it takes to obtain enough to generalize a model to unseen cases.[2] Some projects probably should not proceed until enough critical data is gathered to make them worthwhile.

 

For example, on a project to discover fraud in government contracting, known fraud cases were so rare that strenuous effort could only initially reduce the size of the haystack in which the needles were hiding.[3] The model predictions did enable auditors to focus their effort on the highest-scoring cases. But more known fraud cases—good for data scientists but bad for taxpayers—could have provided the modeling the traction needed to automatically flag suspicious new cases more quickly and accurately.

 

This was certainly the situation on another project of ours, focused on discovering tax fraud collusion[4]. Unfortunately (for taxpayers), there were plenty of training examples, but that did enable stronger analytic results. Excellent models were created, enthusiastically implemented by the IRS, and ultimately credited with saving taxpayers over $20 billion in just their first three years.

 

One can’t uncover insights without data, but not just any data will work. Many data science projects must make do with “found” data, not the results of an experiment designed to illuminate the question studied. It’s like making a salad out of what can be foraged in the yard.

 

One sophisticated credit-issuing company realized this when seeking to determine if there was a market for their products in the class of applicants once routinely dismissed as being too risky. Perhaps a low-limit card would be profitable and even help a deserving subset of applicants pull themselves up in their credit rating? But the company had no data on such applicants by which to distinguish the truly risky from those worth a try; their traditional filters excluded such individuals from even initial consideration. So, they essentially gave (small amounts of) credit almost randomly to thousands of risky applicants and monitored their repayments for two years. Then, they built models to forecast defaulters (those significantly late on payments) trained only on initial application information. The models revealed a profitable subset of formerly ignored prospects. Their large investment in creating relevant data paid off in revealing a way to profitably expand their customer base.

 

So, make sure that the data you’re working with is relevant to the problem to be solved!

 

I’ll briefly summarize the other major ways that lacking data can be hazardous:

Four Other Ways to Lack Crucial Data

1. You Need Both Sides



One client brought us all their fraud cases … and only the fraud cases! We had to explain, respectfully, that we also needed plenty of non-fraud cases. Classification modeling depends on contrast.


2. Holes in the Matrix

Some values may be missing for given combinations of case and feature. Are they missing at random? Or is there a pattern? It’s easiest to delete the case (row) or feature (column), but that might lose useful information. Before trying to impute values from other cases (which can get quite complex), see if a decision tree algorithm that can handle missing data[6] likes any of the troublesome features.


3. Excluded Data

The dataset may be completely blind to important information. The most frequent cause is survivor bias, where some cases didn’t make it through the gauntlet to appear in the dataset.

For example, a survey on data scientist pay found that those who negotiated their offer received an average salary increase of 3%. The authors concluded that one should always negotiate—not realizing that only those who succeeded in being hired appeared in the survey.

If a candidate overdid it and had their offer rescinded, the dataset wouldn’t know about it. (This excluded data danger will be covered more thoroughly in Mistake 5: listen only to the data.)


4. Mislabeled Cases

If the target label depends on human judgment, some cases will be wrong. Radiologists, for example, are estimated (by their own professional society) to miss 30% of important findings.[7] In one study, radiologists were secretly shown the same chart in the same day and unknowingly changed their diagnosis 20% of the time.

In other rare cases, the mislabeling is even deliberate. In analyzing health assessments for Social Security disability benefits, we found huge inconsistencies and even one adjudicator who always decided the opposite of what their supervisor suggested—no matter the merits of the case!

 

Few degree programs prepare graduates for these last two problems, but they are very common in the real world. And few projects consider the value of (or budget for) following up on the problem cases. It is very helpful to carefully examine a case that the model had great trouble getting “right.” It’s likely mislabeled, but if not, it can reveal even more information.

 

I’ll close with a heartening story—at least for modelers. We worked hard on a challenging biometric classification task. Our client was very skeptical that analytics would help. They agreed to re-examine the top 200 cases our model got wrong (likely to prove their point and be freed of such nerds). Instead, they found that 75% of those initial labels were wrong, and they became enthusiastic backers of our further assistance.

 

In my next blog, we’ll explore the second deadly data science mistake: focusing on training. We’ll explore why a model can look great in the lab but completely fail in the wild and how you can ensure your models actually work when it matters most. Stay tuned!

Dr. John Elder

Vice President and Technical Fellow for Data Science

Dr. John Elder serves as Vice President and Technical Fellow for Data Science within MANTECH’s Data and AI Practice, where he provides technical leadership, thought leadership, and strategic guidance on advanced analytics, machine learning, and AI solutions supporting mission-critical government and commercial programs.

View Profile

Footnotes


1 https://cancer.ucsf.edu/news/2025/04/15/popular-ct-scans-could-account-for-5-of-all-cancer-cases-a-year

2 However, you do not need to balance the number of 0/1 cases as is commonly advised: instead, lower your decision threshold from the default of 0.5 to the cutoff that maximizes the expected return (or minimizes the loss). In some situations, it is useful to increase the influence of rare cases through duplication; we will talk about how to avoid those pitfalls in Mistake 10: sample casually.

3 Virtually all known cases were government workers who had, out of guilt, turned themselves in. Most claimed they originally intended to pay back what they had fraudulently obtained (but how?). One audacious fraudster was discovered, however, after coworkers realized that the clerk had been driving a different sports car to work every day of the week!

4 Tax fraud is primarily perpetrated by organized crime (foreign and domestic) rather than by individuals.

5 For decades the credit industry has mailed over a billion offers a year to American households; the high-risk market was then one of the few places not saturated. Credit profits are nonlinear with risk and remind me of the triage system established during the Napoleonic wars, when the levée en masse swelled the battlefields. Combined with devastating new technology, this completely overwhelmed medical resources. Wounds were classified into three levels: minor to be treated later (if at all), serious to receive immediate attention, and most serious likely not worth a physician’s time. (We can envision a combatant, aware of hovering between the latter two classes insisting, like the Black Knight in the Monty Python movie, “What? Leg gone? It’s just a flesh wound!”) Likewise, credit companies make the most profit on individuals in the middle category of “woundedness”-those who can’t pay off their balance but keep trying. But banks lose 5-7 times as much on clients just a little worse off, who eventually give up trying altogether. So, for models to be profitable at this edge of the return cliff they must forecast very fine distinctions. But sudden economic downturns tend to severely punish the stocks of companies that aggressively seek that customer niche unless they pay obsessive attention to model quality.

6 The simplest way a tree handles a case missing the value of the feature is to send it to the majority side of a split. Others impute the value to be the mean or median of the feature for the sake of moving on. The CART algorithm does it best; it determines “surrogate splits” for each tree node so there are several backup questions in case the original question can’t be answered.

7 https://www.rsna.org/news/2022/march/human-error-in-radiology

Learn More About Data and AI

Explore your next career challenge and learn more about the Data and AI team!

Learn More

View More Blogs

View other MANTECH Blog Posts and Case Studies

View Blogs