Learn More About Data and AI
Explore your next career challenge and learn more about the Data and AI team!
Learn More
Four Other Ways to Lack Crucial Data
One client brought us all their fraud cases … and only the fraud cases! We had to explain, respectfully, that we also needed plenty of non-fraud cases. Classification modeling depends on contrast.
Some values may be missing for given combinations of case and feature. Are they missing at random? Or is there a pattern? It’s easiest to delete the case (row) or feature (column), but that might lose useful information. Before trying to impute values from other cases (which can get quite complex), see if a decision tree algorithm that can handle missing data[6] likes any of the troublesome features.
The dataset may be completely blind to important information. The most frequent cause is survivor bias, where some cases didn’t make it through the gauntlet to appear in the dataset.
For example, a survey on data scientist pay found that those who negotiated their offer received an average salary increase of 3%. The authors concluded that one should always negotiate—not realizing that only those who succeeded in being hired appeared in the survey.
If a candidate overdid it and had their offer rescinded, the dataset wouldn’t know about it. (This excluded data danger will be covered more thoroughly in Mistake 5: listen only to the data.)
If the target label depends on human judgment, some cases will be wrong. Radiologists, for example, are estimated (by their own professional society) to miss 30% of important findings.[7] In one study, radiologists were secretly shown the same chart in the same day and unknowingly changed their diagnosis 20% of the time.
In other rare cases, the mislabeling is even deliberate. In analyzing health assessments for Social Security disability benefits, we found huge inconsistencies and even one adjudicator who always decided the opposite of what their supervisor suggested—no matter the merits of the case!
Vice President and Technical Fellow for Data Science
Dr. John Elder serves as Vice President and Technical Fellow for Data Science within MANTECH’s Data and AI Practice, where he provides technical leadership, thought leadership, and strategic guidance on advanced analytics, machine learning, and AI solutions supporting mission-critical government and commercial programs.
View Profile
1 https://cancer.ucsf.edu/news/2025/04/15/popular-ct-scans-could-account-for-5-of-all-cancer-cases-a-year
2 However, you do not need to balance the number of 0/1 cases as is commonly advised: instead, lower your decision threshold from the default of 0.5 to the cutoff that maximizes the expected return (or minimizes the loss). In some situations, it is useful to increase the influence of rare cases through duplication; we will talk about how to avoid those pitfalls in Mistake 10: sample casually.
3 Virtually all known cases were government workers who had, out of guilt, turned themselves in. Most claimed they originally intended to pay back what they had fraudulently obtained (but how?). One audacious fraudster was discovered, however, after coworkers realized that the clerk had been driving a different sports car to work every day of the week!
4 Tax fraud is primarily perpetrated by organized crime (foreign and domestic) rather than by individuals.
5 For decades the credit industry has mailed over a billion offers a year to American households; the high-risk market was then one of the few places not saturated. Credit profits are nonlinear with risk and remind me of the triage system established during the Napoleonic wars, when the levée en masse swelled the battlefields. Combined with devastating new technology, this completely overwhelmed medical resources. Wounds were classified into three levels: minor to be treated later (if at all), serious to receive immediate attention, and most serious likely not worth a physician’s time. (We can envision a combatant, aware of hovering between the latter two classes insisting, like the Black Knight in the Monty Python movie, “What? Leg gone? It’s just a flesh wound!”) Likewise, credit companies make the most profit on individuals in the middle category of “woundedness”-those who can’t pay off their balance but keep trying. But banks lose 5-7 times as much on clients just a little worse off, who eventually give up trying altogether. So, for models to be profitable at this edge of the return cliff they must forecast very fine distinctions. But sudden economic downturns tend to severely punish the stocks of companies that aggressively seek that customer niche unless they pay obsessive attention to model quality.
6 The simplest way a tree handles a case missing the value of the feature is to send it to the majority side of a split. Others impute the value to be the mean or median of the feature for the sake of moving on. The CART algorithm does it best; it determines “surrogate splits” for each tree node so there are several backup questions in case the original question can’t be answered.
7 https://www.rsna.org/news/2022/march/human-error-in-radiology
Explore your next career challenge and learn more about the Data and AI team!
Learn More