Learn More About Data and AI
Explore your next career challenge and learn more about the Data and AI team!
Learn More
The “Deadly Dozen” Data Science Mistakes
Inducing useful models from data can only succeed if the data represent all essential parts of the problem, and if the features, X, are in some way related to the outcomes, Y.
Training can be almost arbitrarily accurate; only out-of-sample results matter.
Different modeling methods have unique strengths and weaknesses. Employ a varied toolkit and let each model teach you something even if you don’t use its predictions.
Carefully design your modeling project to address a key business problem. If possible, craft an error metric to reflect the real-world trade-offs.
The data may not be complete or balanced. Use domain knowledge to supplement and constrain a model blinded to anything not in its data.
Early results will be too good to be true if future information, not knowable at time of model decision, is included in the training data.
Outliers can ruin your model. Or, they can be valuable discoveries. Leverage points – outliers in X – wield far too much influence on parameters.
This term is “overloaded” with three meanings: you can learn too much from early experiences, have ideas that look useful in low dimensions but break down in high dimensions, or believe too much in the hype of your favorite modeling method.
Models have a data boundary within which they are competent to answer questions. This is rarely understood, measured, or enforced.
Sampling down common cases or duplicating rare cases is sometimes valuable but discipline is needed to do it right, especially when employing cross-validation.
The process of searching over a vast space of possible models introduces a hidden complexity to the final model chosen. To measure its true significance, find out how accurate your algorithm is when modeling “null hypothesis” data, where no relation exists between X and Y.
The best model – perhaps an ensemble of many – may be impossible to interpret. Even if it is simple, too much can be read into the features used. It’s best to measure the influence of features over a suite of competing top models.
Vice President and Technical Fellow for Data Science
Dr. John Elder serves as Vice President and Technical Fellow for Data Science within MANTECH’s Data and AI Practice, where he provides technical leadership, thought leadership, and strategic guidance on advanced analytics, machine learning, and AI solutions supporting mission-critical government and commercial programs.
View Profile
Explore your next career challenge and learn more about the Data and AI team!
Learn More