The “Deadly Dozen” Data Science Mistakes: An Introduction

Dr. John Elder - Vice President and Technical Fellow for Data Science

In 2009, fourteen years after founding a data science consultancy, my first book – with co-authors Drs. Bob Nisbet and Gary Miner – came out to strong reviews. Though we were very different in personality, all of us had been making a living from our data science skills for decades and loved to teach, including university classes. It took three years for us to write, argue about, and harmonize the book into a cohesive whole. Astonishingly, it sold out completely and won “Book of the Year” in Mathematics. Though I would co-author two more books on ensemble methods and text mining, this “Data Mining Handbook” had the biggest impact. It was geared toward practitioners, not academics, and made arcane but powerful concepts clear to many for the first time.

The most popular chapter in the book was #20: Top Ten Data Mining Mistakes. It resonated with readers and reviewers who recognized, from hard experience, some of the problems discussed and were intrigued by the rest. By printing time, the list grew to 11 after a colleague sagely suggested adding “lack data”. Years later, I had a breakthrough realization and added the 12th mistake – Cherry Picking. I came to believe it’s the most serious mistake of all, and that its remedy can solve the ongoing “Crisis in Science”.[1]

If you’re to survive and thrive in the forest of Analytics, it’s essential to know the dangerous beasts that lurk therein!

Some of these “beasts” are technical, some are strategic, and some have more to do with human judgment than modeling technique. What they have in common is their ability to undermine results in costly ways.

In the weeks ahead, I’ll follow up with a blog on each mistake in turn, illustrated with real-world adventures. But even at a high level, this list is a practical checklist for anyone building, evaluating, or overseeing data science work. I would also enjoy hearing stories from you that relate to, or extend, this list at ai@mantech.com.

The “Deadly Dozen” Data Science Mistakes

1. Lack Data

Inducing useful models from data can only succeed if the data represent all essential parts of the problem, and if the features, X, are in some way related to the outcomes, Y.

2. Focus on Training

Training can be almost arbitrarily accurate; only out-of-sample results matter.

3. Rely on One Technique

Different modeling methods have unique strengths and weaknesses. Employ a varied toolkit and let each model teach you something even if you don’t use its predictions.

4. Ask the Wrong Question

Carefully design your modeling project to address a key business problem. If possible, craft an error metric to reflect the real-world trade-offs.

5. Listen (only) to the Data

The data may not be complete or balanced. Use domain knowledge to supplement and constrain a model blinded to anything not in its data.

6. Accept Leaks from the Future

Early results will be too good to be true if future information, not knowable at time of model decision, is included in the training data.

7. Discount Pesky Cases

Outliers can ruin your model. Or, they can be valuable discoveries. Leverage points – outliers in X – wield far too much influence on parameters.

8. Extrapolate

This term is “overloaded” with three meanings: you can learn too much from early experiences, have ideas that look useful in low dimensions but break down in high dimensions, or believe too much in the hype of your favorite modeling method.

9. Answer Every Inquiry

Models have a data boundary within which they are competent to answer questions. This is rarely understood, measured, or enforced.

10. Sample Casually

Sampling down common cases or duplicating rare cases is sometimes valuable but discipline is needed to do it right, especially when employing cross-validation.

11. Cherry-Pick Results

The process of searching over a vast space of possible models introduces a hidden complexity to the final model chosen. To measure its true significance, find out how accurate your algorithm is when modeling “null hypothesis” data, where no relation exists between X and Y.

12. Believe the Best Model

The best model – perhaps an ensemble of many – may be impossible to interpret. Even if it is simple, too much can be read into the features used. It’s best to measure the influence of features over a suite of competing top models.

Good judgement comes from experience. Experience comes from bad judgement.

Wise you are if you can learn from the mistakes of others!

Stay tuned as I take on each of these mistakes in the posts ahead—and how to avoid them.

[1] The Crisis is that most published work, even in reputable journals, doesn’t replicate when tested, which invalidates the findings. But most of the Crisis can be solved by using Target Shuffling to most accurately assess significance.

Dr. John Elder

Vice President and Technical Fellow for Data Science

Dr. John Elder serves as Vice President and Technical Fellow for Data Science within MANTECH’s Data and AI Practice, where he provides technical leadership, thought leadership, and strategic guidance on advanced analytics, machine learning, and AI solutions supporting mission-critical government and commercial programs.

View Profile

Expertise

Who We Are

What We Do

Careers

News And Media