Abbott Analytics: Data Mining Consulting
Services

Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements

Abbott Insights

Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance

Data Mining Clients

Client List and Case Studies

Courses and Seminars

Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.

Data Mining Resources

Data Mining Resources, Books, Websites, White Papers, Presentations, Tutorials

About Us

Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).

Contact Us

Home

Abbott Insights™, Data Mining Advice

Abbott Insights Index

Abbott Insights™

Insight 3: Create Three Sampled Data Sets, not Two

Topic: Modeling
Sub-Topic: Sampling
Date Posted: February 2005


One often sees an appeal to split data into two data sets for modeling: a training set and a testing set. The training set is used to build a model, and the testing set is used to assess the model. If the model accuracy on the training set is good, but on the testing set is poor, one has a good indication that the model has been overfit, or in other words, the model has picked up on patterns in the modeling data that are specific to the training data. In this case, the best course of action is to adjust parameters in the modeling algorithm so that a simpler model is created, whether it means fewer inputs in a model (for neural networks, regression, nearest neighbor, etc.), or fewer nodes or splits in the model (neural networks or decision trees). Then, retrain and retest the data to see if results have improved, particularly for the testing data.

However, if one does this several times, or even dozens of times (which is common), the testing data ceases to be an independent assessment of model performance because the testing data was used to change the inputs or algorithm parameters. Therefore, it is strongly recommended to have a third dataset to perform a final validation. This validation step should occur only after training and testing have provided confidence that the model is good enough to deploy.

Diagram depicting sampled data sets

Health Club Survey Analysis, Part I: Successful application of data mining by Abbott Analytics

Data Mining Level II:
Las Vegas, NV - December 10 & 11, 2008
Orlando, FL - February 11 & 12, 2009
Las Vegas, NV - April 29 & 30, 2009
Data Mining Level III:
Las Vegas, NV - December 12, 2008
Orlando, FL - February 13, 2009
Las Vegas, NV - May 1, 2009

Abbott, D.W., Benefits of Creating Ensembles of Classifiers, The Data Administration Newsletter, Robert Seiner, ed., Issue 18, October 2001.