Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements
Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance
Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.
Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).
Topic: Modeling
Sub-Topic: Decision Trees
Date Posted: March 2005
One well-known difficulty in building classification models occurs when one class vastly outnumbers the other classes. For example, if the output (target) variable has 95% 0s and 5% 1s, a neural network could predict every record will be 0 and have 95% accuracy. Of course, this is a meaningless model. This occurs when there are contradictions in the data, that is, when there are input patterns in the data with output patterns containing both 0s and 1s. If there are more records with an output variable value equal to 0, the classifier will choose 0 as the more likely answer.
The most common correction to make when building neural networks for data with a large imbalance of class counts is to merely balance the counts of 0s and 1s by removing records with excess 0s, or by duplicating records with 1s. That issue will be covered in a future issue of Abbott Insights™.
However, some algorithms have a way to accomplish this balancing without sampling by specifying the expected prior probability of each class value (priors). The CART decision tree algorithm is one algorithm with settings to do this. The advantage is that no data is thrown away, yet the classifier won’t favor the overrepresented class value over the underrepresented one.
Health Club Survey Analysis, Part I: Successful application of data mining by Abbott Analytics
Data Mining Level II:
Las Vegas, NV - December 10 & 11, 2008
Orlando, FL - February 11 & 12, 2009
Las Vegas, NV - April 29 & 30, 2009
Data Mining Level III:
Las Vegas, NV - December 12, 2008
Orlando, FL - February 13, 2009
Las Vegas, NV - May 1, 2009
Abbott, D.W., Benefits of Creating Ensembles of Classifiers, The Data Administration Newsletter, Robert Seiner, ed., Issue 18, October 2001.