Abbott Analytics: Data Mining Consulting
Services

Services: Data Mining Project Assessment, Data Preparation For Data Mining, Data Mining Model Development, Data Mining Model Deployment, Data Mining Course: Overview for Project Managers, Data Mining Course: Overview for Practitioners, Customized Data Mining Engagements

Abbott Insights

Insight 1: Find Correlated Variables Prior to Modeling Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection Insight 2: Beware of Outliers in Computing Correlations Topic: Data Preparation Sub-Topic: Outliers Insight 3: Create Three Sampled Data Sets, not Two Topic: Modeling Sub-Topic: Sampling Insight 4: Use Priors to Balance Class Counts Topic: Modeling Sub-Topic: Decision Trees Insight 5: Beware of Automatic Handling of Categorical Variables Topic: Data Understanding and Data Preparation Sub-Topic: Feature Selection and Creation Insight 6: Gain Insights by Building Models from Several Algorithms Topic: Modeling Sub-Topic: Algorithm Selection Insight 7: Beware of Being Fooled with Model Performance Topic: Data Evaluation Sub-Topic: Model Performance

Data Mining Clients

Client List and Case Studies

Courses and Seminars

Upcoming Data Mining Seminars A Practical Introduction to Data Mining Upcoming courses (nationwide) Data Mining Level II: A drill-down of the data mining process, techniques, and applications Data Mining Level III: A hands-on day of data mining using real data and real data mining software Anytime Courses Overview for Project Managers: Train project managers on the data mining process. Overview for Practitioners: Train practitioners (data analysts, project managers, managers) on the data mining process.

Data Mining Resources

Data Mining Resources, Books, Websites, White Papers, Presentations, Tutorials

About Us

Mr. Abbott is a seasoned instructor, having taught a wide range of data mining tutorials and seminars for a decade to audiences of up to 400, including DAMA, KDD, AAAI, and IEEE conferences. He is the instructor of well-regarded data mining courses, explaining concepts in language readily understood by a wide range of audiences, including analytics novices, data analysts, statisticians, and business professionals. Mr. Abbott also has taught applied data mining courses for major software vendors, including Clementine (SPSS), Affinium Model (Unica Corporation), Model 1 (Group1 Software), and hands-on courses using S-Plus and Insightful Miner (Insightful Corporation), and CART (Salford Systems).

Contact Us

Home

Abbott Insights™, Data Mining Advice

Abbott Insights Index

Abbott Insights™

Insight 1: Find Correlated Variables Prior to Modeling

Topic: Data Understanding and Data Preparation
Sub-Topic: Feature Selection
Date Posted: December 2004


Many data sets contain highly correlated variables that measure the same kind of information in different ways. Or, when in-house data is appended with third-part data (census data, for example), the same problem often occurs. Some algorithms will build unstable models if two or more highly correlated variables are included in the model, and others will just slow down. Either way, it is a good idea to remove highly (linearly) correlated variables. But how do you identify them and remove them?

Frequently, data mining software packages allow you to measure correlation between variables, but they don’t typically allow you to select a variable subset based on some correlation threshold. A trick to use when dealing with relatively small data sets that can fit into Excel is to do the following. Export a snippet of the real-valued columns of data as tab or comma delimited, and load it into Excel. Use the correlation data analysis option to create the correlation matrix. Then use the conditional formatting option in Excel to highlight the cells where high correlations occur as one color (green), medium correlations as a second color (orange), and low correlations as a third color (blue). Typically I use logic like “if the cell value is not between 0.9 and –0.9, color the cell green.”

Once the cells are color coded, one typically sees blocks of data that are highly correlated with one another. The threshold depends on the application, but I typically use +/- 0.9 as a threshold. Only one of those variables is needed to represent that idea in the model; remove the others from the list of candidate inputs to the model. This process can remove half or more of the variables from consideration without losing the ability to build reliable models. Additionally, the visual correlation matrix provides insights into variable groupings not readily available without doing some kind of factor analysis or principal component analysis.

A sample correlation matrix is shown below.

Chart depicting data set

Bands 3, 4, and 5 are correlated with each other above the 0.9 level
Bands 8, 9 and 10 are correlated with each other above the 0.9 level
Bands 11 and 12 are correlated with each other above the 0.9 level

Therefore, one may want to remove bands 4, 5, 9, 10, and 12 from the candidate input list.

Health Club Survey Analysis, Part I: Successful application of data mining by Abbott Analytics

Data Mining Level II:
Las Vegas, NV - December 10 & 11, 2008
Orlando, FL - February 11 & 12, 2009
Las Vegas, NV - April 29 & 30, 2009
Data Mining Level III:
Las Vegas, NV - December 12, 2008
Orlando, FL - February 13, 2009
Las Vegas, NV - May 1, 2009

Abbott, D.W., Benefits of Creating Ensembles of Classifiers, The Data Administration Newsletter, Robert Seiner, ed., Issue 18, October 2001.