Date of Award
Quantitative Research Methods
Kathy Green, Ph.D.
Classification, Classification and regression trees, Group membership, Linear discriminant analysis, Logistic regression, Simulation
Logistic Regression (LR), Linear Discriminant Analysis (LDA), and Classification and Regression Trees (CART) are common classification techniques for prediction of group membership. Since these methods are applied for similar purposes with different procedures, it is important to evaluate the performance of these methods under different controlled conditions. With this information in hand, researchers can apply the optimal method for certain conditions. Following previous research which reported the effects of conditions such as sample size, homogeneity of variancecovariance matrices, effect size, and predictor distributions, this research focused on effects of correlation between predictor variables, number of the predictor variables, number of the groups in the outcome variable, and group size ratios for the performance of LDA, LR, and CART. Data were simulated with Monte Carlo procedures in R statistical software and a factorial ANOVA with follow-ups was employed to evaluate the effect of conditions on the performance of each technique as measured by proportions of correctly predicted observations for all groups and for the smallest group.
In most of the conditions for the two outcome measures, higher performances of CART than LDA and LR were observed. But, in some conditions where there were a higher number of predictor variables and number of groups with low predictor variable correlation, superiority of LR to CART was observed. Meaningful effects of methods of correlation, number or predictor variables, group numbers and group size ratio were observed on prediction accuracy of group membership. Effects of correlation, group size ratio, group number, and number of predictor variables on prediction accuracies were higher for LDA and LR than CART. For the three methods, lower correlation and greater number of predictor variables yielded higher prediction accuracies. Having balanced data rather than imbalanced data and greater group numbers led to lower group membership prediction accuracies for all groups, but having more groups led to better predictions for the small group. In general, based on these results, researchers are encouraged to apply CART in most conditions except for the cases when there are many predictor variables (around 10 or more) and non-binary groups with low correlations between predictor variables, when LR might provide more accurate results.
Polat, Cahit, "Performance Evaluation of Logistic Regression, Linear Discriminant Analysis, and Classification and Regression Trees under Controlled Conditions" (2018). Electronic Theses and Dissertations. 1503.
Received from ProQuest