Date of Award

12-2022

Document Type

Dissertation

Degree Name

Ph.D.

Organizational Unit

College of Natural Science and Mathematics, Mathematics

First Advisor

Nicholas Cutforth

Second Advisor

Frederique Chevillot

Third Advisor

Kathy Green

Fourth Advisor

Antonio Olmos

Keywords

Collinearity, LASSO, Logistic regression, RIDGE, Sample size, Weight

Abstract

Logistic Regression (LR), LASSO regression, and RIDGE regression are standard classification techniques for predicting a dichotomous output. Since these methods are applied for similar purposes and have different features, it is crucial to evaluate the performance of these methods under different controlled conditions. With this information, researchers can apply the optimal method for specific conditions.

Following previous research, which reported the effects of conditions such as sample size and multicollinearity on the performance of the classification methods, this research focused on the effects of when sample size, level of predictor collinearity, and predictor variable weight are controlled on the performance of LR, LASSO, and RIDGE regressions. Data were simulated with 100 iterations that generated a total of n = 2,400 observations in R statistical software. A factorial ANOVA with follow-ups was employed to evaluate the effect of conditions on the performance of each technique as measured by accuracy and F-measure.

In most conditions for the two outcome performance measures (accuracy and F-measure), the highest effect on performances was observed from the predictor variable weight. However, when the weight was low, all three regression methods were found to have an overall better performance under high correlation and a large sample size. Moreover, the models with high-weight conditions suppressed the effects of every other controlled condition on accuracy and F-measure output values. Therefore, when the study data conditions include a high-weighted variable, regardless of which method was used or which level of correlation or sample size was selected, there were no marked differences between the methods.

Based on these results, researchers are encouraged first to consider the problem they are trying to solve. Data nature and feature understanding can lead to more accurate and efficient methods implementation while making it easier to pivot to new analytic problems, adapt when model accuracy drifts, and save data scientists and business users considerable time and effort.

Publication Statement

Copyright is held by the author. User is responsible for all copyright compliance.

Rights Holder

Mahmoud M. AlJuhani

Provenance

Received from ProQuest

File Format

application/pdf

Language

en

File Size

315 pgs

Discipline

Statistics

Available for download on Friday, April 11, 2025



Share

COinS