# A Comparison of Logistic, RIDGE, and LASSO Regression with Heart Failure Risk Data: Effects of Sample Size, Predictor Correlation, and Predictor Weight on Outcome Accuracy

12-2022

Dissertation

Ph.D.

## Organizational Unit

College of Natural Science and Mathematics, Mathematics

Nicholas Cutforth

Frederique Chevillot

Kathy Green

Antonio Olmos

## Keywords

Collinearity, LASSO, Logistic regression, RIDGE, Sample size, Weight

## Abstract

Logistic Regression (LR), LASSO regression, and RIDGE regression are standard classification techniques for predicting a dichotomous output. Since these methods are applied for similar purposes and have different features, it is crucial to evaluate the performance of these methods under different controlled conditions. With this information, researchers can apply the optimal method for specific conditions.

Following previous research, which reported the effects of conditions such as sample size and multicollinearity on the performance of the classification methods, this research focused on the effects of when sample size, level of predictor collinearity, and predictor variable weight are controlled on the performance of LR, LASSO, and RIDGE regressions. Data were simulated with 100 iterations that generated a total of n = 2,400 observations in R statistical software. A factorial ANOVA with follow-ups was employed to evaluate the effect of conditions on the performance of each technique as measured by accuracy and F-measure.

In most conditions for the two outcome performance measures (accuracy and F-measure), the highest effect on performances was observed from the predictor variable weight. However, when the weight was low, all three regression methods were found to have an overall better performance under high correlation and a large sample size. Moreover, the models with high-weight conditions suppressed the effects of every other controlled condition on accuracy and F-measure output values. Therefore, when the study data conditions include a high-weighted variable, regardless of which method was used or which level of correlation or sample size was selected, there were no marked differences between the methods.

Based on these results, researchers are encouraged first to consider the problem they are trying to solve. Data nature and feature understanding can lead to more accurate and efficient methods implementation while making it easier to pivot to new analytic problems, adapt when model accuracy drifts, and save data scientists and business users considerable time and effort.

## Publication Statement

Copyright is held by the author. User is responsible for all copyright compliance.

## Rights Holder

Mahmoud M. AlJuhani

application/pdf

en

315 pgs

Statistics