A chart comparing the performance of SL with screens and Lasso

SL Screens

Investigating the use of screens in Super Learner ensembles

Drew King, Brian D Williamson, Ying Huang

Abstract

Clinical research trials often generate large datasets with many variables, which are analyzed to identify potential relationships with clinical outcomes. Machine learning techniques are increasingly employed for such analyses, yet little is known about the accuracy of these algorithms in different settings. This research aims to establish guidelines for combining variable selection algorithms to assess accuracy under various conditions. Our findings indicate that using an ensemble of algorithms with the Lasso variable selection tool can produce inconsistent accuracy with certain types of datasets. A preprint of this work is available on arxiv.org. The methodology will be shared upon peer review.

Introduction

Clinical research often involves large-scale studies, such as those on vaccines or cancer early detection, but rare diseases can result in smaller sub-samples. This imbalance complicates variable selection because many existing tools are not optimized for datasets where some outcomes or categories are uncommon.

Despite these challenges, machine learning has become a mainstay in biomedical research because it can handle a large number of variables and detect complex relationships between variables and clinical outcomes. However, performance varies significantly across different algorithms, suggesting that no single approach is universally optimal. One proposed solution is to adopt an ensemble method—combining outputs from multiple algorithms—to leverage their individual strengths. At present, there are no widely accepted guidelines on how best to combine these procedures, creating a need for clear recommendations. Our goal is to develop such guidelines, potentially improving research in domains like vaccine efficacy and cancer detection.

Contributions

Brian Williamsom, PhD

Drew King, BSc

Ying Huang, PhD

Results

Our preliminary analysis shows that Lasso + Super Learner (SL) is not always a beneficial combination, which is a surprising outcome. In particular:

SL with screens vs Lasso

Note: All code will be open-sourced under the MIT license on GitHub once the paper is published.

Next Steps

We plan to train researchers on applying these ensemble guidelines and continue developing our methods. Future work includes:

Once published, our results will be fully reproducible with standard scientific computing tools, allowing others to validate and extend our findings.