A chart comparing the performance of SL with screens and Lasso

SL Screens

2021-08-08

Investigating the use of screens in Super Learner ensembles

Drew King, Brian D Williamson, Ying Huang

2021-08-08

Investigating the use of screens in Super Learner ensembles

Drew King, Brian D Williamson, Ying Huang

Abstract

Clinical research trials often generate large datasets with many variables, which are analyzed to identify potential relationships with clinical outcomes. Machine learning techniques are increasingly employed for such analyses, yet little is known about the accuracy of these algorithms in different settings. This research aims to establish guidelines for combining variable selection algorithms to assess accuracy under various conditions. Our findings indicate that using an ensemble of algorithms with the Lasso variable selection tool can produce inconsistent accuracy with certain types of datasets. A preprint of this work is available on arxiv.org. The methodology will be shared upon peer review.

Introduction

Clinical research often involves large-scale studies, such as those on vaccines or cancer early detection, but rare diseases can result in smaller sub-samples. This imbalance complicates variable selection because many existing tools are not optimized for datasets where some outcomes or categories are uncommon.

Despite these challenges, machine learning has become a mainstay in biomedical research because it can handle a large number of variables and detect complex relationships between variables and clinical outcomes. However, performance varies significantly across different algorithms, suggesting that no single approach is universally optimal. One proposed solution is to adopt an ensemble method—combining outputs from multiple algorithms—to leverage their individual strengths. At present, there are no widely accepted guidelines on how best to combine these procedures, creating a need for clear recommendations. Our goal is to develop such guidelines, potentially improving research in domains like vaccine efficacy and cancer detection.

Contributions

Brian Williamsom, PhD

Designed the experiment and wrote the simulation code in R.
Mentored Drew in developing analysis code.
Authored the research paper and abstract.

Drew King, BSc

Collaborated on deploying the experiment to a Linux server cluster.
Created R scripts (with dplyr and ggplot2) to aggregate and visualize results.
Designed a poster to present the research to MESA (an organization for minority students in STEM).

Ying Huang, PhD

Provided guidance on experimental design and data analysis.
Offered feedback and direction on the research paper.

Results

Our preliminary analysis shows that Lasso + Super Learner (SL) is not always a beneficial combination, which is a surprising outcome. In particular:

SL with appropriate screening mechanisms (or with Lasso and screens) tends to be robust for most datasets.
Lasso alone can be unreliable in many instances.
Lasso + SL without any screening is especially inconsistent in certain cases.

SL with screens vs Lasso

Note: All code will be open-sourced under the MIT license on GitHub once the paper is published.

Next Steps

We plan to train researchers on applying these ensemble guidelines and continue developing our methods. Future work includes:

Investigating why Lasso alone often underperforms for variable selection.
Testing other algorithms that are optimized for biomedical data.
Rewriting the experiment in faster languages (e.g., C or Python) for improved performance.
Designing interactive data visualizations using D3.js to facilitate more intuitive result exploration.

Once published, our results will be fully reproducible with standard scientific computing tools, allowing others to validate and extend our findings.