Mitigating Bias in AI-based Foundation Models in Healthcare For Minority Groups

NYU Center for Data Science
4 min readFeb 29, 2024

Self-supervised pre-trained foundation models are increasingly used in disease screening, promising early detection and improved patient outcomes. However, the potential of such technologies is often hindered by biases in the data they learn from. Weicheng (Jack) Zhu, a CDS PhD student, in a new preprint, “Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling,” has embarked on a mission to challenge these biases head-on, improving the self-supervised learning stage to learn fairer data representations, which leads to fairer and more accurate screening results. The work is a collaboration with CDS PhD Alumnus Sheng Liu (now at Stanford), CDS Associate Professor of Mathematics and Data Science Carlos Fernandez-Granda, and Narges Razavian, Assistant Professor at the Predictive Analytics Unit of the Center for Healthcare Innovation and Delivery Sciences at NYU Langone Medical Center.

As we increasingly rely on general-purpose, self-supervised learning (SSL) pre-trained foundation models across various tasks, the imperative to ensure these models are fair becomes paramount. In a recent interview, Zhu emphasized the necessity of focusing on fairness in the context of SSL pre-training, advocating for the development of models that learn discriminative features equitably across all sub-cohorts. This approach is critical in healthcare, where the stakes of bias can significantly impact patient care and outcomes.

The heart of the issue lies in how models trained on biased datasets tend to adopt these biases, favoring spurious correlations over meaningful insights. Zhu’s research sheds light on this predicament, specifically in the context of self-supervised learning — a method that relies on unlabeled data to learn representations. “When you collect real-world data, it’s never a perfect distribution of what’s actually out there,” said Zhu. So, in the context of healthcare, “sometimes the model works better in some races or particular subgroups of patients, and this is because, in some datasets, the majority of patients are white or older, which causes the model to fail on younger patients and minorities.” This shortage of available data on minority groups can cause models to focus on irrelevant attributes and create spurious, incorrect correlations, which in the context of healthcare can be dangerous.

To counteract this, Zhu and his team have developed the learning-speed aware self-supervised learning (LA-SSL) method, adjusting the sampling probability of data based on the speed at which the model learns from it. This innovative approach intentionally focuses more on minority groups within the data. “We propose some methods that can effectively adjust the sample weight, or the probability of each sample being sampled in our training process,” Zhu said, pointing to the core of LA-SSL’s strategy to recalibrate the learning focus and mitigate biases. This method leverages the insight that samples which align with spurious correlations (e.g., old and sick) converge faster than those in conflict (e.g., old but healthy), by dynamically adjusting the sampling probability to favor examples that challenge these correlations, thus reducing their impact and leading to more equitable features.

Using a chest X-ray dataset, Zhu’s team exhibits the superiority of a debiased pretrained model for binary classification of medical findings, encompassing a wide array of conditions from tumors to pneumonia and diseases affecting the heart or lungs. This research not only aimed at improving the fairness of representations of datasets in order to reduce biases in downstream diagnoses but also underscored the potential of this framework to enhance self-supervised learning across different data types.

The implications of Zhu’s team’s research, however, extend far beyond healthcare. The study offers a framework to potentially address biases in various domains where spurious correlations can undermine the integrity of machine learning models, from demographic to geolocational differences among datasets. The insights of Zhu and his colleagues into the distribution of training data spotlight a universal challenge in AI: models trained on biased data do not marginalize the underrepresented.

Bias in AI, then, is not just a dataset issue; it’s a learning process challenge. Zhu’s team’s work paves the way for more inclusive AI, demonstrating that through thoughtful adjustments to the learning process, we can make strides towards models that represent all facets of global diversity equally. This research not only advances the field of self-supervised learning but also reinforces the critical need for methodologies that prioritize fairness and robustness in the face of inherent data biases.

The contribution of Zhu and his co-authors is a step towards realizing the full potential of AI, ensuring it can be a force for good in a world rich with data but riddled with disparities.

By Stephen Thomas

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.