You are probably doing Medical Imaging AI the wrong way.

The most common transfer learning recipe is suboptimal. There’s a better approach.

7 min readJun 16, 2023

The current practice of building AI applications in the Medical Imaging space often sticks to a suboptimal approach. The most common recipe takes an architecture designed for natural image datasets such as ImageNet and then fine-tunes the model on medical images.
However, current literature has repeatedly shown that this transfer learning approach has limits and is suboptimal for the medical domain.
Yet, due to commonly held beliefs, transfer learning from ImageNet is still broadly used in the wild, constraining a promising area.
We propose a different approach that leverages recent findings and simplifies the workflow, opening at the same time multiple possibilities for further developments.

ImageNet is generally very good for transfer learning.

Images from ImageNet dataset featuring mammals (top row) and vehicles (bottom row) — Images from ImageNet: the top row is from the mammal subtree, and the bottom is from the vehicle subtree.¹⁰

The ImageNet dataset, featuring natural images, contains 14,197,122 annotated images organized in 1000 classes and is commonly used as a benchmark for many computer vision models⁸.

The common practice for developing deep learning models for image-related tasks leveraged the “transfer learning” approach with ImageNet. Practitioners first trained a Convolutional Neural Network (CNN) to perform image classification on ImageNet (i.e. pre-training). Then they used the resulting model as a base to train for a new target task or domain (i.e. fine-tuning).

This “transfer learning” approach was so successful that it became the de-facto standard for solving a broad range of computer vision problems. AI practitioners obtained impressive results for classification datasets², object detection tasks⁹, image captioning⁵, semantic segmentation¹, and many others.

It is not surprising that this approach was, therefore, also widely used for medical imaging AI.

Many commonly held beliefs have since been challenged in literature.

First and foremost, many believed that the success of ImageNet was due to the massive size of the dataset leading to a generalized representation. Some instead argued that the secret sauce was a large number of distinct object classes (1000) or their fine granularity, nudging the network to learn a hierarchy of generalizable features.

Huh et al.⁴ made a systematic study on the above possible causes of success resulting in the following outcomes:

1. Pre-training with only half the ImageNet data (500 images per class instead of 1000) results in only a tiny drop in transfer learning performance
2. Pre-training with an order of magnitude fewer classes (127 classes instead of 1000) results in only a tiny drop in transfer learning performance

The authors concluded that “blindly adding more training data does not always lead to better performance and can sometimes hurt performance.”

So, size is not all.

He et al.³ at Facebook AI Research went further, questioning the pre-training paradigm.
The team showed that transfer does not necessarily lead to performance improvements, even when tasks are similar. They stated that:

1. Training from scratch on target tasks (instead of pre-training) is possible without architectural changes or affecting performance.
2. Collecting annotations of target data (instead of pre-training data) can be more helpful in improving the target task performance.

Finally, Kornblith et al.⁶ at Google Brain argued that pre-trained features might be less general than previously thought.

Then, why did practitioners get good results in their early attempts with this approach?
My humble, possibly controversial opinion is that the success was primarily due to very effective model architectures, such as ResNet, sometimes despite the pre-training on ImageNet. My consistent experience with networks like ResNet (and Unet) convinced me of their impressive performance potential and ability to generalize and perform on off-sample sets, starting from random weights, often without pre-training.

Why ImageNet transfer learning is NOT so good for medical images.

Chest X-ray Images from CheXNet: Radiologist-Level Pneumonia Detection challenge — Images from the CheXNet Radiologist-Level Pneumonia Detection challenge¹¹

Medical images are very different from the natural images in ImageNet.

Many medical imaging tasks identify pathologies by looking for tiny variations in local textures within a relatively sizeable bodily area of interest. For example, in thoracic X-rays, local white patches in the lungs can indicate pneumonia. In rather large images featuring the retinal fundus (shown below), tiny red dots are signs of possible diabetic retinopathy. These tasks differ entirely from identifying an animal in ImageNet, mainly appearing as a prominent global subject against a natural background.

Finally, while ImageNet features 1000 classes, the medical tasks often have significantly fewer classes, less than 15 for chest x-rays and five for diabetic retinopathy.
Raghu et al.⁷ investigated transfer learning for Medical Imaging and concluded that “transfer learning offers limited performance gains” and “the ImageNet task is not necessarily a good indication of success on medical datasets.” suggesting alternative approaches.

Transfer learning offers limited performance gains. The ImageNet task is not necessarily a good indication of success on medical datasets.⁷

Retinal fundus image with Diabetic Retinopathy signs — Images from the Diabetic Retinopathy Detection challenge¹²

What you can do instead of ImageNet transfer learning

Creating high-quality annotated datasets is generally considered very expensive in the Medical space, given the need for highly specialized experts for clinical labelling.

ML practitioners, believing they had to match the sheer size of ImageNet, refrained from pre-training with much smaller available medical image datasets, let alone developing new ones. Cost constraints made them rule out pre-training with medical images, so they often resorted to available (but sub-optimal) ImageNet.

Given that pre-training with ImageNet or similar-size datasets is not so helpful and needed, as shown earlier, we can reconsider the whole paradigm.

First, the medical images' features and their internal consistency and peculiarity often lead us to excellent results using much smaller datasets, consistent with the abovementioned results. Annotating a few hundred images per class becomes a more feasible option, especially if the clinical team is a partner in the project. Furthermore, in many clinical tasks, classes are relatively few (5–15) and reasonably identifiable with good network architecture.

While practitioners tend to jump to the next shiny toy (read model), there is a huge opportunity to develop medical datasets for new diseases and applications to support physicians and patients better.

We can broadly summarize our takeaways from both literature and our experience in developing AI medical applications for the real world:

Forget about pre-training with ImageNet. Start from a well-known, proven good architecture such as ResNet with random weights.
In some tasks, you can leverage public medical datasets to perform pre-training and generate your base weights for transfer learning.
Physicians often spark the initial clinical idea, drive the AI medical application initiative and are willing to annotate images, sometimes with the help of medical students.
Ensure all the privacy and legal requirements are met (anonymization, patient consent, etc.) and help the physician build a new dataset for a proof of concept.
Design a data augmentation strategy to leverage your dataset during model training. Most frameworks, such as PyTorch, provide all kinds of image transformations (e.g. rotation, translation, zoom, etc.) at each epoch.
In proof of concepts, we often rely on synthetic data that we generate ourselves for specific tasks and under the guidance of a physician or medical consultant. Besides addressing the class imbalance, synthetic data also allows the simulation of rare clinical situations and help identify challenges and solutions.

Conclusions

Each AI task and domain in the medical space deserves its well-designed approach.
As throwing millions of natural images from ImageNet did not result in meaningful results in the medical domain, avoid muscular approaches that aim at solving “everything medical” in a single model. Smaller, specialized applications trained on high-quality data are more likely to be the right tool for specific clinical needs.

Developing dedicated AI models to test a physician’s ideas has become increasingly feasible and affordable thanks to tech enablers such as open-source platforms and advanced GPU availability. High-quality data and physician expert guidance remain critical success factors for medical AI applications.

Ultimately, the ability to allocate effort and limited resources in the right direction will drive better adoption of AI in the clinical space and patient outcomes.

References

[1] Dai, Jifeng, Kaiming He, and Jian Sun. 2015. “Instance-Aware Semantic Segmentation via Multi-task Network Cascades.” December 14, 2015. http://arxiv.org/abs/1512.04412

[2] Donahue, Jeff, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2013. “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition.” October 5, 2013. http://arxiv.org/abs/1310.1531

[3] He, Kaiming, Ross Girshick, and Piotr Dollár. 2018. “Rethinking ImageNet Pre-training.” November 21, 2018. http://arxiv.org/abs/1811.08883

[4] Huh, Minyoung, Pulkit Agrawal, and Alexei A. Efros. 2016. “What Makes ImageNet Good for Transfer Learning?” December 10, 2016. http://arxiv.org/abs/1608.08614

[5] Karpathy, Andrej, and Li Fei-Fei. 2015. “Deep Visual-Semantic Alignments for Generating Image Descriptions.” April 14, 2015. http://arxiv.org/abs/1412.2306

[6] Kornblith, Simon, Jonathon Shlens, and Quoc V. Le. 2019. “Do Better ImageNet Models Transfer Better?” June 17, 2019. http://arxiv.org/abs/1805.08974

[7] Raghu, Maithra, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. 2019. “Transfusion: Understanding Transfer Learning for Medical Imaging.” October 29, 2019. http://arxiv.org/abs/1902.07208

[8] Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. “ImageNet Large Scale Visual Recognition Challenge.” January 29, 2015. http://arxiv.org/abs/1409.0575

[9] Sermanet, Pierre, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. 2014. “OverFeat: Integrated Recognition, Localization and Detection Using Convolutional Networks.” February 23, 2014. http://arxiv.org/abs/1312.6229

[10] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. n.d. “ImageNet: A Large-Scale Hierarchical Image Database.” https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf?ref=blog.roboflow.com

[11] CheXNet: Radiologist-Level Pneumonia Detection

[12] Diabetic Retinopathy Detection