Skin Type Bias Mitigation in CNN models for Medical AI

Transformative Technology:

AI & Data

Semester programme:

Enhanced AI Techniques

Client company:

Fontys Venlo

Project group members:

Rana Ilayda Özgül
Anass Akhnikh
Anh Huynh
Aadira Das
Hanh Truong

Transformative Technology:

AI & Data

Semester programme:

Enhanced AI Techniques

Client company:

Fontys Venlo

Project group members:

Rana Ilayda Özgül
Anass Akhnikh
Anh Huynh
Aadira Das
Hanh Truong

Previous project BRUM - RDW Self Driving Challenge 2026 Next projectBDO - Transient Package Updater

Project description

How can bias in dermatology AI models, particularly bias related to skin tone, be effectively detected, measured, and mitigated to ensure fair and consistent diagnostic performance across different demographic groups without reducing overall model accuracy?

This research investigates how dataset imbalance and under representation of darker skin tones affect model performance in skin lesion classification systems.

Context

Artificial intelligence is increasingly used in medical imaging to support diagnosis, including dermatology where models analyze skin lesion images to assist in identifying conditions such as melanoma and other skin diseases. These systems can improve diagnostic speed, reduce human error, and provide clinicians with an additional decision support tool.

However, the effectiveness of these AI models heavily depends on the quality and representative of the training data. In dermatology datasets, a common issue is demographic imbalance, particularly in skin tone representation. Lighter skin tones are often over-represented, while darker skin tones are underrepresented.

This imbalance can introduce bias into the trained models, leading to unequal performance across different patient groups.

Results

Early on we found that the dataset we used (ISIC) is unbalanced, with mostly lighter skin tones and even missing examples of malignant cases for darker skin types. We also found that using a continuous skin tone measure (ITA) works better than fixed categories like the Fitzpatrick scale. We tested different CNN models and got good overall accuracy, but the models worked unevenly across skin tones, often struggling with medium tones. The key improvement came when we trained separate models for different skin tone groups instead of one general model.

This made performance much better, improving results by about 40% and showing that treating skin tones separately can reduce bias and improve accuracy.

Our final pipeline architecture will combine an automatic ITA-based labeling system with multiple models, since we found that separate models trained on individual skin tone groups perform better than a single model trained across all tones.