1/18
CSE THESIS PRE-DEFENSE

An Explainable Token Stacking
Meta Learner with Multi-head Attention
for Multi-Class Skin Cancer
Classification

Presented By

Mohammed Moin Uddin · C221029

Mir Md. Ejajul Hoque Eju · C220136

Md. Minhajul Islam Rahat · C220138

Supervised By

Mr. Mohammad Mahadi Hassan

Associate Professor, CSE

International Islamic University Chittagong

01 · The Problem

Why does this matter?

The world still faces a difficult medical imaging problem: skin lesions often look similar, but the diagnosis behind them can be very different.

Dermoscopic images can hide important cues behind hair, shadows, and texture noise. At the same time, class imbalance makes some lesion types harder to recognize than others, which raises the risk of misclassification.

Core Challenge

Learning the right visual cues is hard when lesions overlap in appearance, data is unevenly distributed, and the model must remain reliable in a clinical setting.

01

Visual similarity

Many lesion types share overlapping color, shape, and texture patterns.

02

Uneven data

Rare lesions are easy to overlook when the training set is not balanced.

Melanoma lesion sample
Actinic keratosis lesion sample
Squamous cell carcinoma lesion sample
Representative lesion samples

Even different lesion types can appear deceptively close, so the model has to learn subtle discriminative cues.

02 · Classes

The crucial cancer classes

These lesion categories are clinically important because many of them overlap in appearance, yet their risk profiles and treatment needs are very different.

Melanoma

Irregular borders, asymmetry, mixed colors, and fast change.

Basal Cell Carcinoma

Pearly or transparent bump with visible vessels and slow growth.

Squamous Cell Carcinoma

Scaly red patch, dome-like growth, or a non-healing sore.

Pigmented Benign Keratosis

Brown or black waxy lesion with a rough, stuck-on look.

Seborrheic Keratosis

Flat or raised pasted-on lesion with a lighter, scaly surface.

Dermatofibroma

Small firm nodule that often dimples when pinched.

Actinic Keratosis

Rough precancerous patch caused by long-term sun damage.

Vascular Lesions

Red or purple blood-vessel growth with a distinct color pattern.

Normal Nevus

Symmetrical mole with uniform color and stable appearance.

03 · Literature Review

What the literature tells us

We reviewed recent skin cancer classification studies across five thematic streams. Each stream reveals both progress and persistent gaps.

01

CNN & Transfer Learning

Xception, VGG16, DenseNet169 achieve 84–91% accuracy on ISIC via ImageNet pre-training. Deeper networks outperform older architectures for representation learning.

High accuracy · class-level gaps remain
02

Multiclass Classification

XceptionNet and ResNet50 report 78–82% on multiclass tasks — well below binary results. Harder decision boundaries and severe class imbalance are to blame.

Accuracy drops sharply beyond 2 classes
03

Ensemble Learning

Multi-model ensembles range from 78% (VGG16) to 97% (ResNet101 + Multi-head Attention). Feature diversity across architectures improves stability and reduces per-class variance.

Wide range · model selection is critical
04

Transformers & Hybrids

ViT + GNN hybrid reaches 95%; self-supervised transformers hit 96.48%. But both demand large datasets and heavy compute — not feasible for modest medical collections.

High ceiling · data-hungry
05

Explainable AI (XAI)

Grad-CAM baselines show 73–77% accuracy — interpretability tools alone don't lift performance. Yet clinical adoption requires both accuracy and visual trust.

Interpretable · but not always accurate
04 · Literature Review

What others have built

ApproachReported AccuracyCritical Limitation
Xception (binary)90.15%Limited to 2-class problems
Xception (multi-class)90.61%Drops sharply in 9-class setting
Multi-model ensemble78%–97%Wide variance, unreliable
ViT + GNN hybrid95%Requires huge datasets
CNN + PSO optimization98.5% / 86.1%Poor cross-dataset generalization
Pure Transformer96.48%Heavy compute, data-hungry
Grad-CAM XAI baseline73%–77%Interpretable but inaccurate

Most works trade accuracy for interpretability — or vice versa. Few do both.

05 · Research Gap

The five gaps we identified

01

Single-model instability

One model cannot excel across all 9 lesion classes simultaneously.

02

Class imbalance bias

Minority classes (like melanoma) get misclassified — dangerous in medicine.

03

Stability + interpretability rarely coexist

Most works do one or the other. We do both.

04

Inconsistent Grad-CAM heatmaps

Visualizations are sometimes meaningless or off-target.

05

Generalization fails across datasets

Models tuned for ISIC don't transfer well to HAM10000.

06 · Questions & Objectives

What we set out to answer

Research Questions

  • Can ensemble learning beat individual CNNs on 9-class skin lesion classification?
  • Can multi-head attention over base model predictions ("tokens") improve over traditional stacking?
  • Can data augmentation effectively address class imbalance?
  • Does CNN+ViT hybrid outperform CNN ensembles on moderate datasets?
  • Can Grad-CAM provide trustworthy explanations for clinical use?

Research Objectives

  • Implement & compare 7 pre-trained CNNs via transfer learning
  • Address class imbalance with augmentation (2,500/class)
  • Build progressive ensembles: Soft Voting → Stacking → Token Stacking + Attention
  • Compare against CNN+ViT hybrid model
  • Integrate Grad-CAM for visual interpretability
  • Evaluate with 6 metrics including Cohen's Kappa
07 · Methodology

System Architecture

Skin Cancer
ISIC Dataset
Dataset Splitting
Train
80%
Validation
20%
Test
20%
Dataset Preprocessing
Height = 75px   Width = 75 px   Canal = 3
DATA AUGMENTATION
rescale=1./255, zoom range=0.2, rotation range=20, width shift range=0.1, height shift range=0.1, horizontal flip=True, fill_mode='nearest'
Class Balancing
State-of-the-art CNN Models
MobileNetV2
DenseNet 201
VGG16
VGG19
ResNet50
Xception
Inception V3
Ensemble/Hybrid model
Select the best model
Classify
Skin
Cancer
08 · Dataset

The ISIC dataset · before & after

Before Augmentation

2,357 images · severely imbalanced

Vascular lesion142
Actinic keratosis130
Nevus373
Pigmented benign keratosis478
Melanoma454
Squamous cell carcinoma184
Basal cell carcinoma392
Seborrheic keratosis93 ⚠
Dermatofibroma111

After Augmentation

22,500 images · perfectly balanced

All 9 classes2,500 each

Final split

Train · 14,400 (64%)
Val · 3,600 (16%)
Test · 4,500 (20%)

Techniques: rotation · zoom · shift · shear · horizontal flip

09 · Training & Algorithm

How we trained it

Training Configuration

Epochs100
Batch size32
OptimizerSGD
Learning rate0.001
Momentum0.9
LossCategorical Cross-Entropy
LR schedulerReduceLROnPlateau

Inference Algorithm

// Token Stacking Meta Learner
INPUT:  image x, models M₁, M₂
OUTPUT: predicted class ĉ

1. p₁ ← M₁(x)              // (9,)
2. p₂ ← M₂(x)              // (9,)
3. T  ← Stack(p₁, p₂)      // (2, 9)
4. T' ← MHAttention(T,T,T)
5. T'' ← LayerNorm(T')
6. f  ← Flatten(T'')
7. h  ← Dense(32, ReLU)(f)
8. y  ← Softmax(Dense(9))(h)
9. ĉ  ← argmax(y)

RETURN ĉ
10 · DenseNet201 Results

DenseNet201 — 91.91% Test Accuracy

Training vs Validation Accuracy

DenseNet201 Training Accuracy

Confusion Matrix

DenseNet201 Confusion Matrix

Sample Predictions

DenseNet201 Sample Predictions
11 · VGG16 Results

VGG16 — 90.84% Test Accuracy

Training vs Validation Accuracy

VGG16 Training Accuracy

Confusion Matrix

VGG16 Confusion Matrix

Sample Predictions

VGG16 Sample Predictions
12 · VGG19 Results

VGG19 — 90.44% Test Accuracy

Training vs Validation Accuracy

VGG19 Training Accuracy

Confusion Matrix

VGG19 Confusion Matrix

Sample Predictions

VGG19 Sample Predictions
13 · Results · Part 1

Individual CNN performance

ModelAccuracyPrecisionRecallF1-ScoreKappa
DenseNet201 ⭐91.91%91.90%91.87%91.84%0.9090
VGG16 ⭐90.84%90.83%90.83%90.76%0.8970
VGG1990.44%90.45%90.52%90.42%0.8925
Xception88.60%88.71%88.53%88.41%0.8717
InceptionV388.33%88.51%88.43%88.43%0.8687
ResNet5087.87%87.77%87.82%87.67%0.8635
MobileNetV286.71%86.80%86.67%86.41%0.8505

⭐ DenseNet201 and VGG16 selected as base learners. Their architectural diversity — dense connectivity vs uniform convolutions — maximizes feature complementarity.

14 · Ensemble · Soft Voting

Soft Voting Ensemble

Base Learners: DenseNet201 — 91.91% + VGG16 — 90.84%

How It Works

Each base model outputs a probability vector over 9 classes. Soft voting averages these vectors to produce the final prediction — no learning required.

pfinal = (p1 + p2) / 2
ĉ = argmaxk pfinal[k]

Drawback: both models receive equal weight regardless of per-class confidence.

93.73% Test Accuracy
Micro AUC0.996
Macro AUC0.994

Per-Class Recall

Vascular lesion
1.00
Actinic keratosis
0.94
Nevus
0.81
Pigmented BK
0.97
Melanoma
0.81
Squamous CC
0.99
Basal CC
0.98
Seborrheic K.
0.94
Dermatofibroma
1.00
15 · Ensemble · Meta Learner

Stacked Meta Learner

Base Learners: DenseNet201 — 91.91% + VGG16 — 90.84%

How It Works

Base model probability outputs are concatenated into meta-features. A small neural network (meta-learner) is then trained on these meta-features to learn the optimal combination rule.

Xmeta = [p1 ‖ p2] ∈ ℝ18
Meta-Learner: Dense(64) → Dense(9, softmax)

Improvement: learns a non-linear combination — but treats all class positions equally.

93.91% Test Accuracy +0.18% vs Soft Voting
Micro AUC0.998
Macro AUC0.995

Per-Class Recall

Vascular lesion
1.00
Actinic keratosis
0.98
Nevus
0.75
Pigmented BK
0.98
Melanoma
0.79
Squamous CC
0.98
Basal CC
0.99
Seborrheic K.
0.99
Dermatofibroma
1.00
16 · Proposed Model

Token Stacking Meta Learner — Architecture

Input Images
DenseNet201
Input
Image
Dense
Block 1
Dense
Block 2
Dense
Block 3
Dense
Block 4
Global
Avg Pool
Prediction
Vector
(Softmax
output)
Shape: (34)
Input
Image
Conv
64
Conv
128
Conv
256
Conv
512
FC
Layer
VGG16
Prediction
Vector
(Softmax
output)
Shape: (34)
Stacking Layer
(Token Formation)
Shape: (2, 34)
Multihead Attention Layer · Head = 2 · Learns model importance weights
Layer Normalization
Flatten Layer
Dense Layer (32, ReLU)
Output Layer (Softmax, 9 class)
17 · Ensemble · Token Stacking

Token Stacking + Multi-head Attention — Results

Base Learners: DenseNet201 — 91.91% + VGG16 — 90.84% Best Model ★ 94.16%

Training vs Validation Accuracy

Token Stacking Training Accuracy

Normalized Confusion Matrix

Token Stacking Confusion Matrix

ROC Curves (Micro AUC 0.997)

Token Stacking ROC Curve
18 · Results · Part 2

Ensemble showdown

Token Stacking + Multi-head Attention (Ours)
94.16%
Stacked Meta Learner
93.91%
Soft Voting Ensemble
93.73%
DenseNet201 (best individual)
91.91%
CNN + ViT Hybrid
86.29%

Proposed Model Highlights

94.16%Test Accuracy
94.64%Precision
0.997Micro AUC
0.995Macro AUC
19 · Explainable AI

Can the model explain itself?

We applied Grad-CAM to verify that our trained model focuses on clinically relevant skin lesion features — not background noise or artefacts.

Grad-CAM

Gradient-weighted Class Activation Maps — heatmap overlays on test images

Grad-CAM heatmaps on skin lesion images

Warmer regions indicate higher gradient activation. The model consistently highlights the lesion boundary and pigmentation texture — not surrounding healthy skin.

Model attends to the lesion region, not surrounding skin
Activations align with clinical diagnostic cues (ABCDE criteria)
Builds clinician trust — decisions are traceable, not a black box
20 · Critical Analysis

Why the ViT hybrid underperformed

86.29%

CNN+ViT hybrid accuracy — worst among all models tested

01

Data Hunger

Vision Transformers need millions of images to learn global feature relationships. Our 22,500 augmented images are sufficient for CNNs but insufficient for ViTs.

02

Lack of Inductive Bias

CNNs have built-in priors (translation invariance, locality) that suit medical images. ViTs must learn these from scratch.

03

Increased Complexity

Hybrid models add parameters without adding generalization. Subtle inter-class differences (melanoma vs benign keratosis) require more, not less, regularization.

Takeaway: For moderate-sized medical datasets, smarter combinations of CNNs beat bigger transformer-based architectures.

21 · Mathematical Foundation

The core equations

Multi-head Attention (the heart)
Attention(Q, K, V) = softmax(QKT / √dk) · V

Computes how much each token influences others.

Multi-Head Attention
MultiHead = Concat(head₁, ..., headh) · WO

Multiple attention views in parallel.

Token Stacking
T ∈ ℝ(m × K) · m=2 models, K=9 classes

Probability vectors stacked as separate tokens.

Softmax Output
Softmax(zi) = ezi / Σj ezj

Final probability over 9 classes.

Categorical Cross-Entropy Loss
L = −Σi yi log(ŷi)

Optimization target during training.

Grad-CAM
LcGradCAM = ReLU(Σk αkc · Ak)

Heatmap weighted by gradient importance.

22 · Contribution & Impact

What we contribute

🔬 Technical

  • Comprehensive evaluation of 7 CNNs + 3 ensembles + 1 hybrid
  • Novel Token Stacking with Multi-head Attention framework
  • State-of-the-art 94.16% on 9-class ISIC classification
  • Architecture-agnostic methodology — extends to any medical task

🏥 Clinical

  • Aids early melanoma detection — saves lives
  • Provides AI second-opinion to reduce diagnostic errors
  • Democratizes access in rural / resource-limited settings
  • Grad-CAM enables clinician verification → builds trust

📚 Academic

  • Demonstrates attention-based stacking > traditional stacking
  • Empirically validates CNN superiority on moderate datasets
  • Reusable framework for future medical imaging research
  • Honest documentation of limitations & failure modes
23 · Honesty

Limitations & where we go next

Acknowledged Limitations

  • Single dataset. Generalization to HAM10000 not yet tested.
  • Higher compute cost than individual CNNs.
  • Grad-CAM localization isn't perfect — sometimes off-target.
  • No clinical validation conducted yet.
  • Closed-set classification — can't detect new lesion types.

Future Directions

  • Cross-dataset validation on HAM10000
  • Add third base learner with different inductive biases
  • Test on EfficientNet, ConvNeXt, Swin Transformer
  • Lightweight ensemble for real-time mobile deployment
  • Clinical validation with partner hospitals
  • Out-of-distribution detection for open-set recognition
END · Q&A

Thank you.

"For moderate-sized medical datasets, the future isn't bigger models — it's smarter combinations of existing ones."

94.16%
9
22,500
11

Mohammed Moin Uddin · Mir Md. Ejajul Hoque Eju · Md. Minhajul Islam Rahat

Supervised by Mr. Mohammad Mahadi Hassan · CSE · IIUC

We welcome your questions.