CSE THESIS PRE-DEFENSE

An Explainable Token Stacking
Meta Learner with Multi-head Attention
for Multi-Class Skin Cancer
Classification

Presented By

Mohammed Moin Uddin · C221029

Mir Md. Ejajul Hoque Eju · C220136

Md. Minhajul Islam Rahat · C220138

Supervised By

Mr. Mohammad Mahadi Hassan

Associate Professor, CSE

International Islamic University Chittagong

01 · The Problem

Why does this matter?

The world still faces a difficult medical imaging problem: skin lesions often look similar, but the diagnosis behind them can be very different.

Dermoscopic images can hide important cues behind hair, shadows, and texture noise. At the same time, class imbalance makes some lesion types harder to recognize than others, which raises the risk of misclassification.

Core Challenge

Learning the right visual cues is hard when lesions overlap in appearance, data is unevenly distributed, and the model must remain reliable in a clinical setting.

01

Visual similarity

Many lesion types share overlapping color, shape, and texture patterns.

02

Uneven data

Rare lesions are easy to overlook when the training set is not balanced.

Representative lesion samples

Even different lesion types can appear deceptively close, so the model has to learn subtle discriminative cues.

02 · Classes

The crucial cancer classes

These lesion categories are clinically important because many of them overlap in appearance, yet their risk profiles and treatment needs are very different.

Melanoma

Irregular borders, asymmetry, mixed colors, and fast change.

Basal Cell Carcinoma

Pearly or transparent bump with visible vessels and slow growth.

Squamous Cell Carcinoma

Scaly red patch, dome-like growth, or a non-healing sore.

Pigmented Benign Keratosis

Brown or black waxy lesion with a rough, stuck-on look.

Seborrheic Keratosis

Flat or raised pasted-on lesion with a lighter, scaly surface.

Dermatofibroma

Small firm nodule that often dimples when pinched.

Actinic Keratosis

Rough precancerous patch caused by long-term sun damage.

Vascular Lesions

Red or purple blood-vessel growth with a distinct color pattern.

Normal Nevus

Symmetrical mole with uniform color and stable appearance.

03 · Literature Review

What the literature tells us

We reviewed recent skin cancer classification studies across five thematic streams. Each stream reveals both progress and persistent gaps.

01

CNN & Transfer Learning

Xception, VGG16, DenseNet169 achieve 84–91% accuracy on ISIC via ImageNet pre-training. Deeper networks outperform older architectures for representation learning.

High accuracy · class-level gaps remain

02

Multiclass Classification

XceptionNet and ResNet50 report 78–82% on multiclass tasks — well below binary results. Harder decision boundaries and severe class imbalance are to blame.

Accuracy drops sharply beyond 2 classes

03

Ensemble Learning

Multi-model ensembles range from 78% (VGG16) to 97% (ResNet101 + Multi-head Attention). Feature diversity across architectures improves stability and reduces per-class variance.

Wide range · model selection is critical

04

Transformers & Hybrids

ViT + GNN hybrid reaches 95%; self-supervised transformers hit 96.48%. But both demand large datasets and heavy compute — not feasible for modest medical collections.

High ceiling · data-hungry

05

Explainable AI (XAI)

Grad-CAM baselines show 73–77% accuracy — interpretability tools alone don't lift performance. Yet clinical adoption requires both accuracy and visual trust.

Interpretable · but not always accurate

04 · Literature Review

What others have built

Approach	Reported Accuracy	Critical Limitation
Xception (binary)	90.15%	Limited to 2-class problems
Xception (multi-class)	90.61%	Drops sharply in 9-class setting
Multi-model ensemble	78%–97%	Wide variance, unreliable
ViT + GNN hybrid	95%	Requires huge datasets
CNN + PSO optimization	98.5% / 86.1%	Poor cross-dataset generalization
Pure Transformer	96.48%	Heavy compute, data-hungry
Grad-CAM XAI baseline	73%–77%	Interpretable but inaccurate

Most works trade accuracy for interpretability — or vice versa. Few do both.

05 · Research Gap

The five gaps we identified

01

Single-model instability

One model cannot excel across all 9 lesion classes simultaneously.

02

Class imbalance bias

Minority classes (like melanoma) get misclassified — dangerous in medicine.

03

Stability + interpretability rarely coexist

Most works do one or the other. We do both.

04

Inconsistent Grad-CAM heatmaps

Visualizations are sometimes meaningless or off-target.

05

Generalization fails across datasets

Models tuned for ISIC don't transfer well to HAM10000.

06 · Questions & Objectives

What we set out to answer

Research Questions

Can ensemble learning beat individual CNNs on 9-class skin lesion classification?
Can multi-head attention over base model predictions ("tokens") improve over traditional stacking?
Can data augmentation effectively address class imbalance?
Does CNN+ViT hybrid outperform CNN ensembles on moderate datasets?
Can Grad-CAM provide trustworthy explanations for clinical use?

Research Objectives

Implement & compare 7 pre-trained CNNs via transfer learning
Address class imbalance with augmentation (2,500/class)
Build progressive ensembles: Soft Voting → Stacking → Token Stacking + Attention
Compare against CNN+ViT hybrid model
Integrate Grad-CAM for visual interpretability
Evaluate with 6 metrics including Cohen's Kappa

07 · Methodology

System Architecture

Skin Cancer
ISIC Dataset

→

Dataset Splitting

Train
80%

→

Validation
20%

Test
20%

↓

Dataset Preprocessing

Height = 75px Width = 75 px Canal = 3

DATA AUGMENTATION

rescale=1./255, zoom range=0.2, rotation range=20, width shift range=0.1, height shift range=0.1, horizontal flip=True, fill_mode='nearest'

Class Balancing

→

State-of-the-art CNN Models

MobileNetV2

DenseNet 201

VGG16

VGG19

ResNet50

Xception

Inception V3

→

Ensemble/Hybrid model

→

Select the best model

→

Classify
Skin
Cancer

08 · Dataset

The ISIC dataset · before & after

Before Augmentation

2,357 images · severely imbalanced

Vascular lesion	142
Actinic keratosis	130
Nevus	373
Pigmented benign keratosis	478
Melanoma	454
Squamous cell carcinoma	184
Basal cell carcinoma	392
Seborrheic keratosis	93 ⚠
Dermatofibroma	111

⟶

After Augmentation

22,500 images · perfectly balanced

All 9 classes	2,500 each

Final split

Train · 14,400 (64%)

Val · 3,600 (16%)

Test · 4,500 (20%)

Techniques: rotation · zoom · shift · shear · horizontal flip

09 · Training & Algorithm

How we trained it

Training Configuration

Epochs	100
Batch size	32
Optimizer	SGD
Learning rate	0.001
Momentum	0.9
Loss	Categorical Cross-Entropy
LR scheduler	ReduceLROnPlateau

Inference Algorithm

// Token Stacking Meta Learner
INPUT:  image x, models M₁, M₂
OUTPUT: predicted class ĉ

1. p₁ ← M₁(x)              // (9,)
2. p₂ ← M₂(x)              // (9,)
3. T  ← Stack(p₁, p₂)      // (2, 9)
4. T' ← MHAttention(T,T,T)
5. T'' ← LayerNorm(T')
6. f  ← Flatten(T'')
7. h  ← Dense(32, ReLU)(f)
8. y  ← Softmax(Dense(9))(h)
9. ĉ  ← argmax(y)

RETURN ĉ

10 · DenseNet201 Results

DenseNet201 — 91.91% Test Accuracy

Training vs Validation Accuracy

Confusion Matrix

Sample Predictions

11 · VGG16 Results

VGG16 — 90.84% Test Accuracy

Training vs Validation Accuracy

Confusion Matrix

Sample Predictions

12 · VGG19 Results

VGG19 — 90.44% Test Accuracy

Training vs Validation Accuracy

Confusion Matrix

Sample Predictions

13 · Results · Part 1

Individual CNN performance

Model	Accuracy	Precision	Recall	F1-Score	Kappa
DenseNet201 ⭐	91.91%	91.90%	91.87%	91.84%	0.9090
VGG16 ⭐	90.84%	90.83%	90.83%	90.76%	0.8970
VGG19	90.44%	90.45%	90.52%	90.42%	0.8925
Xception	88.60%	88.71%	88.53%	88.41%	0.8717
InceptionV3	88.33%	88.51%	88.43%	88.43%	0.8687
ResNet50	87.87%	87.77%	87.82%	87.67%	0.8635
MobileNetV2	86.71%	86.80%	86.67%	86.41%	0.8505

⭐ DenseNet201 and VGG16 selected as base learners. Their architectural diversity — dense connectivity vs uniform convolutions — maximizes feature complementarity.

14 · Ensemble · Soft Voting

Soft Voting Ensemble

Base Learners: DenseNet201 — 91.91% + VGG16 — 90.84%

How It Works

Each base model outputs a probability vector over 9 classes. Soft voting averages these vectors to produce the final prediction — no learning required.

p_final = (p₁ + p₂) / 2

ĉ = argmax_k p_final[k]

Drawback: both models receive equal weight regardless of per-class confidence.

93.73% Test Accuracy

Micro AUC0.996

Macro AUC0.994

Per-Class Recall

Vascular lesion

1.00

Actinic keratosis

0.94

Nevus

0.81

Pigmented BK

0.97

Melanoma

0.81

Squamous CC

0.99

Basal CC

0.98

Seborrheic K.

0.94

Dermatofibroma

1.00

15 · Ensemble · Meta Learner

Stacked Meta Learner

Base Learners: DenseNet201 — 91.91% + VGG16 — 90.84%

How It Works

Base model probability outputs are concatenated into meta-features. A small neural network (meta-learner) is then trained on these meta-features to learn the optimal combination rule.

X_meta = [p₁ ‖ p₂] ∈ ℝ¹⁸

Meta-Learner: Dense(64) → Dense(9, softmax)

Improvement: learns a non-linear combination — but treats all class positions equally.

93.91% Test Accuracy +0.18% vs Soft Voting

Micro AUC0.998

Macro AUC0.995

Per-Class Recall

Vascular lesion

1.00

Actinic keratosis

0.98

Nevus

0.75

Pigmented BK

0.98

Melanoma

0.79

Squamous CC

0.98

Basal CC

0.99

Seborrheic K.

0.99

Dermatofibroma

1.00

16 · Proposed Model

Token Stacking Meta Learner — Architecture

Input Images

→

DenseNet201

Input
Image

→

Dense
Block 1

→

Dense
Block 2

→

Dense
Block 3

→

Dense
Block 4

→

Global
Avg Pool

→

Prediction
Vector
(Softmax
output)
Shape: (34)

Input
Image

→

Conv
64

→

Conv
128

→

Conv
256

→

Conv
512

→

FC
Layer

VGG16

→

Prediction
Vector
(Softmax
output)
Shape: (34)

→

Stacking Layer
(Token Formation)
Shape: (2, 34)

→

Multihead Attention Layer · Head = 2 · Learns model importance weights

Layer Normalization

Flatten Layer

Dense Layer (32, ReLU)

Output Layer (Softmax, 9 class)

17 · Ensemble · Token Stacking

Token Stacking + Multi-head Attention — Results

Base Learners: DenseNet201 — 91.91% + VGG16 — 90.84% Best Model ★ 94.16%

Training vs Validation Accuracy

Normalized Confusion Matrix

ROC Curves (Micro AUC 0.997)

18 · Results · Part 2

Ensemble showdown

Token Stacking + Multi-head Attention (Ours)

94.16%

Stacked Meta Learner

93.91%

Soft Voting Ensemble

93.73%

DenseNet201 (best individual)

91.91%

CNN + ViT Hybrid

86.29%

Proposed Model Highlights

94.16%Test Accuracy

94.64%Precision

0.997Micro AUC

0.995Macro AUC

19 · Explainable AI

Can the model explain itself?

We applied Grad-CAM to verify that our trained model focuses on clinically relevant skin lesion features — not background noise or artefacts.

Grad-CAM

Gradient-weighted Class Activation Maps — heatmap overlays on test images

Warmer regions indicate higher gradient activation. The model consistently highlights the lesion boundary and pigmentation texture — not surrounding healthy skin.

✓ Model attends to the lesion region, not surrounding skin

✓ Activations align with clinical diagnostic cues (ABCDE criteria)

✓ Builds clinician trust — decisions are traceable, not a black box

20 · Critical Analysis

Why the ViT hybrid underperformed

86.29%

CNN+ViT hybrid accuracy — worst among all models tested

01

Data Hunger

Vision Transformers need millions of images to learn global feature relationships. Our 22,500 augmented images are sufficient for CNNs but insufficient for ViTs.

02

Lack of Inductive Bias

CNNs have built-in priors (translation invariance, locality) that suit medical images. ViTs must learn these from scratch.

03

Increased Complexity

Hybrid models add parameters without adding generalization. Subtle inter-class differences (melanoma vs benign keratosis) require more, not less, regularization.

Takeaway: For moderate-sized medical datasets, smarter combinations of CNNs beat bigger transformer-based architectures.

21 · Mathematical Foundation

The core equations

Multi-head Attention (the heart)

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Computes how much each token influences others.

Multi-Head Attention

MultiHead = Concat(head₁, ..., head_h) · W^O

Multiple attention views in parallel.

Token Stacking

T ∈ ℝ^{(m × K)} · m=2 models, K=9 classes

Probability vectors stacked as separate tokens.

Softmax Output

Softmax(z_i) = e^z_i / Σ_j e^z_j

Final probability over 9 classes.

Categorical Cross-Entropy Loss

L = −Σ_i y_i log(ŷ_i)

Optimization target during training.

Grad-CAM

L^c_GradCAM = ReLU(Σ_k α_k^c · A^k)

Heatmap weighted by gradient importance.

22 · Contribution & Impact

What we contribute

🔬 Technical

Comprehensive evaluation of 7 CNNs + 3 ensembles + 1 hybrid
Novel Token Stacking with Multi-head Attention framework
State-of-the-art 94.16% on 9-class ISIC classification
Architecture-agnostic methodology — extends to any medical task

🏥 Clinical

Aids early melanoma detection — saves lives
Provides AI second-opinion to reduce diagnostic errors
Democratizes access in rural / resource-limited settings
Grad-CAM enables clinician verification → builds trust

📚 Academic

Demonstrates attention-based stacking > traditional stacking
Empirically validates CNN superiority on moderate datasets
Reusable framework for future medical imaging research
Honest documentation of limitations & failure modes

23 · Honesty

Limitations & where we go next

Acknowledged Limitations

Single dataset. Generalization to HAM10000 not yet tested.
Higher compute cost than individual CNNs.
Grad-CAM localization isn't perfect — sometimes off-target.
No clinical validation conducted yet.
Closed-set classification — can't detect new lesion types.

Future Directions

Cross-dataset validation on HAM10000
Add third base learner with different inductive biases
Test on EfficientNet, ConvNeXt, Swin Transformer
Lightweight ensemble for real-time mobile deployment
Clinical validation with partner hospitals
Out-of-distribution detection for open-set recognition

END · Q&A

Thank you.

"For moderate-sized medical datasets, the future isn't bigger models — it's smarter combinations of existing ones."

94.16%Final Accuracy

9Lesion Classes

22,500Training Images

11Models Compared

Mohammed Moin Uddin · Mir Md. Ejajul Hoque Eju · Md. Minhajul Islam Rahat

Supervised by Mr. Mohammad Mahadi Hassan · CSE · IIUC

We welcome your questions.

An Explainable Token Stacking Meta Learner with Multi-head Attention for Multi-Class Skin Cancer Classification

Why does this matter?

Visual similarity

Uneven data

The crucial cancer classes

What the literature tells us

CNN & Transfer Learning

Multiclass Classification

Ensemble Learning

Transformers & Hybrids

Explainable AI (XAI)

What others have built

The five gaps we identified

Single-model instability

Class imbalance bias

Stability + interpretability rarely coexist

Inconsistent Grad-CAM heatmaps

Generalization fails across datasets

What we set out to answer

Research Questions

Research Objectives

System Architecture

The ISIC dataset · before & after

Before Augmentation

After Augmentation

Final split

How we trained it

Training Configuration

Inference Algorithm

DenseNet201 — 91.91% Test Accuracy

Training vs Validation Accuracy

Confusion Matrix

Sample Predictions

VGG16 — 90.84% Test Accuracy

Training vs Validation Accuracy

Confusion Matrix

Sample Predictions

VGG19 — 90.44% Test Accuracy

Training vs Validation Accuracy

Confusion Matrix

Sample Predictions

Individual CNN performance

Soft Voting Ensemble

How It Works

Per-Class Recall

Stacked Meta Learner

How It Works

Per-Class Recall

Token Stacking Meta Learner — Architecture

Token Stacking + Multi-head Attention — Results

Training vs Validation Accuracy

Normalized Confusion Matrix

ROC Curves (Micro AUC 0.997)

Ensemble showdown

Proposed Model Highlights

Can the model explain itself?

Gradient-weighted Class Activation Maps — heatmap overlays on test images

Why the ViT hybrid underperformed

86.29%

Data Hunger

Lack of Inductive Bias

Increased Complexity

The core equations

What we contribute

🔬 Technical

🏥 Clinical

📚 Academic

Limitations & where we go next

Acknowledged Limitations

Future Directions

Thank you.

An Explainable Token Stacking
Meta Learner with Multi-head Attention
for Multi-Class Skin Cancer
Classification