← ML Research Wiki / 1610.02391

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Ramprasaath R Selvaraju Georgia Institute of Technology, Michael Cogswell [email protected] Georgia Institute of Technology, Abhishek Das Georgia Institute of Technology, Ramakrishna Vedantam Georgia Institute of Technology, Devi Parikh [email protected] Georgia Institute of Technology, Dhruv Batra [email protected] Georgia Institute of Technology, Virginia Tech Georgia Institute of Technology (2016)

Paper Information

arXiv ID

1610.02391

Venue

International Journal of Computer Vision

Domain

Computer Vision

SOTA Claim

Yes

Code

Available

Reproducibility

8/10

Contents

Abstract
Methods
Datasets
Results
Limitations
Related Work
External Resources

Abstract

We propose a technique for producing 'visual explanations' for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent.Our approach -Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for 'dog' or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.VQA) or reinforcement learning, without architectural changes or re-training.We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) are robust to adversarial images, (c) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias.For image captioning and VQA, our visualizations show even non-attention based models can localize inputs.Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a 'stronger' deep network from a 'weaker' one.Our code is available at https://github.com/ramprs/grad-cam/and a demo is available on CloudCV [2] 1 .Video of the demo can be found at youtu.be/COjUB9Izk6E.

Summary

This paper presents Grad-CAM, a technique for generating visual explanations from Convolutional Neural Networks (CNNs) by utilizing gradient information to create localization maps that highlight important regions in images. Grad-CAM is applicable to a wide range of CNN architectures without the need for modifications or retraining. The technique aims to enhance interpretability in AI systems, particularly in tasks involving image classification, image captioning, and visual question answering (VQA). The authors evaluate Grad-CAM against existing methods, demonstrating improvements in understanding model predictions, identifying dataset biases, and providing faithful visual explanations. Empirical results show its effectiveness in weakly-supervised localization tasks, better class discriminativity, and increased trust from users through human studies. Counterfactual explanations and applications in bias detection are also explored.

Methods

This paper employs the following methods:

Grad-CAM
Guided Grad-CAM

Models Used

VGG-16
ResNet
AlexNet

Datasets

The following datasets were used in this research:

ILSVRC-15
PASCAL VOC 2007
COCO

Evaluation Metrics

Top-1 localization error
Top-5 localization error
mAP

Results

Grad-CAM outperforms c-MWP and Simonyan et al. in localization tasks
Grad-CAM helps in identifying dataset bias
Grad-CAM visualizations assist untrained users in discerning model reliability

Limitations

The authors identified the following limitations:

Not specified

Technical Requirements

Number of GPUs: None specified
GPU Type: None specified

Keywords

Grad-CAM visual explanations CNN interpretability model trust weakly-supervised localization

Papers Using Similar Methods

External Resources

Funding: Not specified
References: 72
Influential Citations: 1925

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Abstract edit

Summary

Methods add

Models Used add

Datasets add

Evaluation Metrics add

Results add

Limitations add

Technical Requirements edit

Keywords add

Related Papers