Venue
International Journal of Computer Vision
We propose a technique for producing 'visual explanations' for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent.Our approach -Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say logits for 'dog' or even a caption), flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g.VGG), (2) CNNs used for structured outputs (e.g.captioning), (3) CNNs used in tasks with multi-modal inputs (e.g.VQA) or reinforcement learning, without architectural changes or re-training.We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures.In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) are robust to adversarial images, (c) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias.For image captioning and VQA, our visualizations show even non-attention based models can localize inputs.Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a 'stronger' deep network from a 'weaker' one.Our code is available at https://github.com/ramprs/grad-cam/and a demo is available on CloudCV [2] 1 .Video of the demo can be found at youtu.be/COjUB9Izk6E.
This paper presents Grad-CAM, a technique for generating visual explanations from Convolutional Neural Networks (CNNs) by utilizing gradient information to create localization maps that highlight important regions in images. Grad-CAM is applicable to a wide range of CNN architectures without the need for modifications or retraining. The technique aims to enhance interpretability in AI systems, particularly in tasks involving image classification, image captioning, and visual question answering (VQA). The authors evaluate Grad-CAM against existing methods, demonstrating improvements in understanding model predictions, identifying dataset biases, and providing faithful visual explanations. Empirical results show its effectiveness in weakly-supervised localization tasks, better class discriminativity, and increased trust from users through human studies. Counterfactual explanations and applications in bias detection are also explored.
This paper employs the following methods:
The following datasets were used in this research:
- ILSVRC-15
- PASCAL VOC 2007
- COCO
- Top-1 localization error
- Top-5 localization error
- mAP
- Grad-CAM outperforms c-MWP and Simonyan et al. in localization tasks
- Grad-CAM helps in identifying dataset bias
- Grad-CAM visualizations assist untrained users in discerning model reliability
The authors identified the following limitations:
- Number of GPUs: None specified
- GPU Type: None specified
Grad-CAM
visual explanations
CNN interpretability
model trust
weakly-supervised localization