VLSaliency CLIP based visual-language saliency heatmaps The idea is rather simple: aggregated cosine similarities of text embedding and image embeddings from random crops.