Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Visualization of cost map #40

Open
Kimsure opened this issue Oct 15, 2024 · 5 comments
Open

Visualization of cost map #40

Kimsure opened this issue Oct 15, 2024 · 5 comments

Comments

@Kimsure
Copy link

Kimsure commented Oct 15, 2024

For the aggregated cost volume, we show the output of our model, hence has a higher resolution of 96x96. We simply apply bilinear upsampling to overlay with the image.

I don't have the code at the moment, but the visualized figures are normalized with min-max with some scaling for visual clarity, as the model output does not necessarily match the scale with the initial cost volume. This would probably be enough to reproduce the figure, but please let me know if you need more details.

Originally posted by @hsshin98 in #6 (comment)

Do you have the code to visualize the cost map now ? How to reproduce it

@Anglejuebi
Copy link

Anglejuebi commented Nov 13, 2024

Hello, I just started to learn computer vision. I tried to write a cost volume visualization code by myself, but the effect is not very good. Have you realized the visualization of the cost volume? Or do you have any other ideas, thank you.

@Kimsure
Copy link
Author

Kimsure commented Nov 16, 2024

Hi, I referred to the visualization method for image-text similarity in CLIP-surgery. Since cost_map is the result of the cosine similarity between images and text, I think overlaying cost_map onto the original image is the visualization result.

BTW, would you consider releasing your codes as we can take a look together to see how it’s done?

@Anglejuebi
Copy link

Anglejuebi commented Nov 16, 2024

Hello sir, thank you very much for your reply, my method is very simple, first of all, image cropping, after CLIP's image and text encoder, calculate the cosine similarity of each local map and text prompt respectively, to generate the heat map, but this is very easy to be affected by the color prompt, e.g. a woman's hair will also be hotter with ‘Black’ in the text prompt. thank you for the reminder, I'm going to learn about CLIP_Surgery's implementation next! thanks again for your reply, have a nice day!

import torch
import clip
import numpy as np
import cv2
import matplotlib.pyplot as plt
from PIL import Image
from tqdm import tqdm  # Progress bar library
import torch.nn.functional as F  # Import cosine similarity calculation function
import os
import datetime

# Load the CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Load and preprocess the image
def load_image(image_path, target_size=(336, 336)):
    image = Image.open(image_path).convert("RGB")
    # Resize the image to the target size
    image = image.resize(target_size, Image.ANTIALIAS)  # Resize to 336x336
    image = preprocess(image).unsqueeze(0).to(device)
    return image

# Generate heatmap
def generate_heatmap(image_path, text_prompt, grid_size=8):
    # Load and resize the image to 336x336
    image = Image.open(image_path).convert("RGB")
    target_size = (336, 336)
    image = image.resize(target_size, Image.ANTIALIAS)
    image_width, image_height = image.size

    # Text processing
    text = clip.tokenize([text_prompt]).to(device)

    # Get text features
    text_features = model.encode_text(text)

    # Initialize the heatmap matrix
    heatmap = np.zeros((image_height // grid_size, image_width // grid_size))

    # Use tqdm progress bar
    for i in tqdm(range(0, image_height, grid_size), desc="Processing Rows"):
        for j in range(0, image_width, grid_size):
            # Calculate row and column indices
            row_idx = i // grid_size
            col_idx = j // grid_size

            # Check if out of bounds
            if row_idx >= heatmap.shape[0] or col_idx >= heatmap.shape[1]:
                continue  # Skip out-of-bound parts

            # Crop each small block of the image
            cropped_image = image.crop((j, i, min(j + grid_size, image_width), min(i + grid_size, image_height)))
            cropped_image_tensor = preprocess(cropped_image).unsqueeze(0).to(device)

            # Get image block features
            image_features = model.encode_image(cropped_image_tensor)

            # Calculate cosine similarity
            similarity = F.cosine_similarity(image_features, text_features)

            # Fill the heatmap matrix
            heatmap[row_idx, col_idx] = similarity.item()

    # Print min and max values
    print("Heatmap min:", heatmap.min())
    print("Heatmap max:", heatmap.max())

    # Apply logarithmic transformation to amplify differences
    heatmap = np.log(1 + heatmap)

    # Normalize the heatmap
    heatmap = (heatmap - heatmap.min()) / (heatmap.max() - heatmap.min())  # Normalize to [0, 1]

    print("Normalized heatmap min:", heatmap.min())
    print("Normalized heatmap max:", heatmap.max())

    # Map the heatmap to the [0, 255] range and convert to uint8 type
    heatmap = np.uint8(255 * heatmap)

    return heatmap, image  # Return the heatmap and the original image

# Visualize the heatmap
def visualize_heatmap(image_path, heatmap, original_image):
    # Ensure the output folder exists
    output_folder = "output"
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Get the current system time
    current_time = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

    # Save the heatmap image
    heatmap_path = os.path.join(output_folder, f"heatmap_{current_time}.png")
    plt.figure(figsize=(10, 10))
    plt.imshow(heatmap, cmap='jet', alpha=0.6)
    plt.axis('off')
    plt.title('Heatmap')
    plt.savefig(heatmap_path, bbox_inches='tight', pad_inches=0)
    plt.close()

    # Overlay the heatmap on the original image
    image = cv2.cvtColor(np.array(original_image), cv2.COLOR_RGB2BGR)  # Convert the original image to BGR format
    heatmap_resized = cv2.resize(heatmap, (image.shape[1], image.shape[0]))  # Resize the heatmap
    heatmap_colored = cv2.applyColorMap(heatmap_resized, cv2.COLORMAP_JET)

    # Overlay the heatmap on the original image
    superimposed_img = cv2.addWeighted(image, 0.7, heatmap_colored, 0.3, 0)

    # Save the superimposed image
    superimposed_path = os.path.join(output_folder, f"superimposed_{current_time}.png")
    cv2.imwrite(superimposed_path, superimposed_img)

# Main function
if __name__ == "__main__":
    image_path = "images/img.png"  # Replace with the actual image path
    text_prompt = "A photo of a Black Cat in the scene"  # Replace with your text prompt

    heatmap, original_image = generate_heatmap(image_path, text_prompt)
    visualize_heatmap(image_path, heatmap, original_image)

Here is original image

img

Here is Heat Map

heatmap_20241116_145736

@Kimsure
Copy link
Author

Kimsure commented Nov 18, 2024

Hi, thanks for your released code. Regarding your questions:

  1. Several studies have observed that directly applying CLIP to dense prediction tasks (e.g., segmentation) often yields suboptimal object grounding results.
  2. I’m unsure about the size of your input image, but the blocky artifacts in the heat map could potentially be caused by upsampling.

@Anglejuebi
Copy link

Thank you for your reply and answer. I am still in the process of learning. Have a nice day🌹🌹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants