GroundingDINO Infer in Batches

transform = T.Compose(
        [
            # T.RandomResize([800], max_size=1333),
            # Added T.Resize to fix the resized image during batch inference
            T.Resize((800, 1200)),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )

Adapting the existing predict function:

def predict_batch(
        model,
        images: torch.Tensor,
        caption: str,
        box_threshold: float,
        text_threshold: float,
        device: str = "cuda"
) -> Tuple[torch.Tensor, torch.Tensor, List[str]]:
    caption = preprocess_caption(caption=caption)

    model = model.to(device)
    image = images.to(device)

    print(f"Image shape: {image.shape}") # Image shape: torch.Size([num_batch, 3, 800, 1200])
    with torch.no_grad():
        outputs = model(image, captions=[caption for _ in range(len(images))]) # <------- I use the same caption for all the images for my use-case

    print(f'{outputs["pred_logits"].shape}') # torch.Size([num_batch, 900, 256]) 
    print(f'{outputs["pred_boxes"].shape}') # torch.Size([num_batch, 900, 4])
    prediction_logits = outputs["pred_logits"].cpu().sigmoid()[0]  # prediction_logits.shape = (nq, 256)
    prediction_boxes = outputs["pred_boxes"].cpu()[0]  # prediction_boxes.shape = (nq, 4)

    mask = prediction_logits.max(dim=1)[0] > box_threshold
    logits = prediction_logits[mask]  # logits.shape = (n, 256)
    boxes = prediction_boxes[mask]  # boxes.shape = (n, 4)

    tokenizer = model.tokenizer
    tokenized = tokenizer(caption)

    phrases = [
        get_phrases_from_posmap(logit > text_threshold, tokenized, tokenizer).replace('.', '')
        for logit
        in logits
    ]

    return boxes, logits.max(dim=1)[0], phrases

This gave me (him) a roughly 18% improvement in latency over single image inference of a batch of 16 images.

Instructions

To begin with, since this project is based on GroundingDINO, please follow the instructions in the original repo to set up the environment and download the pretrained model and put the weigth file in the ./weights folder. The weight I use is 'groundingdino_swint_ogc.pth'. Another thing to mention is that the prompt I use is 'object' with a batch size of 2, you can modify the parameters as you need in the inference_gdino.py file. Moreover, I am using cpu, please modify the code in inference_gdino.py to use gpu if you want to.

1. Prepare the images

Put the images you want to infer in the ./images folder. The images can be in any format, but the code only supports .jpg for now.

2. Create file path list

Create a file path list By running python create_img_path_list.py. This will create a file path list for the images you want to infer. The file path list should be in the ./images folder and named img_paths.txt. Each line in the file should be the path to the image. For example:

10072701692931/38257c9f278852af.jpg
10072701692931/c6f407dd59d06a81.jpg
71230141340/2701e43b25f70394.jpg
71230141340/0f70e523ef92d29a.jpg

3. Run the inference

python inference_gdino.py

The results will be saved in the ./crop_images folder, where the name of the folder is the sku (aka product ID) of a product.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
GroundingDINO		GroundingDINO
__pycache__		__pycache__
crop_images		crop_images
images		images
README.md		README.md
batch_utlities.py		batch_utlities.py
create_file_path.py		create_file_path.py
image_paths.txt		image_paths.txt
inference_gdino.py		inference_gdino.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GroundingDINO Infer in Batches

Table of Contents

Attentions

Inspirations

Instructions

1. Prepare the images

2. Create file path list

3. Run the inference

About

Languages

yuwenmichael/Grounding-DINO-Batch-Inference

Folders and files

Latest commit

History

Repository files navigation

GroundingDINO Infer in Batches

Table of Contents

Attentions

Inspirations

Instructions

1. Prepare the images

2. Create file path list

3. Run the inference

About

Topics

Resources

Stars

Watchers

Forks

Languages