Zero-shot object detection with Grounding DINO

Tauseef Ahmad
6 min readAug 1, 2023

Background

Most of the existing object detection models are trained to identify a limited set of pre-determined classes. Adding new classes to the list of identifiable objects requires collecting and labeling new data and retraining the model again from scratch, which is a time consuming and expensive process. The objective of Grounding DINO (published in this paper in March 2023) was to develop a strong system to detect arbitrary objects specified by human language inputs without the need to retrain the model, also known as zero-shot detection. The model can identify and detect any object simply by providing a text prompt.

What is Grounding DINO?

Grounding DINO is a self-supervised learning algorithm that combines DINO (DETR with Improved deNoising anchOr boxes) with grounded (GLIP) pre-training. DINO, a transformer-based detection method, performs state-of-the-art object detection and end-to-end optimization, eliminating the need for handcrafted modules like NMS (Non-Maximum Suppression). On the other hand, GLIP focuses on phrase grounding which involves associating phrases or words from a given text with corresponding visual elements in an image or video, effectively linking textual descriptions to their respective visual representations.

Model architecture

The core of the Grounding DINO architecture lies in its ability to effectively bridge the gap between language and vision. This is achieved by employing a two-stream architecture — Multi-scale image features are extracted using a text backbone like Swin Transformer and text features are extracted via a text backbone like NLP model BERT.

Model architecture

The output of these two streams are fed into a feature enhancer for transforming the two sets of features into a single unified representation space. The feature enhancer includes multiple feature enhancer layers. Deformable self-attention is utilized to enhance image features, and regular self-attention is used for text feature enhancers.

Feature enhancer layer

Grounding DINO aims to detect objects from an image specified by an input text. In order to effectively leverage the input text for object detection, a language-guided query selection is used to select most relevant features from both the image and text inputs. These queries guide the decoder in identifying the locations of objects in the image and assigning them appropriate labels based on the text descriptions.

Cross-modality decoder

A cross-modality decoder is then used to integrate text and image modality features. The cross-modality decoder operates by processing the fused features and decoder queries through a series of attention layers and feed-forward networks. These layers allow the decoder to effectively capture the relationships between the visual and textual information, enabling it to refine the object detections and assign appropriate labels. After this step, the model proceeds with the final steps in the object detection including bounding box prediction, class specific confidence filtering and label assignment.

How it works?

Here is how Grounding DINO would work on this image:

Credit: IDEA-Research
  1. The model will first use its understanding of language to identify the objects that are mentioned in the text prompt. For example, in the description “two dogs with a stick,” the model would identify the words “dogs” and “stick” as objects
  2. The model will then generate a set of object proposals for each object that was identified in the natural language description. The object proposals are generated using a variety of features such as the color, shape, and texture of the objects
  3. Next, the score for each object proposal is returned by the model. The score is a measure of how likely it is that the object proposal contains an actual object
  4. The model would then select the top-scoring object proposals as the final detections. The final detections are the objects that the model is most confident are present in the image

In this case, the model would likely detect the two dogs and the stick in the image. The model would also likely score the two dogs higher than the stick, because the dogs are larger and more prominent in the image.

Implementation

In the following section, we will demonstrate an open-set object object detection. Here, we will use a pre-trained Grounding DINO model to detect ‘glass with lid’ (as the text prompt) via a camera feed.

Install Grounding DINO 🦕

First, the github repository containing PyTorch implementation and pre-trained models for Grounding DINO is cloned to your local directory. Create a file named main.py in the same directory where the github repository has been cloned. This file will have the main script to execute Grounding DINO model via a camera feed. The relevant libraries and Grounding DINO modules are imported first by adding the below commands. The last two lines of the code imports the required inference modules.

Import relevant modules

Set model config and weight file paths

Next, the Grounding DINO model configuration file and weight file paths are defined. Apart from that, we also define two hyperparameters box and image thresholds that control the selection of object boxes and images. By default, the model outputs 900 object boxes by default that are ranked according to their similarity scores with the input text. The number of object boxes that the Grounding DINO model outputs can be changed by adjusting the max_boxes hyperparameter.

Define model paths and hyperparameters

The box threshold specifies the minimum similarity score required for an object box to be considered a positive detection. This value is applied to the similarity scores of each object box with each word in the input text. Similarly, the text threshold specifies the minimum similarity score required for an image to be considered a positive detection. This value is applied to the similarity scores of the entire image with the input text.

Detection

Finally, we start our camera feed using opencv module and continuously read the frame. Before passing the camera feed to the model, we need to perform a few transformations on the image frames. First, a transform object is created by performing three image transformations.

RandomResize([800], max_size=1333) - This transformation resizes the image to width 800 and a maximum height of 1333 pixels. This helps to prevent the model overfitting to a specific size.

ToTensor() - This transformation converts the image to a Pytorch tensor.

Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) - This transformation normalizes the image by subtracting the mean and dividing by the standard deviation of the ImageNet dataset. This helps to make the model more robust to changes in lightning and other factors.

Next, the frame (which is a numpy array of the camera frame) is converted to a PIL image object in RGB color space and finally converted to a transform object by performing the above mentioned three transformations.

If the input feed is just a JPEG/PNG file, one can directly use the pre-defined load_image function directly to load the image in the desired format required for the model input.

Finally, we use the predict function which takes model, tensor image object, text prompt, box and text threshold and device type and returns a tuple with three outputs: a list of bounding boxes, logits and text phrases. The output(s) generated from this function along with the camera frame are passed to the annotate function which returns an annotated image (as shown below). The purple box is the bounding box for the glass and the green box is the one for the lid. The other glasses in the feed are not detected as they do not have a lid.

Annotated image: Glass with lid

Tips for performance improvement:

  1. Use device type GPU if nvidia GPU available. Since the model has been trained on Pytorch cuda version which is incompatible with my mac metal GPU , we have chosen CPU mode for this blog demonstration purposes
  2. Input precise text prompts to get accurate detections, avoid acronyms and short forms
  3. Fine tune hyperparameters — box and text threshold value can be tweaked to reduce false positives

References:

  1. https://arxiv.org/pdf/2303.05499.pdf
  2. https://github.com/IDEA-Research/GroundingDINO
  3. https://blog.roboflow.com/grounding-dino-zero-shot-object-detection/

WRITER at MLearning.ai // AI ART DISCORD🗿/imagine AI 3D Models

--

--