View file src/colab/copie_de_open_set_object_detection_with_grounding_dino_practical_paper_reading.py - Download

# -*- coding: utf-8 -*-
"""Copie de Open Set Object Detection with Grounding DINO - Practical Paper Reading.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1DblmypBrQI_e6wMkPDKzJOZcs55CsoX_

Coiped from https://colab.research.google.com/drive/1LC4AeONfiewqugs_ZIcTwZg4ODs7QcyO

[![KISZ](https://hpi.de/fileadmin/_processed_/1/2/csm_AI_Maker_Session_3f8dd88134.jpg)](https://aimaker.community)

# Open-Set Object Detection with Grounding DINO


---

[![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/IDEA-Research/GroundingDINO) [![arXiv](https://img.shields.io/badge/arXiv-2303.05499-b31b1b.svg)](https://arxiv.org/abs/2303.05499)

In computer vision research, object detection aims at developing models that can detect objects in a given image or video scene. These models can detect the exact location of the object in the image as well as the type of the object.

![Object Detection](https://drive.google.com/uc?export=view&id=1l81hl_UrijRdhxKSbyVvxlHn7RvU7vvk)

Traditionally, object detection was performed based on a closed-set of objects. For training models, we had carefully annotated images with a finite set of objects, e.g., cats, dogs, table. However pre-trained vision-language models, e.g., [CLIP](https://openai.com/index/clip/), we are able to perform zero-shot open-set object detection, i.e., query a text in natural language without limitation of object varieties.

![Openset OD](https://drive.google.com/uc?export=view&id=1xfxyiKVYTFbAZsnXKZeYV7jLlxPSOP38)



## Grounding DINO

**The term "Grounding" in "Grounding DINO" refers to the model's ability to ground or connect language descriptions to specific regions in an image**. There are a few key reasons why this model is called "Grounding DINO":

* **Language-Vision Alignment**: The model aims to align or "ground" textual descriptions (like object categories or referring expressions) to visual elements in images. This grounding process enables the model to detect objects specified by language inputs.
* **Phrase Grounding Task:** As mentioned in Section 3, the authors reformulate the object detection task as a phrase grounding task. They concatenate category names into text prompts, essentially grounding these textual phrases to image regions.
* **Use of Grounding Data:** The model is trained on grounding datasets like GoldG and RefC (mentioned in the Appendix Section "Data Usage"), which contain pairs of images and textual descriptions grounded to specific image regions.
* **Cross-Modality Fusion:** The model incorporates several mechanisms (feature enhancer, language-guided query selection, cross-modality decoder) to fuse or "ground" information between the language and vision modalities.
* **Open-Set Detection:** The grounding capability allows the model to detect arbitrary objects specified by language inputs, extending beyond a fixed set of predefined categories.
* **Referring Expression Comprehension:** The model's ability to handle referring expressions (like "man in blue") demonstrates its capacity to ground complex language descriptions to image regions.



## Pro Tip: Use GPU Acceleration

If you are running this notebook in Google Colab, navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `GPU`, and then click `Save`. This will ensure your notebook uses a GPU, which will significantly speed up model execution times.

## Steps in this Tutorial

In this tutorial, we are going to cover:

- Introduction to DINO 🦕
- Download a pretrained DINO model
- Download example data
- Load grounding DINO model
- Grounding DINO demo

## How Does Grounding DINO Work?

Grounding DINO is a dual-encoder-single-decoder model containing a visual encoder (pre-trained DINO with Swin Transformers), text encoder (BERT-base), vision-language feature enhancer, language-guided query selection and a cross-modality decoder.

![GDINO1](https://drive.google.com/uc?export=view&id=1MgNGJ8aEWqWqX7nU1rehA_YdRv1hPncJ)

## Step 0

Use the `nvidia-smi` command to make sure that your notebook is using GPU as the main processor. In case of any problems navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `GPU`, and then click `Save`.
"""

!nvidia-smi

import os
HOME = os.getcwd()  # get current working directory
print(HOME)

"""## Install Grounding DINO 🦕"""

# Commented out IPython magic to ensure Python compatibility.
# %cd {HOME}
!git clone https://github.com/IDEA-Research/GroundingDINO.git
# %cd {HOME}/GroundingDINO
!pip install -q -e .

import os

CONFIG_PATH = os.path.join(HOME, "GroundingDINO/groundingdino/config/GroundingDINO_SwinB_cfg.py")
print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH))

"""## Download Grounding DINO Weights

Checkout the model versions released here: https://github.com/IDEA-Research/GroundingDINO/releases

In this tutorial, we will use the latest version (as of 02.07.2024).

"""

# Commented out IPython magic to ensure Python compatibility.
# %cd {HOME}
!mkdir {HOME}/weights
# %cd {HOME}/weights

!wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth

import os

WEIGHTS_NAME = "groundingdino_swinb_cogcoor.pth"
WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME)
print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))

"""## Download Example Data"""

# Commented out IPython magic to ensure Python compatibility.
# %cd {HOME}
!mkdir {HOME}/data
# %cd {HOME}/data

!wget -q https://artinwords.de/wp-content/uploads/Pieter-Bruegel-Turmbau-zu-Babel.jpg
!wget -q http://images.cocodataset.org/val2017/000000039769.jpg
!wget -q http://images.cocodataset.org/val2017/000000010707.jpg
!wget -q https://upload.wikimedia.org/wikipedia/commons/thumb/e/e3/Plains_Zebra_Equus_quagga.jpg/330px-Plains_Zebra_Equus_quagga.jpg

"""## Load Grounding DINO Model"""

# Commented out IPython magic to ensure Python compatibility.
# %cd {HOME}/GroundingDINO

from groundingdino.util.inference import load_model, load_image, predict, annotate

model = load_model(CONFIG_PATH, WEIGHTS_PATH)

"""## Grounding DINO Demo"""

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "Pieter-Bruegel-Turmbau-zu-Babel.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "boats"
BOX_TRESHOLD = 0.3
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "Pieter-Bruegel-Turmbau-zu-Babel.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "boats in the sea"
BOX_TRESHOLD = 0.3
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "Pieter-Bruegel-Turmbau-zu-Babel.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "boat most at the left, tower of Babel, people, house"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "000000039769.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "remote control"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "000000039769.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "cat on sofa"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "000000039769.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "cat's tail"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "000000010707.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "sofa"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
IMAGE_NAME = "000000010707.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "drinks on the table"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
IMAGE_NAME = "000000010707.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "laptop, people, game"
BOX_TRESHOLD = 0.15
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
IMAGE_NAME = "000000010707.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "beer"
BOX_TRESHOLD = 0.2
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))

# Commented out IPython magic to ensure Python compatibility.
import os
import supervision as sv

IMAGE_NAME = "330px-Plains_Zebra_Equus_quagga.jpg"
IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)

TEXT_PROMPT = "zebra"
BOX_TRESHOLD = 0.3
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

# %matplotlib inline
sv.plot_image(annotated_frame, (14, 14))