View file src/colab/copie_de_florence_2.py - Download

# -*- coding: utf-8 -*-
"""Copie de florence-2.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1CrEb2kWHbA5LqvjADGH0AhicuR74WQfE

# How to use Florence-2

This notebook shows how to use **Florence-2**, the foundational vision model from Microsoft, and is associated with the article **[Florence-2: How it works and how to use it](https://www.assemblyai.com/blog/florence-2-how-it-works-how-to-use)**.

**It is recommended to run this notebook in Colab - if you are coming from the [associated repository](https://github.com/AssemblyAI-Community/florence-2), you can find the link to the colab [here](https://colab.research.google.com/drive/1gD56EBmQ7MZfBkyhGtcrvHtaD2KMSm6W?usp=sharing)**.

If you run this locally you will also need to install the packages in the requirements file.

## Setup

First, we need to install packages beyond those pre-installed in Colab in order to run Florence-2.
"""

# Commented out IPython magic to ensure Python compatibility.
# %%capture
# !pip install timm flash_attn einops;

"""Next, we clone the associated repository and move the files into the working directory."""

# Commented out IPython magic to ensure Python compatibility.
# %%bash
# git clone https://github.com/AssemblyAI-Community/florence-2
# mv florence-2/** .
# rm -rf ./florence-2/

"""Now we import the packages we'll need, including the `utils.py` module from the repository that we just cloned. This file provides misellaneous functionality to make it easier to work with Florence-2."""

# Commented out IPython magic to ensure Python compatibility.
import copy

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests

import utils

# %matplotlib inline

"""Next we load the Florence-2 model and processor"""

model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

"""And then we set these models as constants for our `utils.py` module so that the functions can utilize them as global constants."""

utils.set_model_info(model, processor)

"""Finally, we'll specify the image that we will use for processing. We keep the raw image file for visuals and also convert the image to RGB for passing into Florence-2."""

path = "./car.png"
image = Image.open(path)
image_rgb = Image.open(path).convert("RGB")

image

"""## Florence-2 tasks

Florence-2 is capable of a wide variety of tasks, which are specified to the model as special tokens. Below we overview the tasks that Florence-2 can complete.

Note that the term "Natural language" is an umbrella term that includes descriptive language, "categorical" language, and the visual representation of language. Additionally, "region" refers to a general section of an image (bounding quadrilateral), whereas "segment" refers to a particular section of an image as denoted by a border curve.

The tasks are categorized by additional input needed beyond the image being analyzed - either no additional input, an additional bounding box for a region, or an additional textual input.


## No additional input | Task category | Output | Description | Associated Task Tokens | |-----------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------|------------------------------------------------| | Captioning | Natural Language | Generates a natural language descriptive caption for an entire image | CAPTION
DETAILED_CAPTION
MORE_DETAILED_CAPTION | | Character Recognition | Natural Language
(+ Regions) | Detects and returns text that is contained in an image, potentially also returning bounding boxes for the detected text | OCR
OCR_WITH_REGION | | Object Detection | Regions
(+ Natural Language) | Proposes bounding boxes for salient objects, optionally with either categorical or descriptive labels | REGION_PROPOSAL
OD
DENSE_REGION_CAPTION | ## Region Input | Task category | Output | Description | Associated Task Tokens | |------------------|------------------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------| | Segmentation | Segments | Detects segments for provided regions (i.e. border detection) | REGION_TO_SEGMENTATION | | Description | Natural Language | Provides either a categorical or descriptive annotation for the contents of a region | REGION_TO_CATEGORY
REGION_TO_DESCRIPTION | ## Language input | Task category | Output | Description | Associated Task Tokens | |-------------------------|----------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------| | Phrase Grounding | Regions | Detects regions for salient phrases in a description of an image (one-to-many) | PHRASE_GROUNDING | | Vocabulary Detection | Regions | Detects regions for specific objects (mix of phrase grounding and region-proposal), can also detect text | OPEN_VOCABULARY_DETECTION | | Vocabulary Segmentation | Segments | Detects segments for objects referred to by a particular expression (effectively segment version of vocabulary detection) | REFERRING_EXPRESSION_SEGMENTATION |

Below we'll see how to run Florence-2 for all of these tasks in the order they appear in this list. ### No additional input The tasks in this section require no additional input beyond the image #### Captioning Captioning generates a descriptive natural language annotation for the image as a whole. Florence-2 accommodates 3 levels of linguistic granularity for captioning. """ tasks = [utils.TaskType.CAPTION, utils.TaskType.DETAILED_CAPTION, utils.TaskType.MORE_DETAILED_CAPTION,] for task in tasks: results = utils.run_example(task, image_rgb) print(f'{task.value}{results[task]}') """#### Character recognition Character recognition (OCR) tasks detect the text present in images, and optionally return bounding boxes for the identified text. See also [vocab detection](#scrollTo=ouid-bVL53hd&line=1&uniqifier=1) to actively search for the presence of a specific word or phrase. """ task = utils.TaskType.OCR results = utils.run_example(task, image_rgb) print('Detected Text: ', results[task]) task = utils.TaskType.OCR_WITH_REGION results = utils.run_example(task, image_rgb) # Boxes drawn directly to image, so copy to avoid adulterating image for later tasks image_copy = copy.deepcopy(image) utils.draw_ocr_bboxes(image_copy, results[task]) """#### Object detection Object detection automatically detects the salient objects in an image. Florence-2 supports 3 levels of semantic granularity: 1. None (bounding boxes only) 2. Categorical labels 3. Descriptive labels """ tasks = [utils.TaskType.REGION_PROPOSAL, utils.TaskType.OBJECT_DETECTION, utils.TaskType.DENSE_REGION_CAPTION,] for task in tasks: results = utils.run_example(task, image_rgb) print(task.value) utils.plot_bbox(results[task], image) """### Region input The tasks in this section require an additional bounding box input in addition to the image #### Segmentation Segmentation converts a specified region into a segment (i.e. bounding border) """ top_left=[702, 575] bottom_right=[866, 772] task_prompt = utils.TaskType.REG_TO_SEG # converts coordinates to format `""` text_input = utils.convert_relative_to_loc(top_left + bottom_right) results = utils.run_example(task_prompt, image_rgb, text_input=text_input) bbox_coords = utils.convert_relative_to_bbox(top_left + bottom_right, image) box = {'bboxes': [bbox_coords], 'labels': ['']} # draw input bounding box and output segment image_copy = copy.deepcopy(image) image_copy = utils.draw_polygons(image_copy, results[task_prompt], fill_mask=True) utils.plot_bbox(box, image_copy) """#### Region description Region description rescribes a region with either a categorical or descriptive annotation """ top_left = [52, 332] bottom_right = [932, 774] text_input = utils.convert_relative_to_loc(top_left + bottom_right) bbox = utils.convert_relative_to_bbox(top_left + bottom_right, image) for task_prompt in [utils.TaskType.REGION_TO_CATEGORY, utils.TaskType.REGION_TO_DESCRIPTION]: results = utils.run_example(task_prompt, image_rgb, text_input=text_input) text_result = results[task_prompt].strip().split('<')[0] box = {'bboxes': [bbox], 'labels': [text_result]} utils.plot_bbox(box, image) """### Language input The tasks in this section require an additional linguistic input in addition to the image #### Phrase grounding Given a descriptive annotation of the image (usually the image in its entirety, i.e. a caption), phrase grounding identifies salient phrases in the annotation and prescribes a bounding box to each one """ task_prompt = utils.TaskType.PHRASE_GROUNDING results = utils.run_example(task_prompt, image_rgb, text_input="A green car parked in front of a yellow building.") utils.plot_bbox(results[utils.TaskType.PHRASE_GROUNDING], image) """#### Vocab detection Vocabulary detection is like the inverse of [region description](#scrollTo=9a1ffcf4-4404-4b6d-aee8-6c055da51301&line=3&uniqifier=1) - rather than supplying a region and getting a description, you supply a description and receive the associated region. """ task_prompt = utils.TaskType.OPEN_VOCAB_DETECTION results = utils.run_example(task_prompt, image_rgb, text_input="a turqoise car") bbox_results = utils.convert_to_od_format(results[utils.TaskType.OPEN_VOCAB_DETECTION]) utils.plot_bbox(bbox_results, image) """Vocabulary detection also works with visual language (i.e. text that is in the image) See also [character recognition](#scrollTo=1dee1c41-0e99-41e8-b647-373cf84aa6a7&line=5&uniqifier=1) to detect all text without searching for a specified word/phrase. """ task_prompt = utils.TaskType.OPEN_VOCAB_DETECTION results = utils.run_example(task_prompt, image_rgb, text_input="stop") bbox_results = utils.convert_to_od_format(results[utils.TaskType.OPEN_VOCAB_DETECTION]) utils.plot_bbox(bbox_results, image) """#### Vocab segmentation Vocbulary segmentation is like vocabulary detection, except it returns segments rather than regions and cannot identify visual text. """ task_prompt = utils.TaskType.RES results = utils.run_example(task_prompt, image_rgb, text_input="a stop sign") image_copy = copy.deepcopy(image) utils.draw_polygons(image_copy, results[utils.TaskType.RES], fill_mask=True) """## Cascaded tasks ### Detailed Caption + Phrase Grounding + Segmentations Note that we can cascade tasks together - for example given only an image we can use captioning to generate a description of the entire image and then phrase grounding to associate different objects in the image with the caption. """ # Get a caption task_prompt = utils.TaskType.CAPTION results = utils.run_example(task_prompt, image_rgb) # Use the output as the input into the next task (phrase grounding) text_input = results[task_prompt] task_prompt = utils.TaskType.PHRASE_GROUNDING results = utils.run_example(task_prompt, image_rgb, text_input) results[utils.TaskType.DETAILED_CAPTION] = text_input print(text_input) utils.plot_bbox(results[utils.TaskType.PHRASE_GROUNDING], image) """We can go a step further and perform region segmentation on the regions identified by the phrase grounding task (whose input was, in turn, supplied by the output of the caption task)""" polygons = [] task_prompt = utils.TaskType.REG_TO_SEG # Run region to segmentation for each region identified by phrase grounding for box in results[utils.TaskType.PHRASE_GROUNDING.value]['bboxes']: box = utils.convert_bbox_to_relative(box, image) text_input = utils.convert_relative_to_loc(box) run_results = utils.run_example(task_prompt, image_rgb, text_input=text_input) polygons += run_results[task_prompt]['polygons'] # Commented out IPython magic to ensure Python compatibility. # %%script echo Disabled # # # Construct labels list required and created the necessary input dict # labels = [] # for polygon in polygons: # l = [] # for idx, polygon_ in enumerate(polygon): # l.append('') # labels.append(l) # # seg_results = dict(polygons=polygons, labels=labels) # Construct labels list required and created the necessary input dict labels = [] for polygon in polygons: l = [] for idx, polygon_ in enumerate(polygon): # Instead of appending a list of empty strings or a list of objects, # append a single descriptive label for the entire polygon. l.append(f"Object {len(labels) + 1}") labels.append(''.join(l)) # Join the list into a single string seg_results = dict(polygons=polygons, labels=labels) # draw the output image_copy = copy.deepcopy(image) utils.draw_polygons(image_copy, seg_results, fill_mask=True) """# That's it! Once you're finished with this Colab, head back to the [Final words](https://www.assemblyai.com/blog/florence-2-how-it-works-how-to-use#final-words) of the associated article, or check out our YouTube channel for additional learning resources: https://www.youtube.com/c/AssemblyAI """