View file src/colab/copie_de_florence_2.py - Download
# -*- coding: utf-8 -*-
"""Copie de florence-2.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1CrEb2kWHbA5LqvjADGH0AhicuR74WQfE
# How to use Florence-2
This notebook shows how to use **Florence-2**, the foundational vision model from Microsoft, and is associated with the article **[Florence-2: How it works and how to use it](https://www.assemblyai.com/blog/florence-2-how-it-works-how-to-use)**.
**It is recommended to run this notebook in Colab - if you are coming from the [associated repository](https://github.com/AssemblyAI-Community/florence-2), you can find the link to the colab [here](https://colab.research.google.com/drive/1gD56EBmQ7MZfBkyhGtcrvHtaD2KMSm6W?usp=sharing)**.
If you run this locally you will also need to install the packages in the requirements file.
## Setup
First, we need to install packages beyond those pre-installed in Colab in order to run Florence-2.
"""
# Commented out IPython magic to ensure Python compatibility.
# %%capture
# !pip install timm flash_attn einops;
"""Next, we clone the associated repository and move the files into the working directory."""
# Commented out IPython magic to ensure Python compatibility.
# %%bash
# git clone https://github.com/AssemblyAI-Community/florence-2
# mv florence-2/** .
# rm -rf ./florence-2/
"""Now we import the packages we'll need, including the `utils.py` module from the repository that we just cloned. This file provides misellaneous functionality to make it easier to work with Florence-2."""
# Commented out IPython magic to ensure Python compatibility.
import copy
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import utils
# %matplotlib inline
"""Next we load the Florence-2 model and processor"""
model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
"""And then we set these models as constants for our `utils.py` module so that the functions can utilize them as global constants."""
utils.set_model_info(model, processor)
"""Finally, we'll specify the image that we will use for processing. We keep the raw image file for visuals and also convert the image to RGB for passing into Florence-2."""
path = "./car.png"
image = Image.open(path)
image_rgb = Image.open(path).convert("RGB")
image
"""## Florence-2 tasks
Florence-2 is capable of a wide variety of tasks, which are specified to the model as special tokens. Below we overview the tasks that Florence-2 can complete.
Note that the term "Natural language" is an umbrella term that includes descriptive language, "categorical" language, and the visual representation of language. Additionally, "region" refers to a general section of an image (bounding quadrilateral), whereas "segment" refers to a particular section of an image as denoted by a border curve.
The tasks are categorized by additional input needed beyond the image being analyzed - either no additional input, an additional bounding box for a region, or an additional textual input.
## No additional input
| Task category | Output | Description | Associated Task Tokens |
|-----------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|
| Captioning | Natural Language | Generates a natural language descriptive caption for an entire image | CAPTION
DETAILED_CAPTION
MORE_DETAILED_CAPTION |
| Character Recognition | Natural Language
(+ Regions) | Detects and returns text that is contained in an image, potentially also returning bounding boxes for the detected text | OCR
OCR_WITH_REGION |
| Object Detection | Regions
(+ Natural Language) | Proposes bounding boxes for salient objects, optionally with either categorical or descriptive labels | REGION_PROPOSAL
OD
DENSE_REGION_CAPTION |
## Region Input
| Task category | Output | Description | Associated Task Tokens |
|------------------|------------------------------|------------------------------------------------------------------------------------------------------------|------------------------------------------|
| Segmentation | Segments | Detects segments for provided regions (i.e. border detection) | REGION_TO_SEGMENTATION |
| Description | Natural Language | Provides either a categorical or descriptive annotation for the contents of a region | REGION_TO_CATEGORY
REGION_TO_DESCRIPTION |
## Language input
| Task category | Output | Description | Associated Task Tokens |
|-------------------------|----------|---------------------------------------------------------------------------------------------------------------------------|-----------------------------------|
| Phrase Grounding | Regions | Detects regions for salient phrases in a description of an image (one-to-many) | PHRASE_GROUNDING |
| Vocabulary Detection | Regions | Detects regions for specific objects (mix of phrase grounding and region-proposal), can also detect text | OPEN_VOCABULARY_DETECTION |
| Vocabulary Segmentation | Segments | Detects segments for objects referred to by a particular expression (effectively segment version of vocabulary detection) | REFERRING_EXPRESSION_SEGMENTATION |
Below we'll see how to run Florence-2 for all of these tasks in the order they appear in this list.
### No additional input
The tasks in this section require no additional input beyond the image
#### Captioning
Captioning generates a descriptive natural language annotation for the image as a whole. Florence-2 accommodates 3 levels of linguistic granularity for captioning.
"""
tasks = [utils.TaskType.CAPTION,
utils.TaskType.DETAILED_CAPTION,
utils.TaskType.MORE_DETAILED_CAPTION,]
for task in tasks:
results = utils.run_example(task, image_rgb)
print(f'{task.value}{results[task]}')
"""#### Character recognition
Character recognition (OCR) tasks detect the text present in images, and optionally return bounding boxes for the identified text.
See also [vocab detection](#scrollTo=ouid-bVL53hd&line=1&uniqifier=1) to actively search for the presence of a specific word or phrase.
"""
task = utils.TaskType.OCR
results = utils.run_example(task, image_rgb)
print('Detected Text: ', results[task])
task = utils.TaskType.OCR_WITH_REGION
results = utils.run_example(task, image_rgb)
# Boxes drawn directly to image, so copy to avoid adulterating image for later tasks
image_copy = copy.deepcopy(image)
utils.draw_ocr_bboxes(image_copy, results[task])
"""#### Object detection
Object detection automatically detects the salient objects in an image. Florence-2 supports 3 levels of semantic granularity:
1. None (bounding boxes only)
2. Categorical labels
3. Descriptive labels
"""
tasks = [utils.TaskType.REGION_PROPOSAL,
utils.TaskType.OBJECT_DETECTION,
utils.TaskType.DENSE_REGION_CAPTION,]
for task in tasks:
results = utils.run_example(task, image_rgb)
print(task.value)
utils.plot_bbox(results[task], image)
"""### Region input
The tasks in this section require an additional bounding box input in addition to the image
#### Segmentation
Segmentation converts a specified region into a segment (i.e. bounding border)
"""
top_left=[702, 575]
bottom_right=[866, 772]
task_prompt = utils.TaskType.REG_TO_SEG
# converts coordinates to format `""`
text_input = utils.convert_relative_to_loc(top_left + bottom_right)
results = utils.run_example(task_prompt, image_rgb, text_input=text_input)
bbox_coords = utils.convert_relative_to_bbox(top_left + bottom_right, image)
box = {'bboxes': [bbox_coords], 'labels': ['']}
# draw input bounding box and output segment
image_copy = copy.deepcopy(image)
image_copy = utils.draw_polygons(image_copy, results[task_prompt], fill_mask=True)
utils.plot_bbox(box, image_copy)
"""#### Region description
Region description rescribes a region with either a categorical or descriptive annotation
"""
top_left = [52, 332]
bottom_right = [932, 774]
text_input = utils.convert_relative_to_loc(top_left + bottom_right)
bbox = utils.convert_relative_to_bbox(top_left + bottom_right, image)
for task_prompt in [utils.TaskType.REGION_TO_CATEGORY, utils.TaskType.REGION_TO_DESCRIPTION]:
results = utils.run_example(task_prompt, image_rgb, text_input=text_input)
text_result = results[task_prompt].strip().split('<')[0]
box = {'bboxes': [bbox], 'labels': [text_result]}
utils.plot_bbox(box, image)
"""### Language input
The tasks in this section require an additional linguistic input in addition to the image
#### Phrase grounding
Given a descriptive annotation of the image (usually the image in its entirety, i.e. a caption), phrase grounding identifies salient phrases in the annotation and prescribes a bounding box to each one
"""
task_prompt = utils.TaskType.PHRASE_GROUNDING
results = utils.run_example(task_prompt, image_rgb, text_input="A green car parked in front of a yellow building.")
utils.plot_bbox(results[utils.TaskType.PHRASE_GROUNDING], image)
"""#### Vocab detection
Vocabulary detection is like the inverse of [region description](#scrollTo=9a1ffcf4-4404-4b6d-aee8-6c055da51301&line=3&uniqifier=1) - rather than supplying a region and getting a description, you supply a description and receive the associated region.
"""
task_prompt = utils.TaskType.OPEN_VOCAB_DETECTION
results = utils.run_example(task_prompt, image_rgb, text_input="a turqoise car")
bbox_results = utils.convert_to_od_format(results[utils.TaskType.OPEN_VOCAB_DETECTION])
utils.plot_bbox(bbox_results, image)
"""Vocabulary detection also works with visual language (i.e. text that is in the image)
See also [character recognition](#scrollTo=1dee1c41-0e99-41e8-b647-373cf84aa6a7&line=5&uniqifier=1) to detect all text without searching for a specified word/phrase.
"""
task_prompt = utils.TaskType.OPEN_VOCAB_DETECTION
results = utils.run_example(task_prompt, image_rgb, text_input="stop")
bbox_results = utils.convert_to_od_format(results[utils.TaskType.OPEN_VOCAB_DETECTION])
utils.plot_bbox(bbox_results, image)
"""#### Vocab segmentation
Vocbulary segmentation is like vocabulary detection, except it returns segments rather than regions and cannot identify visual text.
"""
task_prompt = utils.TaskType.RES
results = utils.run_example(task_prompt, image_rgb, text_input="a stop sign")
image_copy = copy.deepcopy(image)
utils.draw_polygons(image_copy, results[utils.TaskType.RES], fill_mask=True)
"""## Cascaded tasks
### Detailed Caption + Phrase Grounding + Segmentations
Note that we can cascade tasks together - for example given only an image we can use captioning to generate a description of the entire image and then phrase grounding to associate different objects in the image with the caption.
"""
# Get a caption
task_prompt = utils.TaskType.CAPTION
results = utils.run_example(task_prompt, image_rgb)
# Use the output as the input into the next task (phrase grounding)
text_input = results[task_prompt]
task_prompt = utils.TaskType.PHRASE_GROUNDING
results = utils.run_example(task_prompt, image_rgb, text_input)
results[utils.TaskType.DETAILED_CAPTION] = text_input
print(text_input)
utils.plot_bbox(results[utils.TaskType.PHRASE_GROUNDING], image)
"""We can go a step further and perform region segmentation on the regions identified by the phrase grounding task (whose input was, in turn, supplied by the output of the caption task)"""
polygons = []
task_prompt = utils.TaskType.REG_TO_SEG
# Run region to segmentation for each region identified by phrase grounding
for box in results[utils.TaskType.PHRASE_GROUNDING.value]['bboxes']:
box = utils.convert_bbox_to_relative(box, image)
text_input = utils.convert_relative_to_loc(box)
run_results = utils.run_example(task_prompt, image_rgb, text_input=text_input)
polygons += run_results[task_prompt]['polygons']
# Commented out IPython magic to ensure Python compatibility.
# %%script echo Disabled
#
# # Construct labels list required and created the necessary input dict
# labels = []
# for polygon in polygons:
# l = []
# for idx, polygon_ in enumerate(polygon):
# l.append('')
# labels.append(l)
#
# seg_results = dict(polygons=polygons, labels=labels)
# Construct labels list required and created the necessary input dict
labels = []
for polygon in polygons:
l = []
for idx, polygon_ in enumerate(polygon):
# Instead of appending a list of empty strings or a list of objects,
# append a single descriptive label for the entire polygon.
l.append(f"Object {len(labels) + 1}")
labels.append(''.join(l)) # Join the list into a single string
seg_results = dict(polygons=polygons, labels=labels)
# draw the output
image_copy = copy.deepcopy(image)
utils.draw_polygons(image_copy, seg_results, fill_mask=True)
"""# That's it!
Once you're finished with this Colab, head back to the [Final words](https://www.assemblyai.com/blog/florence-2-how-it-works-how-to-use#final-words) of the associated article, or check out our YouTube channel for additional learning resources:
https://www.youtube.com/c/AssemblyAI
"""