-
Notifications
You must be signed in to change notification settings - Fork 179
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #63 from hunterheiden/hsh/new_task/screenspot
New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens
- Loading branch information
Showing
9 changed files
with
458 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# SceenSpot | ||
|
||
## GUI Grounding Benchmark: ScreenSpot | ||
|
||
ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget). | ||
|
||
|
||
## Groups | ||
|
||
- `screenspot`: This group bundles both the original grounding task and the new instruction generation task. | ||
|
||
## Tasks | ||
- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC); | ||
- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG). | ||
|
||
### REC Metrics | ||
|
||
REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are: | ||
- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box. | ||
- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct. | ||
- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper. | ||
|
||
### REG Metrics | ||
|
||
REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are: | ||
- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets. | ||
|
||
## Baseline Scores | ||
|
||
As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset: | ||
- `IoU`: 0.051 | ||
- `ACC@0.1`: 0.195 | ||
- `ACC@0.3`: 0.042 | ||
- `ACC@0.5`: 0.006 | ||
- `ACC@0.7`: 0.000 | ||
- `ACC@0.9`: 0.000 | ||
- `CENTER ACC`: 0.097 | ||
- `CIDEr`: 0.097 | ||
|
||
## References | ||
|
||
- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935) | ||
- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick) | ||
|
||
```bibtex | ||
@misc{cheng2024seeclick, | ||
title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents}, | ||
author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu}, | ||
year={2024}, | ||
eprint={2401.10935}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.HC} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
dataset_path: rootsautomation/ScreenSpot | ||
output_type: generate_until | ||
doc_to_visual: !function utils_rec.screenspot_rec_doc_to_visual | ||
doc_to_text: !function utils_rec.screenspot_rec_doc_to_text | ||
doc_to_target: "bbox" | ||
generation_kwargs: | ||
until: | ||
- "ASSISTANT:" | ||
process_results: !function utils_rec.screenspot_rec_process_result | ||
metric_list: | ||
- metric: screenspot_IoU | ||
aggregation : !function utils_rec.screenspot_rec_iou | ||
higher_is_better : true | ||
- metric: screenspot_ACC@0.1 | ||
aggregation : !function utils_rec.screenspot_rec_acc01 | ||
higher_is_better : true | ||
- metric: screenspot_ACC@0.3 | ||
aggregation : !function utils_rec.screenspot_rec_acc03 | ||
higher_is_better : true | ||
- metric: screenspot_ACC@0.5 | ||
aggregation : !function utils_rec.screenspot_rec_acc05 | ||
higher_is_better : true | ||
- metric: screenspot_ACC@0.7 | ||
aggregation : !function utils_rec.screenspot_rec_acc07 | ||
higher_is_better : true | ||
- metric: screenspot_ACC@0.9 | ||
aggregation : !function utils_rec.screenspot_rec_acc09 | ||
higher_is_better : true | ||
- metric: screenspot_Center_ACC | ||
aggregation : !function utils_rec.screenspot_rec_center_acc | ||
higher_is_better : true | ||
metadata: | ||
version: '0.0' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
dataset_path: rootsautomation/ScreenSpot | ||
output_type: generate_until | ||
doc_to_visual: !function utils.screenspot_bbox_doc_to_visual | ||
doc_to_text: !function utils.screenspot_doc_to_text | ||
doc_to_target: "instruction" | ||
generation_kwargs: | ||
until: | ||
- "ASSISTANT:" | ||
process_results: !function utils.screenspot_process_result | ||
metric_list: | ||
- metric: screenspot_CIDEr | ||
aggregation : !function utils.screenspot_cider | ||
higher_is_better : true | ||
metadata: | ||
version: '0.0' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
group: screenspot | ||
task: | ||
- screenspot_reg_test | ||
- screenspot_rec_test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
group: screenspot_rec | ||
task: screenspot_rec_test | ||
include: _default_template_rec_yaml | ||
test_split: test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
group: screenspot_reg | ||
task: screenspot_reg_test | ||
include: _default_template_reg_yaml | ||
test_split: test |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
from PIL import ImageDraw | ||
from pycocoevalcap.eval import COCOEvalCap, Bleu, Meteor, Rouge, Cider, Spice | ||
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer | ||
from pycocotools.coco import COCO | ||
|
||
# COCO_METRICS = ["Bleu_4", "Bleu_3", "Bleu_2", "Bleu_1", "METEOR", "ROUGE_L", "CIDEr"] # , "SPICE"] | ||
COCO_METRICS = ["CIDEr"] | ||
|
||
import logging | ||
|
||
eval_logger = logging.getLogger("lmms-eval") | ||
|
||
|
||
def screenspot_bbox_doc_to_visual(doc): | ||
bbox = doc["bbox"] | ||
image = doc["image"].convert("RGB") | ||
draw = ImageDraw.Draw(image) | ||
bbox_xy = [bbox[0], bbox[1], bbox[2], bbox[3]] | ||
draw.rectangle(bbox_xy, outline="red", width=3) | ||
return [image.convert("RGB")] | ||
|
||
|
||
def screenspot_process_result(doc, result): | ||
""" | ||
Args: | ||
doc: a instance of the eval dataset | ||
results: [pred] | ||
Returns: | ||
a dictionary with key: metric name (in this case coco_bleu), value: metric value | ||
""" | ||
pred = result[0] if len(result) > 0 else "" | ||
ann_id = doc["file_name"] | ||
data_dict = {"instruction": doc["instruction"], "pred": pred, "ann_id": ann_id, 'data_type': doc['data_type'], 'data_source': doc['data_source']} | ||
return {f"screenspot_{metric}": data_dict for metric in COCO_METRICS} | ||
|
||
|
||
def screenspot_doc_to_text(doc): | ||
return f"Direct a user to interact with the highlighted region [{doc['bbox'][0]:.2f}, {doc['bbox'][1]:.2f}, {doc['bbox'][2]:.2f}, {doc['bbox'][3]:.2f}]." | ||
|
||
|
||
def screenspot_aggregation_result(results, metric): | ||
# scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr"), (Spice(), "SPICE")] | ||
scorers = [(Cider(), "CIDEr")] | ||
scorers_dict = {s[1]: s for s in scorers} | ||
|
||
stored_results = [] | ||
# In order to make the coco eval tools to successfully create index | ||
# We need at least two dict in the dataset | ||
# 'annotation' and 'images' | ||
# 'annotation' exactly reproduce the original annotation | ||
# 'images' however only need the image id which is contained in the file name | ||
dataset = {"annotations": [], "images": []} | ||
idx = 0 | ||
ann_id = 0 | ||
for result in results: | ||
stored_results.append({"image_id": idx, "caption": result["pred"]}) | ||
# for s in result["answer"]: | ||
dataset["annotations"].append({"image_id": idx, "caption": result['instruction'], "id": ann_id}) | ||
ann_id += 1 | ||
|
||
dataset["images"].append({"id": idx}) | ||
idx += 1 | ||
|
||
coco = COCO() | ||
# Manually create index here | ||
coco.dataset = dataset | ||
coco.createIndex() | ||
|
||
coco_result = coco.loadRes(stored_results) | ||
coco_eval = COCOEvalCap(coco, coco_result) | ||
|
||
imgIds = coco_eval.params["image_id"] | ||
gts = {} | ||
res = {} | ||
for imgId in imgIds: | ||
gts[imgId] = coco_eval.coco.imgToAnns[imgId] | ||
res[imgId] = coco_eval.cocoRes.imgToAnns[imgId] | ||
|
||
eval_logger.info("tokenization...") | ||
tokenizer = PTBTokenizer() | ||
gts = tokenizer.tokenize(gts) | ||
res = tokenizer.tokenize(res) | ||
|
||
eval_logger.info(f"Computing {metric} scores...") | ||
|
||
score, scores = scorers_dict[metric][0].compute_score(gts, res) | ||
# coco_eval.setEval(score, metric) | ||
|
||
# When metric is one of the Bleu, score will be a list | ||
if type(score) == list: | ||
n = int(metric.split("_")[-1]) | ||
score = score[n - 1] | ||
|
||
return score | ||
|
||
|
||
def screenspot_bleu4(results): | ||
return screenspot_aggregation_result(results, "Bleu_4") | ||
|
||
|
||
def screenspot_bleu3(results): | ||
return screenspot_aggregation_result(results, "Bleu_3") | ||
|
||
|
||
def screenspot_bleu2(results): | ||
return screenspot_aggregation_result(results, "Bleu_2") | ||
|
||
|
||
def screenspot_bleu1(results): | ||
return screenspot_aggregation_result(results, "Bleu_1") | ||
|
||
|
||
def screenspot_meteor(results): | ||
return screenspot_aggregation_result(results, "METEOR") | ||
|
||
|
||
def screenspot_rougel(results): | ||
return screenspot_aggregation_result(results, "ROUGE_L") | ||
|
||
|
||
def screenspot_cider(results): | ||
return screenspot_aggregation_result(results, "CIDEr") | ||
|
||
|
||
def screenspot_spice(results): | ||
return screenspot_aggregation_result(results, "SPICE") |
Oops, something went wrong.