Skip to content

Commit

Permalink
Merge pull request #63 from hunterheiden/hsh/new_task/screenspot
Browse files Browse the repository at this point in the history
New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens
  • Loading branch information
Luodian authored Apr 26, 2024
2 parents d8a3a99 + 319afcc commit c4e9dd9
Show file tree
Hide file tree
Showing 9 changed files with 458 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,9 @@ We also provide the raw data exported from Weights & Biases for the detailed res
- ScienceQA (scienceqa_full)
- ScienceQA Full (scienceqa)
- ScienceQA IMG (scienceqa_img)
- ScreenSpot (screenspot)
- ScreenSpot REC / Grounding (screenspot_rec)
- ScreenSpot REG / Instruction Generation (screenspot_reg)
- SeedBench (seedbench)
- SeedBench 2 (seedbench_2)
- ST-VQA (stvqa)
Expand Down
54 changes: 54 additions & 0 deletions lmms_eval/tasks/screenspot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# SceenSpot

## GUI Grounding Benchmark: ScreenSpot

ScreenSpot is an evaluation benchmark for GUI grounding, comprising over 1200 instructions from iOS, Android, macOS, Windows and Web environments, along with annotated element types (Text or Icon/Widget).


## Groups

- `screenspot`: This group bundles both the original grounding task and the new instruction generation task.

## Tasks
- `screenspot_rec_test`: the original evaluation of `{img} {instruction} --> {bounding box}` called grounding or Referring Expression Completion (REC);
- `screenspot_reg_test`: the new evaluation of `{img} {bounding box} --> {instruction}` called instruction generation or Referring Expression Generation (REG).

### REC Metrics

REC/Grounding requires that a model outputs a bounding box for the target element in the image. The evaluation metrics are:
- `IoU`: Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box.
- `ACC@IoIU`: We use `IoU` to create `ACC@IoU` metrics at different IoU thresholds where an output with an IoU above the threshold is considered correct.
- `CENTER ACC`: The predicted bounding box is considered correct if the center of the predicted bounding box is within the ground truth bounding box. This is what's reported in the paper.

### REG Metrics

REG/Generation requires that a model outputs the instruction that describes the target element in the image. Currently, this element will be highlighted in red in the image. The evaluation metrics are:
- `CIDEr`: The CIDEr metric is used to evaluate the quality of the generated instruction. As the paper doesn't consider this task, we have selected this metric as a standard for evaluating the quality of the generated instruction. This matches with what other works like ScreenAI have done for instruction generation for RICO datasets.

## Baseline Scores

As a Baseline, here is how LLaVA-v1.5-7b performs on the ScreenSpot dataset:
- `IoU`: 0.051
- `ACC@0.1`: 0.195
- `ACC@0.3`: 0.042
- `ACC@0.5`: 0.006
- `ACC@0.7`: 0.000
- `ACC@0.9`: 0.000
- `CENTER ACC`: 0.097
- `CIDEr`: 0.097

## References

- ArXiv: [SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents](https://arxiv.org/abs/2401.10935)
- GitHub: [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)

```bibtex
@misc{cheng2024seeclick,
title={SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents},
author={Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu},
year={2024},
eprint={2401.10935},
archivePrefix={arXiv},
primaryClass={cs.HC}
}
```
33 changes: 33 additions & 0 deletions lmms_eval/tasks/screenspot/_default_template_rec_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
dataset_path: rootsautomation/ScreenSpot
output_type: generate_until
doc_to_visual: !function utils_rec.screenspot_rec_doc_to_visual
doc_to_text: !function utils_rec.screenspot_rec_doc_to_text
doc_to_target: "bbox"
generation_kwargs:
until:
- "ASSISTANT:"
process_results: !function utils_rec.screenspot_rec_process_result
metric_list:
- metric: screenspot_IoU
aggregation : !function utils_rec.screenspot_rec_iou
higher_is_better : true
- metric: screenspot_ACC@0.1
aggregation : !function utils_rec.screenspot_rec_acc01
higher_is_better : true
- metric: screenspot_ACC@0.3
aggregation : !function utils_rec.screenspot_rec_acc03
higher_is_better : true
- metric: screenspot_ACC@0.5
aggregation : !function utils_rec.screenspot_rec_acc05
higher_is_better : true
- metric: screenspot_ACC@0.7
aggregation : !function utils_rec.screenspot_rec_acc07
higher_is_better : true
- metric: screenspot_ACC@0.9
aggregation : !function utils_rec.screenspot_rec_acc09
higher_is_better : true
- metric: screenspot_Center_ACC
aggregation : !function utils_rec.screenspot_rec_center_acc
higher_is_better : true
metadata:
version: '0.0'
15 changes: 15 additions & 0 deletions lmms_eval/tasks/screenspot/_default_template_reg_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
dataset_path: rootsautomation/ScreenSpot
output_type: generate_until
doc_to_visual: !function utils.screenspot_bbox_doc_to_visual
doc_to_text: !function utils.screenspot_doc_to_text
doc_to_target: "instruction"
generation_kwargs:
until:
- "ASSISTANT:"
process_results: !function utils.screenspot_process_result
metric_list:
- metric: screenspot_CIDEr
aggregation : !function utils.screenspot_cider
higher_is_better : true
metadata:
version: '0.0'
4 changes: 4 additions & 0 deletions lmms_eval/tasks/screenspot/_screenspot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
group: screenspot
task:
- screenspot_reg_test
- screenspot_rec_test
4 changes: 4 additions & 0 deletions lmms_eval/tasks/screenspot/screenspot_rec_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
group: screenspot_rec
task: screenspot_rec_test
include: _default_template_rec_yaml
test_split: test
4 changes: 4 additions & 0 deletions lmms_eval/tasks/screenspot/screenspot_reg_test.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
group: screenspot_reg
task: screenspot_reg_test
include: _default_template_reg_yaml
test_split: test
126 changes: 126 additions & 0 deletions lmms_eval/tasks/screenspot/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
from PIL import ImageDraw
from pycocoevalcap.eval import COCOEvalCap, Bleu, Meteor, Rouge, Cider, Spice
from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
from pycocotools.coco import COCO

# COCO_METRICS = ["Bleu_4", "Bleu_3", "Bleu_2", "Bleu_1", "METEOR", "ROUGE_L", "CIDEr"] # , "SPICE"]
COCO_METRICS = ["CIDEr"]

import logging

eval_logger = logging.getLogger("lmms-eval")


def screenspot_bbox_doc_to_visual(doc):
bbox = doc["bbox"]
image = doc["image"].convert("RGB")
draw = ImageDraw.Draw(image)
bbox_xy = [bbox[0], bbox[1], bbox[2], bbox[3]]
draw.rectangle(bbox_xy, outline="red", width=3)
return [image.convert("RGB")]


def screenspot_process_result(doc, result):
"""
Args:
doc: a instance of the eval dataset
results: [pred]
Returns:
a dictionary with key: metric name (in this case coco_bleu), value: metric value
"""
pred = result[0] if len(result) > 0 else ""
ann_id = doc["file_name"]
data_dict = {"instruction": doc["instruction"], "pred": pred, "ann_id": ann_id, 'data_type': doc['data_type'], 'data_source': doc['data_source']}
return {f"screenspot_{metric}": data_dict for metric in COCO_METRICS}


def screenspot_doc_to_text(doc):
return f"Direct a user to interact with the highlighted region [{doc['bbox'][0]:.2f}, {doc['bbox'][1]:.2f}, {doc['bbox'][2]:.2f}, {doc['bbox'][3]:.2f}]."


def screenspot_aggregation_result(results, metric):
# scorers = [(Bleu(4), "Bleu_1"), (Bleu(4), "Bleu_2"), (Bleu(4), "Bleu_3"), (Bleu(4), "Bleu_4"), (Meteor(), "METEOR"), (Rouge(), "ROUGE_L"), (Cider(), "CIDEr"), (Spice(), "SPICE")]
scorers = [(Cider(), "CIDEr")]
scorers_dict = {s[1]: s for s in scorers}

stored_results = []
# In order to make the coco eval tools to successfully create index
# We need at least two dict in the dataset
# 'annotation' and 'images'
# 'annotation' exactly reproduce the original annotation
# 'images' however only need the image id which is contained in the file name
dataset = {"annotations": [], "images": []}
idx = 0
ann_id = 0
for result in results:
stored_results.append({"image_id": idx, "caption": result["pred"]})
# for s in result["answer"]:
dataset["annotations"].append({"image_id": idx, "caption": result['instruction'], "id": ann_id})
ann_id += 1

dataset["images"].append({"id": idx})
idx += 1

coco = COCO()
# Manually create index here
coco.dataset = dataset
coco.createIndex()

coco_result = coco.loadRes(stored_results)
coco_eval = COCOEvalCap(coco, coco_result)

imgIds = coco_eval.params["image_id"]
gts = {}
res = {}
for imgId in imgIds:
gts[imgId] = coco_eval.coco.imgToAnns[imgId]
res[imgId] = coco_eval.cocoRes.imgToAnns[imgId]

eval_logger.info("tokenization...")
tokenizer = PTBTokenizer()
gts = tokenizer.tokenize(gts)
res = tokenizer.tokenize(res)

eval_logger.info(f"Computing {metric} scores...")

score, scores = scorers_dict[metric][0].compute_score(gts, res)
# coco_eval.setEval(score, metric)

# When metric is one of the Bleu, score will be a list
if type(score) == list:
n = int(metric.split("_")[-1])
score = score[n - 1]

return score


def screenspot_bleu4(results):
return screenspot_aggregation_result(results, "Bleu_4")


def screenspot_bleu3(results):
return screenspot_aggregation_result(results, "Bleu_3")


def screenspot_bleu2(results):
return screenspot_aggregation_result(results, "Bleu_2")


def screenspot_bleu1(results):
return screenspot_aggregation_result(results, "Bleu_1")


def screenspot_meteor(results):
return screenspot_aggregation_result(results, "METEOR")


def screenspot_rougel(results):
return screenspot_aggregation_result(results, "ROUGE_L")


def screenspot_cider(results):
return screenspot_aggregation_result(results, "CIDEr")


def screenspot_spice(results):
return screenspot_aggregation_result(results, "SPICE")
Loading

0 comments on commit c4e9dd9

Please sign in to comment.