Skip to content

Commit

Permalink
Docs: evaluation for generative models
Browse files Browse the repository at this point in the history
  • Loading branch information
kimihailv committed Dec 27, 2023
1 parent 4c4ead3 commit bd34c71
Showing 1 changed file with 33 additions and 7 deletions.
40 changes: 33 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,8 +112,8 @@ The exact behavior is controlled by prompts.
model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen")
processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen")

# [cap] Narrate the contents of the image with precision
# [cap] Summarize the visual content of the image
# [cap] Narrate the contents of the image with precision.
# [cap] Summarize the visual content of the image.
# [vqa] What is the main subject of the image?
prompt = "[cap] Summarize the visual content of the image."
image = Image.open("zebra.jpg")
Expand Down Expand Up @@ -170,10 +170,10 @@ Evaluating `uform-vl-english` model, one can expect the following numbers for se
| Dataset | Recall @ 1 | Recall @ 5 | Recall @ 10 |
| :-------- | ---------: | ---------: | ----------: |
| Flickr | 0.727 | 0.915 | 0.949 |
| MS-COCO ¹ | 0.510 | 0.761 | 0.838 |
| MS-COCO¹ | 0.510 | 0.761 | 0.838 |


For multilingual benchmarks, we've created the [`unum-cloud/coco-sm`](https://github.com/unum-cloud/coco-sm) repository ².
For multilingual benchmarks, we've created the [`unum-cloud/coco-sm`](https://github.com/unum-cloud/coco-sm) repository².
Evaluating the `unum-cloud/uform-vl-multilingual-v2` model, one can expect the following metrics for text-to-image search, compared against `xlm-roberta-base-ViT-B-32` [OpenCLIP](https://github.com/mlfoundations/open_clip) model.

| Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
Expand Down Expand Up @@ -220,12 +220,38 @@ Evaluating the `unum-cloud/uform-vl-multilingual-v2` model, one can expect the f
| Meta NLLB | 24.9±6.7 | __32.4±3.5__ | 47.5±10.3 | __58.9±4.5__ | 58.2±11.2 | __70.2±4.3__ | - |

</details>

### Generative Models

For captioning evaluation we mesaure CLIPScore and RefCLIPScore³.

| Model | Size | Caption Length | CLIPScore | RefCLIPScore |
| :---------------------------------- | ---: | -------------: | --------: | -----------: |
| `llava-hf/llava-1.5-7b-hf` | 7B | Long | 0.878 | 0.529 |
| `llava-hf/llava-1.5-7b-hf` | 7B | Short | 0.886 | 0.531 |
| |
| `Salesforce/instructblip-vicuna-7b` | 7B | Long | 0.902 | 0.534 |
| `Salesforce/instructblip-vicuna-7b` | 7B | Short | 0.848 | 0.523 |
| |
| `unum-cloud/uform-gen` | 1.5B | Long | 0.847 | 0.523 |
| `unum-cloud/uform-gen` | 1.5B | Short | 0.842 | 0.522 |
| |
| `unum-cloud/uform-gen-chat` | 1.5B | Long | 0.860 | 0.525 |
| `unum-cloud/uform-gen-chat` | 1.5B | Short | 0.858 | 0.525 |

Results for VQAv2 evaluation.

| Model | Size | Accuracy |
| :---------------------------------- | ---: | -------: |
| `llava-hf/llava-1.5-7b-hf` | 7B | 78.5 |
| `Salesforce/instructblip-vicuna-7b` | 7B | - |
| `unum-cloud/uform-gen` | 1.5B | 66.5 |

<br/>

> ¹ Train split was in training data. <br/>
> ² Lacking a broad enough evaluation dataset, we translated the [COCO Karpathy test split](https://www.kaggle.com/datasets/shtvkumar/karpathy-splits) with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section.
### Generative Models
> ² Lacking a broad enough evaluation dataset, we translated the [COCO Karpathy test split](https://www.kaggle.com/datasets/shtvkumar/karpathy-splits) with multiple public and proprietary translation services, averaging the scores across all sets, and breaking them down in the bottom section. <br/>
> ³ We used `apple/DFN5B-CLIP-ViT-H-14-378` CLIP model.
## Speed

Expand Down

0 comments on commit bd34c71

Please sign in to comment.