Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low recall when testing on flickr30k-cn dataset #6

Open
Qiulin-W opened this issue Sep 27, 2021 · 1 comment
Open

Low recall when testing on flickr30k-cn dataset #6

Qiulin-W opened this issue Sep 27, 2021 · 1 comment

Comments

@Qiulin-W
Copy link

Hi, thanks for the great work!

I tested the pretrained model for zero-shot img2text and text2img retrieval on flickr30k-cn validation set. The bboxes are obtained as indicated in https://github.com/chuhaojin/BriVL-BUA-applications. For each image, we only select the one caption with the highest fluency score. However, the recall@1 for the two task is only 15.93% and 13.74%, respectively. The same evaluation for ViLT reaches 73.2% and 55.0%. I'm wondering whether you test on this dataset? Any comments on my results?

p.s. An example json file of the dataset is as follows
{"sentences": [["0", "一个小男孩正在玩呼啦圈。"]], "bbox": [[78, 92, 183, 124], [179, 137, 363, 214], [68, 21, 170, 101], [73, 326, 206, 498], [338, 150, 379, 187], [0, 305, 363, 396], [105, 273, 179, 342], [30, 32, 261, 483], [89, 192, 130, 210], [12, 155, 389, 498], [173, 150, 192, 167], [17, 134, 237, 353], [10, 341, 389, 496], [90, 76, 170, 169], [29, 118, 282, 363], [17, 357, 339, 402], [129, 133, 152, 155], [6, 423, 78, 498], [97, 231, 138, 250], [74, 22, 174, 175], [165, 167, 197, 191], [34, 77, 242, 494], [316, 145, 341, 197], [33, 167, 164, 323], [294, 1, 382, 19], [199, 8, 382, 158], [15, 385, 389, 497], [1, 366, 379, 396], [179, 126, 371, 228], [204, 13, 379, 130], [57, 23, 189, 235], [59, 71, 230, 482], [55, 23, 203, 167], [44, 29, 213, 248], [61, 27, 210, 219], [32, 124, 264, 367], [44, 39, 236, 286], [18, 326, 338, 445], [198, 383, 389, 496], [61, 344, 209, 498], [95, 269, 186, 340], [46, 302, 331, 471], [19, 123, 344, 307], [11, 14, 374, 409], [31, 132, 234, 357], [20, 134, 271, 354], [16, 10, 358, 360], [32, 20, 297, 478], [39, 19, 206, 157], [2, 330, 62, 443], [29, 168, 175, 331], [153, 312, 389, 404], [2, 408, 272, 498], [0, 328, 347, 467], [317, 148, 349, 197], [35, 302, 227, 458], [38, 143, 229, 366], [11, 367, 385, 492], [191, 320, 380, 389], [323, 148, 347, 199], [61, 324, 244, 498], [79, 0, 385, 495], [47, 143, 222, 355], [6, 0, 389, 221], [0, 367, 377, 407], [0, 194, 389, 498], [103, 123, 356, 222], [14, 7, 222, 183], [20, 4, 389, 164], [0, 286, 389, 497], [14, 4, 191, 132], [21, 331, 308, 438], [59, 118, 352, 219], [70, 88, 181, 128], [0, 227, 389, 498], [4, 327, 389, 490], [0, 330, 363, 451], [15, 348, 302, 436], [126, 116, 156, 147], [48, 52, 269, 480], [17, 0, 224, 154], [34, 54, 245, 478], [8, 98, 389, 491], [24, 12, 167, 110], [17, 116, 316, 361], [32, 0, 305, 476], [4, 110, 37, 201], [48, 135, 223, 349], [14, 410, 370, 497], [38, 13, 265, 391], [51, 301, 219, 483], [54, 332, 244, 484], [22, 127, 256, 356], [47, 172, 216, 360], [81, 92, 178, 124], [75, 82, 174, 140], [27, 150, 230, 361], [53, 20, 192, 152], [0, 269, 356, 357], [18, 2, 195, 118]], "image_id": "/export/PTM_dataset/flickr30k-cn/flickr30k-images/2954461906.jpg"}
{"sentences": [["0", "妇女们正在喝酒和编织。"]], "bbox": [[74, 113, 383, 271], [451, 159, 499, 273], [6, 20, 75, 106], [5, 16, 114, 277], [0, 7, 481, 251], [434, 195, 454, 221], [353, 34, 478, 264], [217, 8, 320, 161], [287, 127, 317, 209], [376, 15, 439, 72], [28, 260, 84, 277], [163, 12, 245, 154], [333, 163, 465, 269], [115, 152, 196, 195], [147, 3, 179, 78], [440, 49, 499, 185], [293, 182, 321, 211], [198, 136, 237, 180], [241, 8, 291, 58], [325, 139, 344, 178], [394, 126, 411, 149], [2, 205, 320, 277], [1, 70, 93, 197], [210, 125, 228, 156], [123, 95, 141, 152], [146, 0, 499, 65], [162, 6, 324, 152], [167, 50, 237, 131], [16, 167, 90, 274], [51, 0, 149, 80], [0, 64, 100, 233], [111, 139, 184, 181], [385, 63, 452, 151], [230, 54, 302, 138], [378, 50, 490, 264], [18, 180, 88, 266], [54, 142, 80, 163], [65, 259, 85, 277], [6, 9, 80, 112], [162, 53, 396, 151], [177, 11, 486, 254], [397, 94, 494, 267], [121, 89, 141, 148], [5, 4, 111, 277], [165, 6, 244, 149], [423, 58, 499, 254], [336, 12, 477, 273], [338, 14, 465, 258], [83, 84, 144, 142], [119, 16, 440, 163], [293, 160, 319, 214], [9, 162, 90, 270], [9, 16, 120, 277], [441, 157, 499, 272], [111, 142, 188, 184], [164, 14, 491, 271], [15, 174, 137, 275], [7, 32, 139, 276], [5, 0, 114, 277], [347, 120, 494, 277], [4, 12, 126, 277], [213, 5, 309, 161], [429, 35, 494, 175], [88, 209, 319, 276], [140, 0, 499, 75], [222, 6, 305, 153], [6, 8, 106, 277], [340, 90, 492, 277], [108, 123, 401, 274], [95, 1, 488, 268], [434, 157, 499, 271], [347, 214, 452, 274], [114, 88, 147, 154], [157, 14, 251, 154], [48, 139, 257, 271], [194, 128, 238, 181], [80, 120, 384, 273], [169, 47, 233, 133], [170, 43, 235, 133], [346, 12, 470, 195], [54, 6, 451, 244], [12, 1, 161, 88], [67, 195, 350, 275], [345, 170, 469, 269], [379, 23, 484, 201], [350, 213, 475, 273], [6, 13, 67, 109], [60, 85, 328, 266], [7, 2, 338, 263], [293, 127, 314, 203], [11, 11, 84, 107], [211, 13, 463, 205], [342, 79, 496, 274], [71, 15, 483, 169], [198, 132, 233, 175], [54, 104, 384, 269], [161, 9, 246, 152], [367, 181, 478, 270], [93, 1, 499, 103], [16, 190, 366, 276]], "image_id": "/export/PTM_dataset/flickr30k-cn/flickr30k-images/2314492671.jpg"}

@YizhaoGao-AI
Copy link
Collaborator

Hi, there are many reasons: 1. Our model is pre-trained on weak semantic correlation data crawled from the web while ViLT is pre-trained on strong semantic correlation data. Flickr30K is also a strong correlation dataset. 2. The translation of Flickr30K inevitably brings negative effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants