Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练集:point_grounding和ocr组织格式 #30

Open
zhoumenghan opened this issue Dec 11, 2024 · 3 comments
Open

训练集:point_grounding和ocr组织格式 #30

zhoumenghan opened this issue Dec 11, 2024 · 3 comments

Comments

@zhoumenghan
Copy link

您好,文章里面写到在训练中除了box_grounding, 也像seeclick一样用到了point_grounding和ocr, 想问一下针对point_grouding, conversation如何组织呢?(因为qwen2-vl的tokenizer没有包含类似<point_start>这种)

还有一个问题,ocr的组织格式是如下这样吗?
{
"conversations": [
{
"from": "human",
"value": "\nIn the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions (with bbox).\n<|box_start|>(951,0),(1000,18)<|box_end|>\n<|box_start|>(514,288),(555,361)<|box_end|>\n<|box_start|>(910,87),(926,115)<|box_end|>\n<|box_start|>(708,810),(713,827)<|box_end|>\n<|box_start|>(739,626),(773,660)<|box_end|>"
},
{
"from": "gpt",
"value": "<|object_ref_start|>System<|object_ref_end|><|box_start|>(951,0),(1000,18)<|box_end|>\n<|object_ref_start|>Save to Google Drive<|object_ref_end|><|box_start|>(514,288),(555,361)<|box_end|>\n<|object_ref_start|>More options menu<|object_ref_end|><|box_start|>(910,87),(926,115)<|box_end|>\n<|object_ref_start|>Learn more about results and reviews "Helix Fruit Jump Arcade Game"<|object_ref_end|><|box_start|>(708,810),(713,827)<|box_end|>\n<|object_ref_start|>See more personalized recommendations<|object_ref_end|><|box_start|>(739,626),(773,660)<|box_end|>"
}
],
"images": [
"/home/test2/LAM/OS-Atlas-data/desktop_domain/linux_desktop/output_20240912_152854_original_screenshot.png"
]
},

期待您的回复~

@numbmelon
Copy link
Collaborator

numbmelon commented Dec 11, 2024

”<point>“这种special token是我们单独加进Internvl的,一般来说应该不会有这类special token。在训练7b的qwen2vl时,对于point数据,我们并没有在前后用special token处理。(结果表明效果也不错)
ocr格式可以参考一下qwenvl仓库相关内容,应该需要用上<|quad_end|>和<|quad_start|>,用于处理多边形的场景。由于我们的重点在于grounding,这部分数据我们在训练qwen时也没有做很严格的规范化

@zhoumenghan
Copy link
Author

感谢您的回复~如果仅做grounding任务,只用box_grounding,不用point_grounding(感觉point其实也类似于box,而且可以通过output box转换得到)和ocr sft,您觉得影响大吗?另外,您觉得gradient_accumulation_steps是否至关重要呢?我用了您公开的9个数据集训练,gradient_accumulation_steps设置成了2,1/3 epoch之后,在screenspot v2性能开始下降了,我猜测会不会是gradient_accumulation_steps太小影响了梯度计算的准确性,我现在正改大了重新训。

@numbmelon
Copy link
Collaborator

对于qwenvl来说,如果不需要得到point格式的输出,只用box数据是完全可行的,也能有很好的性能。(事实上我们目前发布的7B版本就只使用了很少量的point和ocr数据)
对于gradient_accumulation_steps,我觉得您应当关注训练时总的batch_size大小,我们训练全量数据时batch_size为1024。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants