CogAgent #9

mcx-agile-loop · 2024-08-15T09:23:01Z

No description provided.

rainx0r · 2024-08-18T16:43:33Z

So I implemented CogAgent and debugged it until inference was running but it turns out that CogAgent can't do inference with more than 1 image at a time due to the cross attention module, which basically means it can't really ingest videos.

One potential way around this is to maybe go frame by frame and try to get it to label it and then it can condition on the current video frame + previous auto regressive labels, but that way it won't do any in context learning as it can't really attend over past (screenshot, label) pairs in the context.

We could switch to CogVLM which should be able to do multiple images as it doesn't have the cross attention module, but it also hasn't been fine-tuned / trained on software data (only CogAgent was), so it'd probably do very poorly especially when image resolution is 400x400.

Let's discuss next week, but I think we might need to rethink whether or not we want to use this baseline.

rainx0r · 2024-08-20T16:59:10Z

I updated the branch to use CogVLM2-Video. It seems to work but the output is bad, it just copies the SOP from the prompt instead of attending to the screenshots. Probably need to try different inference parameters (temperature / top_p) but also might need to change the prompt cause the model isn't as smart.

Another issue is that the inference code for CogVLM2-Video expects 1 video (one big array of shape [Channel, Timestep, Width, Height]) which messes up the ICL which interleaves the video tokens with their labels and I assume the assumption is that the model should be able to use the positional encoding to find the connection between the video frames and their label. In CogVLM2-Video's inference code, basically all the text tokens get pooled together and all the image tokens get pooled together if that makes sense.

Nevertheless we should be able to at least use it for zero shot evaluation I guess if we tune the prompt and inference settings a bit.

mcx-agile-loop assigned mcx-agile-loop and unassigned mcx-agile-loop Aug 15, 2024

rainx0r self-assigned this Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CogAgent #9

CogAgent #9

mcx-agile-loop commented Aug 15, 2024

rainx0r commented Aug 18, 2024

rainx0r commented Aug 20, 2024

CogAgent #9

CogAgent #9

Comments

mcx-agile-loop commented Aug 15, 2024

rainx0r commented Aug 18, 2024

rainx0r commented Aug 20, 2024