We current have multimodal_chat_dataset which is great for conversations on an image, but many VQA datasets are structured more like instructions where there is a question column, answer column, and image column (see VQA datasets on HF, sort by most downloads). We should add a multimodal_instruct_dataset builder to support these types of datasets from the configs.
- This will require upgrading
InputOutToMessages with image support