Evaluation benchmark

Refer to this repo for data download and extraction

Method

Click to see the detailed model structure

The SliME strategy emphasizes the importance of global context in relation to local image details. For videos, we apply average pooling to the image features of all 8 frames to extract the global feature. These 8 frames serve as local features, compressed by our local compression layer. Thanks to this compression layer, although the maximum model length is just 2048, we can process up to 10 frames, ensuring rich temporal information is retained.

The inference code can be found in ```llava/eval/video``. Just edit some lines of the code to adapt it to your needs.

num_patch_width, num_patch_height = 4, 2 ## line 240 in llava_arch, enables us to include 8 frames
process_images # line 247 in mm_utils, avoids doing image patching

Results

Model	LLM Params			Overall (%)		Short Video (%)		Medium Video (%)		Long Video (%)
Model	LLM Params	Frames	Date	w/o subs	w subs	w/o subs	w subs	w/o subs	w subs	w/o subs	w subs
ST-LLM	7B	64	2024-06-15	37.9	42.3	45.7	48.4	36.8	41.4	31.3	36.9
Video-LLava	7B	8	2024-06-15	39.9	41.6	45.3	46.1	38	40.7	36.2	38.1
QwenVL-Chat	7B	4	2024-06-15	41.1	41.9	46.9	47.3	38.7	40.4	37.8	37.9
ShareGPT4Video	8B	16	2024-06-17	39.9	43.6	48.3	53.6	36.3	39.3	35	37.9
ShareGemini	7B	64	2024-06-20	43.2	47.9	49.1	52.8	41.3	47.3	39.1	43.4
SliME	8B	8	2024-6-12	45.3	47.2	53.3	55.4	42.7	44.4	39.8	41.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slime_video.md

Slime_video.md

Evaluation benchmark

Method

Results

Files

Slime_video.md

Latest commit

History

Slime_video.md

File metadata and controls

Evaluation benchmark

Method

Results