From 864b17ba1a7c0552278edb9e56ab9748d61b66e8 Mon Sep 17 00:00:00 2001 From: Hou9612 <2289327470@qq.com> Date: Tue, 28 May 2024 10:50:04 +0800 Subject: [PATCH] update main experiment table --- index.html | 20 ++++++------------ static/img/experiments/exp1.png | Bin 471299 -> 454691 bytes .../experiments/framework_pipeline_long.png | Bin 1338713 -> 0 bytes 3 files changed, 6 insertions(+), 14 deletions(-) delete mode 100644 static/img/experiments/framework_pipeline_long.png diff --git a/index.html b/index.html index 33b01fa..c6d3165 100644 --- a/index.html +++ b/index.html @@ -1397,25 +1397,17 @@
- We propose an event-centric framework that contains three phases from snippet prediction, event extraction to event interaction.
- Firstly, we propose a pyramid multimodal transformer model to capture the events with different temporal lengths,
- where the audio and visual snippet features are required to interact with each other within multi-scale temporal windows.
- Secondly, we propose to model the video as structured event graphs according to the snippet prediction,
- based on which we refine the event-aware snippet-level features and aggregate them into event features.
- At last, we study event relations by modeling the influence among multiple aggregated audio and visual events and then refining the event features.
- The three phases progressively achieve a comprehensive understanding of video content as well as event relations
- and are jointly optimized with video-level event labels in an end-to-end fashion.
- We want to highlight that the inherent relations among multiple events are essential for
- understanding the temporal structures and dynamic semantic of the long form audio-visual videos,
- which has not been sufficiently considered in previous event localization works.
- More details are in the paper (Arxiv).
-
+ To validate the superiority of our proposed framework, we choose 16 related methods for comparison,
+ including weakly supervised temporal action localization methods: STPN, RSKP;
+ long sequence modeling methods: Longformer, Transformer-LS, ActionFormer, FLatten Transformer;
+ audio-visual learning methods: AVE, AVSlowFast, HAN, PSP, DHHN, CMPAE, CM-PIE;
+ video classification methods: SlowFast, MViT, and MeMViT.
Fj{-`H(UIn%_V>WN@t~3k79HdJs-=?@Q
z(N7BUAIG!*7Z)H}D??MA#B=mePui@7Qh)bv1pg{}Ccm%sMZ8>nx_*OLdS{`9=;{)K
zT;R^`ftS+jL_v8}dK>4P@CfY$6xXlJ#CcL8%h^0_h2(Ur+DtbNw&oa=p-h>z(_AiT
zO1Tr9KCMCrc_}_F)P$CG;o%;yx45MO`sSwx*rW9>35~jtG?kkd(LtzH;v?7X&nHjU
z*P%2wFIVhu mdcILV*-zc29wz(e~ZPw$OVs}w?lqU44(!QXdfOkl1GQ?l${z6652{TGeNn5
z?y|ptS)zw&S0;UD{-^i?>$mkX*BTkf7#OMF9d3WuRqHV6gV@y-kX`lV*ta%$!#p|m
z&nNkIQe}52 sNC-Tql
z=gB*^;6XS0(_X%ur|>@&qq(6*O+r+Rl3fbBx;1;|1>UE{ac)LQY6XnKD{-a0TN-oz
z$+$r_8-xy##Q~d6;j8$Nf*ofsj;%SF-RU8uv@82YshvyU=Bw4Hc#GCqLoN7$iauyF
zJdy-Ni++T978RzlrI#Bzk{KI5SL?T5>7PTkWcG-4r(1Ei3(=_xHO?TMo(O_oThXA_
zIa%dTuLom))3px#H~f493FS(HT8;k$#+(`khE01ftn23_B7GB|)c&r0vy$i9Al3cb
z_%KJVql!g#GdXEkEe1N-Asg)Gg#7XgqrBw!z5YozozIg8-~{Sh$c0m_Oy*VIdQ{iE
zfuIaRKIWdVeS(zn)RsZFNjw& Jezd3r18SieY9mchw0?~ixgScdjS_oI5FTxe;kMT4fuk1(+}rNwF%Muxdw
z4&+>k1ks@U5DG4+>G*qnA~-2nOuh=cFdU6F0#KW1IxB1qK~UCniQpb%>Nk({WH~fg
zE`yJuZHBrq``u7y`#F)oPi9 `-6$)AI*-b>KhfFN*MsQTfM@Uvck|%p7ZzxNz>raL&%SgwCCh8KzN)xXj
zg)=dtr&?5Z1_H=729R5w?c+xZ&sz6v16z54fZ3-TA&kn`f<@u6?#XfF?~HSE1Pb3F
zTHQB-@wwjN`c)U`DQt1{IBE3Q*=@