renew

VQAssessment · Jul 31, 2023 · 7857960 · 7857960
1 parent cb38328
commit 7857960
Show file tree

Hide file tree

Showing 16 changed files with 101 additions and 296 deletions.
diff --git a/README.md b/README.md
@@ -1,11 +1,10 @@
 # DOVER
 
-Official Codes, Demos, Models for the [Disentangled Objective Video Quality Evaluator (DOVER)](arxiv.org/abs/2211.04894v2).
+Official Codes, Demos, Models for the [Disentangled Objective Video Quality Evaluator (DOVER)](arxiv.org/abs/2211.04894v3), state-of-the-art in UGC-VQA.
 
-- 9 Feb, 2022: **DOVER-Mobile** is available! Evaluate on CPU with High Speed!
-- 16 Jan, 2022: Full Training Code Available (include LVBS). See below.
-- 19 Dec, 2022: Training Code for *Head-only Transfer Learning* is ready!! See [training](https://github.com/QualityAssessment/DOVER#training-adapt-dover-to-your-video-quality-dataset).
-- 18 Dec, 2022: 感谢媒矿工厂提供中文解读。Thrid-party Chinese Explanation on this paper: [微信公众号](https://mp.weixin.qq.com/s/NZlyTwT7FAPkKhZUNc-30w).
+- 17 Jul, 2023: DOVER has been accepted by ICCV2023. We will release the DIVIDE-3k dataset to train DOVER++ via fully-supervised LVBS soon.
+- 9 Feb, 2023: **DOVER-Mobile** is available! Evaluate on CPU with Very High Speed!
+- 16 Jan, 2023: Full Training Code Available (include LVBS). See below.
 - 10 Dec, 2022: Now the evaluation tool can directly predict a fused score for any video. See [here](https://github.com/QualityAssessment/DOVER#new-get-the-fused-quality-score-for-use).
 
 
@@ -31,18 +30,20 @@ Official Codes, Demos, Models for the [Disentangled Objective Video Quality Eval
 Corresponding video results can be found [here](https://github.com/QualityAssessment/DOVER/tree/master/figs).
 
 The first attempt to disentangle the VQA problem into aesthetic and technical quality evaluations.
-Official code for ArXiv Preprint Paper *"Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content"*.
+Official code for [ICCV2023] Paper *"Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives"*. 
 
 
 
 ## Introduction
 
-### Problem Definition
+
+*In-the-wild UGC-VQA is entangled by aesthetic and technical perspectives, which may result in different opinions on the term **QUALITY**.*
 
 ![Fig](figs/problem_definition.png)
 
 ### the proposed DOVER
 
+*This inspires us to propose a simple and effective way to disengtangle the two perspectives from **EXISTING** UGC-VQA datasets.*
 
 ![Fig](figs/approach.png)
 
@@ -219,53 +220,33 @@ Or, just take a look at our training curves that are made public:
 and welcome to reproduce them!
 
 
-## Results
-
-### Score-level Fusion
-
-Directly training on LSVQ and testing on other datasets:
-
-| | PLCC@LSVQ_1080p | PLCC@LSVQ_test | PLCC@LIVE_VQC | PLCC@KoNViD | MACs | config | model |
-| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 
-| DOVER | 0.830 | 0.889 | 0.855 | 0.883 | 282G | [config](dover.yml) | [github](https://github.com/teowu/DOVER/releases/download/v0.1.0/DOVER.pth) |
-
-### Representation-level Fusion
-
-Transfer learning on smaller datasets (as reproduced in current training code):
-
-| | KoNViD-1k | CVD2014 | LIVE-VQC | YouTube-UGC |
-| ---- | ---- | ---- | ---- | ---- |
-| SROCC | 0.905 (0.906 in paper) | 0.894 | 0.855 (0.858 in paper) | 0.888 (0.880 in paper) |
-| PLCC | 0.905 (0.909 in paper) | 0.908 | 0.875 (0.874 in paper) | 0.884 (0.874 in paper) |
-
-LVBS is introduced in the representation-level fusion.
-
-
-
 ## Acknowledgement
 
-Thanks for [Annan Wang](https://github.com/AnnanWangDaniel) for developing the interfaces for subjective studies.
-Thanks for every participant of the studies!
+Thanks for every participant of the subjective studies!
 
 ## Citation
 
-Should you find our works interesting and would like to cite them, please feel free to add these in your references!
+Should you find our work interesting and would like to cite it, please feel free to add these in your references! 
 
-```bibtex
-@article{wu2022disentanglevqa,
- title={Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content},
- author={Wu, Haoning and Liao, Liang and Chen, Chaofeng and Hou, Jingwen and Wang, Annan and Sun, Wenxiu and Yan, Qiong and Lin, Weisi},
- journal={arXiv preprint arXiv:2211.04894},
- year={2022}
-}
 
-@article{wu2022fastquality,
+```bibtex
+%fastvqa
+@inproceedings{wu2022fastvqa,
  title={FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment Sampling},
  author={Wu, Haoning and Chen, Chaofeng and Hou, Jingwen and Liao, Liang and Wang, Annan and Sun, Wenxiu and Yan, Qiong and Lin, Weisi},
- journal={Proceedings of European Conference of Computer Vision (ECCV)},
+ booktitle ={Proceedings of European Conference of Computer Vision (ECCV)},
  year={2022}
 }
 
+%dover
+@inproceedings{wu2023dover,
+ title={Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives}, 
+ author={Wu, Haoning and Zhang, Erli and Liao, Liang and Chen, Chaofeng and Hou, Jingwen Hou and Wang, Annan and Sun, Wenxiu Sun and Yan, Qiong and Lin, Weisi},
+ year={2023},
+ booktitle={International Conference on Computer Vision (ICCV)},
+}
+
+
 @misc{end2endvideoqualitytool,
  title = {Open Source Deep End-to-End Video Quality Assessment Toolbox},
  author = {Wu, Haoning},

diff --git a/dover/datasets/.ipynb_checkpoints/__init__-checkpoint.py b/dover/datasets/.ipynb_checkpoints/__init__-checkpoint.py
@@ -1,3 +1,3 @@
 ## API for DOVER and its variants
 from .basic_datasets import *
-from .fusion_datasets import *
+from .dover_datasets import *
diff --git a/...checkpoints/fusion_datasets-checkpoint.py → ..._checkpoints/dover_datasets-checkpoint.py b/...checkpoints/fusion_datasets-checkpoint.py → ..._checkpoints/dover_datasets-checkpoint.py
@@ -30,8 +30,19 @@ def get_spatial_fragments(
  random=False,
  random_upsample=False,
  fallback_type="upsample",
+ upsample=-1,
  **kwargs,
 ):
+ if upsample > 0:
+ old_h, old_w = video.shape[-2], video.shape[-1]
+ if old_h >= old_w:
+ w = upsample
+ h = int(upsample * old_h / old_w)
+ else:
+ h = upsample
+ w = int(upsample * old_w / old_h)
+
+ video = get_resized_video(video, h, w)
  size_h = fragments_h * fsize_h
  size_w = fragments_w * fsize_w
  ## video: [C,T,H,W]
@@ -56,7 +67,7 @@ def get_spatial_fragments(
  video / 255.0, scale_factor=randratio, mode="bilinear"
  )
  video = (video * 255.0).type_as(ovideo)
-
+ 
  assert dur_t % aligned == 0, "Please provide match vclip and align index"
  size = size_h, size_w
 
@@ -231,6 +242,7 @@ def spatial_temporal_view_decomposition(
  video[stype] = torch.stack(imgs, 0).permute(3, 0, 1, 2)
  del ovideo
  else:
+ decord.bridge.set_bridge("torch")
  vreader = VideoReader(video_path)
  ### Avoid duplicated video decoding!!! Important!!!!
  all_frame_inds = []
@@ -319,6 +331,9 @@ def __init__(self, opt):
 
  self.weight = opt.get("weight", 0.5)
 
+ self.fully_supervised = opt.get("fully_supervised", False)
+ print("Fully supervised:", self.fully_supervised)
+
  self.video_infos = []
  self.ann_file = opt["anno_file"]
  self.data_prefix = opt["data_prefix"]
@@ -362,8 +377,11 @@ def __init__(self, opt):
  with open(self.ann_file, "r") as fin:
  for line in fin:
  line_split = line.strip().split(",")
- filename, _, _, label = line_split
- label = float(label)
+ filename, a, t, label = line_split
+ if self.fully_supervised:
+ label = float(a), float(t), float(label)
+ else:
+ label = float(label)
  filename = osp.join(self.data_prefix, filename)
  self.video_infos.append(dict(filename=filename, label=label))
  except:

diff --git a/dover/datasets/__init__.py b/dover/datasets/__init__.py
@@ -1,3 +1,3 @@
 ## API for DOVER and its variants
 from .basic_datasets import *
-from .fusion_datasets import *
+from .dover_datasets import *
diff --git a/dover/datasets/__pycache__/fusion_datasets.cpython-38.pyc b/dover/datasets/__pycache__/fusion_datasets.cpython-38.pyc
diff --git a/dover/datasets/fusion_datasets.py → dover/datasets/dover_datasets.py b/dover/datasets/fusion_datasets.py → dover/datasets/dover_datasets.py
@@ -30,8 +30,19 @@ def get_spatial_fragments(
  random=False,
  random_upsample=False,
  fallback_type="upsample",
+ upsample=-1,
  **kwargs,
 ):
+ if upsample > 0:
+ old_h, old_w = video.shape[-2], video.shape[-1]
+ if old_h >= old_w:
+ w = upsample
+ h = int(upsample * old_h / old_w)
+ else:
+ h = upsample
+ w = int(upsample * old_w / old_h)
+
+ video = get_resized_video(video, h, w)
  size_h = fragments_h * fsize_h
  size_w = fragments_w * fsize_w
  ## video: [C,T,H,W]
@@ -56,7 +67,7 @@ def get_spatial_fragments(
  video / 255.0, scale_factor=randratio, mode="bilinear"
  )
  video = (video * 255.0).type_as(ovideo)
-
+ 
  assert dur_t % aligned == 0, "Please provide match vclip and align index"
  size = size_h, size_w
 
@@ -231,6 +242,7 @@ def spatial_temporal_view_decomposition(
  video[stype] = torch.stack(imgs, 0).permute(3, 0, 1, 2)
  del ovideo
  else:
+ decord.bridge.set_bridge("torch")
  vreader = VideoReader(video_path)
  ### Avoid duplicated video decoding!!! Important!!!!
  all_frame_inds = []
@@ -319,6 +331,9 @@ def __init__(self, opt):
 
  self.weight = opt.get("weight", 0.5)
 
+ self.fully_supervised = opt.get("fully_supervised", False)
+ print("Fully supervised:", self.fully_supervised)
+
  self.video_infos = []
  self.ann_file = opt["anno_file"]
  self.data_prefix = opt["data_prefix"]
@@ -362,8 +377,11 @@ def __init__(self, opt):
  with open(self.ann_file, "r") as fin:
  for line in fin:
  line_split = line.strip().split(",")
- filename, _, _, label = line_split
- label = float(label)
+ filename, a, t, label = line_split
+ if self.fully_supervised:
+ label = float(a), float(t), float(label)
+ else:
+ label = float(label)
  filename = osp.join(self.data_prefix, filename)
  self.video_infos.append(dict(filename=filename, label=label))
  except:

diff --git a/dover/models/.ipynb_checkpoints/__init__-checkpoint.py b/dover/models/.ipynb_checkpoints/__init__-checkpoint.py
@@ -9,7 +9,6 @@
  "VQABackbone",
  "IQABackbone",
  "VQAHead",
- "MaxVQAHead",
  "IQAHead",
  "VARHead",
  "BaseEvaluator",

diff --git a/dover/models/.ipynb_checkpoints/conv_backbone-checkpoint.py b/dover/models/.ipynb_checkpoints/conv_backbone-checkpoint.py
@@ -4,6 +4,9 @@
 from timm.models.layers import trunc_normal_, DropPath
 from timm.models.registry import register_model
 
+from open_clip import CLIP3D
+import open_clip
+
 class GRN(nn.Module):
  """ GRN (Global Response Normalization) layer
  """
@@ -635,6 +638,7 @@ def convnextv2_huge(**kwargs):
  return model
 
 
+
 
 if __name__ == "__main__":