Second release (Action Segmentation)

IntelLabs · Oct 1, 2023 · b08f718 · b08f718
1 parent 08d95be
commit b08f718
Show file tree

Hide file tree

Showing 19 changed files with 543 additions and 119 deletions.
diff --git a/README.md b/README.md
@@ -1,18 +1,31 @@
 # GraVi-T
-This repository contains an open-source codebase for Graph-based long-term Video undersTanding (GraVi-T). It is designed to serve as a spatial-temporal graph learning framework for multiple video understanding tasks. In the current version, it supports training and evaluating one of the state-of-the-art models, [SPELL](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136950367.pdf), for the tasks of active speaker detection and action localization.
+This repository contains an open-source codebase for Graph-based long-term Video undersTanding (GraVi-T). It is designed to serve as a spatial-temporal graph learning framework for multiple video understanding tasks. In the current version, it supports training and evaluating one of the state-of-the-art models, [SPELL](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136950367.pdf), for the tasks of active speaker detection, action localization, and action segmentation.
 
-In the near future, we will release more advanced graph-based approaches for other tasks, including action segmentation and audio-visual diarization. We also want to note that our method has recently won many challenges, including the Ego4D challenges [@ECCV22](https://ego4d-data.org/workshops/eccv22/), [@CVPR23](https://ego4d-data.org/workshops/cvpr23/) and ActivityNet [@CVPR22](https://research.google.com/ava/challenge.html).
+In the near future, we will release more advanced graph-based approaches for other tasks, including video summarization and audio-visual diarization. We also want to note that our method has recently won many challenges, including the Ego4D challenges [@ECCV22](https://ego4d-data.org/workshops/eccv22/), [@CVPR23](https://ego4d-data.org/workshops/cvpr23/) and ActivityNet [@CVPR22](https://research.google.com/ava/challenge.html).
 
 ![](docs/images/gravit_teaser.jpg?raw=true)
 
 ## Use Cases and Performance
-|  Model  |         Dataset         |            Task           |     validation mAP (%)     |
-|:--------|:-----------------------:|:-------------------------:|:--------------------------:|
-|  SPELL  |  AVA-ActiveSpeaker v1.0 |  Active Speaker Detection |   **94.2** (up from 88.0)  |
-|  SPELL+ |  AVA-ActiveSpeaker v1.0 |  Action Speaker Detection |   **94.9** (up from 89.3)  |
-|  SPELL  |  AVA-Actions v2.2       |  Action Localization      |   **36.8** (up from 29.4)  |
+### Active Speaker Detection (Dataset: AVA-ActiveSpeaker v1.0)
+|  Model  |      Feature       |     validation mAP (%)     |
+|:--------|:------------------:|:--------------------------:|
+|  SPELL  |  RESNET18-TSM-AUG  |   **94.2** (up from 88.0)  |
+|  SPELL  |  RESNET50-TSM-AUG  |   **94.9** (up from 89.3)  |
 > Numbers in parentheses indicate the mAP scores without using the suggested graph learning method.
 
+### Action Localization (Dataset: AVA-Actions v2.2)
+|  Model  |         Feature        |     validation mAP (%)     |
+|:--------|:----------------------:|:--------------------------:|
+|  SPELL  |   SLOWFAST-64x2-R101   |   **36.8** (up from 29.4)  |
+> Number in parentheses indicates the mAP score without using the suggested graph learning method.
+
+### Action Segmentation (Dataset: 50Salads - split2)
+|  Model  |   Feature    |         F1@0.1 (%)        |           Acc (%)         |
+|:--------|:------------:|:-------------------------:|:-------------------------:|
+|  SPELL  |   MSTCN++    |  **84.7** (up from 83.4)  |  **85.0** (up from 84.6)  |
+|  SPELL  |   ASFORMER   |  **89.8** (up from 86.1)  |  **88.2** (up from 87.8)  |
+> Numbers in parentheses indicate the scores without using the suggested graph learning method.
+
 ## Requirements
 Preliminary requirements:
 - Python>=3.7
@@ -26,7 +39,7 @@ pip3 install -r requirements.txt
 Alternatively, you can manually install PyYAML, pandas, and [PyG](https://www.pyg.org)>=2.0.3 with CUDA>=11.1
 
 ## Installation
-After confirming the above requirements are met, run the following commands:
+After confirming the above requirements, run the following commands:
 ```
 git clone https://github.com/IntelLabs/GraVi-T.git
 cd GraVi-T
@@ -35,7 +48,7 @@ pip3 install -e .
 
 ## Getting Started (Active Speaker Detection)
 ### Annotations
-1) Download the annotations from the official site:
+1) Download the annotations of AVA-ActiveSpeaker from the official site:
 ```
 DATA_DIR="data/annotations"
 
@@ -53,7 +66,7 @@ Download `RESNET18-TSM-AUG.zip` from the Google Drive link from [SPELL](https://
 > We use the features from the thirdparty repositories.
 
 ### Directory Structure
-The data directories should look like as follows:
+The data directories should look as follows:
 ```
 |-- data
     |-- annotations
@@ -70,7 +83,7 @@ We can perform the experiments on active speaker detection with the default conf
 #### Step 1: Graph Generation
 Run the following command to generate spatial-temporal graphs from the features:
 ```
-python data/generate_graph.py --features RESNET18-TSM-AUG --ec_mode csi --time_span 90 --tau 0.9
+python data/generate_spatial-temporal_graphs.py --features RESNET18-TSM-AUG --ec_mode csi --time_span 90 --tau 0.9
 ```
 The generated graphs will be saved under `data/graphs`. Each graph captures long temporal context information in a video, which spans about 90 seconds (specified by `--time_span`).
 
@@ -91,6 +104,9 @@ This will print the evaluation score.
 ## Getting Started (Action Localization)
 Please refer to the instructions in [GETTING_STARTED_AL.md](docs/GETTING_STARTED_AL.md).
 
+## Getting Started (Action Segmentation)
+Please refer to the instructions in [GETTING_STARTED_AS.md](docs/GETTING_STARTED_AS.md).
+
 ## Contributor
 GraVi-T is written and maintained by [Kyle Min](https://sites.google.com/view/kylemin)
 
@@ -121,7 +137,7 @@ Technical report for Ego4D challenge 2022:
 
 > This “research quality code”  is for Non-Commercial purposes and provided by Intel “As Is” without any express or implied warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.
 
-> AVA-ActiveSpeaker, AVA-Actions: Please see the dataset's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it.
+> AVA-ActiveSpeaker, AVA-Actions, 50Salads: Please see the dataset's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it.
 
 ## Datasets & Models Disclaimer
 

diff --git a/configs/action-localization/ava_actions/SPELL_default.yaml b/configs/action-localization/ava_actions/SPELL_default.yaml
@@ -2,11 +2,14 @@ exp_name: SPELL_AL_default
 model_name: SPELL
 graph_name: SLOWFAST-64x2-R101_cdi_90.0_3.0
 loss_name: bce_logit
+use_spf: True
+use_ref: False
 num_modality: 1
 channel1: 1024
 channel2: 512
 proj_dim: 64
 final_dim: 80
+num_att_heads: 0
 dropout: 0.2
 lr: 0.0005
 wd: 0.0001

diff --git a/configs/action-segmentation/50salads/SPELL_default.yaml b/configs/action-segmentation/50salads/SPELL_default.yaml
@@ -0,0 +1,19 @@
+exp_name: SPELL_AS_default
+model_name: SPELL
+graph_name: ASFORMER_10_10
+loss_name: ce_ref
+use_spf: False
+use_ref: True
+w_ref: 5
+num_modality: 1
+channel1: 64
+channel2: 64
+final_dim: 19
+num_att_heads: 4
+dropout: 0.2
+lr: 0.0005
+wd: 0
+batch_size: 1
+sch_param: 5
+num_epoch: 50
+sample_rate: 2
diff --git a/configs/active-speaker-detection/ava_active-speaker/SPELL_default.yaml b/configs/active-speaker-detection/ava_active-speaker/SPELL_default.yaml
@@ -2,11 +2,14 @@ exp_name: SPELL_ASD_default
 model_name: SPELL
 graph_name: RESNET18-TSM-AUG_csi_90.0_0.9
 loss_name: bce_logit
+use_spf: True
+use_ref: False
 num_modality: 2
 channel1: 64
 channel2: 16
 proj_dim: 64
 final_dim: 1
+num_att_heads: 0
 dropout: 0.2
 lr: 0.0005
 wd: 0

diff --git a/configs/active-speaker-detection/ava_active-speaker/SPELL_plus_default.yaml b/configs/active-speaker-detection/ava_active-speaker/SPELL_plus_default.yaml
@@ -2,11 +2,14 @@ exp_name: SPELL_plus_ASD_default
 model_name: SPELL
 graph_name: RESNET50-TSM-AUG_csi_90.0_0.9
 loss_name: bce_logit
+use_spf: True
+use_ref: False
 num_modality: 2
 channel1: 64
 channel2: 16
 proj_dim: 64
 final_dim: 1
+num_att_heads: 0
 dropout: 0.2
 lr: 0.0003
 wd: 0

diff --git a/data/generate_graph.py → data/generate_spatial-temporal_graphs.py b/data/generate_graph.py → data/generate_spatial-temporal_graphs.py
@@ -35,7 +35,7 @@ def _get_time_windows(list_fts, time_span):
     return twd_all
 
 
-def generate_graph(data_file, path_graphs, sp):
+def generate_graph(data_file, args, path_graphs, sp):
     """
     Generate graphs of a single video
     Time span of each graph is not greater than "time_span"
@@ -114,7 +114,7 @@ def generate_graph(data_file, path_graphs, sp):
 
 if __name__ == "__main__":
     """
-    Generate graphs from the extracted features
+    Generate spatial-temporal graphs from the extracted features
     """
 
     parser = argparse.ArgumentParser()
@@ -129,7 +129,6 @@ def generate_graph(data_file, path_graphs, sp):
     parser.add_argument('--time_span',     type=float, help='Maximum time span for each graph in seconds', required=True)
     parser.add_argument('--tau',           type=float, help='Maximum time difference between neighboring nodes in seconds', required=True)
 
-    global args
     args = parser.parse_args()
 
     # Iterate over train/val splits
@@ -141,6 +140,6 @@ def generate_graph(data_file, path_graphs, sp):
         list_data_files = sorted(glob.glob(os.path.join(args.root_data, f'features/{args.features}/{sp}/*.pkl')))
 
         with Pool(processes=20) as pool:
-            num_graph = pool.map(partial(generate_graph, path_graphs=path_graphs, sp=sp), list_data_files)
+            num_graph = pool.map(partial(generate_graph, args=args, path_graphs=path_graphs, sp=sp), list_data_files)
 
         print (f'Graph generation for {sp} is finished (number of graphs: {sum(num_graph)})')
diff --git a/data/generate_temporal_graphs.py b/data/generate_temporal_graphs.py
@@ -0,0 +1,112 @@
+import os
+import glob
+import torch
+import argparse
+import numpy as np
+from functools import partial
+from multiprocessing import Pool
+from torch_geometric.data import Data
+
+
+def generate_temporal_graph(data_file, args, path_graphs, actions, train_ids, all_ids):
+    """
+    Generate temporal graphs of a single video
+    """
+
+    video_id = os.path.splitext(os.path.basename(data_file))[0]
+    feature = np.transpose(np.load(data_file))
+    num_frame = feature.shape[0]
+    skip = args.skip_factor
+
+    # Get a list of ground-truth action labels
+    with open(os.path.join(args.root_data, f'annotations/{args.dataset}/groundTruth/{video_id}.txt')) as f:
+        label = [actions[line.strip()] for line in f]
+
+    # Get a list of the edge information: these are for edge_index and edge_attr
+    node_source = []
+    node_target = []
+    edge_attr = []
+    for i in range(num_frame):
+        for j in range(num_frame):
+            # Frame difference between the i-th and j-th nodes
+            frame_diff = i - j
+
+            # The edge ij connects the i-th node and j-th node
+            # Positive edge_attr indicates that the edge ij is backward (negative: forward)
+            if abs(frame_diff) <= args.tauf:
+                node_source.append(i)
+                node_target.append(j)
+                edge_attr.append(np.sign(frame_diff))
+
+            # Make additional connections between non-adjacent nodes
+            # This can help reduce over-segmentation of predictions in some cases
+            elif skip:
+                if (frame_diff % skip == 0) and (abs(frame_diff) <= skip*args.tauf):
+                    node_source.append(i)
+                    node_target.append(j)
+                    edge_attr.append(np.sign(frame_diff))
+
+    # x: features
+    # g: global_id
+    # edge_index: information on how the graph nodes are connected
+    # edge_attr: information about whether the edge is spatial (0) or temporal (positive: backward, negative: forward)
+    # y: labels
+    graphs = Data(x = torch.tensor(np.array(feature, dtype=np.float32), dtype=torch.float32),
+                  g = all_ids.index(video_id),
+                  edge_index = torch.tensor(np.array([node_source, node_target], dtype=np.int64), dtype=torch.long),
+                  edge_attr = torch.tensor(edge_attr, dtype=torch.float32),
+                  y = torch.tensor(np.array(label, dtype=np.int64)[::args.sample_rate], dtype=torch.long))
+
+    if video_id in train_ids:
+        torch.save(graphs, os.path.join(path_graphs, 'train', f'{video_id}.pt'))
+    else:
+        torch.save(graphs, os.path.join(path_graphs, 'val', f'{video_id}.pt'))
+
+
+if __name__ == "__main__":
+    """
+    Generate temporal graphs from the extracted features
+    """
+
+    parser = argparse.ArgumentParser()
+    # Default paths for the training process
+    parser.add_argument('--root_data',     type=str,   help='Root directory to the data', default='./data')
+    parser.add_argument('--dataset',       type=str,   help='Name of the dataset', default='50salads')
+    parser.add_argument('--features',      type=str,   help='Name of the features', required=True)
+
+    # Hyperparameters for the graph generation
+    parser.add_argument('--tauf',          type=int,   help='Maximum frame difference between neighboring nodes', required=True)
+    parser.add_argument('--skip_factor',   type=int,   help='Make additional connections between non-adjacent nodes', default=10)
+    parser.add_argument('--sample_rate',   type=int,   help='Downsampling rate for the input', default=2)
+
+    args = parser.parse_args()
+
+    # Build a mapping from action classes to action ids
+    actions = {}
+    with open(os.path.join(args.root_data, f'annotations/{args.dataset}/mapping.txt')) as f:
+        for line in f:
+            aid, cls = line.strip().split(' ')
+            actions[cls] = int(aid)
+
+    # Get a list of all video ids
+    all_ids = sorted([os.path.splitext(v)[0] for v in os.listdir(os.path.join(args.root_data, f'annotations/{args.dataset}/groundTruth'))])
+
+    # Iterate over different splits
+    print ('This process might take a few minutes')
+
+    list_splits = sorted(os.listdir(os.path.join(args.root_data, f'features/{args.features}')))
+    for split in list_splits:
+        # Get a list of training video ids
+        with open(os.path.join(args.root_data, f'annotations/{args.dataset}/splits/train.{split}.bundle')) as f:
+            train_ids = [os.path.splitext(line.strip())[0] for line in f]
+
+        path_graphs = os.path.join(args.root_data, f'graphs/{args.features}_{args.tauf}_{args.skip_factor}/{split}')
+        os.makedirs(os.path.join(path_graphs, 'train'), exist_ok=True)
+        os.makedirs(os.path.join(path_graphs, 'val'), exist_ok=True)
+
+        list_data_files = sorted(glob.glob(os.path.join(args.root_data, f'features/{args.features}/{split}/*.npy')))
+
+        with Pool(processes=20) as pool:
+            pool.map(partial(generate_temporal_graph, args=args, path_graphs=path_graphs, actions=actions, train_ids=train_ids, all_ids=all_ids), list_data_files)
+
+        print (f'Graph generation for {split} is finished')
diff --git a/docs/GETTING_STARTED_AL.md b/docs/GETTING_STARTED_AL.md
@@ -1,6 +1,6 @@
 ## Getting Started (Action Localization)
 ### Annotations
-Download the annotations from the official site:
+Download the annotations of AVA-Actions from the official site:
 ```
 DATA_DIR="data/annotations"
 
@@ -16,7 +16,7 @@ Download `SLOWFAST-64x2-R101.zip` from the Google Drive link from [SPELL](https:
 > We use the features from the thirdparty repositories. SLOWFAST-64x2-R101 is obtained by using the official code of [SlowFast](https://github.com/facebookresearch/SlowFast) with the pretrained checkpoint ([SLOWFAST_64x2_R101_50_50.pkl](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/SLOWFAST_64x2_R101_50_50.pkl)) in [SlowFast Model Zoo](https://github.com/facebookresearch/SlowFast/blob/main/MODEL_ZOO.md).
 
 ### Directory Structure
-The data directories should look like as follows:
+The data directories should look as follows:
 ```
 |-- data
     |-- annotations
@@ -34,7 +34,7 @@ We can perform the experiments on action localization with the default configura
 #### Step 1: Graph Generation
 Run the following command to generate spatial-temporal graphs from the features:
 ```
-python data/generate_graph.py --features SLOWFAST-64x2-R101 --ec_mode cdi --time_span 90 --tau 3
+python data/generate_spatial-temporal_graphs.py --features SLOWFAST-64x2-R101 --ec_mode cdi --time_span 90 --tau 3
 ```
 The generated graphs will be saved under `data/graphs`. Each graph captures long temporal context information in a video, which spans about 90 seconds (specified by `--time_span`).
 

diff --git a/docs/GETTING_STARTED_AS.md b/docs/GETTING_STARTED_AS.md
@@ -0,0 +1,49 @@
+## Getting Started (Action Segmentation)
+### Annotations
+We suggest using the same set of annotations used by [MS-TCN++](https://github.com/sj-li/MS-TCN2) and [ASFormer](https://github.com/ChinaYi/ASFormer). Download the 50Salads dataset from the links provided by either of the two repositories.
+
+### Features
+We suggest extracting the features using [ASFormer](https://github.com/ChinaYi/ASFormer). Please use their repository and the pre-trained model checkpoints ([link](https://github.com/ChinaYi/ASFormer/tree/main#reproduce-our-results)) to extract the frame-wise features for each split of the dataset. Please extract the features from each of the four refinement layers and concatenate them. To be more specific, you can concatenate the 64-dimensional features from this [line](https://github.com/ChinaYi/ASFormer/blob/main/model.py#L315), which will give you 256-dimensional (frame-wise) features. Similarly, you can also extract MS-TCN++ features from this [line](https://github.com/sj-li/MS-TCN2/blob/master/model.py#L23).
+> We use the features from the thirdparty repositories.
+
+### Directory Structure
+The data directories should look as follows:
+```
+|-- data
+    |-- annotations
+        |-- 50salads
+            |-- groundTruth
+            |-- splits
+            |-- mapping.txt
+    |-- features
+        |-- ASFORMER
+            |-- split1
+            |-- split2
+            |-- split3
+            |-- split4
+            |-- split5
+```
+
+### Experiments
+We can perform the experiments on action segmentation with the default configuration by following the three steps below.
+
+#### Step 1: Graph Generation
+Run the following command to generate temporal graphs from the features:
+```
+python data/generate_temporal_graphs.py --features ASFORMER --tauf 10
+```
+The generated graphs will be saved under `data/graphs`. Each graph captures long temporal context information in a video.
+
+#### Step 2: Training
+Next, run the training script by passing the default configuration file. You also need to specify which split to perform the experiments on:
+```
+python tools/train_context_reasoning.py --cfg configs/action-segmentation/50salads/SPELL_default.yaml --split 2
+```
+The results and logs will be saved under `results`.
+
+#### Step 3: Evaluation
+Now, we can evaluate the trained model's performance:
+```
+python tools/evaluate.py --exp_name SPELL_AS_default --eval_type AS
+```
+This will print the evaluation scores.
diff --git a/gravit/__init__.py b/gravit/__init__.py
@@ -3,4 +3,4 @@
 try:
     __version__ = get_distribution('gravit').version
 except:
-    __version__ = '1.0.0'
+    __version__ = '1.1.0'