Skip to content

Commit

Permalink
Second release (Action Segmentation)
Browse files Browse the repository at this point in the history
  • Loading branch information
kylemin committed Oct 1, 2023
1 parent 08d95be commit b08f718
Show file tree
Hide file tree
Showing 19 changed files with 543 additions and 119 deletions.
40 changes: 28 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,31 @@
# GraVi-T
This repository contains an open-source codebase for Graph-based long-term Video undersTanding (GraVi-T). It is designed to serve as a spatial-temporal graph learning framework for multiple video understanding tasks. In the current version, it supports training and evaluating one of the state-of-the-art models, [SPELL](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136950367.pdf), for the tasks of active speaker detection and action localization.
This repository contains an open-source codebase for Graph-based long-term Video undersTanding (GraVi-T). It is designed to serve as a spatial-temporal graph learning framework for multiple video understanding tasks. In the current version, it supports training and evaluating one of the state-of-the-art models, [SPELL](https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136950367.pdf), for the tasks of active speaker detection, action localization, and action segmentation.

In the near future, we will release more advanced graph-based approaches for other tasks, including action segmentation and audio-visual diarization. We also want to note that our method has recently won many challenges, including the Ego4D challenges [@ECCV22](https://ego4d-data.org/workshops/eccv22/), [@CVPR23](https://ego4d-data.org/workshops/cvpr23/) and ActivityNet [@CVPR22](https://research.google.com/ava/challenge.html).
In the near future, we will release more advanced graph-based approaches for other tasks, including video summarization and audio-visual diarization. We also want to note that our method has recently won many challenges, including the Ego4D challenges [@ECCV22](https://ego4d-data.org/workshops/eccv22/), [@CVPR23](https://ego4d-data.org/workshops/cvpr23/) and ActivityNet [@CVPR22](https://research.google.com/ava/challenge.html).

![](docs/images/gravit_teaser.jpg?raw=true)

## Use Cases and Performance
| Model | Dataset | Task | validation mAP (%) |
|:--------|:-----------------------:|:-------------------------:|:--------------------------:|
| SPELL | AVA-ActiveSpeaker v1.0 | Active Speaker Detection | **94.2** (up from 88.0) |
| SPELL+ | AVA-ActiveSpeaker v1.0 | Action Speaker Detection | **94.9** (up from 89.3) |
| SPELL | AVA-Actions v2.2 | Action Localization | **36.8** (up from 29.4) |
### Active Speaker Detection (Dataset: AVA-ActiveSpeaker v1.0)
| Model | Feature | validation mAP (%) |
|:--------|:------------------:|:--------------------------:|
| SPELL | RESNET18-TSM-AUG | **94.2** (up from 88.0) |
| SPELL | RESNET50-TSM-AUG | **94.9** (up from 89.3) |
> Numbers in parentheses indicate the mAP scores without using the suggested graph learning method.
### Action Localization (Dataset: AVA-Actions v2.2)
| Model | Feature | validation mAP (%) |
|:--------|:----------------------:|:--------------------------:|
| SPELL | SLOWFAST-64x2-R101 | **36.8** (up from 29.4) |
> Number in parentheses indicates the mAP score without using the suggested graph learning method.
### Action Segmentation (Dataset: 50Salads - split2)
| Model | Feature | F1@0.1 (%) | Acc (%) |
|:--------|:------------:|:-------------------------:|:-------------------------:|
| SPELL | MSTCN++ | **84.7** (up from 83.4) | **85.0** (up from 84.6) |
| SPELL | ASFORMER | **89.8** (up from 86.1) | **88.2** (up from 87.8) |
> Numbers in parentheses indicate the scores without using the suggested graph learning method.
## Requirements
Preliminary requirements:
- Python>=3.7
Expand All @@ -26,7 +39,7 @@ pip3 install -r requirements.txt
Alternatively, you can manually install PyYAML, pandas, and [PyG](https://www.pyg.org)>=2.0.3 with CUDA>=11.1

## Installation
After confirming the above requirements are met, run the following commands:
After confirming the above requirements, run the following commands:
```
git clone https://github.com/IntelLabs/GraVi-T.git
cd GraVi-T
Expand All @@ -35,7 +48,7 @@ pip3 install -e .

## Getting Started (Active Speaker Detection)
### Annotations
1) Download the annotations from the official site:
1) Download the annotations of AVA-ActiveSpeaker from the official site:
```
DATA_DIR="data/annotations"
Expand All @@ -53,7 +66,7 @@ Download `RESNET18-TSM-AUG.zip` from the Google Drive link from [SPELL](https://
> We use the features from the thirdparty repositories.
### Directory Structure
The data directories should look like as follows:
The data directories should look as follows:
```
|-- data
|-- annotations
Expand All @@ -70,7 +83,7 @@ We can perform the experiments on active speaker detection with the default conf
#### Step 1: Graph Generation
Run the following command to generate spatial-temporal graphs from the features:
```
python data/generate_graph.py --features RESNET18-TSM-AUG --ec_mode csi --time_span 90 --tau 0.9
python data/generate_spatial-temporal_graphs.py --features RESNET18-TSM-AUG --ec_mode csi --time_span 90 --tau 0.9
```
The generated graphs will be saved under `data/graphs`. Each graph captures long temporal context information in a video, which spans about 90 seconds (specified by `--time_span`).

Expand All @@ -91,6 +104,9 @@ This will print the evaluation score.
## Getting Started (Action Localization)
Please refer to the instructions in [GETTING_STARTED_AL.md](docs/GETTING_STARTED_AL.md).

## Getting Started (Action Segmentation)
Please refer to the instructions in [GETTING_STARTED_AS.md](docs/GETTING_STARTED_AS.md).

## Contributor
GraVi-T is written and maintained by [Kyle Min](https://sites.google.com/view/kylemin)

Expand Down Expand Up @@ -121,7 +137,7 @@ Technical report for Ego4D challenge 2022:

> This “research quality code” is for Non-Commercial purposes and provided by Intel “As Is” without any express or implied warranty of any kind. Please see the dataset's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it. Intel does not warrant or assume responsibility for the accuracy or completeness of any information, text, graphics, links or other items within the code. A thorough security review has not been performed on this code. Additionally, this repository may contain components that are out of date or contain known security vulnerabilities.
> AVA-ActiveSpeaker, AVA-Actions: Please see the dataset's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it.
> AVA-ActiveSpeaker, AVA-Actions, 50Salads: Please see the dataset's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it.
## Datasets & Models Disclaimer

Expand Down
3 changes: 3 additions & 0 deletions configs/action-localization/ava_actions/SPELL_default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@ exp_name: SPELL_AL_default
model_name: SPELL
graph_name: SLOWFAST-64x2-R101_cdi_90.0_3.0
loss_name: bce_logit
use_spf: True
use_ref: False
num_modality: 1
channel1: 1024
channel2: 512
proj_dim: 64
final_dim: 80
num_att_heads: 0
dropout: 0.2
lr: 0.0005
wd: 0.0001
Expand Down
19 changes: 19 additions & 0 deletions configs/action-segmentation/50salads/SPELL_default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
exp_name: SPELL_AS_default
model_name: SPELL
graph_name: ASFORMER_10_10
loss_name: ce_ref
use_spf: False
use_ref: True
w_ref: 5
num_modality: 1
channel1: 64
channel2: 64
final_dim: 19
num_att_heads: 4
dropout: 0.2
lr: 0.0005
wd: 0
batch_size: 1
sch_param: 5
num_epoch: 50
sample_rate: 2
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@ exp_name: SPELL_ASD_default
model_name: SPELL
graph_name: RESNET18-TSM-AUG_csi_90.0_0.9
loss_name: bce_logit
use_spf: True
use_ref: False
num_modality: 2
channel1: 64
channel2: 16
proj_dim: 64
final_dim: 1
num_att_heads: 0
dropout: 0.2
lr: 0.0005
wd: 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@ exp_name: SPELL_plus_ASD_default
model_name: SPELL
graph_name: RESNET50-TSM-AUG_csi_90.0_0.9
loss_name: bce_logit
use_spf: True
use_ref: False
num_modality: 2
channel1: 64
channel2: 16
proj_dim: 64
final_dim: 1
num_att_heads: 0
dropout: 0.2
lr: 0.0003
wd: 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def _get_time_windows(list_fts, time_span):
return twd_all


def generate_graph(data_file, path_graphs, sp):
def generate_graph(data_file, args, path_graphs, sp):
"""
Generate graphs of a single video
Time span of each graph is not greater than "time_span"
Expand Down Expand Up @@ -114,7 +114,7 @@ def generate_graph(data_file, path_graphs, sp):

if __name__ == "__main__":
"""
Generate graphs from the extracted features
Generate spatial-temporal graphs from the extracted features
"""

parser = argparse.ArgumentParser()
Expand All @@ -129,7 +129,6 @@ def generate_graph(data_file, path_graphs, sp):
parser.add_argument('--time_span', type=float, help='Maximum time span for each graph in seconds', required=True)
parser.add_argument('--tau', type=float, help='Maximum time difference between neighboring nodes in seconds', required=True)

global args
args = parser.parse_args()

# Iterate over train/val splits
Expand All @@ -141,6 +140,6 @@ def generate_graph(data_file, path_graphs, sp):
list_data_files = sorted(glob.glob(os.path.join(args.root_data, f'features/{args.features}/{sp}/*.pkl')))

with Pool(processes=20) as pool:
num_graph = pool.map(partial(generate_graph, path_graphs=path_graphs, sp=sp), list_data_files)
num_graph = pool.map(partial(generate_graph, args=args, path_graphs=path_graphs, sp=sp), list_data_files)

print (f'Graph generation for {sp} is finished (number of graphs: {sum(num_graph)})')
112 changes: 112 additions & 0 deletions data/generate_temporal_graphs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import os
import glob
import torch
import argparse
import numpy as np
from functools import partial
from multiprocessing import Pool
from torch_geometric.data import Data


def generate_temporal_graph(data_file, args, path_graphs, actions, train_ids, all_ids):
"""
Generate temporal graphs of a single video
"""

video_id = os.path.splitext(os.path.basename(data_file))[0]
feature = np.transpose(np.load(data_file))
num_frame = feature.shape[0]
skip = args.skip_factor

# Get a list of ground-truth action labels
with open(os.path.join(args.root_data, f'annotations/{args.dataset}/groundTruth/{video_id}.txt')) as f:
label = [actions[line.strip()] for line in f]

# Get a list of the edge information: these are for edge_index and edge_attr
node_source = []
node_target = []
edge_attr = []
for i in range(num_frame):
for j in range(num_frame):
# Frame difference between the i-th and j-th nodes
frame_diff = i - j

# The edge ij connects the i-th node and j-th node
# Positive edge_attr indicates that the edge ij is backward (negative: forward)
if abs(frame_diff) <= args.tauf:
node_source.append(i)
node_target.append(j)
edge_attr.append(np.sign(frame_diff))

# Make additional connections between non-adjacent nodes
# This can help reduce over-segmentation of predictions in some cases
elif skip:
if (frame_diff % skip == 0) and (abs(frame_diff) <= skip*args.tauf):
node_source.append(i)
node_target.append(j)
edge_attr.append(np.sign(frame_diff))

# x: features
# g: global_id
# edge_index: information on how the graph nodes are connected
# edge_attr: information about whether the edge is spatial (0) or temporal (positive: backward, negative: forward)
# y: labels
graphs = Data(x = torch.tensor(np.array(feature, dtype=np.float32), dtype=torch.float32),
g = all_ids.index(video_id),
edge_index = torch.tensor(np.array([node_source, node_target], dtype=np.int64), dtype=torch.long),
edge_attr = torch.tensor(edge_attr, dtype=torch.float32),
y = torch.tensor(np.array(label, dtype=np.int64)[::args.sample_rate], dtype=torch.long))

if video_id in train_ids:
torch.save(graphs, os.path.join(path_graphs, 'train', f'{video_id}.pt'))
else:
torch.save(graphs, os.path.join(path_graphs, 'val', f'{video_id}.pt'))


if __name__ == "__main__":
"""
Generate temporal graphs from the extracted features
"""

parser = argparse.ArgumentParser()
# Default paths for the training process
parser.add_argument('--root_data', type=str, help='Root directory to the data', default='./data')
parser.add_argument('--dataset', type=str, help='Name of the dataset', default='50salads')
parser.add_argument('--features', type=str, help='Name of the features', required=True)

# Hyperparameters for the graph generation
parser.add_argument('--tauf', type=int, help='Maximum frame difference between neighboring nodes', required=True)
parser.add_argument('--skip_factor', type=int, help='Make additional connections between non-adjacent nodes', default=10)
parser.add_argument('--sample_rate', type=int, help='Downsampling rate for the input', default=2)

args = parser.parse_args()

# Build a mapping from action classes to action ids
actions = {}
with open(os.path.join(args.root_data, f'annotations/{args.dataset}/mapping.txt')) as f:
for line in f:
aid, cls = line.strip().split(' ')
actions[cls] = int(aid)

# Get a list of all video ids
all_ids = sorted([os.path.splitext(v)[0] for v in os.listdir(os.path.join(args.root_data, f'annotations/{args.dataset}/groundTruth'))])

# Iterate over different splits
print ('This process might take a few minutes')

list_splits = sorted(os.listdir(os.path.join(args.root_data, f'features/{args.features}')))
for split in list_splits:
# Get a list of training video ids
with open(os.path.join(args.root_data, f'annotations/{args.dataset}/splits/train.{split}.bundle')) as f:
train_ids = [os.path.splitext(line.strip())[0] for line in f]

path_graphs = os.path.join(args.root_data, f'graphs/{args.features}_{args.tauf}_{args.skip_factor}/{split}')
os.makedirs(os.path.join(path_graphs, 'train'), exist_ok=True)
os.makedirs(os.path.join(path_graphs, 'val'), exist_ok=True)

list_data_files = sorted(glob.glob(os.path.join(args.root_data, f'features/{args.features}/{split}/*.npy')))

with Pool(processes=20) as pool:
pool.map(partial(generate_temporal_graph, args=args, path_graphs=path_graphs, actions=actions, train_ids=train_ids, all_ids=all_ids), list_data_files)

print (f'Graph generation for {split} is finished')
6 changes: 3 additions & 3 deletions docs/GETTING_STARTED_AL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Getting Started (Action Localization)
### Annotations
Download the annotations from the official site:
Download the annotations of AVA-Actions from the official site:
```
DATA_DIR="data/annotations"
Expand All @@ -16,7 +16,7 @@ Download `SLOWFAST-64x2-R101.zip` from the Google Drive link from [SPELL](https:
> We use the features from the thirdparty repositories. SLOWFAST-64x2-R101 is obtained by using the official code of [SlowFast](https://github.com/facebookresearch/SlowFast) with the pretrained checkpoint ([SLOWFAST_64x2_R101_50_50.pkl](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/SLOWFAST_64x2_R101_50_50.pkl)) in [SlowFast Model Zoo](https://github.com/facebookresearch/SlowFast/blob/main/MODEL_ZOO.md).
### Directory Structure
The data directories should look like as follows:
The data directories should look as follows:
```
|-- data
|-- annotations
Expand All @@ -34,7 +34,7 @@ We can perform the experiments on action localization with the default configura
#### Step 1: Graph Generation
Run the following command to generate spatial-temporal graphs from the features:
```
python data/generate_graph.py --features SLOWFAST-64x2-R101 --ec_mode cdi --time_span 90 --tau 3
python data/generate_spatial-temporal_graphs.py --features SLOWFAST-64x2-R101 --ec_mode cdi --time_span 90 --tau 3
```
The generated graphs will be saved under `data/graphs`. Each graph captures long temporal context information in a video, which spans about 90 seconds (specified by `--time_span`).

Expand Down
49 changes: 49 additions & 0 deletions docs/GETTING_STARTED_AS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Getting Started (Action Segmentation)
### Annotations
We suggest using the same set of annotations used by [MS-TCN++](https://github.com/sj-li/MS-TCN2) and [ASFormer](https://github.com/ChinaYi/ASFormer). Download the 50Salads dataset from the links provided by either of the two repositories.

### Features
We suggest extracting the features using [ASFormer](https://github.com/ChinaYi/ASFormer). Please use their repository and the pre-trained model checkpoints ([link](https://github.com/ChinaYi/ASFormer/tree/main#reproduce-our-results)) to extract the frame-wise features for each split of the dataset. Please extract the features from each of the four refinement layers and concatenate them. To be more specific, you can concatenate the 64-dimensional features from this [line](https://github.com/ChinaYi/ASFormer/blob/main/model.py#L315), which will give you 256-dimensional (frame-wise) features. Similarly, you can also extract MS-TCN++ features from this [line](https://github.com/sj-li/MS-TCN2/blob/master/model.py#L23).
> We use the features from the thirdparty repositories.
### Directory Structure
The data directories should look as follows:
```
|-- data
|-- annotations
|-- 50salads
|-- groundTruth
|-- splits
|-- mapping.txt
|-- features
|-- ASFORMER
|-- split1
|-- split2
|-- split3
|-- split4
|-- split5
```

### Experiments
We can perform the experiments on action segmentation with the default configuration by following the three steps below.

#### Step 1: Graph Generation
Run the following command to generate temporal graphs from the features:
```
python data/generate_temporal_graphs.py --features ASFORMER --tauf 10
```
The generated graphs will be saved under `data/graphs`. Each graph captures long temporal context information in a video.

#### Step 2: Training
Next, run the training script by passing the default configuration file. You also need to specify which split to perform the experiments on:
```
python tools/train_context_reasoning.py --cfg configs/action-segmentation/50salads/SPELL_default.yaml --split 2
```
The results and logs will be saved under `results`.

#### Step 3: Evaluation
Now, we can evaluate the trained model's performance:
```
python tools/evaluate.py --exp_name SPELL_AS_default --eval_type AS
```
This will print the evaluation scores.
2 changes: 1 addition & 1 deletion gravit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
try:
__version__ = get_distribution('gravit').version
except:
__version__ = '1.0.0'
__version__ = '1.1.0'
Loading

0 comments on commit b08f718

Please sign in to comment.