Skip to content

Performance evaluation of PySceneDetect in terms of both latency and accuracy #481

@awkrail

Description

@awkrail

Problem/Use Case

Currently, PySceneDetect does not support evaluation, However, evaluating its performance is crucial for the further development. I propose a feature to integrate evaluation codes into PySceneDetect. This issue describes the procedure.

Solutions

Datasets

To evaluate the performance, we need datasets that consists of videos and manually-annotated shots. I investigated shot detection on Google scholar, and found that the following datasets were proposed. I think that BCC and RAI are a good starting point because these datasets are frequently used in shot detection literature and the dataset size is small, so easy to download. In addition, Kinetics-GEBD, ClipShot, and AutoShot collected videos from YouTube, thus using them for our evaluation protocols may violate YouTube's policy.

Dataset conference domain #videos Avg. video length (second) #citations Paper title
BCC ACMMM15 Broadcast 11 2,945 133 A deep siamese network for scene detection in broadcast videos
RAI CAIP15 Broadcast 10 591 86 Shot and scene detection via clustering for re-using broadcast video
Kinetics-GEBD ICCV21 General 55351 n/a 81 Generic Event Boundary Detection: A Benchmark for Event Detection
ClipShot ACCV18 General 4039 237 54 Fast Video Shot Transition Localization with Deep Structured Models
AutoShot CVPR Workshop 23 General 853 39 13 AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection

Metrics

The previous literature use recall, precision, and F1 scores to evaluate their methods. Let $\hat{Y}=(\hat{y}_1, \hat{y}_2, \cdots, \hat{y}_k, \cdots, \hat{y}_K)$ be predicted shot boundary frame numbers and $Y=(y_1, y_2, \cdots, y_l, \cdots, y_L)$ be the manually-annotated shot frame numbers.
Recall and precision is calculated as the following Python code:

def compute_f1(hat_ys, ys):
    threshold = 5 # if abs(hat_y - y) <= threshold, the prediction is accurate
    correct = 0
    for hat_y in hat_ys:
         if min([abs(hat_y - y) for y in ys]) < threshold:
             correct += 1
    recall = correct / len(ys)
    precision = correct / len(hat_ys)
    f1 = 2 * recall * precision / (recall + precision)

Note that this code provides a rough overview of the evaluation process. For precise implementation details, I will need to understand edge cases (e.g., two hat_y correspond to one y, so many-to-one case).

Implementation

I believe two evaluation modes are necessary: local mode and CI mode.
For local mode, I created an evaluation/ directory in the home directory and wrote Python scripts to run evaluations on local laptops.
For CI mode, based on the evaluation/ directory, we set up GitHub Actions to automatically run evaluation commands whenever new commits are pushed.

Questions

How do we store RAI and BCC video datasets? Because the video size are larger than Github limitations (100MB), we need a storage service.
Zenodo is one of the candidate because it allows us to store datasets for academic purposes and allows us to download the datasets in a CLI-friendly manner (like curl and wget).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions