-
Notifications
You must be signed in to change notification settings - Fork 462
Description
Problem/Use Case
Currently, PySceneDetect does not support evaluation, However, evaluating its performance is crucial for the further development. I propose a feature to integrate evaluation codes into PySceneDetect. This issue describes the procedure.
Solutions
Datasets
To evaluate the performance, we need datasets that consists of videos and manually-annotated shots. I investigated shot detection on Google scholar, and found that the following datasets were proposed. I think that BCC and RAI are a good starting point because these datasets are frequently used in shot detection literature and the dataset size is small, so easy to download. In addition, Kinetics-GEBD, ClipShot, and AutoShot collected videos from YouTube, thus using them for our evaluation protocols may violate YouTube's policy.
Dataset | conference | domain | #videos | Avg. video length (second) | #citations | Paper title |
---|---|---|---|---|---|---|
BCC | ACMMM15 | Broadcast | 11 | 2,945 | 133 | A deep siamese network for scene detection in broadcast videos |
RAI | CAIP15 | Broadcast | 10 | 591 | 86 | Shot and scene detection via clustering for re-using broadcast video |
Kinetics-GEBD | ICCV21 | General | 55351 | n/a | 81 | Generic Event Boundary Detection: A Benchmark for Event Detection |
ClipShot | ACCV18 | General | 4039 | 237 | 54 | Fast Video Shot Transition Localization with Deep Structured Models |
AutoShot | CVPR Workshop 23 | General | 853 | 39 | 13 | AutoShot: A Short Video Dataset and State-of-the-Art Shot Boundary Detection |
Metrics
The previous literature use recall, precision, and F1 scores to evaluate their methods. Let
Recall and precision is calculated as the following Python code:
def compute_f1(hat_ys, ys):
threshold = 5 # if abs(hat_y - y) <= threshold, the prediction is accurate
correct = 0
for hat_y in hat_ys:
if min([abs(hat_y - y) for y in ys]) < threshold:
correct += 1
recall = correct / len(ys)
precision = correct / len(hat_ys)
f1 = 2 * recall * precision / (recall + precision)
Note that this code provides a rough overview of the evaluation process. For precise implementation details, I will need to understand edge cases (e.g., two hat_y correspond to one y, so many-to-one case).
Implementation
I believe two evaluation modes are necessary: local mode and CI mode.
For local mode, I created an evaluation/ directory in the home directory and wrote Python scripts to run evaluations on local laptops.
For CI mode, based on the evaluation/ directory, we set up GitHub Actions to automatically run evaluation commands whenever new commits are pushed.
Questions
How do we store RAI and BCC video datasets? Because the video size are larger than Github limitations (100MB), we need a storage service.
Zenodo is one of the candidate because it allows us to store datasets for academic purposes and allows us to download the datasets in a CLI-friendly manner (like curl and wget).