Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the S3D architecture to TorchVision #6412

Merged
merged 22 commits into from
Aug 19, 2022
Merged

Conversation

sophiazhi
Copy link
Contributor

@sophiazhi sophiazhi commented Aug 12, 2022

Fixes #6402

Add S3D model for video classification.

@sophiazhi sophiazhi changed the title S3D initial commit Onboard S3D model Aug 12, 2022
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sophiazhi, the PR looks good overall. I have added a few comments, please let me know what you think.

torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Show resolved Hide resolved
torchvision/models/video/s3d.py Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
@datumbox
Copy link
Contributor

It might be worth looking into the original TF implementation to confirm you implement S3D as expected. I know you don't do the Gated variant but this reference is still valid. As you can see the final AvgPool_0a_7x7 layer uses a kernel=(2,7,7) and performs the reduction as I described at #6412 (comment).

It's also worth noting the are using a dropout layer at the end. See this idiom on how to pass the parameter to the network.

Finally is there any reason we implement S3D and not it's gated variant S3D-G which is marginally more accurate?

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minus reproducing the accuracy, I think we are almost there. Just minor comments left.

torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Show resolved Hide resolved
torchvision/models/video/s3d.py Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
torchvision/models/video/s3d.py Outdated Show resolved Hide resolved
@jdsgomes
Copy link
Contributor

2. Instead of processing all the frames (250) at once, please try dividing a video into multiple segments (each segment consists of 64 frames) and taking the average for the final prediction
3. We used the highest value of jpeg quality (minimal compression) both in the training and inference processes
4. I think the interpolation also does matter. Please try bicubic resizing.

Thank you,

@sophiazhi Thanks a lot for your work. LGTM.

I am approving this PR on the basis that the model achieves decent accuracy for its size and that unblocks Multimodal. On the cons side, we are using ported weights from an unofficial repo which seem to be lagging by 5 accuracy points comparing to the paper. To mitigate this I have:

  1. Pinged the author of the repo to clarify the exact inference setup they use.
  2. Started an internal job (id 55651) to train the network from scratch and potentially replace the weights if we get better accuracy.

Since the gap in accuracy is on the borderline of what we usually consider acceptable and because I've made multiple commits on the PR, I would like a second opinion on how we should progress here.

@jdsgomes @langong347 Please let me know your thoughts.

tldr; from my point of view this can be merged to unblock but ideally we would follow up and try to close the gap.

This is a difficult one, since the gap is larger that we would normally accept (at least from recent contributions, we might find some cases where this was different). I see that @kylemin suggested some changes in the testing setup so might be worth trying them to see if we can close the gap (point 1 you already did as far as I know, but its not clear if you did the others).

Another option would be to make it internal only to unblock until we can close the gap.

class S3D_Weights(WeightsEnum):
KINETICS400_V1 = Weights(
url="https://download.pytorch.org/models/s3d-1bd8ae63.pth",
transforms=partial(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the transforms function should be modified to be consistent with ours. Please refer to this link

@datumbox
Copy link
Contributor

@kylemin Thank you very much for the prompt reply. I have some additional questions to ensure we are doing the right thing on our side:

I think the transforms function should be modified to be consistent with ours. Please refer to this link

I understand that your method scales the input to [-1, 1]. It's not immediate obvious but that's also what we do. Is there anything other inconsistency you spotted that we should fix?

Instead of using the full-resolution frames, please try performing a 5-crop (each crop is 224x224) inference

So instead of resizing the video to 256x256, you propose to do a 5-crop of 224x224 each. Correct?

Instead of processing all the frames (250) at once, please try dividing a video into multiple segments (each segment consists of 64 frames) and taking the average for the final prediction

Sounds good. Do you remember how many 64-frame clips you produce for each Video?

I think the interpolation also does matter. Please try bicubic resizing.

Sounds good, I'll make the change. Thanks for pointing this out.

@kylemin
Copy link

kylemin commented Aug 19, 2022

@datumbox I am not sure there is another inconsistency, but I believe that modifying the transforms function would improve the accuracy and mostly resolve the issue of performance discrepancy. Yes, I meant 5-crop of 224x224. I remember that this improved the inference accuracy a little bit. I think the rest (dividing a video into multiple segments, jpeg quality, bicubic interp) can wait and are not in high priority because they might not be the reason why there is a non-negligible performance discrepancy. I am unsure if I processed 250 frames at once or not, but I remember that I once tested using four 64-frame clips for each video. Sorry for this uncertainty, but I hope the problem is resolved by modifying the transforms function and scaling the input to [-1, 1].

@datumbox
Copy link
Contributor

@kylemin Thanks for all the info.

I hope the problem is resolved by modifying the transforms function and scaling the input to [-1, 1].

Unfortunately that's what we do already. I know it doesn't look like it immediately when you check the code but basically we:

  • First rescale the values from 0-255 to 0-1
  • Then we subtract 0.5 and divide by 0.5
  • This leads to rescaling the input to [-1, 1] scale

I thus believe the transform we use is completely aligned with yours. Despite doing this from the beginning, we still face a -5% accuracy gap.

I fully appreciate it's been years since you done the work and probably it's hard to remember all the details. I really appreciate the answers. I'll have a look on the rest of the recommendations and let you know what we find.

@datumbox
Copy link
Contributor

datumbox commented Aug 19, 2022

I am going to revert the 5/10 crop update I did on transforms because this leads to incompatible batch & collate logic on the reference scripts. The issue is that the aforementioned augmentations make new videos "popup" and thus the length of labels and ids no longer match. It would require far more work to make the references work with this that I don't think it's worth it at this point. Since I plan to hard reset the branch, you can find the full changes here. Part of the reason I'm not making the rest of the required changes is that the original paper explicitly mentions they do single central crops for inference.

Switching to BICUBIC interpolation marginally reduces the accuracy by ~0.1. Also digging into the parent repo of Kyle, I see that they are using CV2's default interpolation for resize which is bilinear. I think there might be some additional detail that we might be missing on the inference, possibly related on how the video was scored on the temporal dimension. But I can certainly confirm that our preprocessing is implemented exactly as described on the paper and is aligned with Kyle's repo.

Copy link
Contributor

@jdsgomes jdsgomes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the latest changes and LGTM!
thanks

@datumbox datumbox merged commit 6de7021 into pytorch:main Aug 19, 2022
Copy link

@langong347 langong347 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't get a chance to review yesterday. Had a question about the test input. The code looks good to me.

As for upstreaming to torchvision: we can use our local version S3D for the moment while the weights are getting finalized. Personally, I don't think you need to make this implementation internal as long as the discrepancy of 5% in accuracy is temporarily acceptable.

@@ -312,6 +312,9 @@ def _check_input_backprop(model, inputs):
"mvit_v2_s": {
"input_shape": (1, 3, 16, 224, 224),
},
"s3d": {
"input_shape": (1, 3, 16, 224, 224),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we looking for this particular input shape during test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture doesn't support the default input size (which is smaller) so here we define the "smallest reasonable" input to make the test work.

@datumbox
Copy link
Contributor

@langong347 We have merged the PR and it should appear on the next nightly tomorrow. The weights we use here are identical to the ones you use on your local repo. This means that you don't have to wait for anything to adopt the version contributed by Sophia.

@kylemin
Copy link

kylemin commented Aug 19, 2022

@datumbox Could we test if loading images in BGR instead of RGB can recover the performance? I found this log of S3D experiments and I think that how each video is scored on the temporal dimension does not affect the performance that much.
image
About the highest value of jpeg quality, I meant when you extract frames from videos. FFmpeg extracts frames of low quality by default so I think I changed the setup. However, I don't think that explains such a large performance gap.. I'll try to log in to my previous school server to see if the information about how I extracted the frames and how I performed the inference is still there. Thanks!

@kylemin
Copy link

kylemin commented Aug 20, 2022

So I've found a script that was used to extract the frames using ffmpeg and to save them in jpeg format. I am not sure this helps at this point.. but anyway I'll leave it here. Thanks

import os
import subprocess
from collections import OrderedDict

from joblib import delayed
from joblib import Parallel
import pandas as pd
import numpy as np
#import jpeg4py as jpeg
import cv2
import csv
import h5py


def download_clip(video_identifier, start_time, end_time, sp, lstr, label):
    tmp_filename = os.path.join('../Kinetics-400', sp, '%s'%lstr, '%s_%06d_%06d.mp4' % (video_identifier, start_time, end_time))

    proc = subprocess.Popen('ffprobe -v error -select_streams v:0 -show_entries stream=width,height -of csv=p=0 "' + tmp_filename + '"', stdout=subprocess.PIPE, shell=True)
    (out, err) = proc.communicate()
    if err is not None:
        return None, None, None, 0
    else:
        if out:
            w, h = out.decode().strip().split(',')
            w = int(w)
            h = int(h)
            scale = ''
            if w >= h:
                w = int(float(w)*256./float(h))
                w = w - w%2
                h = 256
                if w > 384:
                    w = 384
            else:
                h = int(float(h)*256./float(w))
                h = h - h%2
                w = 256
                if h > 384:
                    h = 384
            scale = '%s:%s' % (w,h)
        else:
            return None, None, None, 0

    command = ['ffmpeg', '-hide_banner','-loglevel', 'quiet', '-debug', '0',#'panic'
               '-i', '"%s"' % tmp_filename,
               '-f', 'image2pipe', '-qscale:v', '1',
               '-filter:v', "scale=%s"%scale, '-r','25', #'25'
               '-threads', '1', '-pix_fmt', 'bgr24','-vcodec','rawvideo','-']
    command = ' '.join(command)

    max_num = 250
    num_frame = max_num
    np_frame = np.ndarray((1, max_num), dtype='object')
    pipe = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, bufsize=10**7)
    for i in range(max_num):
        try:
            if i < num_frame:
                raw_image = pipe.stdout.read(h*w*3)
                img = np.fromstring(raw_image, dtype='uint8')
                img = img.reshape((h, w, 3))
                jpg = cv2.imencode('.jpg', img, [cv2.IMWRITE_JPEG_QUALITY, 99])[1]
                jpg = jpg.squeeze()
                np_frame[0, i] = np.fromstring(jpg.tostring(), dtype='uint8')
                pipe.stdout.flush()
            else:
                np_frame[0, i] = np.array([], dtype=np.uint8)
        except:
            if num_frame > i:
                num_frame = i
            np_frame[0, i] = np.array([], dtype=np.uint8)

    np_label = None
    if label >= 0:
        np_label = np.zeros((1,), dtype=int)
        np_label[0] = label
    np_shape = np.zeros((1, 4), dtype=int)
    np_shape[0] = [num_frame, h, w, 3]

    return np_label, np_shape, np_frame, num_frame


def download_clip_wrapper(i, row, sp, lstr='', label=-1):
    np_label, np_shape, np_frame, num_frame = download_clip(row['video-id'], row['start-time'], row['end-time'], sp, lstr, label)

    status = tuple([i, np_label, np_shape, np_frame, num_frame])
    return status


def parse_kinetics_annotations(input_csv, ignore_is_cc=False):
    df = pd.read_csv(input_csv)
    if 'youtube_id' in df.columns:
        columns = OrderedDict([
            ('youtube_id', 'video-id'),
            ('time_start', 'start-time'),
            ('time_end', 'end-time'),
            ('label', 'label-name')])
        df.rename(columns=columns, inplace=True)
        if ignore_is_cc:
            df = df.loc[:, df.columns.tolist()[:-1]]
    return df


def main():
    sp = 'val'
    target = '/z/home/kylemin/dataset/Kinetics-new/Kinetics-400h/'
    class_label = tuple(csv.reader(open('../class_label.csv', 'r')))
    video_list = list(csv.reader(open(os.path.join(target, sp+'_list.csv'), 'r')))

    with h5py.File(os.path.join(target, sp+'.h5'), 'w', libver='latest') as hf:
        list_input = []
        if sp == 'test':
            list_input.append(['youtube_id','time_start','time_end','split'])
        else:
            list_input.append(['label','youtube_id','time_start','time_end','split'])

        for vs in video_list:
           cls = vs[0].split('_')
           if sp == 'test':
               list_input.append([vs[0][:11],int(cls[-2]),int(cls[-1])])
           else:
               list_input.append([vs[1],vs[0][:11],int(cls[-2]),int(cls[-1]),sp])

        input_csv = 'input.csv'
        with open(input_csv, 'w') as f:
            writer = csv.writer(f)
            writer.writerows(list_input)

        dataset = parse_kinetics_annotations(input_csv)

        if sp == 'test':
            status_lst = Parallel(n_jobs=10)(delayed(download_clip_wrapper)(
                j, row, sp) for j, row in dataset.iterrows())
        else:
            status_lst = Parallel(n_jobs=10)(delayed(download_clip_wrapper)(
                j, row, sp, row['label-name'], class_label.index([row['label-name']])) for j, row in dataset.iterrows())

        #status_lst = []
        #for j, row in dataset.iterrows():
        #    status_lst.append(download_clip_wrapper(j, row, sp, row['label-name'], class_label.index([row['label-name']])))

        idx_list = []
        for k, l, s, f, nf in status_lst:
            if s is not None and f is not None and nf > 15:
                idx_list.append(k)

        num_video = len(idx_list)
        if sp != 'test':
            dset_labels = hf.create_dataset('labels', shape=(num_video,), dtype=int)
        dset_shapes = hf.create_dataset('shapes', shape=(num_video,4), dtype=int)
        dt = h5py.special_dtype(vlen=np.dtype('uint8'))
        dset_frames = hf.create_dataset('videos', shape=(num_video,250), dtype=dt)

        if sp != 'test':
            np_labels = np.zeros((num_video,), dtype=int)
        np_shapes = np.zeros((num_video, 4), dtype=int)
        np_frames = np.ndarray((num_video, 250), dtype='object')
        for k, l, s, f, nf in status_lst:
            if s is not None and f is not None and nf > 15:
                idx = idx_list.index(k)
                if sp != 'test':
                    np_labels[idx] = l
                np_shapes[idx] = s
                np_frames[idx] = f

        if sp != 'test':
            dset_labels[...] = np_labels
        dset_shapes[...] = np_shapes
        dset_frames[...] = np_frames


if __name__ == '__main__':
    main()

@datumbox
Copy link
Contributor

datumbox commented Aug 22, 2022

@kylemin Thanks so much for digging into the logs and files to help out. I really appreciate the help!

Could we test if loading images in BGR instead of RGB can recover the performance? I

I've tested (see #6461) that but it doesn't help, the accuracy is lower. I think you were scoring RGB frames because your do the conversion from BGR to RGB on your original repo in this line.

FFmpeg extracts frames of low quality by default so I think I changed the setup. However, I don't think that explains such a large performance gap.

I think I understand why you say this now that I can see your preprocessing script at #6412 (comment). It's because after decoding you store the frames as JPEG files. We don't do that on our side, we just directly use the decoded picture.

Another difference that I noticed is that you actually resize videos to 256 on the smallest dimension (with a hard max of 384) instead of 256x256 on the original paper. I tested this out but didn't significantly affect the accuracy.

So the accuracy gap remains strange to me. If you have any other hypothesis, I'm happy to explore it. Otherwise, thanks a lot for your help!

@datumbox
Copy link
Contributor

datumbox commented Aug 22, 2022

Training the network from scratch yields a network with the following accuracy using single crops with 128 frames:

Train Params:  --ngpus 8 --nodes 8 --cache-dataset --batch-size=12 --lr 0.2 --clip-len 64 --clips-per-video 5 --sync-bn --model s3d --train-resize-size 256 256 --train-crop-size 224 224 --val-resize-size 256 256 --val-crop-size 224 224
Val Params: --batch-size=16 --test-only --cache-dataset --clip-len 128 --clips-per-video 1 
Checkpoint: job55651/model_34.pth
Clip Acc@1 68.206 Clip Acc@5 87.645

Unfortunately the above leads to severe overfitting on the Training set, as we see below:

acc1: 100.0000 (99.4226)  acc5: 100.0000 (99.9334)

I think we need to add some higher regularization. The paper is a bit unclear on the hyper-params used but on the TF code-base I see the default value for Dropout=0.2. I'll give that a try and if I still see overfitting I'll tune the weight decay as well.

@YosuaMichael
Copy link
Contributor

@datumbox I notice from @kylemin script that in the ffmpeg command it use -r 25 which mean it use frame rate of 25 fps. I am currently running the test using --frame-rate 25 to check if this improve the result.

@YosuaMichael
Copy link
Contributor

YosuaMichael commented Aug 25, 2022

I ran with a test with BICUBIC and following parameter:

>     --batch-size=16 --test-only \
>     --data-path="/datasets01/kinetics/070618/400/" \
>     --clip-len 64 --frame-rate 25 --clips-per-video 5 \
>     --cache-dataset \
>     --model s3d --weights="S3D_Weights.DEFAULT"

And the result I got is:

Test: Total time: 0:42:11
 * Clip Acc@1 60.389 Clip Acc@5 82.375
 * Video Acc@1 68.261 Video Acc@5 88.393

Seems like it still can't reach ~72% accuracy.

@datumbox
Copy link
Contributor

@YosuaMichael Can you try with --clip-len 128 and --clips-per-video 1? Supposedly that's close to how the paper does it.

facebook-github-bot pushed a commit that referenced this pull request Aug 25, 2022
Summary:
* S3D initial commit

* add model builder code and docstrings

* change classifier submodule, populate weights enum

* fix change of block args from List[List[int]] to ints

* add VideoClassification to transforms

* edit weights url for testing, add s3d to models.video init

* norm_layer changes

* norm_layer and args fix

* Overwrite default dropout

* Remove docs from internal submodules.

* Fix tests

* Adding documentation.

* Link doc from main models.rst

* Fix min_temporal_size

* Adding crop/resize parameters in references script

* Release weights.

* Refactor dropout.

* Adding the weights table in the doc

Reviewed By: datumbox

Differential Revision: D39013679

fbshipit-source-id: 140f7531dcecf65396518e8632f639b3a2a1cfad

Co-authored-by: Vasilis Vryniotis <datumbox@users.noreply.github.com>
Co-authored-by: Vasilis Vryniotis <vvryniotis@fb.com>
@YosuaMichael
Copy link
Contributor

@datumbox using --clip-len 128 --clips-per-video 1 I got worse result:

Test: Total time: 0:55:10
 * Clip Acc@1 65.626 Clip Acc@5 86.332

Here are the full params:

>     --batch-size=16 --test-only \
>     --data-path="/datasets01/kinetics/070618/400/" \
>     --clip-len 128 --frame-rate 25 --clips-per-video 1 \
>     --cache-dataset \
>     --model s3d --weights="S3D_Weights.DEFAULT"

@datumbox
Copy link
Contributor

@YosuaMichael My understanding form reading the documentation of ffprobe is that the default interpolation is bilinear not bicubic. Worth checking that as well to see if you get an improvement if you have the bandwidth.

@YosuaMichael
Copy link
Contributor

Hi @datumbox , I have tried with BILINEAR and here are what I got:

with --clip-len 128 --clips-per-video 1

* Clip Acc@1 65.632 Clip Acc@5 86.391

with --clip-len 64 --clips-per-video 5

 * Clip Acc@1 60.344 Clip Acc@5 82.352
 * Video Acc@1 68.220 Video Acc@5 88.372

I think BILINEAR and BICUBIC produce a relatively similar result in this case.

@datumbox
Copy link
Contributor

Thanks @YosuaMichael. I've been training a new model the whole week but due to infra problems I don't have a new model. The accuracy you report is on par with the model that I have currently trained from scratch. I still have overfitting problems so I'm sure there is more to be done here to improve it. I'll report results here as I get them.

@datumbox
Copy link
Contributor

datumbox commented Sep 2, 2022

Training a network with Dropout=0.2 yields the following results:

Train Params:  --ngpus 8 --nodes 8 --cache-dataset --batch-size=12 --lr 0.2 --clip-len 64 --clips-per-video 5 --sync-bn --model s3d --train-resize-size 256 256 --train-crop-size 224 224 --val-resize-size 256 256 --val-crop-size 224 224
Val Params: --batch-size=16 --test-only --cache-dataset --clip-len 128 --clips-per-video 1 
Checkpoint: job59255/model_43.pth
* Clip Acc@1 68.345 Clip Acc@5 88.050

Which is similar to Dropout=0.0. This model is about +1 Acc@1 point higher than the one we deployed but still have roughly 4 points gap from the original paper. I'm still observing significant amount of overfitting as we don't provide currently offer an good way to do augmentations for Video (coming up). I could keep trying to tweak the configuration to avoid overfitting but this job requires lots of resources and currently want to dedicate them on the testing of Transforms.

@YosuaMichael Any thoughts on whether we should deploy these weights instead of the ported ones?

@YosuaMichael
Copy link
Contributor

Hi @datumbox, overall I prefer to deploy the new weight as 1 point difference is quite significant. However I have no strong opinion on this since we may also update the weight later once we have more augmentation on video.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3D feature request
7 participants