This tool extracts frames, motion vectors, frame types and timestamps from H.264 and MPEG-4 Part 2 encoded videos.
This class is a replacement for OpenCV's VideoCapture and can be used to read and decode video frames from a H.264 or MPEG-4 Part 2 encoded video stream/file. It returns the following values for each frame:
- decoded frame as BGR image
- motion vectors
- Frame type (keyframe, P- or B-frame)
- (for RTSP streams): UTC wall time of the moment the sender sent out a frame (as opposed to an easily retrievable timestamp for the frame reception)
These additional features enable further projects, such as fast visual object tracking or synchronization of multiple RTSP streams. Both a C++ and a Python API is provided. Under the hood FFMPEG is used.
The image below shows a video frame with extracted motion vectors overlaid,
A usage example can be found in extract_mvs.py
.
- Provided PyPI package
- Added unittests in
tests/tests.py
- Updated for compatibility with Python >3.8
- Provided a script to wrap Docker run command
- Updated demo script with command line arguments for extraction and storing of motion vectors
- Changed Docker image to manylinux_2_24_x86_64 to prepare for building wheels
The mv-extractor seems to be quite popular and I want to improve it. However, I do not have the time and resources to do this alone. Hence, I gladly welcome any community contributions.
You can install the motion vector extractor via pip
pip install --upgrade pip
pip install motion-vector-extractor
Note, that we currently provide the package only for x86-64 linux, such as Ubuntu or Debian, and Python 3.8, 3.9, and 3.10. If you are on a different platform, please use the Docker image as described below.
Download the example video vid_h264.mp4
from the repo and place it somewhere. To extract the motion vectors, open a terminal at the same location and run
extract_mvs vid_h264.mp4 --preview --verbose
The extraction script provides command line options to store extracted motion vectors to disk, and to enable/disable graphical output. For all options type
extract_mvs -h
For example, to store extracted frames and motion vectors to disk without showing graphical output run
extract_mvs vid_h264.mp4 --dump
Before you can run the tests, clone the source code. To this end, change into the desired installation directory on your machine and run
git clone https://github.com/LukasBommes/mv-extractor.git mv_extractor
Now, to run the tests from the mv_extractor
directory with
python3 tests/tests.py
Confirm that all tests pass.
If you are using the Docker image instead of the PyPI package as explained below, you can invoke the tests with
sudo ./run.sh python3.10 tests/tests.py
If you want to use the motion vector extractor in your own Python script import it via
from mvextractor.videocap import VideoCap
You can then use it according to the example in extract_mvs.py
.
Generally, a video file is opened by VideoCap.open()
and frames, motion vectors, frame types and timestamps are read by calling VideoCap.read()
repeatedly. Before exiting the program, the video file has to be closed by VideoCap.release()
. For a more detailed explanation see the API documentation below.
Instead of installing the motion vector extractor via PyPI you can also use the prebuild Docker image from DockerHub. The Docker image contains the motion vector extractor and all its dependencies and comes in handy for quick testing or in case your platform is not compatible with the provided Python package.
To use the Docker image you need to install Docker. Furthermore, you need to clone the source code with
git clone https://github.com/LukasBommes/mv-extractor.git mv_extractor
Afterwards, you can run the extraction script in the mv_extractor
directory as follows
sudo ./run.sh python3.10 extract_mvs.py vid_h264.mp4 --preview --verbose
This pulls the prebuild Docker image from DockerHub and runs the extraction script inside the Docker container.
This step is not required and for faster installation, we recommend using the prebuilt image.
If you still want to build the Docker image locally, you can do so by running the following command in the mv_extractor
directory
sudo docker build . --tag=mv-extractor
Note that building can take more than one hour.
Now, run the docker container with
sudo docker run -it --ipc=host --env="DISPLAY" -v $(pwd):/home/video_cap -v /tmp/.X11-unix:/tmp/.X11-unix:rw mv-extractor /bin/bash
This module provides a Python API which is very similar to that of OpenCV VideoCapture. Using the Python API is the recommended way of using the H.264 Motion Vector Capture class.
Methods | Description |
---|---|
VideoCap() | Constructor |
open() | Open a video file or url |
grab() | Reads the next video frame and motion vectors from the stream |
retrieve() | Decodes and returns the grabbed frame and motion vectors |
read() | Convenience function which combines a call of grab() and retrieve(). |
release() | Close a video file or url and release all ressources |
Constructor. Takes no input arguments.
Open a video file or url. The stream must be H264 encoded. Otherwise, undesired behaviour is likely.
Parameter | Type | Description |
---|---|---|
url | string | Relative or fully specified file path or an url specifying the location of the video stream. Example "vid.flv" for a video file located in the same directory as the source files. Or "rtsp://xxx.xxx.xxx.xxx:554" for an IP camera streaming via RTSP. |
Returns | Type | Description |
---|---|---|
success | bool | True if video file or url could be opened successfully, false otherwise. |
Reads the next video frame and motion vectors from the stream, but does not yet decode it. Thus, grab() is fast. A subsequent call to retrieve() is needed to decode and return the frame and motion vectors. the purpose of splitting up grab() and retrieve() is to provide a means to capture frames in multi-camera scenarios which are as close in time as possible. To do so, first call grab() on all cameras and afterwards call retrieve() on all cameras.
Takes no input arguments.
Returns | Type | Description |
---|---|---|
success | bool | True if next frame and motion vectors could be grabbed successfully, false otherwise. |
Decodes and returns the grabbed frame and motion vectors. Prior to calling retrieve() on a stream, grab() needs to have been called and returned successfully.
Takes no input arguments and returns a tuple with the elements described in the table below.
Index | Name | Type | Description |
---|---|---|---|
0 | success | bool | True in case the frame and motion vectors could be retrieved sucessfully, false otherwise or in case the end of stream is reached. When false, the other tuple elements are set to empty numpy arrays or 0. |
1 | frame | numpy array | Array of dtype uint8 shape (h, w, 3) containing the decoded video frame. w and h are the width and height of this frame in pixels. Channels are in BGR order. If no frame could be decoded an empty numpy ndarray of shape (0, 0, 3) and dtype uint8 is returned. |
2 | motion vectors | numpy array | Array of dtype int64 and shape (N, 10) containing the N motion vectors of the frame. Each row of the array corresponds to one motion vector. If no motion vectors are present in a frame, e.g. if the frame is an I frame an empty numpy array of shape (0, 10) and dtype int64 is returned. The columns of each vector have the following meaning (also refer to AVMotionVector in FFMPEG documentation): - 0: source: Offset of the reference frame from the current frame. The reference frame is the frame where the motion vector points to and where the corresponding macroblock comes from. If source < 0, the reference frame is in the past. For s > 0 the it is in the future (in display order). - 3: src_x: x-location (in pixels) where the motion vector points to in the reference frame. - 4: src_y: y-location (in pixels) where the motion vector points to in the reference frame. - 5: dst_x: x-location of the vector's origin in the current frame (in pixels). Corresponds to the x-center coordinate of the correspdoning macroblock. - 6: dst_y: y-location of the vector's origin in the current frame (in pixels). Corresponds to the y-center coordinate of the correspdoning macroblock. - 7: motion_x = motion_scale * (src_x - dst_x) - 8: motion_y = motion_scale * (src_y - dst_y) - 9: motion_scale: see definiton of columns 7 and 8. Used to scale up the motion components to integer values. E.g. if motion_scale = 4, motion components can be integer values but encode a float with 1/4 pixel precision. |
3 | frame_type | string | Unicode string representing the type of frame. Can be "I" for a keyframe, "P" for a frame with references to only past frames and "B" for a frame with references to both past and future frames. A "?" string indicates an unknown frame type. |
4 | timestamp | double | UTC wall time of each frame in the format of a UNIX timestamp. In case, input is a video file, the timestamp is derived from the system time. If the input is an RTSP stream the timestamp marks the time the frame was send out by the sender (e.g. IP camera). Thus, the timestamp represents the wall time at which the frame was taken rather then the time at which the frame was received. This allows e.g. for accurate synchronization of multiple RTSP streams. In order for this to work, the RTSP sender needs to generate RTCP sender reports which contain a mapping from wall time to stream time. Not all RTSP senders will send sender reports as it is not part of the standard. If IP cameras are used which implement the ONVIF standard, sender reports are always sent and thus timestamps can always be computed. |
Convenience function which internally calls first grab() and then retrieve(). It takes no arguments and returns the same values as retrieve().
Close a video file or url and release all ressources. Takes no input arguments and returns nothing.
The C++ API differs from the Python API in what parameters the methods expect and what values they return. Refer to the docstrings in src/video_cap.hpp
.
What follows is a short explanation of the data returned by the VideoCap
class. Also refer this excellent book by Iain E. Richardson for more details.
The decoded video frame. Nothing special about that.
H.264 uses different techniques to reduce the size of a raw video frame prior to sending it over a network or storing it into a file. One of those techniques is motion estimation and prediction of future frames based on previous or future frames. Each frame is split into 16 pixel x 16 pixel large macroblocks. During encoding motion estimation matches every macroblock to a similar looking macroblock in a previously encoded frame (note that this frame can also be a future frame since encoding and playout order might differ). This allows to transmit only those motion vectors and the reference macroblock instead of all macroblocks, effectively reducing the amount of transmitted or stored data.
Motion vectors correlate directly with motion in the video scene and are useful for various computer vision tasks, such as visual object tracking.
The frame type is either "P", "B" or "I" and refers to the H.264 encoding mode of the current frame. An "I" frame is send fully over the network and serves as a reference for "P" and "B" frames for which only differences to previously decoded frames are transmitted. Those differences are encoded via motion vectors. As a consequence for an "I" frame no motion vectors are returned by this library. The difference between "P" and "B" frames is that "P" frames refer only to pas frames, whereas "B" frames have motion vectors which refer to both past and future frames. References to future frames are possible even with live streams because the decoding order of frames differs from the display order.
In addition to extracting motion vectors and frame types, the video capture class also outputs a UNIX timestamp representing UTC wall time for each frame. If the stream originates from a video file, this timestamp is simply derived from the current system time. However, when an RTSP stream is used as input, the timestamp calculation is more intricate as the timestamps represents not the time when the frame was received, but the time when the frame was send by the sender. Thus, this timestamp can be used for accurate synchronization of multiple video streams.
Computation of the frame wall time works as follows:
-
Wait for a RTCP sender report package which contains a mapping between the stream's RTP timestamp and the current UTC wall time. Now, a correlation between RTP timestamps of the stream and wall time is known. Name the RTP timestamp
T_RTP_LAST
and the corresponding UTC wall timeT_UTC_LAST
. -
For each new frame, compute the UTC timestamp as follows:
T_UTC = T_UTC_LAST + (T_RTP - T_RTP_LAST) / 90000
Here T_RTP
is the frame's RTP timestamp and T_RTP_LAST
and T_UTC_LAST
are the RTP timestamp and corresponding UTC wall time of the last RTCP sender report packet. The factor of 90000 is needd because the RTP timestamps increment with 90 kHz per RTSP specification. That means, that the RTP timestamp increments by 90000 every second.
Note, that the sender clock needs to be synchronized with a network time server (via NTP) to ensure frame timestamps are in sync with UTC. Most IP cameras provide an option for this.
This software is written by Lukas Bommes, M.Sc. - A*Star SIMTech, Singapore
It is based on MV-Tractus and OpenCV's videoio module.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use our work for academic research please cite
@INPROCEEDINGS{9248145,
author={L. {Bommes} and X. {Lin} and J. {Zhou}},
booktitle={2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA)},
title={MVmed: Fast Multi-Object Tracking in the Compressed Domain},
year={2020},
volume={},
number={},
pages={1419-1424},
doi={10.1109/ICIEA48937.2020.9248145}}