Copyright© 2021 WolframRhodium
BM3D denoising filter for VapourSynth, implemented in CUDA.
-
Please check VapourSynth-BM3D.
-
The
_rtc
version compiles GPU code at runtime, which might runs faster than standard version at the cost of a slight overhead. -
The
cpu
version is implemented in AVX and AVX2 intrinsics, serves as a reference implementation on CPU. However, bitwise identical outputs are not guaranteed across CPU and CUDA implementations.
-
CPU with AVX support.
-
CUDA-enabled GPU(s) of compute capability 5.0 or higher (Maxwell+).
-
GPU driver 450 or newer.
The minimum requirement on compute capability is 3.5, which requires manual compilation (specifying nvcc flag -gencode arch=compute_35,code=sm_35
).
The cpu
version does not require any external libraries but requires AVX2 support on CPU in addition.
{bm3dcuda, bm3dcuda_rtc, bm3dcpu}.BM3D(clip clip[, clip ref=None, float[] sigma=3.0, int[] block_step=8, int[] bm_range=9, int radius=0, int[] ps_num=2, int[] ps_range=4, bint chroma=False, int device_id=0, bool fast=True, int extractor_exp=0])
-
clip:
The input clip. Must be of 32 bit float format. Each plane is denoised separately if
chroma
is set toFalse
. Data of unprocessed planes is undefined. Frame properties of the output clip are copied from it. -
ref:
The reference clip. Must be of the same format, width, height, number of frames as
clip
.Used in block-matching and as the reference in empirical Wiener filtering, i.e.
bm3d.Final
/bm3d.VFinal
:basic = core.{bm3dcpu, bm3dcuda, bm3dcuda_rtc}.BM3D(src, radius=0) final = core.{bm3d...}.BM3D(src, ref=basic, radius=0) vbasic = core.{bm3d...}.BM3D(src, radius=radius_nonzero).bm3d.VAggregate(radius=radius_nonzero) vfinal = core.{bm3d...}.BM3D(src, ref=vbasic, radius=r).bm3d.VAggregate(radius=r) # alternatively, using the v2 interface basic_or_vbasic = core.{bm3dcpu, bm3dcuda, bm3dcuda_rtc}.BM3Dv2(src, radius=r) final_or_vfinal = core.{bm3d...}.BM3Dv2(src, ref=basic_or_vbasic, radius=r)
corresponds to the followings (ignoring color space handling and other differences in implementation), respectively
basic = core.bm3d.Basic(clip) final = core.bm3d.Final(basic, ref=src) vbasic = core.bm3d.VBasic(src, radius=r).bm3d.VAggregate(radius=r, sample=1) vfinal = core.bm3d.VFinal(src, ref=vbasic, radius=r).bm3d.VAggregate(radius=r)
-
sigma: The strength of denoising for each plane.
The strength is similar (but not strictly equal) as
VapourSynth-BM3D
due to differences in implementation. (coefficient normalization is not implemented, for example)Default
[3,3,3]
. -
block_step, bm_range, radius, ps_num, ps_range:
Same as those in
VapourSynth-BM3D
.If
chroma
is set toTrue
, only the first value is in effect.Otherwise an array of values may be specified for each plane (except
radius
).Note: It is generally not recommended to take a large value of
ps_num
as current implementations do not take duplicate block-matching candidates into account during temporary searching, which may leads to regression in denoising quality. This issue is not present inVapourSynth-BM3D
.Note2: Lowering the value of "block_step" will be useful in reducing blocking artifacts at the cost of slower processing.
-
chroma:
CBM3D algorithm.
clip
must be ofYUV444PS
format.Y channel is used in block-matching of chroma channels.
Default
False
. -
device_id:
Set GPU to be used.
Default
0
. -
fast:
Multi-threaded copy between CPU and GPU at the expense of 4x memory consumption.
Default
True
. -
extractor_exp:
Used for deterministic (bitwise) output. This parameter is not present in the
cpu
version since the implementation always produces deterministic output.Pre-rounding is employed for associative floating-point summation.
The value should be a positive integer not less than 3, and may need to be higher depending on the source video and filter parameters.
Default
0
. (non-determinism)
-
bm3d.VAggregate
should be called after temporal filtering, as inVapourSynth-BM3D
. Alternatively, you may use theBM3Dv2()
interface for both spatial and temporal denoising in one step. -
The
_rtc
version has three additional experimental parameters:-
bm_error_s: (string)
Specify cost for block similarity measurement.
Currently implemented costs:
SSD
(Sum of Squared Differences),SAD
(Sum of Absolute Differences),ZSSD
(Zero-mean SSD),ZSAD
(Zero-mean SAD),SSD/NORM
.Default
SSD
. -
transform_2d_s/transform_1d_s: (string)
Specify type of transform.
Currently implemented transforms:
DCT
(Discrete Cosine Transform),Haar
(Haar Transform),WHT
(Walsh–Hadamard Transform),Bior1.5
(transform based on a bi-orthogonal spline wavelet).Default
DCT
.
These features are not implemented in the standard version due to performance and binary size concerns.
-
GPU memory consumptions:
(ref ? 4 : 3) * (chroma ? 3 : 1) * (fast ? 4 : 1) * (2 * radius + 1) * size_of_a_single_frame
- The CMake configuration of
BM3DCUDA_RTC
links to NVRTC static library by default, which requires CUDA 11.5 or later.
cmake -S . -B build -D CMAKE_BUILD_TYPE=Release -D CMAKE_CUDA_FLAGS="--threads 0 --use_fast_math -Wno-deprecated-gpu-targets" -D CMAKE_CUDA_ARCHITECTURES="50;61-real;75-real;86"
cmake --build build --config Release