IBA::Unsharp_mask() speed and memory optimization #4513

ssh4net · 2024-10-29T04:05:54Z

Description

Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp:
src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur))

Added two pass 1D convolution for a kernels higher than 3x3

Tests

	ImageBuf sharped(input.spec());
	const int repeats = 50;

	std::cout << "Start sharpening\n";
	auto start = std::chrono::high_resolution_clock::now();

	for (int i = 0; i < repeats; i++)
	{
		//ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f);
		ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f);
		std::cout << ".";
	}

	std::cout << "\n";

	auto part1 = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> elapsed_part1 = part1 - start;
	std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n";

both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup:
~30-40% with less memory use.

for 5x5 gaussian kernels two pass mode should add at least 20% speedup.

(if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use)

Checklist:

I have read the contribution guidelines.
I have updated the documentation, if applicable. (Check if there is no
need to update the documentation, for example if this is a bug fix that
doesn't change the API.)
I have ensured that the change is tested somewhere in the testsuite
(adding new test cases if necessary).
If I added or modified a C++ API call, I have also amended the
corresponding Python bindings (and if altering ImageBufAlgo functions, also
exposed the new functionality as oiiotool options).
My code follows the prevailing code style of this project. If I haven't
already run clang-format before submitting, I definitely will look at the CI
test that runs clang-format and fix anything that it highlights as being
nonconforming.

instead of using 3x IBA + helper functions unsharp_mask_impl() function that use parallel_image() Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

for a kernels 5x5 or more can give a speedup from 40% and higher for a bigger kernels Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

ssh4net · 2024-10-31T04:03:50Z

@lgritz what to do?
There are 3 pixels difference between reference tahoe-unsharp.tif made with old single pass 5x5 convolution and new 2x 1D convolution:
this is amplified view of those pixels:

difference is 0.004 for one of channels.

I guess an issue happen due to float rounding errors. When convolution_() estimate a scale for normalization and use it later.
Already normalized kernel from kernel function already has
1.000000075 sum
that give scale equal to 0.999999925 in double (in excel)
but c++ in float has it 0.999999881
and small error can a bit amplified in 2x 1D convolution.

As an alternative we can try to use double for kernels and normalization and cast to float before apply to a real pixels.
But in any case result will be different from reference images, especially for higher bit depth inputs/outputs

Signed-off-by: Larry Gritz <lg@larrygritz.com>

lgritz

LGTM

The changes resulted in 3 pixels having a single LSB difference versus the old code, which I think is totally understandable because you changed the code in a way that should be mathematically equivalent, but is a different sequence of floating point operations. I think this is acceptable, especially given the performance gain.

I took the liberty of updating the reference output and pushing that change to your tree, that should clear things up for the tests.

There are still some failed CI tests -- unrelated to your code -- in the heif test, which is the result of the GitHub runners upgrading the libheif version they are using. We've already fixed this in OIIO main, you branched off a little earlier than that for this fix, so as soon as we merge this PR, all will be ok.

I'll double check that CI passes in all cases except for the heif failures, and if that's the case, I will merge.

Sorry to take a while with this review, it's been a really busy last several days for me. But this is good work, thanks.

…areFoundation#4513) Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp: src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur)) Added two pass 1D convolution for a kernels higher than 3x3 ## Tests ``` ImageBuf sharped(input.spec()); const int repeats = 50; std::cout << "Start sharpening\n"; auto start = std::chrono::high_resolution_clock::now(); for (int i = 0; i < repeats; i++) { //ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f); ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f); std::cout << "."; } std::cout << "\n"; auto part1 = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed_part1 = part1 - start; std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n"; ``` both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup: ~30-40% with less memory use. for 5x5 gaussian kernels two pass mode should add at least 20% speedup. (if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use) --------- Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

ssh4net added 4 commits October 29, 2024 12:54

Optimisation of unsharp mask function

468abcc

instead of using 3x IBA + helper functions unsharp_mask_impl() function that use parallel_image() Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

some cleanup

3ec0c8e

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

auto-tests errors fix

e44b08d

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

two pass blur for kernels bigger than 3x3

6b28ec0

for a kernels 5x5 or more can give a speedup from 40% and higher for a bigger kernels Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

ssh4net changed the title ~~IBA::Unsharp_mask() speed and memory optimization (after convolution code)~~ IBA::Unsharp_mask() speed and memory optimization Oct 29, 2024

ssh4net added 4 commits October 29, 2024 20:47

clang fix

71d052c

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

missed second width

ee6b6fc

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

fixing issue from convolve_() in-place

77c5299

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

vertical first

76ce825

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>

Update reference output

7bbe453

Signed-off-by: Larry Gritz <lg@larrygritz.com>

lgritz approved these changes Nov 6, 2024

View reviewed changes

lgritz merged commit 3c3f715 into AcademySoftwareFoundation:main Nov 6, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IBA::Unsharp_mask() speed and memory optimization #4513

IBA::Unsharp_mask() speed and memory optimization #4513

ssh4net commented Oct 29, 2024 •

edited

Loading

ssh4net commented Oct 31, 2024 •

edited

Loading

lgritz left a comment

IBA::Unsharp_mask() speed and memory optimization #4513

IBA::Unsharp_mask() speed and memory optimization #4513

Conversation

ssh4net commented Oct 29, 2024 • edited Loading

Description

Tests

Checklist:

ssh4net commented Oct 31, 2024 • edited Loading

lgritz left a comment

Choose a reason for hiding this comment

ssh4net commented Oct 29, 2024 •

edited

Loading

ssh4net commented Oct 31, 2024 •

edited

Loading