Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IBA::Unsharp_mask() speed and memory optimization #4513

Merged
merged 9 commits into from
Nov 6, 2024

Conversation

ssh4net
Copy link
Contributor

@ssh4net ssh4net commented Oct 29, 2024

Description

Replacing 3x IBA + Helper function that generate 4 fulls size image buffers with single unsharp_mask_impl() that use parallel_image() to compute unsharp:
src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur))

Added two pass 1D convolution for a kernels higher than 3x3

Tests

	ImageBuf sharped(input.spec());
	const int repeats = 50;

	std::cout << "Start sharpening\n";
	auto start = std::chrono::high_resolution_clock::now();

	for (int i = 0; i < repeats; i++)
	{
		//ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f);
		ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f);
		std::cout << ".";
	}

	std::cout << "\n";

	auto part1 = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> elapsed_part1 = part1 - start;
	std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n";

both single threaded (one IB at time) and multithreaded (multiply IB at time) show pretty good speedup:
~30-40% with less memory use.

for 5x5 gaussian kernels two pass mode should add at least 20% speedup.

(if someone can do independent benchmark, will be great. As soon as I had a big differences on them depend on real or synthetic use)

Checklist:

  • I have read the contribution guidelines.
  • I have updated the documentation, if applicable. (Check if there is no
    need to update the documentation, for example if this is a bug fix that
    doesn't change the API.)
  • I have ensured that the change is tested somewhere in the testsuite
    (adding new test cases if necessary).
  • If I added or modified a C++ API call, I have also amended the
    corresponding Python bindings (and if altering ImageBufAlgo functions, also
    exposed the new functionality as oiiotool options).
  • My code follows the prevailing code style of this project. If I haven't
    already run clang-format before submitting, I definitely will look at the CI
    test that runs clang-format and fix anything that it highlights as being
    nonconforming.

instead of using 3x IBA + helper functions unsharp_mask_impl() function that use parallel_image()

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
for a kernels 5x5 or more can give a speedup from 40% and higher for a bigger kernels

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
@ssh4net ssh4net changed the title IBA::Unsharp_mask() speed and memory optimization (after convolution code) IBA::Unsharp_mask() speed and memory optimization Oct 29, 2024
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
@ssh4net
Copy link
Contributor Author

ssh4net commented Oct 31, 2024

@lgritz what to do?
There are 3 pixels difference between reference tahoe-unsharp.tif made with old single pass 5x5 convolution and new 2x 1D convolution:
this is amplified view of those pixels:
image

difference is 0.004 for one of channels.

I guess an issue happen due to float rounding errors. When convolution_() estimate a scale for normalization and use it later.
Already normalized kernel from kernel function already has
1.000000075 sum
that give scale equal to 0.999999925 in double (in excel)
but c++ in float has it 0.999999881
and small error can a bit amplified in 2x 1D convolution.

As an alternative we can try to use double for kernels and normalization and cast to float before apply to a real pixels.
But in any case result will be different from reference images, especially for higher bit depth inputs/outputs

Signed-off-by: Larry Gritz <lg@larrygritz.com>
Copy link
Collaborator

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The changes resulted in 3 pixels having a single LSB difference versus the old code, which I think is totally understandable because you changed the code in a way that should be mathematically equivalent, but is a different sequence of floating point operations. I think this is acceptable, especially given the performance gain.

I took the liberty of updating the reference output and pushing that change to your tree, that should clear things up for the tests.

There are still some failed CI tests -- unrelated to your code -- in the heif test, which is the result of the GitHub runners upgrading the libheif version they are using. We've already fixed this in OIIO main, you branched off a little earlier than that for this fix, so as soon as we merge this PR, all will be ok.

I'll double check that CI passes in all cases except for the heif failures, and if that's the case, I will merge.

Sorry to take a while with this review, it's been a really busy last several days for me. But this is good work, thanks.

@lgritz lgritz merged commit 3c3f715 into AcademySoftwareFoundation:main Nov 6, 2024
29 checks passed
lgritz pushed a commit to lgritz/OpenImageIO that referenced this pull request Nov 10, 2024
…areFoundation#4513)

Replacing 3x IBA + Helper function that generate 4 fulls size image
buffers with single unsharp_mask_impl() that use parallel_image() to
compute unsharp:
src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur))

Added two pass 1D convolution for a kernels higher than 3x3

## Tests

```
	ImageBuf sharped(input.spec());
	const int repeats = 50;

	std::cout << "Start sharpening\n";
	auto start = std::chrono::high_resolution_clock::now();

	for (int i = 0; i < repeats; i++)
	{
		//ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f);
		ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f);
		std::cout << ".";
	}

	std::cout << "\n";

	auto part1 = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> elapsed_part1 = part1 - start;
	std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n";
```

both single threaded (one IB at time) and multithreaded (multiply IB at
time) show pretty good speedup:
~30-40% with less memory use.

for 5x5 gaussian kernels two pass mode should add at least 20% speedup.

(if someone can do independent benchmark, will be great. As soon as I
had a big differences on them depend on real or synthetic use)

---------

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
lgritz pushed a commit to lgritz/OpenImageIO that referenced this pull request Nov 10, 2024
…areFoundation#4513)

Replacing 3x IBA + Helper function that generate 4 fulls size image
buffers with single unsharp_mask_impl() that use parallel_image() to
compute unsharp:
src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur))

Added two pass 1D convolution for a kernels higher than 3x3

## Tests

```
	ImageBuf sharped(input.spec());
	const int repeats = 50;

	std::cout << "Start sharpening\n";
	auto start = std::chrono::high_resolution_clock::now();

	for (int i = 0; i < repeats; i++)
	{
		//ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f);
		ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f);
		std::cout << ".";
	}

	std::cout << "\n";

	auto part1 = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> elapsed_part1 = part1 - start;
	std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n";
```

both single threaded (one IB at time) and multithreaded (multiply IB at
time) show pretty good speedup:
~30-40% with less memory use.

for 5x5 gaussian kernels two pass mode should add at least 20% speedup.

(if someone can do independent benchmark, will be great. As soon as I
had a big differences on them depend on real or synthetic use)

---------

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
lgritz pushed a commit to lgritz/OpenImageIO that referenced this pull request Nov 13, 2024
…areFoundation#4513)

Replacing 3x IBA + Helper function that generate 4 fulls size image
buffers with single unsharp_mask_impl() that use parallel_image() to
compute unsharp:
src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur))

Added two pass 1D convolution for a kernels higher than 3x3

## Tests

```
	ImageBuf sharped(input.spec());
	const int repeats = 50;

	std::cout << "Start sharpening\n";
	auto start = std::chrono::high_resolution_clock::now();

	for (int i = 0; i < repeats; i++)
	{
		//ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f);
		ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f);
		std::cout << ".";
	}

	std::cout << "\n";

	auto part1 = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> elapsed_part1 = part1 - start;
	std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n";
```

both single threaded (one IB at time) and multithreaded (multiply IB at
time) show pretty good speedup:
~30-40% with less memory use.

for 5x5 gaussian kernels two pass mode should add at least 20% speedup.

(if someone can do independent benchmark, will be great. As soon as I
had a big differences on them depend on real or synthetic use)

---------

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
lgritz pushed a commit to lgritz/OpenImageIO that referenced this pull request Nov 21, 2024
…areFoundation#4513)

Replacing 3x IBA + Helper function that generate 4 fulls size image
buffers with single unsharp_mask_impl() that use parallel_image() to
compute unsharp:
src + contr * (((src - blur) < threshold) ? 0.0 : (src - blur))

Added two pass 1D convolution for a kernels higher than 3x3

## Tests

```
	ImageBuf sharped(input.spec());
	const int repeats = 50;

	std::cout << "Start sharpening\n";
	auto start = std::chrono::high_resolution_clock::now();

	for (int i = 0; i < repeats; i++)
	{
		//ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 15.0f, 10.0f, 0.01f);
		ok = ImageBufAlgo::unsharp_mask(sharped, input, "gaussian", 5.0f, 2.0f, 0.05f);
		std::cout << ".";
	}

	std::cout << "\n";

	auto part1 = std::chrono::high_resolution_clock::now();
	std::chrono::duration<double> elapsed_part1 = part1 - start;
	std::cout << "Elapsed time: " << elapsed_part1.count() << " s\n";
```

both single threaded (one IB at time) and multithreaded (multiply IB at
time) show pretty good speedup:
~30-40% with less memory use.

for 5x5 gaussian kernels two pass mode should add at least 20% speedup.

(if someone can do independent benchmark, will be great. As soon as I
had a big differences on them depend on real or synthetic use)

---------

Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants