Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Math: Optimize sofm_exp_fixed() HiFi version #9631

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

singalsu
Copy link
Collaborator

No description provided.

@singalsu singalsu marked this pull request as ready for review November 1, 2024 11:35
@@ -357,20 +312,27 @@ int32_t sofm_exp_fixed(int32_t x)
xs = x;
while (xs >= SOFM_EXP_TWO_Q27 || xs <= SOFM_EXP_MINUS_TWO_Q27) {
xs >>= 1;
n++;
n <<= 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not just count the number of bits to shift and shift by N directly? Maybe with a branch like

if (xs >= SOFM_EXP_TWO_Q27)
    xs >>= ffs(xs) - ffs(SOFM_EXP_TWO_Q27);
else if (xs <= SOFM_EXP_MINUS_TWO_Q27)
    xs >>= ...;

? The actual shift above is likely wrong, but that would be the idea

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a HiFi instruction to count bits for normalization, so maybe such instruction could be used. If counting bits is done in a loop like this it would not help.

There's also the Q27 vs. Q28 format issue, here we shift right one too much and in next we shift left. But I could not get measurable improvement from trying to optimize that, and I'm not sure I preserved right the equation so I left it out from this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's max. 3 loops for worst case large input number. So, the overhead from counting bits other way needs to be small.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu on Xtensa __builtin_ffs() uses the NSAU instruction to count those bits

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I didn't know, I'll try with it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu yep, we have definitions for all the ffs*() and clz*() variants in

#define ffs(i) __builtin_ffs(i)
#define ffsl(i) __builtin_ffsl(i)
#define ffsll(i) __builtin_ffsll(i)
#define clz(i) __builtin_clz(i)
#define clzl(i) __builtin_clzl(i)
#define clzll(i) __builtin_clzll(i)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm able to save another 0.4 MCPS with normalize HiFi instruction based input value scaling to replace the while loop. Also even more saving seems to be possible with allowing range -4 to +4 instead of -2 to +2 for the small value exponent function. I'll update this PR tomorrow.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another great looking improvement! One question about using rand for the test case, plesae check inline.

int32_t ivalue, iexp_value;
int i;

srand((unsigned int)time(NULL));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit mixed feeling about this. Randomization can work, but then you'd have to have enough iterations to cover meaningful amount of the test space to provide repeatable and reliable results (see e.g. https://scotthannen.org/blog/2024/05/20/dont-use-random-numbers-in-tests.html).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed here testing of the other small values exp() function that uses random numbers (there was a discussion about it in review). Random numbers give good coverage over time if the random numbers are good (the failing input / output value is printed). But of course with 256 points in one run, some issue might be missed. But what would then be suitable number of points to trust that 32 bit inputs are properly tested in acceptable time. The cmocka test is run with xtensa simulator, so it could take quite long.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still kept this. If it's a blocker, also easy to change to a fixed grid of test points. As you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random is not reproducible but does offer a more varied surface area for the tests. @singalsu for the moment can we make this a static table as we dont yet have the infra to deal with test randomization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, knowing the exponent nature, I think I could test with a dense grid of 1 lsb at min and max values and some 100 points grid for intermediate values.

I think there's a mistake in random values of the other function test in this same source file so I should fix it too. The values tested are actually only integer decimal values +0.5 and not entire 32 bit range.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have both? A fixed set of points and a few more random ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I can do that since I have already code for random.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed my mind again. After adding the linear grid test with 3 regions of exp() asymptotic approach of x-axis, curve up, asymptotic approach of y-axis I thought that having random added is low value, so I removed it.


return y;
}

EXPORT_SYMBOL(sofm_exp_fixed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the EXPORT_SYMBOL() line follows immediately the function, with no empty lines

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, is it just style issue? Looks ugly to me, but anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, is it just style issue? Looks ugly to me, but anyway.

@singalsu yeah, it's the same in the Linux kernel and even checkpatch requests for it to follow the exported function, but maybe it tolerates empty lines. Still, everywhere else it follows immediately, so, I wouldn't add it just in a single location

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed now!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu I still see whitespace in the updated patch...?

Should be:

	return y;
}
EXPORT_SYMBOL(sofm_exp_fixed);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I think I just forgot to push the update branch!

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless my eyes deceive, the last style issue is still there...


return y;
}

EXPORT_SYMBOL(sofm_exp_fixed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu I still see whitespace in the updated patch...?

Should be:

	return y;
}
EXPORT_SYMBOL(sofm_exp_fixed);

There was no unit test for the function, so it is added. The
pass criteria is tuned for current implementation to just
pass in the three defined regions. It is useful to verify
the next changes to optimize the function.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
The unnecessary shift and multiply functions can be removed
with use of normal C shift left and with use xtensa multiply,
shift, and round intrinsics directly in the function.

This change saves in TGL HiFi3 platform 1.3 MCPS in DRC
processing mode.

Signed-off-by: Seppo Ingalsuo <seppo.ingalsuo@linux.intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants