Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

short multiplication #2

Open
kimstik opened this issue Apr 29, 2024 · 17 comments
Open

short multiplication #2

kimstik opened this issue Apr 29, 2024 · 17 comments

Comments

@kimstik
Copy link

kimstik commented Apr 29, 2024

        for (uint32_t j = 0; j < 8; j++) {
            int32_t in=*activations_idx++;
            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            sum += tmpsum;                                  // sign*in*1
            if (weightChunk & 0x40000000) sum += tmpsum<<3; // sign*in*8
            if (weightChunk & 0x20000000) sum += tmpsum<<2; // sign*in*4
            if (weightChunk & 0x10000000) sum += tmpsum<<1; // sign*in*2
            weightChunk <<= 4;
        }

I'm having some concerns about the line of code 'sum += tmpsum;'.
Is it redundant, perhaps?
If the weight element is in SDDD format, then the sum of three shifts should be sufficient. Here's the relevant code section:

            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            if (weightChunk & 0x10000000) sum += tmpsum; // sign*in*1
            if (weightChunk & 0x20000000) sum += tmpsum<<1; // sign*in*2
            if (weightChunk & 0x40000000) sum += tmpsum<<2; // sign*in*3
            weightChunk <<= 4;

upd: It appears that I overlooked the fact that the value 0 is not used...

@cpldcpu
Copy link
Owner

cpldcpu commented Apr 29, 2024

Yes, this is to scale the numbers in a way, where the distance between the first positive and negative element is the same as between the first and second.

e.g. the encoding for two bit is -3 -1 1 3 or -1.5, -0.5, 0.5, 1.5 with constant delta.

With twos complements the encoding would be assymetric, -2,1,0,1
With ones complement without the shift, you would get a double 0: -1,0, 0,1 or a double step around zero: -2,-1,1,2

The encoding is more regular that way and there are no redundant codes. The weights are centered symetrically around zero for most layers, so you don't want to have a huge gap there.

The impact of this encoding network capacity and loss minimal, though. QaT is able to redistribute quantization errors fascinatingly well.

@kimstik
Copy link
Author

kimstik commented Apr 30, 2024

Did you notice that Clang generates linear code without jumps (at the cost of a slight size increase)?
They seem to mention static branch prediction.
It's interesting to find out how expensive conditional branches are in QingKe RISC-V2A finally ...

@cpldcpu
Copy link
Owner

cpldcpu commented Apr 30, 2024

I didn't check Clang specificially, but there were quite some differences in code generation between optimization options. To be honest, I thought that the RV32EC code was a bit cumbersome. The only bit that could be checked without masking is the MSB.

In the end, the biggest impact on execution speed was moving the inference code to SRAM, though.

@kimstik
Copy link
Author

kimstik commented May 1, 2024

Interesting. So, this device has both flash memory and a processor on a single chip?
That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

@kimstik
Copy link
Author

kimstik commented May 1, 2024

I achieved branchless linear code by processing two activations simultaneously in GCC:

// W packed two 4bit into 8bits by interleaving: (S1,D12,D11,D10),(S1,D12,D11,D10) -> (S1,S2,D10,D20,D11,D22,D12,D22)
static inline int vmac8x4(uint32_t *act, int32_t W) {
    static const int M[4]   = { 0x00000000, 0x00000fff, 0x0fff0000, 0x0fff0fff, };
    static const int ONE[4] = { 0x00000000, 0x00000001, 0x00010000, 0x00010001, };	//FIXME: perhaps we may get rid of this part for sign application
    int acc = 0;
    for (uint32_t j = 0; j < 2; j++) {
        uint32_t in = *act++;
        for (uint32_t k = 0; k < 2; k++) {
            int tmp = (M[(W>>(32-2))&0x3] ^ (in & 0x00ff00ff)) + ONE[(W>>(32-2))&0x3];    //    process bytes 1/3
                        acc += tmp;
            tmp<<=1;    acc += tmp ^ M[(W>>(32-2))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-4))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-6))&0x3];
            W  <<=8;
            in >>=8;    // repeat for bytes 0/2
        }
    }
    return (acc + (acc>>16))&0xfff;
}

Just a prototype. It needs to be tested.

@cpldcpu
Copy link
Owner

cpldcpu commented May 1, 2024

Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

Indeed, see here.
https://cpldcpu.wordpress.com/2024/05/01/decapsulating-the-ch32v203-reveals-a-separate-flash-die/

Quite surprised to see this even in a $0.30 MCU. But the Ch32v003 is monolithic.

@cpldcpu
Copy link
Owner

cpldcpu commented May 1, 2024

I achieved branchless linear code by processing two activations simultaneously in GCC:

Nice! I thought about this, but never went through actually doing this. I wonder how well RV32EC deals with the lookup tables?

Btw, I don't know wether you saw the FP1.3.0 encoding:

} else if (bits_per_weight == 16 + 4 ) { // 4 bit shift

Need to write up something about it. I almost managed to get it to the same accuracy as the 4 bit encoding. It's insane that it works so well.

This may work even better when parallelized.

@kimstik
Copy link
Author

kimstik commented May 1, 2024

nonlinear FP1.3.0 is nice, but running it in parallel isn't looks straightforward yet.
It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

2-bit parallel multiplication should be very interesting.

@cpldcpu
Copy link
Owner

cpldcpu commented May 1, 2024

It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

For an exponent of zero, the sign encodes to +1 and -1, so all states are used. The entropy of the weights is still not optimal, after the training, not all codes are used by the model in all layers.

@kimstik
Copy link
Author

kimstik commented May 2, 2024

8 instructions per iteration of FP1.3.0 is so beautiful:

.L28:
	slli	t1,a4,8
	sll	a2,s1,a2
	srai	a1,t1,28
	lb	s1,2(a3)
	add	a2,a2,s0
	andi	a1,a1,7
	bge	t1,zero,.L29
	neg	s1,s1
.L29:
	slli	s0,a4,12
	sll	a1,s1,a1
	srai	t1,s0,28
	add	a1,a1,a2
	lb	s1,3(a3)
	andi	a2,t1,7
	bge	s0,zero,.L30
	neg	s1,s1
.L30:
	slli	s0,a4,16
	sll	a2,s1,a2
	srai	t1,s0,28
	lb	s1,4(a3)
	add	a2,a2,a1
	andi	t1,t1,7
	bge	s0,zero,.L31
	neg	s1,s1
.L31:

@kimstik
Copy link
Author

kimstik commented Jul 29, 2024

I managed to get a 4-in-1 activation convolution working with 5-bit input quantization.
However, it provides almost no performance benefit ;(((
It still takes around 106 instructions for 8 inputs.

@cpldcpu
Copy link
Owner

cpldcpu commented Aug 2, 2024

it should be possible to parrallelize 4 bit quants with mul? 2 or 4 at a time?

@kimstik
Copy link
Author

kimstik commented Aug 4, 2024

Yes, with mul it should be much more simple. would you like to shift to chip with mul?

@cpldcpu
Copy link
Owner

cpldcpu commented Aug 4, 2024

The successors of the CH32V003 (V002 etc) will all have mul.

@kimstik
Copy link
Author

kimstik commented Aug 4, 2024

Ough! I missed it! Is V2C mul is 1-cycle?
Hope next successor will add Zcmp

@cpldcpu
Copy link
Owner

cpldcpu commented Aug 6, 2024

I think it is 1 cycle. They only added the mul instruction, nothing else.

@cpldcpu
Copy link
Owner

cpldcpu commented Aug 6, 2024

Should be possible to implement SIMD processing. For example 6 bit activations and 2 bit weights with 4 in parallel. The weights could be unpacked with an 8 bit table lookup into 32bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants