short multiplication #2

kimstik · 2024-04-29T16:23:54Z

        for (uint32_t j = 0; j < 8; j++) {
            int32_t in=*activations_idx++;
            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            sum += tmpsum;                                  // sign*in*1
            if (weightChunk & 0x40000000) sum += tmpsum<<3; // sign*in*8
            if (weightChunk & 0x20000000) sum += tmpsum<<2; // sign*in*4
            if (weightChunk & 0x10000000) sum += tmpsum<<1; // sign*in*2
            weightChunk <<= 4;
        }

I'm having some concerns about the line of code 'sum += tmpsum;'.
Is it redundant, perhaps?
If the weight element is in SDDD format, then the sum of three shifts should be sufficient. Here's the relevant code section:

            int32_t tmpsum = (weightChunk & 0x80000000) ? -in : in; 
            if (weightChunk & 0x10000000) sum += tmpsum; // sign*in*1
            if (weightChunk & 0x20000000) sum += tmpsum<<1; // sign*in*2
            if (weightChunk & 0x40000000) sum += tmpsum<<2; // sign*in*3
            weightChunk <<= 4;

upd: It appears that I overlooked the fact that the value 0 is not used...

cpldcpu · 2024-04-29T18:18:15Z

Yes, this is to scale the numbers in a way, where the distance between the first positive and negative element is the same as between the first and second.

e.g. the encoding for two bit is -3 -1 1 3 or -1.5, -0.5, 0.5, 1.5 with constant delta.

With twos complements the encoding would be assymetric, -2,1,0,1
With ones complement without the shift, you would get a double 0: -1,0, 0,1 or a double step around zero: -2,-1,1,2

The encoding is more regular that way and there are no redundant codes. The weights are centered symetrically around zero for most layers, so you don't want to have a huge gap there.

The impact of this encoding network capacity and loss minimal, though. QaT is able to redistribute quantization errors fascinatingly well.

kimstik · 2024-04-30T16:28:53Z

Did you notice that Clang generates linear code without jumps (at the cost of a slight size increase)?
They seem to mention static branch prediction.
It's interesting to find out how expensive conditional branches are in QingKe RISC-V2A finally ...

cpldcpu · 2024-04-30T17:12:45Z

I didn't check Clang specificially, but there were quite some differences in code generation between optimization options. To be honest, I thought that the RV32EC code was a bit cumbersome. The only bit that could be checked without masking is the MSB.

In the end, the biggest impact on execution speed was moving the inference code to SRAM, though.

kimstik · 2024-05-01T10:02:13Z

Interesting. So, this device has both flash memory and a processor on a single chip?
That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

kimstik · 2024-05-01T11:26:14Z

I achieved branchless linear code by processing two activations simultaneously in GCC:

// W packed two 4bit into 8bits by interleaving: (S1,D12,D11,D10),(S1,D12,D11,D10) -> (S1,S2,D10,D20,D11,D22,D12,D22)
static inline int vmac8x4(uint32_t *act, int32_t W) {
    static const int M[4]   = { 0x00000000, 0x00000fff, 0x0fff0000, 0x0fff0fff, };
    static const int ONE[4] = { 0x00000000, 0x00000001, 0x00010000, 0x00010001, };	//FIXME: perhaps we may get rid of this part for sign application
    int acc = 0;
    for (uint32_t j = 0; j < 2; j++) {
        uint32_t in = *act++;
        for (uint32_t k = 0; k < 2; k++) {
            int tmp = (M[(W>>(32-2))&0x3] ^ (in & 0x00ff00ff)) + ONE[(W>>(32-2))&0x3];    //    process bytes 1/3
                        acc += tmp;
            tmp<<=1;    acc += tmp ^ M[(W>>(32-2))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-4))&0x3];
            tmp<<=1;    acc += tmp ^ M[(W>>(32-6))&0x3];
            W  <<=8;
            in >>=8;    // repeat for bytes 0/2
        }
    }
    return (acc + (acc>>16))&0xfff;
}

Just a prototype. It needs to be tested.

cpldcpu · 2024-05-01T12:35:02Z

Interesting. So, this device has both flash memory and a processor on a single chip? That's unusual for modern Chinese designs. The common trend there is to use two-chip solutions, allowing for faster operation without wait states and insane overclocking capabilities.

Indeed, see here.
https://cpldcpu.wordpress.com/2024/05/01/decapsulating-the-ch32v203-reveals-a-separate-flash-die/

Quite surprised to see this even in a $0.30 MCU. But the Ch32v003 is monolithic.

cpldcpu · 2024-05-01T12:38:42Z

I achieved branchless linear code by processing two activations simultaneously in GCC:

Nice! I thought about this, but never went through actually doing this. I wonder how well RV32EC deals with the lookup tables?

Btw, I don't know wether you saw the FP1.3.0 encoding:

BitNetMCU/BitNetMCU_inference.c

Line 129 in e730e59

} else if (bits_per_weight == 16 + 4 ) { // 4 bit shift

Need to write up something about it. I almost managed to get it to the same accuracy as the 4 bit encoding. It's insane that it works so well.

This may work even better when parallelized.

kimstik · 2024-05-01T14:54:52Z

nonlinear FP1.3.0 is nice, but running it in parallel isn't looks straightforward yet.
It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

2-bit parallel multiplication should be very interesting.

cpldcpu · 2024-05-01T17:30:05Z

It's strange that you didn't give it much attention, as it uses only 15 out of 16 states. There are shifts of +0/-0 present.

For an exponent of zero, the sign encodes to +1 and -1, so all states are used. The entropy of the weights is still not optimal, after the training, not all codes are used by the model in all layers.

kimstik · 2024-05-02T14:46:31Z

8 instructions per iteration of FP1.3.0 is so beautiful:

.L28:
	slli	t1,a4,8
	sll	a2,s1,a2
	srai	a1,t1,28
	lb	s1,2(a3)
	add	a2,a2,s0
	andi	a1,a1,7
	bge	t1,zero,.L29
	neg	s1,s1
.L29:
	slli	s0,a4,12
	sll	a1,s1,a1
	srai	t1,s0,28
	add	a1,a1,a2
	lb	s1,3(a3)
	andi	a2,t1,7
	bge	s0,zero,.L30
	neg	s1,s1
.L30:
	slli	s0,a4,16
	sll	a2,s1,a2
	srai	t1,s0,28
	lb	s1,4(a3)
	add	a2,a2,a1
	andi	t1,t1,7
	bge	s0,zero,.L31
	neg	s1,s1
.L31:

kimstik · 2024-07-29T09:36:45Z

I managed to get a 4-in-1 activation convolution working with 5-bit input quantization.
However, it provides almost no performance benefit ;(((
It still takes around 106 instructions for 8 inputs.

cpldcpu · 2024-08-02T06:59:41Z

it should be possible to parrallelize 4 bit quants with mul? 2 or 4 at a time?

kimstik · 2024-08-04T11:16:06Z

Yes, with mul it should be much more simple. would you like to shift to chip with mul?

cpldcpu · 2024-08-04T11:51:20Z

The successors of the CH32V003 (V002 etc) will all have mul.

kimstik · 2024-08-04T20:14:38Z

Ough! I missed it! Is V2C mul is 1-cycle?
Hope next successor will add Zcmp

cpldcpu · 2024-08-06T18:21:22Z

I think it is 1 cycle. They only added the mul instruction, nothing else.

cpldcpu · 2024-08-06T18:24:52Z

Should be possible to implement SIMD processing. For example 6 bit activations and 2 bit weights with 4 in parallel. The weights could be unpacked with an 8 bit table lookup into 32bit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

short multiplication #2

short multiplication #2

kimstik commented Apr 29, 2024 •

edited

Loading

cpldcpu commented Apr 29, 2024 •

edited

Loading

kimstik commented Apr 30, 2024

cpldcpu commented Apr 30, 2024

kimstik commented May 1, 2024

kimstik commented May 1, 2024 •

edited

Loading

cpldcpu commented May 1, 2024

cpldcpu commented May 1, 2024 •

edited

Loading

kimstik commented May 1, 2024 •

edited

Loading

cpldcpu commented May 1, 2024 •

edited

Loading

kimstik commented May 2, 2024

kimstik commented Jul 29, 2024

cpldcpu commented Aug 2, 2024

kimstik commented Aug 4, 2024

cpldcpu commented Aug 4, 2024

kimstik commented Aug 4, 2024

cpldcpu commented Aug 6, 2024

cpldcpu commented Aug 6, 2024

short multiplication #2

short multiplication #2

Comments

kimstik commented Apr 29, 2024 • edited Loading

cpldcpu commented Apr 29, 2024 • edited Loading

kimstik commented Apr 30, 2024

cpldcpu commented Apr 30, 2024

kimstik commented May 1, 2024

kimstik commented May 1, 2024 • edited Loading

cpldcpu commented May 1, 2024

cpldcpu commented May 1, 2024 • edited Loading

kimstik commented May 1, 2024 • edited Loading

cpldcpu commented May 1, 2024 • edited Loading

kimstik commented May 2, 2024

kimstik commented Jul 29, 2024

cpldcpu commented Aug 2, 2024

kimstik commented Aug 4, 2024

cpldcpu commented Aug 4, 2024

kimstik commented Aug 4, 2024

cpldcpu commented Aug 6, 2024

cpldcpu commented Aug 6, 2024

kimstik commented Apr 29, 2024 •

edited

Loading

cpldcpu commented Apr 29, 2024 •

edited

Loading

kimstik commented May 1, 2024 •

edited

Loading

cpldcpu commented May 1, 2024 •

edited

Loading

kimstik commented May 1, 2024 •

edited

Loading

cpldcpu commented May 1, 2024 •

edited

Loading