Skip to content

Xeon Phi (Knights Corner) Support. #6440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 213 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
213 commits
Select commit Hold shift + click to select a range
868a201
add detection of Xeon PHI: Knights Corner.
julialongtin Mar 12, 2024
7f3722b
handle the case that we have no glibc on the PHI.
julialongtin Mar 12, 2024
5a2973a
instead of checking on glibc, check on SYS_getcpu
julialongtin Mar 12, 2024
a31c936
try to detect the PHI cross compiler in make.
julialongtin Mar 12, 2024
aec982e
try to detect the PHI cross compiler in make.
julialongtin Mar 12, 2024
f346a41
try to implement one intrinsic
julialongtin Mar 13, 2024
a1ae649
use right type, and define GGML_F32_VEC_ZERO.
julialongtin Mar 13, 2024
7a57feb
import intrinsics.
julialongtin Mar 13, 2024
717e164
implement F32 dot products.
julialongtin Mar 16, 2024
257ffd9
Update ggml.c
julialongtin Mar 16, 2024
e216a2f
Update ggml.c
julialongtin Mar 16, 2024
eac00a7
Update ggml.c
julialongtin Mar 16, 2024
fe663c1
merge from upstream
julialongtin Mar 17, 2024
f882673
add a benchmark / test binary.
julialongtin Mar 17, 2024
ab6f3a8
Update ggml-phi-knc.c
julialongtin Mar 17, 2024
ee27148
remove intrinsics import, and use upConv to save 12 bytes of memory t…
julialongtin Mar 20, 2024
76e66e7
use the same header as ggml.c, and remove some warnings.
julialongtin Mar 20, 2024
ac36371
formatting changes.
julialongtin Mar 20, 2024
0979522
spacing changes.
julialongtin Mar 21, 2024
9185e14
be more specific about the length of our list of run amounts.
julialongtin Mar 21, 2024
a7bd64c
begin work on targeting dot_q5_K_q8_K.
julialongtin Mar 23, 2024
9bcb835
import stdint.h for sizeSt.
julialongtin Mar 23, 2024
8f57803
import stdio.h for size_t.
julialongtin Mar 23, 2024
cd20404
pull in ggml specific types.
julialongtin Mar 23, 2024
18f3539
tell ggml-common.h to export what we want.
julialongtin Mar 23, 2024
0b3f171
force to compile.
julialongtin Mar 23, 2024
0b012c0
allow using code from ggml-phi-knc-dot_q5_K_q8_K.c
julialongtin Mar 23, 2024
0a2051a
attempt to speed up float clearing.
julialongtin Mar 23, 2024
6face8a
first fixes.
julialongtin Mar 23, 2024
edb76ff
formatting improvement.
julialongtin Mar 23, 2024
e3503c9
promote aux16 into a vector.
julialongtin Mar 23, 2024
c72157a
promote aux16 into a vector.
julialongtin Mar 23, 2024
f092a10
promote aux16 into a vector. (part three)
julialongtin Mar 23, 2024
e43a63e
fix typo.
julialongtin Mar 23, 2024
31d4f93
copy right block.
julialongtin Mar 23, 2024
f985372
add missing variable.
julialongtin Mar 23, 2024
bd6d7e6
try to use vectorized zeroing function.
julialongtin Mar 23, 2024
9d7ca41
expand mask, and align memory.
julialongtin Mar 23, 2024
bb5eb95
use better memory save operator.
julialongtin Mar 23, 2024
f09b3ed
use quotes properly.
julialongtin Mar 23, 2024
2fdd11f
promote aux16 to a vector.
julialongtin Mar 23, 2024
f967690
add missing address of operators.
julialongtin Mar 23, 2024
ea1edb0
promote aux32 to a vector.
julialongtin Mar 23, 2024
4477b8e
add I32 vector memory clearing.
julialongtin Mar 23, 2024
a5132a1
attempt our first FMA.
julialongtin Mar 23, 2024
5935bb3
use proper mov operator, and pass addresses.
julialongtin Mar 23, 2024
03a3e0e
perform 16 operations at a time.
julialongtin Mar 24, 2024
ba4f412
better comments, and fix some small errors.
julialongtin Mar 24, 2024
c28bfe4
spacing changes, eliminate dead references to k1 or zero, and use the…
julialongtin Mar 24, 2024
169a145
fix our reference to src in the second place, and use a more accurate…
julialongtin Mar 24, 2024
cf481cf
promote aux8 into a vector.
julialongtin Mar 24, 2024
ca0dc26
loosen alignment requirements for zeros, add missing function, and pr…
julialongtin Mar 24, 2024
bc3d6db
separate filling aux16 from consuming aux16 by making it an array of …
julialongtin Mar 24, 2024
12c9576
fix vector sizes.
julialongtin Mar 25, 2024
9f569ca
massively rewrite assembly routines.
julialongtin Apr 2, 2024
8c17353
minor changes.
julialongtin Apr 2, 2024
47190a7
formatting.
julialongtin Apr 2, 2024
cb44226
Merge pull request #1 from julialongtin/k1om
julialongtin Apr 2, 2024
96fdd21
indent headers consistently.
julialongtin Apr 3, 2024
6f67ea8
formatting changes.
julialongtin Apr 3, 2024
9412572
add Makefile rule for generation .s file, for manual inspection.
julialongtin Apr 3, 2024
84df774
whoops. missing tab.
julialongtin Apr 3, 2024
9ad5efa
use GGML_F32_EPR, and remove some dead code.
julialongtin Apr 3, 2024
9152143
reformat, and label what these files are.
julialongtin Apr 3, 2024
53773e0
replace tabs with spaces.
julialongtin Apr 3, 2024
e298d9e
further optimizations. 0.99 tokens per second.
julialongtin Apr 22, 2024
6d16090
fix some small errors.
julialongtin Apr 22, 2024
90e99ea
fix an offset error, and get rid of tabs.
julialongtin Apr 22, 2024
8cae9a9
comment and spacing fixes.
julialongtin Apr 24, 2024
d69cf87
use or, instead of and. bug fix?
julialongtin Apr 24, 2024
77d4ca9
spacing and capitalization changes.
julialongtin Apr 25, 2024
047291f
spacing and capitalization changes. Fix the register list of GGML_5bi…
julialongtin Apr 26, 2024
81ca166
minor spacing and comment changes.
julialongtin May 9, 2024
af4ee51
add batch fp16<->fp32 conversion functions.
julialongtin May 9, 2024
a283551
remove a warning.
julialongtin May 9, 2024
e1fdfaa
fix typo
julialongtin May 9, 2024
867de5e
use different restrict syntax, to make g++ happy.
julialongtin May 9, 2024
2282ac4
broadcast a single int8, instead of 4 of them.
julialongtin May 10, 2024
f6edcc4
Use a vectorized assembly function to handle remaining chunks less th…
julialongtin May 10, 2024
b00607d
use vbroadcastss in place of vbroadcast32x4.
julialongtin May 10, 2024
0ff7d5d
perform better prefetches, and invert the test of our clear flag for …
julialongtin May 10, 2024
650094e
remove useless prefetches.
julialongtin May 10, 2024
7966c8e
spacing and comment changes.
julialongtin May 10, 2024
7e44eab
move sub earlier, and move the compare of iterations to outside, and …
julialongtin May 10, 2024
21a1e74
fix loop.
julialongtin May 10, 2024
8064727
use values inside of the loop as soon as we have them.
julialongtin May 10, 2024
4a3c42c
correct a comment, and use jz when comparing to zero.
julialongtin May 10, 2024
a82ada7
comment clarification.
julialongtin May 10, 2024
3156e63
change from handling three iterations per loop to four.
julialongtin May 11, 2024
fba57c1
subtract the correct amount.
julialongtin May 11, 2024
fa0226c
look at the right final memory location.
julialongtin May 11, 2024
b34575b
add missing jump.
julialongtin May 11, 2024
6c4e687
spacing changes.
julialongtin May 11, 2024
9d7f967
spacing changes.
julialongtin May 11, 2024
0a0bb9b
introduce r10 and r11, for vloadunpackhd.
julialongtin May 11, 2024
a1d0da6
rename label 1 to 3.
julialongtin May 11, 2024
047defe
rename some labels.
julialongtin May 11, 2024
7fa2d73
relabel some other labels.
julialongtin May 11, 2024
653a565
fill and increment r12 and r13.
julialongtin May 11, 2024
9550ca5
add missing vector.
julialongtin May 11, 2024
efdb411
make the offset of q4 available.
julialongtin May 11, 2024
3449b0f
minor comment fixes.
julialongtin May 11, 2024
1072686
load from identical addresses for low and high side.
julialongtin May 11, 2024
b23ab86
make offset available in a register.
julialongtin May 11, 2024
a20edbf
do 2 rounds of 4, instead of 4 rounds of 2. and properly offset unall…
julialongtin May 11, 2024
0add310
spacing changes.
julialongtin May 12, 2024
9ec8635
add detection of Xeon PHI: Knights Corner.
julialongtin Mar 12, 2024
a83e2ca
handle the case that we have no glibc on the PHI.
julialongtin Mar 12, 2024
5c0d49c
instead of checking on glibc, check on SYS_getcpu
julialongtin Mar 12, 2024
366279e
try to detect the PHI cross compiler in make.
julialongtin Mar 12, 2024
7fb8d47
try to detect the PHI cross compiler in make.
julialongtin Mar 12, 2024
429d69f
try to implement one intrinsic
julialongtin Mar 13, 2024
b5ea05f
use right type, and define GGML_F32_VEC_ZERO.
julialongtin Mar 13, 2024
7fce3f6
import intrinsics.
julialongtin Mar 13, 2024
192e4ad
implement F32 dot products.
julialongtin Mar 16, 2024
83be3db
Update ggml.c
julialongtin Mar 16, 2024
114e7dd
Update ggml.c
julialongtin Mar 16, 2024
c70b5f2
Update ggml.c
julialongtin Mar 16, 2024
d7d679e
merge from upstream
julialongtin Mar 17, 2024
a56a6f3
add a benchmark / test binary.
julialongtin Mar 17, 2024
d095d8e
Update ggml-phi-knc.c
julialongtin Mar 17, 2024
5a9d2f5
remove intrinsics import, and use upConv to save 12 bytes of memory t…
julialongtin Mar 20, 2024
a06fa4b
use the same header as ggml.c, and remove some warnings.
julialongtin Mar 20, 2024
bb73cb3
formatting changes.
julialongtin Mar 20, 2024
a48d3b9
spacing changes.
julialongtin Mar 21, 2024
c9730c0
be more specific about the length of our list of run amounts.
julialongtin Mar 21, 2024
669ce9b
begin work on targeting dot_q5_K_q8_K.
julialongtin Mar 23, 2024
3edaaca
import stdint.h for sizeSt.
julialongtin Mar 23, 2024
62e3543
import stdio.h for size_t.
julialongtin Mar 23, 2024
8703abe
pull in ggml specific types.
julialongtin Mar 23, 2024
a7f8abe
tell ggml-common.h to export what we want.
julialongtin Mar 23, 2024
aee550a
force to compile.
julialongtin Mar 23, 2024
a015d84
allow using code from ggml-phi-knc-dot_q5_K_q8_K.c
julialongtin Mar 23, 2024
7f5adf3
attempt to speed up float clearing.
julialongtin Mar 23, 2024
b3ec86e
first fixes.
julialongtin Mar 23, 2024
ff29b65
formatting improvement.
julialongtin Mar 23, 2024
2f0a949
promote aux16 into a vector.
julialongtin Mar 23, 2024
66d26d4
promote aux16 into a vector.
julialongtin Mar 23, 2024
84093a6
promote aux16 into a vector. (part three)
julialongtin Mar 23, 2024
e99f3a9
fix typo.
julialongtin Mar 23, 2024
656bf28
copy right block.
julialongtin Mar 23, 2024
2870bfc
add missing variable.
julialongtin Mar 23, 2024
7a00422
try to use vectorized zeroing function.
julialongtin Mar 23, 2024
5c010f7
expand mask, and align memory.
julialongtin Mar 23, 2024
ed639a6
use better memory save operator.
julialongtin Mar 23, 2024
31b8a5a
use quotes properly.
julialongtin Mar 23, 2024
45c94bd
promote aux16 to a vector.
julialongtin Mar 23, 2024
3c29fd5
add missing address of operators.
julialongtin Mar 23, 2024
10237df
promote aux32 to a vector.
julialongtin Mar 23, 2024
da69ed5
add I32 vector memory clearing.
julialongtin Mar 23, 2024
e3468e0
attempt our first FMA.
julialongtin Mar 23, 2024
d34e0ff
use proper mov operator, and pass addresses.
julialongtin Mar 23, 2024
0c01d07
perform 16 operations at a time.
julialongtin Mar 24, 2024
98c9b69
better comments, and fix some small errors.
julialongtin Mar 24, 2024
3cdfc9c
spacing changes, eliminate dead references to k1 or zero, and use the…
julialongtin Mar 24, 2024
3fef54f
fix our reference to src in the second place, and use a more accurate…
julialongtin Mar 24, 2024
1c182a3
promote aux8 into a vector.
julialongtin Mar 24, 2024
e579af1
loosen alignment requirements for zeros, add missing function, and pr…
julialongtin Mar 24, 2024
2a47e5f
separate filling aux16 from consuming aux16 by making it an array of …
julialongtin Mar 24, 2024
20c2bc5
fix vector sizes.
julialongtin Mar 25, 2024
33cc1d8
massively rewrite assembly routines.
julialongtin Apr 2, 2024
90498c1
minor changes.
julialongtin Apr 2, 2024
3cf6eb0
formatting.
julialongtin Apr 2, 2024
3ff0924
indent headers consistently.
julialongtin Apr 3, 2024
aeb5ae8
formatting changes.
julialongtin Apr 3, 2024
ded4da4
add Makefile rule for generation .s file, for manual inspection.
julialongtin Apr 3, 2024
f84859a
whoops. missing tab.
julialongtin Apr 3, 2024
b8abefb
use GGML_F32_EPR, and remove some dead code.
julialongtin Apr 3, 2024
fb83cd9
reformat, and label what these files are.
julialongtin Apr 3, 2024
d966ac2
replace tabs with spaces.
julialongtin Apr 3, 2024
c3d438b
further optimizations. 0.99 tokens per second.
julialongtin Apr 22, 2024
e37b7f8
fix some small errors.
julialongtin Apr 22, 2024
4fb1547
fix an offset error, and get rid of tabs.
julialongtin Apr 22, 2024
dc1f639
comment and spacing fixes.
julialongtin Apr 24, 2024
0124f7a
use or, instead of and. bug fix?
julialongtin Apr 24, 2024
9a799eb
spacing and capitalization changes.
julialongtin Apr 25, 2024
54f181d
spacing and capitalization changes. Fix the register list of GGML_5bi…
julialongtin Apr 26, 2024
1c2fdc3
minor spacing and comment changes.
julialongtin May 9, 2024
9fa06f4
add batch fp16<->fp32 conversion functions.
julialongtin May 9, 2024
c39fa8b
remove a warning.
julialongtin May 9, 2024
2cf193e
fix typo
julialongtin May 9, 2024
664a602
use different restrict syntax, to make g++ happy.
julialongtin May 9, 2024
6e0258a
broadcast a single int8, instead of 4 of them.
julialongtin May 10, 2024
b1c9622
Use a vectorized assembly function to handle remaining chunks less th…
julialongtin May 10, 2024
a14fe02
use vbroadcastss in place of vbroadcast32x4.
julialongtin May 10, 2024
d8d574c
perform better prefetches, and invert the test of our clear flag for …
julialongtin May 10, 2024
204bc1f
remove useless prefetches.
julialongtin May 10, 2024
f555f9d
spacing and comment changes.
julialongtin May 10, 2024
dda250f
move sub earlier, and move the compare of iterations to outside, and …
julialongtin May 10, 2024
270204e
fix loop.
julialongtin May 10, 2024
9a1a53b
use values inside of the loop as soon as we have them.
julialongtin May 10, 2024
f3b86eb
correct a comment, and use jz when comparing to zero.
julialongtin May 10, 2024
4097cde
comment clarification.
julialongtin May 10, 2024
511ad80
change from handling three iterations per loop to four.
julialongtin May 11, 2024
47ca67a
subtract the correct amount.
julialongtin May 11, 2024
1b7ca0b
look at the right final memory location.
julialongtin May 11, 2024
4d94831
add missing jump.
julialongtin May 11, 2024
fc23c22
spacing changes.
julialongtin May 11, 2024
a273a9e
spacing changes.
julialongtin May 11, 2024
9f3623f
introduce r10 and r11, for vloadunpackhd.
julialongtin May 11, 2024
9aa34c8
rename label 1 to 3.
julialongtin May 11, 2024
eefa650
rename some labels.
julialongtin May 11, 2024
0c0137e
relabel some other labels.
julialongtin May 11, 2024
50887fc
fill and increment r12 and r13.
julialongtin May 11, 2024
257c06b
add missing vector.
julialongtin May 11, 2024
3d39d61
make the offset of q4 available.
julialongtin May 11, 2024
420e9db
minor comment fixes.
julialongtin May 11, 2024
084e368
load from identical addresses for low and high side.
julialongtin May 11, 2024
7925fb1
make offset available in a register.
julialongtin May 11, 2024
bd22e9d
do 2 rounds of 4, instead of 4 rounds of 2. and properly offset unall…
julialongtin May 11, 2024
aede2f5
spacing changes.
julialongtin May 12, 2024
ded062c
Merge branch 'master' into 0.99-rebase
julialongtin Jun 12, 2024
03a8b80
Merge pull request #4 from julialongtin/0.99-rebase
julialongtin Jun 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,8 @@ CC := riscv64-unknown-linux-gnu-gcc
CXX := riscv64-unknown-linux-gnu-g++
endif

K1OM := $(shell echo | $(CC) -dM -E - | grep __k1om__)

#
# Compile flags
#
Expand Down Expand Up @@ -279,6 +281,10 @@ endif
ifndef RISCV

ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686 amd64))

# detect the PHI cross compiler.
ifeq "${K1OM}" ""

# Use all CPU extensions that are available:
MK_CFLAGS += -march=native -mtune=native
HOST_CXXFLAGS += -march=native -mtune=native
Expand All @@ -290,6 +296,11 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686 amd64))
# Usage SSSE3-only (Not is SSE3!)
#MK_CFLAGS += -mssse3
#MK_CXXFLAGS += -mssse3
else
OBJS += ggml-phi-knc.o ggml-phi-knc-dot_q5_K_q8_K.o
MK_CFLAGS += -march=knc -mtune=knc
endif

endif

ifneq '' '$(findstring mingw,$(shell $(CC) -dumpmachine))'
Expand Down Expand Up @@ -733,13 +744,29 @@ clean:
# Helper function that replaces .c, .cpp, and .cu file endings with .o:
GET_OBJ_FILE = $(patsubst %.c,%.o,$(patsubst %.cpp,%.o,$(patsubst %.cu,%.o,$(1))))

# Helper function that replaces .c, .cpp, and .cu file endings with .s:
GET_ASM_FILE = $(patsubst %.c,%.s,$(patsubst %.cpp,%.s,$(patsubst %.cu,%.s,$(1))))

main: examples/main/main.cpp ggml.o llama.o $(COMMON_DEPS) console.o grammar-parser.o $(OBJS)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
@echo
@echo '==== Run ./main -h for help. ===='
@echo

bench-phi-knc.s: bench-phi-knc.c
$(CC) $(CFLAGS) -S $< -o $(call GET_ASM_FILE, $<)

ggml-phi-knc.s: ggml-phi-knc.c
$(CC) $(CFLAGS) -S $< -o $(call GET_ASM_FILE, $<)

bench-phi-knc: bench-phi-knc.c ggml-phi-knc.o ggml-phi-knc-dot_q5_K_q8_K.o
$(CC) $(CFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CC) $(CFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)

ggml-phi-knc-dot_q5_K_q8_K.s: ggml-phi-knc-dot_q5_K_q8_K.c
$(CC) $(CFLAGS) -S $< -o $(call GET_ASM_FILE, $<)

infill: examples/infill/infill.cpp ggml.o llama.o $(COMMON_DEPS) console.o grammar-parser.o $(OBJS)
$(CXX) $(CXXFLAGS) -c $< -o $(call GET_OBJ_FILE, $<)
$(CXX) $(CXXFLAGS) $(filter-out %.h $<,$^) $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS)
Expand Down
213 changes: 213 additions & 0 deletions bench-phi-knc.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
/* bench-phi-knc.c: benchmarks and tests for the Xeon PHI Knights Corner optimizations. */

#include <immintrin.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

/* For CLOCK_REALTIME? */
#include <unistd.h>
#include <time.h>

/* For memcpy */
#include <string.h>

/* include the increasingly inacurately named header for our F32 dot product code. */
#include "ggml-phi-knc.h"

/* include the header for our Q8K_Q5K dot product code. */
#include "ggml-phi-knc-dot_q5_K_q8_K.h"

// largest Float32 vectors to get the dot product of.
#define F32_MAXVEC 1024768
// how many benchmarks we will run in total.
#define F32_RUNCOUNT 12
#define F32_ITEMS_PER_RUN {10, 16, 17, 32, 33, 48, 49, 64, 65, 80, 81, 1024768}

int main(void)
{
int vecRuns[F32_RUNCOUNT] = F32_ITEMS_PER_RUN;

// seed the random number generator.
srand(time(NULL));

// Run benchmarks for our F32 dot product functions. Benchmark them against a naieve implementation.
for (uint8_t runCount = 0; runCount < F32_RUNCOUNT; ++runCount)
{
struct timespec start, middle, end;
double vector_time;
double scalar_time;
float scalar = 0.0f;
float vector = 0.0f;

// Generate random input vector of [-1, 1] values.
float vec1[F32_MAXVEC] __attribute__((aligned(64)));
for (int i = 0; i < vecRuns[runCount]; i++)
vec1[i] = 2 * (0.5 - rand() / (float)RAND_MAX);

// Generate a second random input vector of [-1, 1] values.
float vec2[F32_MAXVEC] __attribute__((aligned(64)));
for (int i = 0; i < vecRuns[runCount]; i++)
vec2[i] = 2 * (0.5 - rand() / (float)RAND_MAX);

// on your mark..
clock_gettime(CLOCK_MONOTONIC, &start);

// call dot product
ggml_vec_dot_f32(vecRuns[runCount], &vector, 0, vec1, 0, vec2, 0, 0);

// save the middle point..
clock_gettime(CLOCK_MONOTONIC, &middle);

// do the same work by hand;
for (int i = 0; i < vecRuns[runCount]; ++i)
scalar += vec1[i]*vec2[i];

clock_gettime(CLOCK_MONOTONIC, &end);

printf("vector\tvs\tscalar (%d items)\n", vecRuns[runCount]);
printf("%.9f\tvs\t%.9f\n", vector, scalar);

vector_time = middle.tv_sec - start.tv_sec;
vector_time += (middle.tv_nsec - start.tv_nsec) / 1000000000.0;

scalar_time = end.tv_sec - middle.tv_sec;
scalar_time += (end.tv_nsec - middle.tv_nsec) / 1000000000.0;

printf("%.9f\tvs\t%.9f\n", vector_time, scalar_time);
}

fflush(stdout);

// Generate a random input vector of 256 4 bit values.
uint8x16_t q4[8];
uint8_t * q4ptr = (uint8_t *)q4;
for (int i = 0; i < 128; i++)
q4ptr[i] = rand() && 0xFF;

// Generate a random input vector of 256 1 bit values.
uint8x16_t q1[2];
uint8_t * q1ptr = (uint8_t *)q1;
for (int i = 0; i < 32; i++)
q1ptr[i] = rand() && 0xFF;

// Get our reference, unshifted result.
uint8x16_t q5[16];
GGML_5bit_Unpack_Unaligned(q4, (uint8_t *)q1, q5);

printf("successfully got a Q5.\n");

// Perform alignment tests, for GGML_5bit_Unpack_Unaligned.
// Try to run GGML_5bit_Unpack_Unaligned with all possible misalignments, and get it to fail.
for (uint8_t shiftCount = 1; shiftCount < 16; ++shiftCount)
{
uint8x16_t q5new[16];
uint8x16_t q4Shifted[9];

// create an off-by-shiftCount copy of q4.
q4ptr = ((uint8_t *)q4Shifted) + shiftCount;
memcpy (q4ptr, q4, 128);

// call the unaligned form of this function:
GGML_5bit_Unpack_Unaligned((uint8x16_t *)q4ptr, (uint8_t *)q1, q5new);

for (uint32_t byteCount = 0; byteCount < 256; ++byteCount)
{
if ( ((uint8_t *)q5new)[byteCount] != ((uint8_t *)q5)[byteCount] )
{
printf("whoops!\nshiftCount: %d\nbyteCount: %d\n", shiftCount, byteCount);
exit (-1);
}
}

printf("Got a Q5 offset by %d\n", shiftCount);
}

// Generate a random input vector of 256 8 bit values.
int8x16_t q8[16];
int8_t * q8ptr = (int8_t *)q8;
for (int i = 0; i < 256; i++)
q8ptr[i] = rand() && 0xFF;

// Generate eight random scales, one for each pair of sums.
uint8_t scale[8];
for (int i = 0; i < 8; i++)
scale[i] = rand() && 0xFF;

// Generate a random X scale.
float rndScaleX = 2 * (0.5 - rand() / (float)RAND_MAX);
ggml_fp16_t scaleX = GGML_PHI_FP32_TO_FP16(rndScaleX);

// Display the random X scale. Verifies FP32_TO_FP16_TO_FP32 is working.
printf("rndScaleX: %f\n", rndScaleX);
printf("scaleX: %x\n", scaleX);
printf("newScaleX: %f\n", GGML_PHI_FP16_TO_FP32(scaleX));

// Generate a random Y scale.
float scaleY = 2 * (0.5 - rand() / (float)RAND_MAX);
printf("scaleY: %f\n", scaleY);

// Create a place for our golden result.
float32x16_t res;

// Clear res.
GGML_F32x16_VEC_ZERO(&res);

// Generate an initial result, to compare to.
GGML_8X_2xI8x16_2xI8x16_MUL_2xI16x16_S_FMA_I32x16_Unaligned (q8, q5, scale, scaleX, scaleY, &res);

// Generate a sum of the result.
float sum = 0.0f;
for (int l = 0; l < 16; ++l) sum += ((float *)&res)[l];

printf("Got a res: %f\n", sum);

// Perform alignment tests, for GGML_8X_2xI8x16_2xI8x16_MUL_2xI16x16_S_FMA_I32x16_Unaligned.
// try to run GGML_8X_2xI8x16_2xI8x16_MUL_2xI16x16_S_FMA_I32x16_Unaligned with all possible mis-alignments, and get it to fail.
for (uint8_t shiftCount = 1; shiftCount < 16; ++shiftCount)
{
float32x16_t resNew1;
int8x16_t q8Shifted[17];

// Create an off-by-shiftCount copy of q8.
q8ptr = ((int8_t *)q8Shifted)+shiftCount;
memcpy (q8ptr, q8, 256);

// Clear resNew.
GGML_F32x16_VEC_ZERO(&resNew1);

// Call the unaligned form of this function:
GGML_8X_2xI8x16_2xI8x16_MUL_2xI16x16_S_FMA_I32x16_Unaligned ((int8x16_t *)q8ptr, q5, scale, scaleX, scaleY, &resNew1);

// check the result against our reference.
for (uint32_t floatCount = 0; floatCount < 64; ++floatCount)
{
if ( ((int8_t *)&resNew1)[floatCount] != ((int8_t *)&res)[floatCount] )
{
printf("whoops!\nshiftCount: %d\nfloatCount: %d\n", shiftCount, floatCount);
for (uint32_t row = 0; row < 16 ; ++row)
{
for (int col1 = 0; col1 < 4; ++col1)
{
printf("%2.2x\t", ((int8_t *)&resNew1)[(4*row)+col1]);
}
printf(" vs ");
for (int col2 = 0; col2 < 4; ++col2)
{
printf("%2.2x\t", ((int8_t *)&res)[(4*row)+col2]);
}
printf ("\n");
}
exit (-1);
}
}

// Generate a sum of our new result.
float sumf = 0.0f;
for (int l = 0; l < 16; ++l) sumf += ((float *)&resNew1)[l];

printf("Got a res from a Q8 offset by %d: %f\n", ((uint64_t) q8ptr) & 0x3F, sumf);
}

return 0;
}
Loading