-
-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More sixel renderer optimizations. #13
Comments
@ismail-yilmaz Wow sounds promising - and ofc I am really curious about these improvements on asm level and whether they also apply to wasm to some degree or could be ported back. Its pure stack machine design makes some optimizations behave worse (even pointer arithmetic tends to run slower from my findings). |
@ismail-yilmaz Minor note on what I've found worth looking into, but did not get down to it yet:
Both ideas are somewhat branching related, and I think that is promising ground for further optimizations. Edit: Btw my decoder function kinda invites to use computed gotos, sadly they are not yet natively supported in wasm. If you want to squeeze 10 - 20% more perf, this might be worth a look as well. |
@ismail-yilmaz I implemented the "write always something" in |
@jerch ,
I apologize for my late reply (Really busy now, and I'll comment on the things you've written, tonight - GMT+3). But I think I finally understand why you get so high results (which is very impressive by the way, congrats!) When I do the same thing, the performance of my decoder falls down to ~25 MiBs (from 42). I now know why (at least, I know what's the main culprit. I'll write the whole thing tonight (GMT + 3) and share some findings, with a suggestion on your decoder's main loop. Best regards, |
Well wouldn't call 30 min "late" 😸 Yes I am really interested in your asm findings and whether I can backport some ideas to wasm as well. Wasm itself is pretty reduced in op codes and its stack mechanics, so some things might not work out there.
Oh, thats weird. Note that those numbers are raw decoder speed, the whole turnover in xterm.js with the sixel-bench test is just at 36 MB/s (gets slowed down by the JS handling pre/posthand). Furthermore my test machine is a i7-2760QM with 2.40GHz and 1333 MHz (0.8 ns) double bank memory. The CPU is kinda old, but I remember seeing big differences in i3/i5 types vs. i7, esp. due their different cache settings. If we want comparable metrics across machines, maybe we should switch to cycles instead. |
A disclaimer: My experience with, and knowledge of, webassembly is pretty limited at the moment. (Not so with Asm/C/C++ though. ) I will be using the My findings suggest that there are three critical sections that impact the sixel rendering performance. Let's go step by step, shall we? 1. Parameter collectionThis was the first interesting part. The fact that higher color palette settings (>= 256) and sixel compression relies on a supplied parameters list makes this section a good candidate for optimization. Throughout my tests and benchmarking I came to the conclusion that for effective cpu cache utilization it would be better to collect the consecutive parameters in a single, separate loop at once (i.e via a dedicated inline function for better instruction and data cache utilization.) The code of the said three function can be inspected here. (Note that I use Here are the results (with -O3 optimization):
While the difference between faster and unrolled version can be considered negligible, it is still there. And it also shows that conditional branching may not always be the worst path to take. Which brings me to the question: Does this also apply to wasm? Did you already consider this approach for your A side note: A code line from
Now, I don't know which one would be faster as I don't have in-depth knowledge of To be continued... |
@ismail-yilmaz Thx a bunch, I'll check these things out. About critical sections - I kinda have a hard time to profile it in wasm, I simply dont know any tools, that can achieve it within wasm. So I went the faulty route of deactivating code paths and comparing the runtimes, which ofc carries the risk of creating bigger dead code branches, that get eliminated by the compiler. Still my findings are these in wasm:
Note that the big spread in the runtimes is a result of one test image with 4096 colors and random pixels. This image has almost no compression, thus very high throughput numbers. Normal images are grouped around the lower number with only small variations (< +20 MB/s). Interpretation of the numbers:
So I gonna try to find a simpler state model, as it promises great savings. My original JS implementation uses a DFA switching on every input byte. While this is fairly equal in JS, it actually performs worse in wasm. The current wasm state model is a direct transfer of the DFA states into switch and if-else conditions with less switching per byte but deeper nested side paths (the Edit: Fixed the mixed numbers above. |
@jerch , I will hopefully write the second and third parts this weekend, and try to clear up some points with the numbers and profiling.
This is my main loop, by the way:
Each function is inlined and the only other two inner loops are in |
@ismail-yilmaz 61 Mib/s for the full deal with sequence parsing, sixel decoding and painting in the terminal? Wow, thats far beyond what I can achieve (xterm.js input processing alone is slower lol). May I ask you to run the benchmarks of my lib? I have the feeling, that our machines are quite different in the throughput numbers, sadly I have no clue, how to measure wasm execution in terms of CPU cycles, which would be a better number for comparison. If you are up to test my wasm decoder I can give you the compiled version, thus you wont need to install the emscripten SDK. |
Unfortunately no. It is sequence parsing + sixel decoding.
The drop has several reasons. But the most important reason is I have to stick with the gui calls and rules provided by our framework to ensure backend compatibility, as our vte is a widget, not an app. But the raw throughput (when only displaying the image is disabled) for But this "extremely fast" rendering is not feasible for me because I need to use a static buffer with fixed length to get this. You can see where I'm getting at.. :)) It appears that
Ah, I totally forgot to mention that I already compiled it just two days ago, but did not have time to test it, sorry. However, if you have an additional script to test I will run the existing tests and compare the results with mine. (I'll share the results within next week) I am really interested in optimizing sixel rendering now, and I really like to see xterm.js with top-notch sixel support. Besides, your |
I can prepare a script for the sixel-bench test, sure thing. If you pulled the whole cd node-sixel
git checkout faster_decode
# edit wasm/build.sh to point EMSCRIPTEN_PATH to your emsdk_env.sh
# then run
npm install # will abort, just ignore it
npm run build-wasm
npm install
# benchmark
node_modules/.bin/xterm-benchmark ./lib/index.benchmark.js Have not tested that yet on a clean checkout, so bear with me if its not working out of the box. Also my newer optimizations are not yet checked in, thus you will see slightly lower numbers. Edit: Fixed the checkout/build commands. |
I'd be grateful, thanks! I will benchmark it on this sunday. |
Damn switch statement - with flat switch statement over byte values I get worse numbers (80 - 150 MB/s), but reshaped into sorted if-else cascades I get slightly higher numbers (100 - 180 MB/s). Lol, not a good sign, if the conditional cascade actually performs better, normally the switch statement should outperform that by far with proper jump table optimization for multiple cases, grrrrr. So the neckbreaker for the current complicated state model seems to be the extensive usage of switch. Will see, if I can get a simpler branching model from the cascading thingy (still misses several edge case transitions). |
This is the best compromise I could find with reduced state handling: inline void maybe_color() {
if (ps.state == ST_COLOR) {
if (ps.p_length == 1) {
ps.color = ps.palette[ps.params[0] % ps.palette_length];
} else if (ps.p_length == 5) {
if (ps.params[1] < 3
&& ps.params[1] == 1 ? ps.params[2] <= 360 : ps.params[2] <= 100
&& ps.params[3] <= 100
&& ps.params[4] <= 100) {
switch (ps.params[1]) {
case 2: // RGB
ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_rgb(
ps.params[2], ps.params[3], ps.params[4]);
break;
case 1: // HLS
ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_hls(
ps.params[2], ps.params[3], ps.params[4]);
break;
case 0: // illegal, only apply color switch
ps.color = ps.palette[ps.params[0] % ps.palette_length];
}
}
}
}
}
void decode(int length) {
if (ps.not_aborted && ps.y_offset < ps.height) {
for (int i = 0; i < length; ++i) {
int code = ps.chunk[i] & 0x7F;
if (62 < code && code < 127) {
switch (ps.state) {
case ST_COMPRESSION:
put(code - 63, ps.color, ps.params[0]);
ps.cursor += ps.params[0];
ps.state = ST_DATA;
break;
case ST_COLOR:
maybe_color();
ps.state = ST_DATA;
default:
put_single(code - 63, ps.color);
ps.cursor++;
}
} else if (47 < code && code < 58) {
params_add_digit(code - 48);
} else
switch (code) {
case 59:
params_add_param();
break;
case 33:
maybe_color();
params_reset();
ps.state = ST_COMPRESSION;
break;
case 35:
maybe_color();
params_reset();
ps.state = ST_COLOR;
break;
case 36:
ps.cursor = 0;
break;
case 45:
ps.y_offset += 6;
ps.offset = ps.y_offset * ps.width + 8;
ps.cursor = 0;
break;
case 34:
maybe_color();
params_reset();
ps.state = ST_ATTR;
break;
}
}
}
} While if-cascades are slightly faster in wasm, they kinda make the code alot uglier. Furthermore I dont want to optimize for shortcomings of wasm runtimes (only tested it with nodeJS so far), if they have a hard time to opt for real jump tables, they better get that fixed imho. Throughput numbers are between 92 - 180 MB/s, binary size dropped from ~6KB to ~3.8KB.
Well it seems I cannot find a significantly faster version in wasm even with reduced state model, and I dont think that the states can be further reduced without introducing faulty handling. I did not yet try your suggested optimizations above. |
In the meantime, some preliminary benchmarks, including wasm: (Sixel tests, on my development decoder, the one that gets`~42 Mib/s on sixel-bench)
"Fringe" decoder 61 Mib/ss version. ( static RGBA buffer with fixed size of 1920x1080. Yeah, kinda insane. But this can happen when we make compilers really happy. But making compilers happy can make life miserable for a developer. Hence this decoder is not really usable.
And
and
` I'll continue to explore the optimization strategies, part 2, with some more findings, tomorrow. Have a nice weekend. |
@jerch , Another treat: A modified version of your wasm decoder (I adapted the logic of mine to yours. Seems slightly faster, but contains less checking.) You can compare it with the above benchmarks. (Warning: the patch code is ugly - in C++ standards, at least..) Results:
Please find attached the modified code.. |
I've made some corrections to my patch and as a result it seems a bit faster than previous:
The code can be easily modified and unrolled, which might let you gain even faster speeds, and test the assembly behavior for your final decoder. |
@ismail-yilmaz Hmm weird, with your last code I get slightly worse runtimes than with my lastest optimizations (still have to cleanup and push that). Btw in line 187 you do this: for(int n = 0; n < 6; ++n) {
if(c & (1 << n))
ps.canvas[p + ps.jump_offsets[n]] = ps.color;
} Replaced with: for(int n = 0; n < 6; ++n) {
ps.canvas[((c >> n) & 1) * (p + ps.jump_offsets[n])] = ps.color;
} I get much higher throughput. It is again the "write always something" trick, can you try if this gives you higher numbers as well? (Because you said above that this is worse for you, so I am not sure, if it misuses some CPU cache settings). Btw my numbers are constantly higher than yours, so your 42-ish number is more like 65 for my machine. Maybe CPU and bus speed differences can explain most of that, not sure if there are bigger differences in cache loading times between AMD and Intel, I also think that the different pipeline lengths will equal out in the end (well I have not really an informed opinion in that field anymore, lost interest in CPU differences around Pentium 4, which was known to have insane long pipelines). |
@jerch , Sure (not much difference here, interesting...):
Well, for one, it is because you are using a static integer array with fixed size. Compilers usually vectorize the hell out of them (on x86 arch). I allocate the memory on heap (resizeable) and use aligned RGBA structure. integer operations can use registers. Setting a rgba buffer (even their length be the same) is usually done using memcpy. That's why the performance of my renderer takes hit if I "always write" . Here's a synthetic benchmark stressing the difference between RGBA/integer and static vs dynamic allocation:
Possibly, because it appears that Athlon FX 6100 is somewhere in between i5 and i7, closer to i5. |
@jerch
|
@ismail-yilmaz Pushed my latest optimizations, which is actually the fastest with all proper state transitions. Thats the one where I get >90 MB/s for all tests of the benchmark, and ~170MB/s for the 12bit noise image (which should not be taken too serious, as it is "very opinionated" in its data). The reduced state model is also contained as Edit: Ah I see, your last numbers are more like mine only showing a small gap now. |
Yes the static nature is much easier to deal with, but thats all I need for the wasm instance. The level to get multiple instances here is the wasm container itself, thus one would just spawn multiple wasm decoders if needed. Ofc with the pixels on the heap you can not get the pseudo cache locality from chunk bytes, as it could be allocated elsewhere (and the chunk bytes prolly live elsewhere as well). |
Yeah, that's, unfortunately, not really affordable for me That's why I'd better explore other options... |
I did not like the static memory in the first place, because it creates tons of frictions for the higher level JS integration. But emscripten currently does not allow "memory.grow" for direct wasm builds (its simply not yet implemented lol), thus removing the allocators and going fully static was the easiest way for me to overcome the shortcomings. And since it shows better performance I am not really mad about it and will reshape the JS integration to suit that model. Edit: But cant you do something alike - like allocating a bigger area once and do the memory semantics on your own within that area? I remember doing that for a game AI once, where I misused a bigger portion of stack memory with |
I'll update the node-sixel shortly and report the results later today.
Of course. I already did some "cool/clumsy" tricks that "just work" and failed miserably at others, but they were not feasible in general, because the real problem is that our vte is a widget and it can, and should continue to, work on a variety of devices and backends, ranging from liimited hobby hardware and rasberry Pi to hi-perf desktops, on SDL/linuxfb and even in web browsers. That's another reason why I find your work on wasm and opinions important and very productive for me. You see, I am also the co-author and maintainer of our HTML5 backend which allows apps that use our framework to run inside web-browsers (canvas + websockets). It is called |
Oh I see, well thats interesting. What JS engine do you use for that? If you want to go that route we prolly should share more general ideas about wasm and JS interactions. I did some more testing before getting down to a wasm sixel decoder implementation, mainly around nasty tasks in xterm.js like base64 and utf8 decoding. Wasm beats all of my JS implementations by far (at least 2 times faster), and those are already among the fastest in JS land (all tested on V8). Furthermore - calling into wasm creates almost no overhead, if done right. Currently I think that wasm can be used to replace performance critical JS sections, even if called over and over on a hot path. |
Well, turtle is pretty lightweight as it does not try to recreate a gui with JS. What it does is simply redirect the app window's output (and an associated background) to a web browser via a simple base64 encoded binary protocol. It gets decoded by javascript and displayed and key + mouse input are tracked. This all happens in a simple loop so it can work on all major browsers. So all it does is some image processing and blitting partial or complete images from a server application to a client web browser capable of canvas and web sockets. That's why I think it can be easily optimized using webassembly (I may be dead wrong of course.) To give you a clear idea of what I am talking about, here is an example. (Note that this is two years ago and neither TerminalCtrl nor turtle are as slow as this now.) But then again, I don't really want to pollute further our sixel and wasm optimization discussion wih other stuff. Later when I implement this I'd love to discuss this and learn more about webassembly from your experience with it. |
Oh thx for finding this one, yeah the unaligned load is alot slower (had to get rid of it with |
@ismail-yilmaz Making baby steps towards SIMD parsing. Got the easy part working - slicing data at single state changing bytes (CR, LF, color/attribs/compression introducer). This high level lexing runs at 2GB/s native, and 700 MB/s in wasm (heavy usage of The real challenge now is not to waste too much cycles on actual parsing of the subchunk data. 😸
Normally I am a big fan of less data moving, but the penalty of unaligned SIMD access seems rather high, so an additional copy step might help, if many SIMD instructions will follow. Also not sure if my SIMD-fu is good enough to get number/color parsing done with it, might just drop to the old code here (which is no biggy, as those are low on the runtime meter). Still the sixel data bytes will need special care, and better get moved over to SIMD paint in aligned batches. Regarding parsing numbers, this might help: http://0x80.pl/articles/simd-parsing-int-sequences.html |
Those numbers are impressive and sounds very promising!
Yeah this is what I have in my mind too. For the next round, my strategy will be to focus on batching single sixels with color information (map them), without modifying the original loop much, then flush them at once using SIMD where count % 4 == 0 . I'd like to see the difference for better or worse. |
Did some thinking/modelling of parsing with SIMD. My idea is currently the following:
These are "typed" fragments containing:
The first fragment might be continuing from the previous vector, thus needs some sort of state carry. In the same sense the last fragment might not be finished yet. Not sure yet how efficiently deal with that. All other fragments are "final", thus have proper outer state boundaries and can be processed atomic. The main problem here I currently have is to efficiently identify, whether a fragment has follow-up sixel bytes or not. In SIMD that could be done as follows: __m128i data = _mm_loadu_si128(fragment);
__m128i lesser_63 = _mm_cmplt_epi8(data, 63);
__m128i lesser_127 = _mm_cmplt_epi8(data, 127);
__m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127); While this works, it creates a rather big penalty, and is not yet helpful in the later sixel processing. And there is another problem with that fixed fragment idea - the biggest atomic directive the sixel format can have, is a non-disturbed color definition like:
This can only be processed in 128 SIMD in one go, if we limit the register numbering to one digit, bummer. Means at least for this one it is not possible without looping again. |
Modified my top-level parser abit, which makes it slower (at 605 MB/s native), but now also precalculates the sixel byte offsets: void decode(int length) {
__m128i LF = _mm_set1_epi8(45);
__m128i CR = _mm_set1_epi8(36);
__m128i COLOR = _mm_set1_epi8(35);
__m128i COMP = _mm_set1_epi8(33);
int l = length / 16 * 16;
int rem = length - l;
for (int i = 0; i < l; i += 16) {
__m128i part = _mm_load_si128((__m128i *) &ps.chunk[i]); // this is actually very slow in wasm!
int start = i;
int end = i;
// test for singles: CR, LF, COMPRESSION and COLOR introducer
__m128i testLF = _mm_cmpeq_epi8(part, LF);
__m128i testCR = _mm_cmpeq_epi8(part, CR);
__m128i testCRLF = _mm_or_si128(testLF, testCR);
__m128i testCOLOR = _mm_cmpeq_epi8(part, COLOR);
__m128i testCOMP = _mm_cmpeq_epi8(part, COMP);
__m128i testCOLORCOMP = _mm_or_si128(testCOLOR, testCOMP);
__m128i testSINGLE = _mm_or_si128(testCRLF, testCOLORCOMP);
int hasSINGLE = _mm_movemask_epi8(testSINGLE);
// identify sixels
__m128i lesser_63 = _mm_cmplt_epi8(part, _mm_set1_epi8(63));
__m128i lesser_127 = _mm_cmplt_epi8(part, _mm_set1_epi8(127));
__m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127);
int sixelPos = _mm_movemask_epi8(sixels_marked);
while (hasSINGLE) {
// get LSB: __builtin_ctz (better with _tzcnt_u32, but not supported in wasm)
int adv = __builtin_ctz(hasSINGLE);
end = i + adv;
if (end - start) parse_fragment(start, end, i + (sixelPos ? __builtin_ctz(sixelPos) : 16));
handle_single(i + adv, ps.chunk[i + adv]);
start = end + 1;
hasSINGLE &= ~(1 << adv);
sixelPos &= ~((1 << adv) - 1);
}
end = i + 16;
if (end - start) parse_fragment(start, end, (sixelPos ? __builtin_ctz(sixelPos) : 16) + i);
}
// TODO: rem handling...
}
Now Edit: Fixed lousy do-while loop in code. |
@ismail-yilmaz The |
More findings about sixel parsing - my basic mixed sixel painter looks like this now: inline void simd_paint(__m128i sixels, int cur) {
__m128i colors = _mm_set1_epi32(ps.color);
__m128i ones = _mm_set1_epi32(1);
int p = cur * 4 + ps.offset;
for (int i = 0; i < 6; ++i) {
__m128i singles = _mm_and_si128(sixels, ones);
__m128i bitmask = _mm_cmpeq_epi32(ones, singles);
__m128i updated = _mm_and_si128(colors, bitmask);
__m128i prev = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
__m128i keep = _mm_andnot_si128(bitmask, prev);
__m128i final = _mm_or_si128(keep, updated);
_mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);
sixels = _mm_srai_epi32(sixels, 1);
}
} It is meant to be fed with 4 consecutive sixels (minus 63) and the current 4-pixel aligned cursor position (128bit progression). The color blending with previous color is fixed with a On caller level I found 2 ways being suitable to digest the mixed sixels:
The second variant is in general ~20% faster (prolly due to omitting the extra memory roundtrip of the register array), but degrades for images with highly scattered sixel bytes. The reason is obvious - if you always do the shift variant but only one sixel byte at a time comes in, you have to do the shift correction and dummy paints over and over. Imho a combination of both will help here in local sixel byte state:
About paint performance: Things not yet covered:
Cheers 😸 Edit: |
@jerch Just to be clear: I can't comment or give feedback much these days because of my job right now. I am reading your findings and really appreciate your hard work. Thus I think we should later gather these findings for a recommendation guide. Also maybe, just maybe, we can later use these findings to create a reference SIMD optimized encoder and decoder. (Btw. Last night I have started to implement the parser in SIMD but now it has to wait for the next week until I can be free to finish & test it and give you some feedback. ) |
Ah no worries, no need to feel pushed or anything, to me this is mostly optimization for fun. If we get somewhere down to a ref decoder/encoder - awesome, but if not - I dont care much either. |
One more idea to further lower half filled paint calls - by stacking the colors in a ringbuffer[4], we could lower paint calls to real 4-pixel progression (including CRLF jumps). Not sure, if this will show a real benefit (or even runs worse because of the ringbuffer overhead), as far as I know most encoder libs do a CR reset on a color change not mixing colors from current line cursor position onwards. Well. would need some investigation on typical sixel encoding schemes... |
Some callgrind profiling data (shortened):
|
Started to wonder, how about a simple thing like the RGB color conversion, thus went on and tried SIMD versions of it: int normalize_rgb(int r, int g, int b) {
return 0xFF000000 | ((b * 255 + 99) / 100) << 16 | ((g * 255 + 99) / 100) << 8 | ((r * 255 + 99) / 100);
}
int normalize_rgb_simd_int(int r, int g, int b) {
// algo: ((x * 255 + 99) * 0xA3D7 + 0x8000) >> 22
__m128i reg = _mm_set_epi32(r, g, b, 100);
reg = _mm_mullo_epi32(reg, _mm_set1_epi32(255));
reg = _mm_add_epi32(reg, _mm_set1_epi32(99));
reg = _mm_mullo_epi32(reg, _mm_set1_epi32(0xA3D7));
reg = _mm_add_epi32(reg, _mm_set1_epi32(0x8000));
reg = _mm_srli_epi32(reg, 22);
__m128i result = _mm_shuffle_epi8(reg, _mm_set_epi8(
0x80, 0x80, 0x80, 0x80,
0x80, 0x80, 0x80, 0x80,
0x80, 0x80, 0x80, 0x80,
0x00, 0x04, 0x08, 0x0C
));
return _mm_cvtsi128_si32(result);
}
int normalize_rgb_simd_float(int r, int g, int b) {
__m128 reg = _mm_set_ps(r, g, b, 100);
reg = _mm_mul_ps(reg, _mm_set1_ps(2.55f));
reg = _mm_round_ps(reg, _MM_FROUND_TO_NEAREST_INT);
__m128i result = _mm_cvtps_epi32(reg);
result = _mm_shuffle_epi8(result, _mm_set_epi8(
0x80, 0x80, 0x80, 0x80,
0x80, 0x80, 0x80, 0x80,
0x80, 0x80, 0x80, 0x80,
0x00, 0x04, 0x08, 0x0C
));
return _mm_cvtsi128_si32(result);
} and here is the asm for those (gcc 11) with cycle count in front (taken from intel optimization guide): normalize_rgb(int, int, int):
1 mov r8d, edx
1 mov edx, esi
1 sal edx, 8
1 sub edx, esi
1 add edx, 99
1 movsx rax, edx
1 sar edx, 31
3 imul rax, rax, 1374389535
1 sar rax, 37
1 sub eax, edx
1 mov edx, edi
1 sal edx, 8
1 sal eax, 8
1 sub edx, edi
1 add edx, 99
1 movsx rcx, edx
1 sar edx, 31
3 imul rcx, rcx, 1374389535
1 sar rcx, 37
1 sub ecx, edx
1 or eax, ecx
1 mov ecx, r8d
1 sal ecx, 8
1 sub ecx, r8d
1 add ecx, 99
1 movsx rdx, ecx
1 sar ecx, 31
3 imul rdx, rdx, 1374389535
1 sar rdx, 37
1 sub edx, ecx
1 sal edx, 16
1 or eax, edx
1 or eax, -16777216
39 ret
normalize_rgb_simd_int(int, int, int):
1 mov eax, 100
1 movd xmm0, esi
1 movd xmm1, eax
2 pinsrd xmm0, edi, 1
2 pinsrd xmm1, edx, 1
1 punpcklqdq xmm1, xmm0
1 movdqa xmm0, xmm1
1 pslld xmm0, 8
1 psubd xmm0, xmm1
1 paddd xmm0, XMMWORD PTR .LC0[rip]
10 pmulld xmm0, XMMWORD PTR .LC1[rip]
1 paddd xmm0, XMMWORD PTR .LC2[rip]
1 psrld xmm0, 22
1 pshufb xmm0, XMMWORD PTR .LC3[rip]
1 movd eax, xmm0
26 ret
normalize_rgb_simd_float(int, int, int):
1 pxor xmm2, xmm2
1 pxor xmm1, xmm1
1 pxor xmm3, xmm3
1 movss xmm0, DWORD PTR .LC4[rip]
5 cvtsi2ss xmm2, edx
5 cvtsi2ss xmm1, esi
5 cvtsi2ss xmm3, edi
1 unpcklps xmm0, xmm2
1 unpcklps xmm1, xmm3
1 movlhps xmm0, xmm1
5 mulps xmm0, XMMWORD PTR .LC5[rip]
6 roundps xmm0, xmm0, 0
3 cvtps2dq xmm0, xmm0
1 pshufb xmm0, XMMWORD PTR .LC3[rip]
1 movd eax, xmm0
38 ret Performance: the normal version wins, Edit: the float version drops down to 20 cycles, if already fed with floats (and currently wins the race in the decoder). Still think the int variant will be the winner in the end, as the values are not meant to be floats. I really wonder, if I could get rid of that |
@ismail-yilmaz I kinda give up with the high level SIMD tokenization. I tried 3 different approaches now (with the last being the fastest, but still 25% slower than my fastest byte-byte loop). It turns out that the high fragmentation of sixel data needs many state changes for one 16 byte slice (one register load), which creates lots of single byte extractions (which are cumbersome with SIMD) and alignment corrections on number and sixel bytes (the only ones that are consecutive to some degree). But again those consecutive bytes cannot be processed directly in SIMD, as they need special preparations:
Nonetheless here is my last and fastest top level SIMD attempt, in case you want to play with it or have better idea, how to approach the fragment data: typedef union Vec128 {
__m128i vector;
long long i64[2];
int i32[4];
short i16[8];
char byte[16];
} Vec128;
inline void handle_unclear(char code) {
switch (code) { // <---- thats the speed showstopper...
case '!':
ps.state = ST_COMPRESSION;
break;
case '#':
ps.state = ST_COLOR;
break;
case '$':
ps.cursor = 0;
break;
case '-':
ps.y_offset += 6;
ps.offset = ps.y_offset * ps.width + 16;
ps.cursor = 0;
break;
case ';':
break;
}
}
static int sixels_or_numbers_counter = 0;
inline void handle_sixels_or_numbers(Vec128 reg, int start, int end, int number_bits, int sixel_bits) {
// to not get removed by optimzation
sixels_or_numbers_counter++;
// follow-up code removed for less cluttering
}
// like BLSR, but not available in wasm
inline int clear_bits_below(int x) {
return x & (x - 1);
}
inline __m128i mask_sixels(const __m128i input) {
const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(65));
return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-64));
}
inline __m128i mask_numbers(const __m128i input) {
const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(80));
return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-118));
}
void decode____(int length) {
Vec128 reg;
int l = length / 16 * 16;
int rem = length - l;
if (rem) {
for (int k = l + rem; k < l + 16; ++k) ps.chunk[k] = 0;
l += 16;
}
for (int i = 0; i < l; i += 16) {
reg.vector = _mm_lddqu_si128((__m128i *) &ps.chunk[i]);
// strip high bit
reg.vector = _mm_and_si128(reg.vector, _mm_set1_epi8(0x7F));
// identify sixel & numbers
int sixels = _mm_movemask_epi8(mask_sixels(reg.vector));
int numbers = _mm_movemask_epi8(mask_numbers(reg.vector));
// identify unclear bytes
int unclear = 0xFFFF & ~(sixels | numbers);
int pos = 0;
while (unclear) {
int adv = __builtin_ctz(unclear);
if (pos < adv) {
handle_sixels_or_numbers(reg, pos, adv, numbers, sixels);
}
handle_unclear(reg.byte[adv]);
pos = adv + 1;
unclear = clear_bits_below(unclear);
}
if (pos < 16) {
handle_sixels_or_numbers(reg, pos, 16, numbers, sixels);
}
}
} This top level SIMD loop needs alot less instructions at the 16-bytes slice than my first version, and groups slices into When applying the real data parsing (color/compression/sixels) down the code paths, things get ugly due to extra work needed for padding and alignment (not shown above). So far the only ground that shows real speed improvement with SIMD is the sixel-to-pixel path. I gonna stop those top level SIMD attempts for now until you have a fundamental better idea, how to approach the data. In the meantime I gonna try to micro optimize my byte-by-byte loop (repeated SIMD painting still missing). The last trick I have in mind is to reduce the dummy paints* to almost 0, at least across compression-sixels-compression progression (not really useful across color changes as they almost always contain cursor movements with CR). Looking at the real asm output, [*] You might have wondered in my last posts about "dummy paints". I call any paint a "dummy paint", if the 4-pixel range is underfull, thus sixels are missing, which create additional "nonsense" paints with
In total |
@jerch ,
That sounds like a very good number for a JS sixel decoder (o so I assume, because the only one I know in the wild is yours.) I'll take your findings and try moving it forward on a test setup , starting this monday as I will have a window (we'll have a national holiday season this whole week.) to work on something fun. Will post my findings on x86_64. So. stay tuned. :) |
Latest numbers:
While I missed my >130 MB/s goal with "test2_clean.sixel" (not even sure why), I got a nice boost on the fullHD noise image, which is now at 12fps. Thats still very low, on the other hand the image is really hard to chew on with its 4096 colors and the single pixel addressing (cursor kinda moving from 0 to target position for tons of 1bit-sixel). I have no test data for normal fullHD stuff, but I think it is capable to stay >30 fps for less degenerated data. |
@ismail-yilmaz Not sure how good you know callgrind, I have an interesting case, which I dont quite understand, maybe you have an idea or can explain, whats going on. Playing around with the byte-by-byte loop I found this loop schematics to be the fastest: int decode(int length) {
for (int i = 0; i < length; ++i) {
int code = ps.chunk[i] & 0x7F;
switch (ps.state) {
case ST_DATA:
while (62 < code && code < 127) {
... // process sixel bytes
if (++i >= length) break; // <--------
code = ps.chunk[i] & 0x7F;
}
break;
case ST_COMPRESSION:
while (47 < code && code < 58) {
... // process number bytes
if (++i >= length) break; // <--------
code = ps.chunk[i] & 0x7F;
}
break;
}
}
} It basically sub-loops at all positions, where multiple occurrences are possible skipping the outer loop and switch statement. Ofc now I need those inner break conditions on Now looking into callgrind I see many instructions being burnt on those length-break checks. Thus I went further and restructured the break conditions as outer fall-through: int decode(int length) {
ps.chunk[length] = 0xFF; // new fall-through break condition in data
for (int i = 0; i < length; ++i) {
int code = ps.chunk[i] & 0x7F;
switch (ps.state) {
case ST_DATA:
while (62 < code && code < 127) {
... // process sixel bytes
code = ps.chunk[++i] & 0x7F;
}
break;
case ST_COMPRESSION:
while (47 < code && code < 58) {
... // process number bytes
code = ps.chunk[++i] & 0x7F;
}
break;
}
} // 0x7F in code will reach here as NOOP breaking the for loop eventually on next i < length check
} This works great and reduces counted instructions for |
I haven't checked it yet ( I will), but it is likely a pipeline effect :(on x86, at least): Remember this? That is exactly how I process parameters (and also optimized my sequence parser) on x86_64. In almost all scenarios the while loop works faster than the for loop for such operations that it is likely to loop more than once. E.g two style of reading parameter chunks (Size: 95.37 MB): For loop:
While loop
Timing (20 rounds, O3):
|
The problem with this approach, in my experience at least, is it performs worse with single sixel paints (branch misses tends to be high and winding/unwinding the stack for the loop at that "hot" point has a greater cost than its potential gains.) |
Yes thats true, but the image triggers that case only 157 times. Overall the sub loops are much faster in wasm (not so much in native). Ofc the stack unwinding back from the inner switch case might be much more expensive here (br_table moves quick some stuff on the stack in wasm). Well I dont have something like callgrind for wasm, thus can only profile things indirectly. Nonetheless will try to reshape the outer thing into a while loop as well, it is abit annoying that I cannot use pointer stuff that easy in wasm, not clue why I get such a bad penalty for using those. |
@ismail-yilmaz Note there is an issue with your while loop above - your loop condition tests for (Some background on this: NULL was used by some very early terminals as "time padding bytes" (give the device some time to keep up), thus it is theoretically possible to see cascades of NULLs in old data.) |
@jerch , Ah, that's just a synthetic test loop to display the difference, the actual parser is based on Upp::Stream and uses Stream::Get() and Stream::Peek(), which return -1 on eof or stream error (meaning, 0 is considered a valid byte. Besides, that loop can be used to parse the parameters in my use-cases, because our parser, firstcollects and then splits the parsed valid parameter chunks into a vector of Strings. Strings are null terminated. So at that point it is guaranteed to have no '\0' in the string. except for the null terminator. |
Another optimization idea I had while looking at callgrind metrics - currently
and a final copy step on LF:
This has several advantages in terms of cache and SIMD utilisation:
Ofc this has a major drawback - the additional copy step itself, which might just be more expensive than the savings from better cache utilization achieved. Remains uncertain until actually tried. Another similar idea would be to spread sixel bits into SIMD registers (horizontal spreading), instead of the current shift-or looping (vertical spreading):
Well I dont expect that to be any faster for 3 reasons:
Not sure if I will try any of these ideas, they need quite some canvas construction refactoring, before they would work at all... |
Some more microoptimizations here and there:
Also tested against 5 fullHD screenshots, they all run at 40-50 FPS, and the degraded noise example is now at 15 FPS. Furthermore I partially tested the first idea of my last post - well, it is not worth the trouble. A cache optimized load/store in Will push the code once I cleaned it up abit. |
Extended the benchmark script with more image relevant metrics to get some more explanation of bottlenecks:
To make sense of these numbers, we need to know some facts about the images itself:
Interpretation:
Summary: |
@ismail-yilmaz Some updates from my side. Have not yet pushed the code as the changes and integration still takes time, because of the level 1 vs. level 2 handling and the question, whether to truncate at raster dimensions. While trying to find a working API to cover all these details I made and interesting observation, that might also help you to get better cache utilization: My wasm implementation started as level2 truncating only, as it promised way greater optimizations. In the end we are only interested in the final pixel array, thus it seemed obvious to do that as one big blob of memory to avoid copy costs (even static to save pointer indirections). Imho this is a pretty big cache side effect overcompensating an additional copy step (prolly due to the big blob going into far memory regions for big images with bad cache locality again). Still have to understand the details, also I am not settled yet, whether to keep the level 1/2 distinction, or go with just one general purpose decoder in the end (that would be way easier to maintain). If you want to try something similar, I did the following:
On the first sight it is not really obvious, why this might lead to better cache utilization, as the memory usage is quite spread across those 0.75 MB. Imho the following happens: really every memory interaction (sixel-to-pixel write, off-copy, flush) touches these 6 memory areas prolly keeping it hot in the cache. Only for very wide images thrashing will occur, when the cursor gets too far to the right. Just a theory atm, if it is true there should be a significant throughput drop at some line width depending on the cache size. |
Hello @jerch , Thanks for the update! I am reading your posts, passively. As you prolly noted, I've taken a short break, after a long and very exhausting season. I'll be back, and try to implement + continue testing + reporting back as usual starting from Aug 29 and on. |
Trying to get a hold of the cache effects - with a very simple autogenerated image across various widths I see the following behavior in my benchmark:
It seems the throughput peaks somewhere around 128px width and has a bigger drop between 1024 and 4096. Doing these areas more in detail reveals, that the throughput peaks between 120-180px with 125 MB/s max, and around 1200-1400px the throughput drops from ~107 to ~95 MB/s out of a sudden. The drop around 1300px is pretty close to my L1 cache size of 32kB, which strongly suggests that cache behavior changed here. I have no good explanation for the lower peak yet, it simply might be a summation effect of overall buffer/cache utilization. Furthermore my tests are partially screwed by GC actions, which might show up non-deterministically (might be the reason for the rather big ranges across the runs). Trying to test cache behavior with JS is def. not a good idea 😅. (Btw very small images <64px width are much slower in throughput, I guess that the function call overhead for off-copying pixels gets significant here.) |
The new line-based decoder opened the field for further optimizations, which kinda makes the SIMD attempts obsolete under wasm.
It seems SIMD is not yet that usable and well optimized for wasm, thus I will stick with the generic version for now, which also covers more browsers. Note that for x86-native the sixel SIMD decoder runs "test1_clean.sixel" with 220 MB/s (~40% faster). offtopic: |
@ismail-yilmaz FYI - pushed the my latest wasm code to the PR: jerch/node-sixel#20. |
Hello @jerch,
I don't want to pollute your repos or other active discussion threads with my findings until I can come up with some practical optimization suggestions for your renderer. So I'll put this here.
I've been doing some heavy profiling on my SixelRenderer since last week. Here are some preliminary results:
42 MiB/s (2.90 secs)
(parsing + rendering) on yoursixel-bench
animation, which means ~12 MiB/s gain. for mesixel-bench
video on average38.00 Mib (3.10 secs.)
. As you can guess, this does not translate 1:1 to final terminal performance, because other variables also effect the vte rendering performance. Still it can achieve slightly higher performance than MLTerm now.The above improvements are all due to some really simple tricks -one of which I borrowed from you- and the examination of the produced assembly code. I am confident that we can raise the throughput bar even higher, because there is still room for optimization.
In the following days this week I'd like to share my findings (code changes/alterations) with you, which you might find interesting and, hopefully, useful for your
c/wasm
decoder.The text was updated successfully, but these errors were encountered: