-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chafa sixel-encodes AllRGB images substantially faster than notcurses #2573
Comments
this is a case ripe for threading, which |
also, gcc12 introduced autovectorization at O2, from which we might benefit. if we don't, it's probably time to take a look at SIMDizing this shit ourselves via intrinsics. it's always good to have competition! even better to then blow said competition out of the water. |
i'm thinking we ought be able to do a fixed-size three-dimensional pseudometric and eliminate the green preference in constant time. if the difference is beyond first order, it's barely going to show up anyway. there's no need to search the entire space. right now we have constant time with a single dimension by checking high and lo over the channel sum. expanding that into three dimensions would mean eight bounds instead of two, but all of them are in cache, and you could likely SIMDize the actual comparisons thus eliminating any branches. hell, we could probably do that and improve our performance as it exists simply by reducing two comparisons to a single SIMD instruction and killing the branch. hot damn, pepper your angus! |
looking at a single-instance run over the folder, we get:
for chafa and
for notcurses |
so yeah i think a good first step would be...let's see, we get 1024 RGBA pixels to a page. that's a 32x32 chunk or any equivalent rectangle. we'd want contiguous sections in memory to go to the same thread, so it just takes 1024 (or some multiple thereof, N). next thread takes the next N. you can work out the starting pixel in O(1) from your pixel offset (we'll need adapt this a bit for linestrides). threads grab chunks until it's done. we could spin up some worker threads ahead of time maybe, upon detecting sixel support, so they're ready by the time we actually get a sixel encode request. they can be applied to any sixel encoding, even multiple concurrent ones, so long as job info is present in the work queue. there are three phases for the purposes of a concurrency graph:
so it would be good to see how much of our time each phase is occupying. if a majority of it is refinement/merging, threads probably aren't going to get the job done (Amdahl's Law and all that). if it's 1 and 3, especially 3, they'll help very much. i'm inclined to think 8192 or 16384 are good chunk sizes (2048 or 4096 pixels). they don't trounce the entire L1 no matter what uarch properties it has (so long as it's 32KB+). we don't want an abundance of threads--the more threads, the more contention in phase 1. |
then, within a thread, it would be a fine thing to vectorize at least |
shifting to |
hrmmm, that's not going to be the best way to do it given how
so we'd need parallelize the TAM crap as well. which i think we can do -- there are no carried dependencies. but we'd need arrange the workers differently; we'd want each one to do a cell at a time, or some multiple thereof. so if we wanted to do that, i think the thing to do would be to factor out an but overall this is good -- it means more of the total computation ought be parallelizable. remember, we have to do a lot more stuff than chafa, what with those other planes and such. alright, let's get to it, you've got your marching orders! |
so the first question -- can we eliminate the sixel-based loop in |
i've restructured while doing this, i found what i thought to be a bug ( |
oh ok |
oh, motherfucker, i bet i had the |
yep, works perfectly now, w00t w00t |
alright, we now have an |
hrmmm. initial profiling looks...not promising. the majority of our time seems to be spent in (actually, we spend more time in here's a profile for
and AllRGB:
|
so for the former case, infinite ideal parallelism in well, good thing we ran that. so the big wins are going to come from parallelizing so, what is
the data is stored as vectors for each color, with each vector equal in size to the total number of sixels. cache performance ought be pretty reasonable, then -- we're reading left to right for the duration of a sixel line (one byte per sixel, so 64 pixels == cacheline). worst case is horizontal width congruent to 1 modulo 64, in which case we're doing a cacheline fill of 1.5% efficiency, per color, per sixel row. when width is congruent to 0 modulo 64, cacheline fills are 100% efficient. we could raise that efficiency by storing by sixel row, and then by color within said sixel row. something to consider. normally we would expect to make significant use of sixel RLE (not applicable to e.g. the AllRGB images, though). if we admitted dynamic lengths, we could pre-RLE this data. this helps us not at all in the worst case, though. in the case where we have many colors (and thus the slowest case), it's unlikely that any given color will be in a sixel row. can we not preprocess these away? one bit per color per sixel row, set in the previous step, would allow us to skip the full width for that color. this might be a very nice, very easy pickup. try it. you could place these in their own bitvector for cache friendliness. so my first thought would be parallelizing on sixel rows. a chunk is then width * colors data, each of size 1B. for 1024 colors, you hit an x86 page at 4 pixels wide (geez, no matter we're going through so many pages -- this is a huge structure!). so a thread has plenty of work. each writes to their own internal buffer, and then you just lay those into the fbuf. ought be perfect parallelism. alright. between these last two ideas, i think we can get somewhere. beyond that, i think we need to just shrink this elephantine structure down. too big! 819KB per sixel row for 800 pixels across @ 1024 colors -- too mas! |
so if we wanted to add the 1-bit "color absent" signal, that's best done in |
so you need (colors * sixel rows) bits total for this map, right? yeah. initialize it to 0. whenever you load so you know we've really built a three-pass encoder, not a two-pass. can we not integrate these two pieces of shit? |
ok the actionmap proposed above is looking good initially, about a 20% pickup on AllRGB, need flush out a bug though so who knows maybe it actually makes us 30000 times slower |
alright, i have parallelized all this now, and check this out. 3.0.5 allrgb -a
3.0.5 allrgb
|
@joseluis it has been a long struggle, but finally, Victory |
Hot damn! |
we have a safe worker engine now. committing! |
@dankamongmen Congrats so much on your release! An epic story indeed. (My only regret to nuking the deadname account was the loss of all my reaction emojis in prior threads, including this one. :( ) |
Eating your lunch again, man ;-) |
very nice! i haven't worked on notcurses much in several years, but i do hope to come back to it and get things gassed back up! this was a fun little competition back in the day. |
Sure was - cheers! |
In the course of stunting on hpjansson/chafa#27, I discovered that
chafa
1.8.0 renders AllRGB/balloon.jpg in about half the time ofncplayer
on my 3970X. Representative timings:vs
in addition, the
chafa
output is slightly better -- look at the purple sections, and their borders with the blue sections to their right. we have some weird green pixels in there, unlikechafa
.ncplayer
:chafa
:note that this is on amd64, where
chafa
AFAIK uses hand-coded SIMD. i'm not sure if that's true on other architectures, but who cares; i tolerate defeat on no platform. well, i suppose there's only one thing to do now.THERE CAN BE ONLY ONE
The text was updated successfully, but these errors were encountered: