-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yEnc SSE decode ideas #4
Comments
SIMD isn't really designed for text processing tasks like yEnc, so implementation isn't so straight forward unfortunately. Because of this, a scalar (byte by byte) implementation usually isn't so bad, but doing a SIMD implementation is intellectually interesting and works for the epeen factor (as for Nyuu, I intend to integrate PAR2 generation, so reducing CPU usage may help a bit on fast connections). Having said that, a SIMD implementation of yEnc encoding can usually beat the naive implementation by several factors, due to corner cases that an encoder has to deal with. I've been thinking about writing up how the algorithm works. Unfortunately the code isn't the easiest to read (I'm a little lazy there, admittedly), but here's a rough diagram showing an overview of the main loop: Note that the byte shuffle instruction requires a CPU with SSSE3 support (which most people should have these days). If the CPU doesn't have SSSE3 support, the algorithm just switches to doing the escaping using standard non-SIMD methods if special characters are found (the 16-bit mask != 0). Despite this, it's still a bit faster than a pure scalar implementation because most bytes aren't escaped in yEnc, and hence, end up hitting the fast route of: load 16 bytes, +42 to all bytes, check special characters, store to memory. Line endings are a bit complicated, since they have different rules to the rest of the line (more characters to escape). When the algorithm crosses a line boundary, it needs to 'revert' (back-track) until the last character of the line, process the line boundary, then go back to the main loop. A large portion of the code is there just to deal with this case. Generating vectors for the shuffle and add operation isn't exactly fast, so we pre-generate these instead and just do a lookup when encoding data. Hope that all makes sense. As for decoding, I've actually thought about it, and have tried a few things. A naive implementation is surprisingly fast, since decoding is somewhat simpler than encoding, though you could still probably beat it by a few factors with SIMD. Here's my scalar implementation:
I see that in your implementation, you handle two dots after a newline which I don't. yEnc doesn't actually allow such to happen (first dot must be escaped), but I suppose a valid decoder should still deal with it. A general workflow for a SIMD version may be:
I actually have a somewhat working implementation of the above. I haven't yet thought of a way to deal with such invalid sequences well, but it may be doable. A compromise may be to detect such cases instead (rather than actually handle them), and if encountered, fallback to scalar processing. As I expect most yEnc implementations to be reasonably valid, it's unlikely the slower scalar fallback will ever be invoked, but remains there to be spec compliant. Hope that's useful. I may eventually add decoding support to this library. Welcome any ideas. |
That's a great overview, thank you! 👍 What would you do with the case where the The double-dot thing is unrelated to yEnc but inherent to NNTP, the servers will add this extra dot when transmitting the message. While I initially ignored it as an edge case, it actually happens quite often (once per file almost). I was experimenting a bit with implementing this yesterday, but I am sort of stuck in the "finally understanding what pointers do and how to use them"-stage.. so this stuff is a bit above my head. |
That's not too hard to deal with. Once you have the bit mask indicating the location of
That's interesting, because a correctly yEnc encoded message should never have a dot as the first character of a line.
I'm sure you'll get it. It's likely a bit different to what you're used to, but you seem to be a very experienced programmer, so it won't take you long to understand. |
No, it's not yEnc et al. It's the NNTP server that has to add a dot to every line that starts with a dot per-NNTP-spec during the transmission, some leftovers from ancient times I think. |
But there should be never be a line that starts with a dot, no? |
Why not? It's not a special character in the already encoded data, just a char of which 42 can be subtracted. |
Looks like I made a mistake - had thought that yEnc requires dots to be escaped if it's the first character in a line, but re-reading the spec, it's actually optional. Sorry about that. |
So I've been thinking about this, and I think that dealing with invalid sequences can be solved. But the double-dot issue looks challenging. Looking at RFC3977:
This seems to define how invalid sequences such as I had a quick look at Nzbget's code and it doesn't appear to handle lines starting with a dot. Presumably SABnzbd didn't do this for a long time either? All yEnc decoders I've seen, apart from SABYenc don't handle this case (which isn't an issue if they go through some NNTP layer, but I have doubts that typically happens). Interestingly, I haven't yet seen an yEnc encoder that doesn't escape dots which start a line. But there's obviously posts out there that have it. I'll probably take the approach of stripping out all instances of This does also make incremental processing a little more difficult, as it may need to know the preceeding two characters to be correct. |
NZBGet does it will receiving the data in the NNTP layer: I just checked some test-articles, and it seems that usually less than 1% of all 16-byte blocks contains a dot. EDIT: All usenet downloaders must correct for this, 60% of posts I tried have the |
Not finished, mostly a PoC for now Ref #4
I see, thanks for pointing that out! I think a SIMD version should be fast enough to make threading not that needed. I mean, this yEnc library is single threaded and synchronous, as I feel that it's fast enough to not really warrant the need (I'll probably eventually add async support though). I've committed up a very rough implementation. From my quick testing, it seems to work and handles all edge cases I've tried so far. Still needs work though, including more testing, tidying up, optimizations etc. I've just pushed it up in case you're interested. Unfortunately the algorithm is a bit convoluted, but maybe can be streamlined better. The algorithm supports incremental processing by supporting 4 possible previous states (preceeding character is a Quick speed test of the current code on a 3.8GHz Haswell CPU, using random input data: Once this is tidied up, maybe I can write up something to help others port the code across (though it's mostly copy the functions across, a few defines, and lookup-table generation code). |
Coool! 💯 |
Wow, very impressive & cool. I wanted to ask a noob question: is
So: also on Linux, so generic Intel SIMD/SSE. 👍 |
The Intel Instrinsics Guide (obviously lacks AMD extensions) is a handy reference. The movemask instruction is one useful instruction SSE has over ARM NEON. I just realized a mistake: I've been treating |
Super interesting topic 👍 I've tried the SSE decoder but for me it produces incorrect output. Am I'm doing something wrong or is there a bug in the code? I've made a test program to illustrate the issue: curl https://gist.githubusercontent.com/hugbug/fd7d95d53e3a2ca4aafb0f811d929bfc/raw/457d22331669372bd6c92bab812730e2d18338be/decoder_test.cpp > decoder_test.cpp
g++ -std=c++11 decoder_test.cpp
./a.out
Testing scalar
Test 1: OK
Test 2: OK
Testing sse
Test 1: OK
Test 2: FAILURE
Source:
0f 1a b1 a4 0c d4 15 2a 61 47 c8 bf bf e4 d3 e9 e2 b2 1f e0 99 1d 79 9a 38 26 c0 8b 3d 40 42 cf c6 f9 85 34 8d 9c f2 55 ce 16 ec 4d 38 29 3d 4d 22 d8 bb cc ce 2a 91 c9 93 87 6f 0f fb 5b 2c d7 90 3c 22 4a ac a7 1a 57 1a bb 6b 64 23 e0 87 8f b2 3d 7d 94 30 c5 eb 2f cb e5 78 35 8e bc d0 0b 57 15 58 69 e3 9d fc f3 da 6b c1 07 3d 4d d2 6a 60 6f 43 a4 3d 4a 81 dc b7 ca 04 8a c1 f6 8d b8
Expected:
e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 88 f5 b6 6f f3 4f 70 0e fc 96 61 d6 18 a5 9c cf 5b 0a 63 72 c8 2b a4 ec c2 23 0e ff e3 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 13 6a 06 9b c1 05 a1 bb 4e 0b 64 92 a6 e1 2d eb 2e 3f b9 73 d2 c9 b0 41 97 dd e3 a8 40 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e
Result:
e5 f0 87 7a e2 aa eb 00 37 1d 9e 95 95 ba a9 bf b8 b8 b8 b8 b8 b8 b8 b8 0e 0e 0e 0e 0e 0e 0e 9c 9c 9c 9c 9c 9c 9c 9c a4 a4 a4 a4 a4 a4 a4 f8 ae 91 a2 a4 00 67 9f 69 5d 45 e5 d1 31 02 ad 66 12 f8 20 82 7d f0 2d f0 91 41 3a f9 b6 5d 65 88 88 88 88 88 88 88 a1 a1 a1 a1 a1 a1 a1 a1 2d 2d 2d 2d 2d 2d 2d 2d b0 b0 b0 b0 b0 b0 b0 36 45 19 7a e0 57 b2 8d a0 da 60 97 cc 63 8e NOTE: My system doesn't have Another issue is that the code fails to compile if |
Thanks for the comment! Yeah, the SSE2 path was broken. I've pushed up my changes that I have been working on, which should have that fixed. As for your test failures, the SSE code requires some lookup tables to be generated before running the main function. In the old code, this was done in the node initialization function, in the new code, there's a I've modified your program to use the new code. You'll need to copy over common.h, decoder.cc and decoder.h from the src directory over as well. Compiles via |
Cool, that works for me:
with result:
|
I've tested the SSE decoder in NZBGet. Conditions
Test case 1: decoding offArticle decoding was disabled via compiler define Test case 2: classic decoderThat test case shows how fast the program can download using current yEnc-decoder. Test case 3: SSE decoderNow we use the SSE-decoder. With SSSE3 but without Test case 4: improved classic decoderDuring experiments I've realised that I can improve current (classic) decoder. It checks for Test case 5: CRC check offThis test is for reference: both decoding and CRC check disabled. Test resultsThe numbers are download speed (MB/s) and download time (s). The download speed is calculated based on download time in seconds which is an integer value; fractions of seconds can not be measured here unfortunately.
|
Update: I've checked CPU capabilities using Intel tool and it showed that CPU supports POPCNT. As a side note: Now, knowing the CPU also supports PCLMULQDQ I can try the fast CRC routine, which didn't compile before. |
@hugbug Did you use @animetosho version or a modified one for SSE? |
The SSE decoder can be parametrized to work in raw mode (where it processes dots) or in clean mode. I used the latter. It still filters out I think the SSE decoder can't show it's best due to line by line processing in NZBGet. With typical line length of 128 bytes that's the portion the decoder deals with. The lookup tables are probably never in CPU cache. Although I must say I also tested with a post encoded with 2048 bytes per line. All decoders speed up considerably but the difference remains similar. I suppose the SSE decoder can perform much better on larger inputs. I guess in SAB you can feed it with the whole article body (~500KB) and it should show its strengths. My attempt was to replace the decoder without much rework in NZBGet and I wanted to share the results. Also take into account that I tested on one CPU only. The results may be different on other (newer) CPUs. |
A lookup is likely faster everywhere else Ref #4
I'm guessing your x86 CPU there is a Core i5 520M. I can't tell what the ARM CPU is from the details given. The decoder definitely should work better on larger inputs, as the overhead from start/end alignment gets amortized more, so it isn't really designed for ~128 byte inputs. The fact that it can handle newlines as well as dot unstuffing means you don't have to do it elsewhere, which is another speed benefit. I haven't done enough investigation into whether using the I think I've got in most of the optimizations now, and can't think of many ideas on how to change the algorithm. So some quick speed tests: Tested by encoding 750KB of fixed random data into yEnc, then decoding that data and timing the decode. There'll obviously be some variation depending on how much escaping the random data generates. Measured speed is relative to the decoded output produced (not input to the decoder). Cleaned decoder: Intel Ivy Bridge @3.8GHz (i5 3570S): 2007 - 2118 MB/s Raw decoder: Intel Ivy Bridge @3.8GHz (i5 3570S): 1943 - 1999 MB/s |
I've ported the algorithm to use ARM NEON. I've been wondering whether it'd actually be faster or not, considering limitations of most ARM CPUs, but on my Cortex A53 running in armv7 mode, decoding does seem to be about 2x faster than my scalar code. Encoding actually gets more of a speed boost, most likely because there's less scalar <-> NEON interaction. There's a number of issues to consider with ARM/NEON though:
The code is otherwise basically identical to the SSE code, just uses NEON instructions instead. |
Thanks a lot! In the meantime I've done many tests for performance optimisations in NZBGet including SSE decoder and SSE crc routine. I'm currently in process of integrating them for good. Producing binaries that work on all systems was the difficult part. I cannot compile the whole program with I was just about to ask you regarding ARM support but you have that already. Just in time! 🥇 I'm absolutely need to do runtime feature check on ARM too. I'm planning to achieve this by parsing My ARM device doesn't have |
Please note: ARM64 / Aarch64 always has NEON, but does not mention that in /proc/cpuinfo:
I believe that when compiling, you should NOT mention "neon". I'll check. ... about compiling (#1) I tried #4 (comment) on my ARM64, but that didn't work: see errors below. The GCC on my ARM64 is older (5.4.0 20160609) than on my x86-ubuntu (6.3.0 20170406) ... is that the cause?
Am I doing something wrong? UPDATE After upgrading to gcc 6.3.0 20170519, less errors messages, but still two fatal:
|
In the test app change do_decode_sse to do_decode_neon. |
The input for SABYenc is chunks of data from the socket, so not lines. Whenever the socket reports data can be read, we read it and put it in a python list until we reach end-of-article. Then this list of chunks we give to |
I've added support for memory cache to NServ which slightly improved overall performance in tests (since NServ runs on the same computer and consumes CPU) but the proportions remained. To estimate the possible performance improvement I suggest you to disable decoding in SABYEnc and measure the speed. You can't get more speed with SSE decoder than without decoder at all. |
Oh I see. If you can simplify the main loop down, removing the end check from it, you could then replace your main loop with one that just loops through chunks and feeds it to the decoder. I've wondered whether I should add support for detecting terminator sequences ( |
@animetosho If I compile decoder unit with The situation is even more problematic in case with ARM. The CRC-unit for ARM must be compiled with To solve that issue I'm putting CPU detection code into a separate unit compiled with default settings. I wonder if I should put SSSE3-decoder into a separate unit to make sure that SSE2-decoder is compiled with And by the way, is SSSE3 decoder faster in your tests? In my limited tests I see no difference between SSE2 and SSSE3 decoders. Maybe I should keep only SSE2 version; that would make things easier. |
Technically, It's a pain to do this though. In practice, (S)SSE3 additions are quite purpose specific, and intrinsics are just wrappers around single assembly instructions, so compilers generally follow them. As such, I can't see the compiler using SSSE3 instructions in the SSE2 version, even if you compile with I'm not sure whether the same can be said with SSE2 over i686 as SSE2 is somewhat more general purpose. I get significantly faster speeds for SSSE3 (like 1380MB/s SSE2 vs 2200MB/s SSSE3 on a modern Intel CPU). It only differs when removing characters though, so in your case, since you'll never get \r or \n characters, the only characters to remove at the To my understanding, 32-bit ARMv8 is largely the same as ARMv7, but the documentation lists some differences so it can't really be relied on unfortunately. Though I'm running in armv7 mode, and code compiled using armv8-a+crc seems to work... maybe it's because the CPU can understand the ARMv8 stuff. I have to say that compiler tooling/support for dynamic dispatch is rather primitive, and a bit of a pain to deal with. |
Found the relevant GCC doc:
So, yeah, I have to split up SSE2 and SSSE3 into separate units. Not a big deal, I've already put initialization part into a separate unit. I guess I have to eliminate the extra optimizations with tune- and AVX- ifdefs though. I'm currently in process of reworking code which fetches data from news server and feeds it into decoder. My plan is to eliminate own line detection and use raw-decoder instead (on large blocks like 10KB). That saves data copying and allows decoding of data directly from buffer which receives from socket. All doing in one pass instead of two (line detection, then decoding). There is however one thing that stops me at the moment. The decoder must detect end of yEnc-data or end of article. Currently I need to tell the decoder where to stop using In yEnc end of data is marked with Once detected end of data (yend or article end), the decoder should stop processing and should report that (for example via new state value). It would also be helpful if it could report back the end position since we need to process the yenc-trailing data (parse CRC). It's not necessary for decoder to deal with
This doesn't work unfortunately because I can't detect the last chunk on transmission level. Once the news server sent the last chunk my attempts to receive more data result in hanging as no data is available (and the server doesn't close connection waiting for further commands). That's why I need to scan the received data for end of article marker before attempting to receive more data. Do you think you could extend the decoder with end of stream detection? That'd be great and highly appreciated. If implemented we would be able to do all the heavy number crunching using SIMD in one pass. Well, in two passes actually, as we need to calculate CRC of decoded data separately. Theoretically decoding and CRC calculation could be made in one pass but that would be too much of efforts I guess, especially the Intel routine isn't designed for such use. So I'm not asking about this, just thinking out loud. Thanks again. This discussion topic is a great pleasure to participate in. |
Yeah, detecting terminator sequences is something I've been wondering about. I didn't realize that So there'll need to be a second version of the function, with a different signature (since input may only be partially consumed) which stops at the end point. Non-raw decoder will only look for
Yes, this only works if the end sequence has been scanned elsewhere. SABYenc seems to take that approach.
I've been thinking about the possibility of such optimizations via function stitching. It's a little awkward because yEnc is variable length, and may require the algorithm to go back to memory, though I think it could still yield a net benefit. It's a little complicated to implement/maintain, and I don't know how much it'd yield over running it separately (perhaps with loop tiling). It's something I may look into eventually nonetheless.
Same here! I'm glad to be a part of it! |
Indeed, NNTP mandates
Regarding better CPU cache usage. If we choose block size large enough for efficient SIMD processing but small enough to remain in cache between decoder and CRC passes, we can process the data in two passes without unnecessary complicating decoder, right? I mean the caller should execute CRC function just after decoding and you don't have to deal with integrating CRC calculation into decoder. That's actually how it is currently processed in nzbget (although in very small one-line chunks). The only thing we need is an incremental CRC function. The ARM function is already incremental. The crc_fold not yet. Linux kernel has a CRC function which also uses PCLMULQDQ but supports initial CRC parameter. Have you evaluated it yet? It's pure asm and that scares, not easily portable probably. From the amount of code it seems to be much smaller than crc_fold, probably not as sophisticated and not as fast. |
My current plan is to bail the SIMD decoder when it sees Limiting processing size is useful for cache if multiple passes are necessary. Function stitching is still beneficial for a number of reasons though:
I'm not intending this decoder to be the fastest one can possibly be, I'm mostly looking for easy wins (and stitching CRC into the decoder is a little complex). A modern CPU can already run this at around 2GB/s per core, and PCLMUL CRC32 runs in excess of 10GB/s on one core of a Haswell CPU, which is a fair bit faster than what a 10Gbps connection would give you. I can't see exceeding 10Gbps to be an issue for a while, and one can always multi-thread. The CRC folding approach is incremental, if you see the original code I took it from. The state isn't stored as a CRC32 though, rather it's stored in 512 bits which can be reduced to a 32-bit CRC. I haven't tested the Linux code yet, thanks for linking to that. It looks like it largely uses the same idea. One could port this to C intrinsics if assembly is not desired. |
Done! Since current version of raw decoder doesn't support end-of-stream detection yet, an extra scan of incoming data before feeding decoder is implemented to properly process data. Therefore it's not one pass at the moment. Nonetheless the overall efficiency is greatly improved: 268 MB/s -> 350 MB/s. Detailed benchmark results
Please note that these speeds represent overall download speed in NZBGet, not just decoding speed (the program of course has to do way more work in addition to decoding). For test conditions see this topic. I'm still using non SIMD CRC32 calculation routine (slice by 4). Improvements on that front is the next item on my list.
That's great news. I'll adopt it. |
Cool, nice to see my random thought bubble making this issue resulted in something usefull. If it's possible for NZBget, SABnzbd will also be able to take advantage of it 👍 All thanks to @animetosho! @hugbug Did you possibly also have test results on ARM? Would be very interested to see how much Neon adds! I've shifted my focus now to first converting SABnzbd to Python 3, so that it can also use VS2017 for compiling the C extensions (in 2.7 you're locked to VS2008) and use |
Reposting test results from NZBGet topic here as this is very much related to the discussion. When making tests on two ARM devices discovered and fixed a performance bottleneck in NServ. The CPU usage of NServ has been greatly reduced, which in turn gives more CPU time to NZBGet and increases speed. All tests were rerun with improved NServ (including tests on Mac). Test devices
Test resultsAll numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. The overhead of CRC calculation shows how much improvement potential is still there - the CRC routine isn't optimised for SIMD yet. Once again a reminder that the speeds below represent overall download speed in NZBGet, not just decoding speed.
Observations
|
Thanks for those benchmarks - very interesting to know! The ARM results are interesting. I don't really know ARM uArchs anywhere as well as I do x86, but I suspect lower performance on ARM cores which have a separate NEON unit (I believe older designs do this). But interestingly, from these slides, it seems that the Cortex A15 fully integrates NEON into the regular pipeline, so I'm a bit surprised with the lack of speed gain there. Perhaps this could mean that the SIMD decoder is slower than the scalar decoder on something like a Cortex A7. I can't find much information on SIMD width on ARM chips. I believe the Cortex A57 and later use 128-bit units, so am guessing the A53 and A15 possibly use 64-bit units. That in itself will reduce the benefit of SIMD (half the throughput). |
On Wikipedia page Comparison of ARMv7-A cores Cortex A-12, A-15 and A-17 are listed as having 128 bit NEON. What wonders me in benchmark results is that the performance difference between the slowest (new-decoder per-line mode crc on) and the fastest (no-decoder raw mode crc off) is only 29% (68 vs 88) on ARMv7 whereas on Intel it's 125% (307 vs 693). As if CPU spends far more time doing other things and therefore doesn't respond so well to optimisations in decoder. May be it's not CPU itself but other system components such as slow RAM, I don't know. |
@hugbug do you run NServ also on the ARM devices themselves? What if you run it on your Mac/Windows and then connect to the ARMv7 via Gigabit? |
That's an interesting and useful comparison table. Somewhat interestingly though, I just found this article which seems to imply that A15 is not 128b:
Similarly, the same site mentions that the A7 and A53 are single issue 64-bit units. Of course, the site could be wrong, but if it was true, it'd make a little bit more sense. Your PVR benchmarks do seem a little odd still - I'd expect a 1.7GHz 3 wide out-of-order core to easily beat the 1.5GHz 2 wide in-order CPU at almost everything, but it seems to be significantly weaker. ARM claims that the A53 should be about as performant as an A9, but the A15 should be more powerful than an A9. The NEO2 benchmarks make more sense to me, though your point about there being much less gain with everything disabled is certainly on the mark still.
Gigabit maxes out at 125MB/s, which I suspect could be a bit of a bottleneck (perhaps not for the PVR). Not exactly knowledgeable about the test server, but if it runs on a different core, hopefully it doesn't have much of an effect. Unless there's a memory bottleneck or similar - might be worth a try nonetheless (can get an idea for how network transfers affect the device). |
I'm testing in NZBGet which is multithreaded and for tests 10 connections was configured, meaning 10 threads all doing similar jobs (on different articles): downloading, decoding, computing CRC, then normally also writing into disk but that last part was disabled during tests. Therefore 4 cores @ 1.5GHz is better than 2 cores @ 1.7GHz.
Tried that too and got similar results. Now process
When redoing tests on PVR I've noticed that I'm now getting lower CPU usage in NServ and better speeds in benchmark (97 MB/s in I'm redoing tests on PVR and will post updated results. |
Results for ARMv7 (NServ) updated directly in the benchmark post. ARMv7 and ARMv8 can now be better compared because the same nzbget binaries and same NServ were used in all tests. |
If you are not tired of benchmarks here are numbers for SIMD CRC, which I've got just integrated (reposting from NZBGet issue tracker). SIMD CRCIntegrated SIMD CRC routines for Intel and ARMv8 into NZBGet.
All numbers are in MB/s. For each decoder two test cases were measured - with and without CRC calculation; the latter is shown in parentheses. All tests were performed 4 times, the worst result was discarded and the average of the remaining three results were taken. For convenience the table also includes all previous measurements with scalar-crc routine.
Conclusion
|
Thanks again for the benchmarks - they're quite interesting.
Oops, forgot about that, thanks for the correction! I've gotten around to implementing the end scanning version of the decoder. Have been a little busy with other stuff, so took longer than expected. Function signatures have changed to accommodate signaling how much input/output is consumed. I've also moved code around Turns out that Searching for the end sequence is sometimes noticeably slower than not doing it, but hopefully faster than a second scan of the data. |
Thanks so much! I'll integrate the new version and report back. In the meantime I've done more tests, in particular on Dell 2015 notebook when running Linux. The numbers are crazy high (MB/s):
For description of devices, test conditions and more results (not related to SIMD) please see original post. |
Results for one-pass simd decoder with end-of-stream detection (simd-end):
|
Cool to see how within 1 month from creating this issue it has now a working implementation in NZBget. So I would say this issue served its purpose and in case I have specific implementation questions for SABnzbd I will open another topic! |
It has been interesting - thanks for creating the topic! Are you planning to migrate to Python 3 before using this decoder? I imagine that SABYenc could be changed to use it, as is, but I'd imagine that Python 3's API would be different - if that's the goal. |
I was amazed to find this repo, I have been thinking of some way to do yEnc-decoding (as a Python-C-extension) using SSE instructions but my knowledge of C is just too rudimentary for now.
Do you think think SSE can help compared to regular char-by-char decoding of yEnc body?
How would you go about the decoding-escaping problem? I can imagine finding the escape chars, but how to remove them later on when building the output string?
I tried to grasp your encoding-code, but I think I probably miss the main idea due to the included edge-cases and optimizations.
Thanks!
EDIT: I think I am getting more and more of the code and how you handle the encoding-escaping here: https://github.com/animetosho/node-yencode/blob/master/yencode.cc#L718-L752
I don't completly understand the shuffle operations just yet and how they handle the extra chars, what are
shufMixLUT
andshufLUT
?The text was updated successfully, but these errors were encountered: