-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsyncIO compression part 2 - added async read and asyncio to compression code #3022
AsyncIO compression part 2 - added async read and asyncio to compression code #3022
Conversation
- Extracted asyncio code to fileio_asyncio.c/.h - Moved type definitions to fileio_types.h - Moved common macro definitions needed by both fileio.c and fileio_asyncio.c to fileio_common.h
61adb40
to
f231a36
Compare
- Added asyncio functionality for compression flow - Added ReadPool for async reads, embedded to both comp and decomp flows
8e52ecd
to
7366119
Compare
Maybe rebase on top of |
…ter_refactor * origin/dev: AsyncIO compression part 1 - refactor of existing asyncio code (facebook#3021) cleanup double word in comment. slightly shortened status and summary lines in very verbose mode Change zstdless behavior to align with zless (facebook#2909) slightly shortened compression status update line More descriptive exclusion error; updated docs and copyright Typo (and missing commit) Suggestion from code review Python style change Fixed bugs found in other projects Updated README Test and tidy Feature parity with original shell script; needs further testing Work-in-progress; annotated types, added docs, parsed and resolved excluded files Using faster Python script to amalgamate
Merged, changes now belong only to part 2. |
programs/fileio_asyncio.c
Outdated
} else | ||
EXM_THROW(37, "Unexpected short read"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please enclose the else
in brackets & indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
programs/fileio_asyncio.c
Outdated
if(job->usedBufferSize > srcBufferRemainingSpace) { | ||
memmove(ctx->srcBufferBase, ctx->srcBuffer, ctx->srcBufferLoaded); | ||
ctx->srcBuffer = ctx->srcBufferBase; | ||
} | ||
memcpy(ctx->srcBuffer + ctx->srcBufferLoaded, job->buffer, job->usedBufferSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something you can do to avoid these memcpy()
?
Do they actually happen in the normal case? E.g. if zstd is consuming the entire buffer do we ever hit this case?
If they don't happen for zstd, and only happen for other formats, then it probably isn't a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two possible ways to avoid memcpy
(at the cost of code complexity) are:
- Continue holding the
IOJob_t
and referencing the buffer it owns. This doesn't remove the need for memcpy completely as there are cases where the code would ask for more bytes than are currently available and so we will need to coalesce two buffers. For that we could have another buffer allocated that will be used only when strickly needed. - Implement a ring buffer and have each
IOJob_t
reference a portion of this ring buffer.
The memcpy/memove
bothered me, but I believe these two solutions increase the complexity of the code without actually improving performance. Memcpy should be cheap compared to our other actions (compression / decompression and IO). Am I wrong to assume memcpy is relatively cheap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I wrong to assume memcpy is relatively cheap?
Relatively cheap, yes. Negligible, no.
Decompression runs at > 1GB/s. So adding an extra memcpy can be a 10% slowdown.
Compression can run at a few hundred MB/s, or GBs/s with multithreading. So an extra memcpy can be a 5% slowdown.
Definitely worth benchmarking.
Could you have jobs+1 buffers, and reserve 1 buffer for reading? When you're finished reading, swap that buffer with the next jobs buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decompression runs at > 1GB/s. So adding an extra memcpy can be a 10% slowdown.
The memcpy can happen only for input, not for output which should have a lower rate.
Other than that, up to now each such memcpy/memove was instead an fread
which is a memcpy at the best case, but might also involve a syscall (depending on buffering). While the memcpy can have a non negligible cost, it is definitely cheaper than the previous implementation.
Could you have jobs+1 buffers, and reserve 1 buffer for reading? When you're finished reading, swap that buffer with the next jobs buffer?
As mentioned on our second thread this is problematic as there are some cases the user would consume
some bytes leaving us with a lower number of bytes available than required in the next read. In such a cases we'd need to take some bytes from the current buffer and joint them with bytes from the next buffer. This can only be possible using memcpy
or if the buffers are all a part of one continuous ring buffer (which also has edge cases toward the end of the buffer, unless we use some mmap tricks) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I will give it a try, and we can judge it afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In such a cases we'd need to take some bytes from the current buffer and joint them with bytes from the next buffer
Not necessarily. You could loosen the contract that you always return the requested # of bytes. All compressors should have the ability to consume the input, even if it is smaller than the expected size.
So in the case where you have leftover bytes, instead of joining with the next job, just return a shorter read. Then when its consumed the tail of the current buffer, grab the next job's buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that increase chances of memcpy to internal decompressor buffers though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatively cheap, yes. Negligible, no.
You are right there. Perf shows 1-3.5% time spent on memcpy.
Also interesting is that ZSTD_(de)compressStream
spends even more time on memcpy
. I wonder if this is related to the code getting less data than expected.
76f5cd2
to
4cac292
Compare
Added benchmarking information and removed most of the memcpy calls. |
programs/fileio.c
Outdated
@@ -1130,13 +1143,15 @@ FIO_compressLz4Frame(cRess_t* ress, | |||
LZ4F_preferences_t prefs; | |||
LZ4F_compressionContext_t ctx; | |||
|
|||
IOJob_t *writeJob =AIO_WritePool_acquireJob(ress->writeCtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor (coding style):
if writeJob
pointer never changes after initialization, you should const
it : IOJob_t* const writeJob = ...
.
It sends a signal to the reviewer (or the next contributor) that this pointer will never change its value later on.
When applied consistently, its absence becomes a signal that this pointer will change in the future, and should therefore receive extra attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually does change later on with
AIO_WritePool_enqueueAndReacquireWriteJob(&writeJob);
programs/fileio_asyncio.c
Outdated
/* AIO_ReadPool_executeReadJob: | ||
* Executes a read job synchronously. Can be used as a function for a thread pool. */ | ||
static void AIO_ReadPool_executeReadJob(void* opaque){ | ||
IOJob_t* job = (IOJob_t*) opaque; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor : it looks to me that this pointer never changes, so it should be IOJob_t* const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added const
s throughout the new patch, I believe I found all occurrences that it could apply to.
…ter_refactor * origin/dev: fixed minor compression difference in btlazy2 updated all names to offBase convention change the offset|repcode sumtype format to match offBase
tests/playTests.sh
Outdated
@@ -186,7 +186,7 @@ fi | |||
|
|||
println "\n===> simple tests " | |||
|
|||
datagen > tmp | |||
datagen -g500K > tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason for this change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
One of the following tests checks that we fail with "-M2K". This test started failing (i.e decompression succeeded) when I introduced the changes in the branch.
Reason being that I increased the CLI's buffer sizes (and changed buffer usage pattern) resulting in less mallocs during decompression. Increasing the data size forces the lib to allocate space for the window. (**note this is as far as I currently understand the code, might have missed something in this explanation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,
so maybe it is this one -M2K
test (or series of tests) that needs to generate and use its own specific source,
rather than impacting a generic sample used by many other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
This is a good rewrite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only some very minor comments on coding style
bed2a44
to
0f5f6ed
Compare
programs/fileio.c
Outdated
if (ret != Z_OK) | ||
EXM_THROW(72, "zstd: %s: deflate error %d \n", srcFileName, ret); | ||
{ size_t const cSize = ress->dstBufferSize - strm.avail_out; | ||
EXM_THROW(72, "zstd: %s: deflate error %d \n", srcFileName, ret); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This needs indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
programs/fileio.c
Outdated
@@ -1146,27 +1161,27 @@ FIO_compressLz4Frame(cRess_t* ress, | |||
#if LZ4_VERSION_NUMBER >= 10600 | |||
prefs.frameInfo.contentSize = (srcFileSize==UTIL_FILESIZE_UNKNOWN) ? 0 : srcFileSize; | |||
#endif | |||
assert(LZ4F_compressBound(blockSize, &prefs) <= ress->dstBufferSize); | |||
assert(LZ4F_compressBound(blockSize, &prefs) <= writeJob->bufferSize); | |||
|
|||
{ | |||
size_t readSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please rename this variable, it is super close to the parameter U64* readsize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
Just removed it as it was actually useless here.
if (ferror(srcFile)) { | ||
EXM_THROW(26, "Read error : I/O error"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding of the code, we should be checking ferror()
whenever we call fread()
in the current implementation. No matter if we use --async-io
or --no-async-io
. Is that correct?
I just want to verify because this check is absolutely critical.
And by moving it to the fread()
, we now get coverage for other formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding of the code, we should be checking ferror() whenever we call fread() in the current implementation. No matter if we use --async-io or --no-async-io. Is that correct?
Yes, the flows are almost same for both cases. The difference is that the async code queues the execution function to the worker thread while the sync version directly calls the execution function (see AIO_ReadPool_executeReadJob
and AIO_IOPool_enqueueJob
).
assert(feof(finput)); | ||
|
||
AIO_fwriteSparseEnd(prefs, foutput, storedSkips); | ||
assert(ress->readCtx->reachedEof); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't happen, right? Assuming the code is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the code wasn't correct, but it didn't happen because 64k is less than our in size (128k). I fixed it.
As it is now, it shouldn't trigger and might be redundant (I just replaced the previous assert because it was easy).
We should only exit the loop if ress->readCtx->srcBufferLoaded==0
.
Since we fill the buffer right before this is checked, it should happen only if we reached the EOF or encountered a read error.
Read errors, in turn, will panic before we get to this point.
programs/fileio_asyncio.c
Outdated
job = (IOJob_t*) malloc(sizeof(IOJob_t)); | ||
buffer = malloc(bufferSize); | ||
IOJob_t* const job = (IOJob_t*) malloc(sizeof(IOJob_t)); | ||
void* const buffer = malloc(bufferSize); | ||
if(!job || !buffer) | ||
EXM_THROW(101, "Allocation error : not enough memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
I think these annoying indentation issues happened when I split my original branch into two and while copy-pasting some code my editor decided that no indentation is the correct indentation. :(
programs/fileio_asyncio.c
Outdated
if (ctx->srcBufferLoaded >= n) | ||
return 0; | ||
|
||
/* We still have bytes loaded, but enough to satisfy caller. We need to get the next job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "but not enough"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Bytef and uInt are zlib types, not available when zlib is disabled Fixes: 1598e6c ("Async write for decompression") Fixes: cc0657f ("AsyncIO compression part 2 - added async read and asyncio to compression code (facebook#3022)")
PR 2/2 for asyncio compression.
This PR builds on the previous PR and adds:
ReadPool
- a thread pool based reader for async input.Benchmarking from a single desktop, shows this branch and dev. "No cache" means I used
dd
to remove file from fs caching before each run.multiple.zst
is a 111MB file with multiple frames in it.