Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize processing small files #1958

Closed
lfcnassif opened this issue Oct 28, 2023 · 3 comments · Fixed by #1957
Closed

Optimize processing small files #1958

lfcnassif opened this issue Oct 28, 2023 · 3 comments · Fixed by #1957
Assignees

Comments

@lfcnassif
Copy link
Member

lfcnassif commented Oct 28, 2023

When indexTempOnSSD = true TempFileTask creates temp files for most files < 1GB size, except for subitems (already in the case data storage) and carved files whose parent already has a temp File. This avoids decompressing the same file multiple times from E(x)01 evidences and also caches data from other image types in network shares.

For small files, we can cache the content on memory, avoiding unneeded writes to and reads from the temp directory for items that can be processed without a temp File.

What file size limit would be reasonable to keep on memory while processing it (after it was taken from the queue)? We already use a large buffer up to 8MB in Item.getBufferedInputStream() method, which would use up to 400MB of memory in a 50 threads machine.

@lfcnassif lfcnassif self-assigned this Oct 28, 2023
@lfcnassif
Copy link
Member Author

Caching small subitems on memory would also avoid uncompressing them multiple times from the internal case storage, where they are compressed. In the past I also tested creating uncompressed temp files for them, but I didn't come to a conclusion if it improved processing speed or not, since creating temp files has a cost. But keeping them on memory should speed up things a bit.

@lfcnassif
Copy link
Member Author

Until now, I didn't get clear differences with this approach, tested with a huge UFDR, one small and one medium size E01. I'll repeat tests using a non SSD temp disk and maybe more evidences...

lfcnassif added a commit that referenced this issue Oct 31, 2023
@lfcnassif
Copy link
Member Author

lfcnassif commented Oct 31, 2023

Conclusions after lots of tests on a few evidences (03 E01s and 02 UFDRs):

  • For E01 processing using a non SSD disk as temp, I got up to 33% speed up using default profile;
  • For UFDR processing using a non SSD disk as temp, I got up to 50% speed up (with a small WhatsApp database, expanding its messages is a bottleneck for other UFDRs). I think it's due UFDRs are decompressed using a java library while E01s use the faster zlib native library;
  • For E01 processing using a common SSD disk as temp, I got a minor speed up, up to 10%, when it exists;
  • For UFDR processing using a common SSD disk as temp, I got up to 12% speed up, when it exists;
  • For E01 and UFDR processing using a NVME disk as temp, I got no noticeable/conclusive speed up;

A few thoughts:

  • If user forget to set indexTempOnSSD = true, temp files for compressed files won't be created and this should help a lot;
  • If users set indexTempOnSSD = true by mistake, keeping small files on heap and writing less to temp disk is better, I tested this with 01 UFDR and processing was 13% faster with the memory cache;
  • If users forget to disable antivirus on temp disk, I think writing less files to temp should be much faster (not tested);
  • Using a memory cache for small files results in less writing to the temp SSD, it may make its life a bit longer.

So, I'll merge the proposed change put together with #1224.

Currently the memory buffer limit is 8MB, we may decrease it if someone thinks it is too large, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant