-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LibraryScanner: Improve hashing of directory contents #2497
Conversation
macOS SCons build failures are unrelated. We need to get rid of SCons asap! |
#2498 should fix the SCons/OSX errors btw. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, tests pass, library rescan didn't cause any issues. LGTM.
@@ -54,7 +56,7 @@ void RecursiveScanDirectoryTask::run() { | |||
if (currentFileInfo.isFile()) { | |||
const QString& fileName = currentFileInfo.fileName(); | |||
if (supportedExtensionsRegex.indexIn(fileName) != -1) { | |||
newHashStr.append(currentFile); | |||
hasher.addData(currentFile.toUtf8()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you made a performance check for toUtf8() compared to use the QString bytes?
I can Imagine that the later is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raw bytes of QString are not platform-independent when considering endianess, but UTF-8 is. Does this matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no function to obtain the raw bytes of a QString. Using toLatin1() may lose characters and toLocal8Bit() is platform dependent while the database is portable. Please clarify which option you consider here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just the idea to check if something like
hasher.addData(reinterprete_cast<char*>(currentFile.utf16()))
is faster or not.
This way there is the double amount to hash but no utf8 conversion.
I don't care about endianes, because this would be only apply if the DB is on an external HD, accessing the same OS with different CPU architectures.
Thinking of this a independent solution is not bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the additional allocations really matter here and won't invest hours to figure out how to do a realistic performance test to prove this. The optimized solution would also produce non-portable data, in contrast to all other data in the database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was just the idea to check if something like
hasher.addData(reinterprete_cast<char*>(currentFile.utf16()))
is faster or not.
This way there is the double amount to hash but no utf8 conversion
Double the data should not be an issue. sha256 runtime not increase 1:1 with the amount of input data. I'm on my phone, but you should see the same effect on desktop pcs:
$ openssl speed sha256
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
sha256 180714.92k 532643.80k 1088833.54k 1466255.70k 1633716.91k 1647209.13k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual number of bytes that we feed into QCryptographicHash should not be an issue. But QString might need to allocate a temporary QByteArray for the UTF-8 representation and the conversion itself also takes some time.
Yes, utf16() would be faster, but it is non-portable. If someone is able to proof that it is much faster and worth the drawback, please go ahead ;) I will not do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curious I did the test. Here are the results:
10 x 30 characters
6040 nsec hasher.addData(reinterpret_cast<const char*>(lorem.unicode()), lorem.size() * sizeof(QChar));
8683 nsec hasher.addData(lorem.toUtf8());
12836 nsec hasher.addData(lorem.toUtf8()); "ööööö.."
10 x 888 characters
149882 nsec hasher.addData(reinterpret_cast<const char*>(lorem.unicode()), lorem.size() * sizeof(QChar));
82302 nsec hasher.addData(lorem.toUtf8());
257479 nsec hasher.addData(lorem.toUtf8()); "ööööö.."
Conclusion:
As expected there is only performance gain for ASCII characters. Except in non Latin charter sets two byte utf8 characters are a minority.
For a typical file name of 30 charters, the unicode() version is way faster. If we append all strings first, the utf8 version catches up at one point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those numbers also need to be put in relation to the file system access. I guess that the differences on the overall performance are negligible. That's what I had in mind with "realistic performance test".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of cause. I was just interested if we have a low hanging fruit here.
LGTM, Thank you. |
qHash()
mangling alltruncating SHA256 bytes) instead of 32-bit integers fromqHash()
Remark The new
cache_key_t
typedef (a primitive type) and utility functions could be reused for caching artwork images. The aoide branch already contains the code for hashing images. Unfortunately this migration will require to discard and recalculate the hash of all artwork images. But we need to do it at some point if we want to get rid of the crippled 16-bit integer hashes that cause many hash collisions and wrong artwork display.