-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Sha1HashKey to CityHash128Key for generating PreparedStatement handle and metadata cache keys #717
Conversation
Codecov Report
@@ Coverage Diff @@
## dev #717 +/- ##
============================================
+ Coverage 48.52% 48.61% +0.09%
- Complexity 2747 2775 +28
============================================
Files 115 116 +1
Lines 27156 27342 +186
Branches 4547 4562 +15
============================================
+ Hits 13177 13293 +116
- Misses 11809 11879 +70
Partials 2170 2170
Continue to review full report at Codecov.
|
@@ -1096,3 +1096,364 @@ else if (databaseName.length() > 0) | |||
return fullName.toString(); | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a separate file for this?
Sha1HashKey(String s) { | ||
bytes = getSha1Digest().digest(s.getBytes()); | ||
CityHash128Key(String s) { | ||
segments = CityHash.cityHash128(s.getBytes(), 0, s.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance of this can be improved, especially in the case of large SQL strings. s.getBytes()
ends up calling StringCoding.encode(...)
, which converts the internal String chars to bytes using the platform default encoding.
I recommend using s.getBytes(int srcBegin, int srcEnd, byte dst[], int dstBegin)
, which, while deprecated (but will never be removed), has substantially higher performance.
CityHash128Key(String s) {
byte[] bytes = byte[s.length()];
s.getBytes(0, s.length(), bytes, 0);
segments = CityHash.cityHash128(bytes, 0, s.length());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.getBytes(int srcBegin, int srcEnd, byte dst[], int dstBegin) truncates the upper 8 bits of Java's internal Unicode-16 representation. Since this is SQL before inserting parameter values, it should be mostly 7-bit ASCII -- the only exception I can think of would be database names/table names/column names, at least some of which can be full Unicode. For European or other alphabetic languages, their Unicode-16 spaces tend to be 256 characters wide and the odds of this truncation causing a spurious collision seem very small -- but I'd be less confident if the language in question was, say, Chinese, that you couldn't ever have two SQLs that differed only by one ideogram in a table/column name, and the two ideograms in question unfortunately happened to have the same lower 8 bits. It's not very likely, but it's significantly less astronomically unlikely than a 128-bit hash collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yes, we need to avoid character reencoding, but I think we also need to avoid throwing away hash entropy while we do so. I was wondering about something like getBytes(Charset.UTF_16BE) -- if it matches the internal representation used by String, will it skip reencoding? So maybe something like:
byte[] bytes = s.getBytes(Charset.UTF_16BE); // avoid character reencoding
segments = CityHash.cityHash128(bytes, 0, bytes.length); // bytes.length = 2*s.length() + 2, due to Byte Order Mark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RDearnaley Looking at the code of StringCoding.encode()
and the various encoder subclasses, I don't find any optimization for UTF_16BE. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RDearnaley It appears only option would be a custom Charset
and CharsetEncoder
. It is possible, as there is a CharsetProvider
SPI available. The "raw encoder" would simply "encode" chars as pairs of bytes (upper 8-bits, lower 8-bits). The cost is a 2x sized byte array due to the naive approach, but possibly worth it as the byte array is quickly garbage collected, or could potentially be cached as a WeakReference
in a ThreadLocal
inside of the CityHash128Key
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How strange -- yes, taking a look in the Java code for them, it appears that none of UTF_16BE, UTF_16 etc make use of the obvious optimization that one of them has to be a no-op. I'm wondering if someone might have already created a raw double byte encoding to speed this up -- it seems a fairly obvious performance optimization for cases like this where you care about the entropy content rather than the specific values. If not, it might make a good addition to the Java version of CityHash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another possible approach whose speed would be worth testing:
private static byte[] toBytes( final String str )
{
final int len = str.length();
final char[] chars = str.getChars();
final byte[] buff = new byte[2 * len];
for (int i = 0, j = 0; i < len; i++) {
buff[j++] = (byte) chars[i]; // Or str.charAt(i) and skipping str.getChars() might be faster?
buff[j++] = (byte) (chars[i] >>> 8); // Or >> might be faster?
}
return buff;
}
or just inline this since we only use it once. Old-school, I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course it would work, but getChars() is going to create an additional copy, incurring a CPU hit as well as increasing GC pressure. Regarding inlining, probably unnecessary as the JIT will inline it when it gets hot.
I think a custom encoder would offer better performance, as the internal char array is passed without copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the str.charAt(i) approach also be making a copy, or would that just go on the stack/in a register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str.charAt(i) is just a simple array access, with the value passed on the stack. If the compiler inlines the call, it could be a viable option. You'd have to find it in the output of -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:+PrintCompilation
.
|
||
Sha1HashKey(String sql, | ||
CityHash128Key(String sql, | ||
String parametersDefinition) { | ||
this(String.format("%s%s", sql, parametersDefinition)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.format()
is a terrible way to concatenate two Strings. If I wrote this in my original commit, I must have been high. Just change to this(sql + parametersDefinition)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having taken another look, yes, this showed up in my YourKit runs -- in fact I saw it roughly as much as the character encoding issue we've been discussing above.
1) Further speedups to prepared statement hashing 2) Caching of '?' chararacter positiobs in prepared statements to speed parameter substitution
@cheenamalhotra I have sent you a pull request cheenamalhotra#11 with many of the additional speedups we've discussed above |
Prepared statement performance fixes
Pull latest changes from Microsoft:dev branch
# Conflicts: # src/main/java/com/microsoft/sqlserver/jdbc/SQLServerConnection.java # src/main/java/com/microsoft/sqlserver/jdbc/SQLServerPreparedStatement.java # src/main/java/com/microsoft/sqlserver/jdbc/Util.java
added missing line for bulkcopy tests.
…by Rene) Add String comparison with CityHashKey and fix test failures
|
||
ParsedSQLCacheItem cacheItem = new ParsedSQLCacheItem(parsedSql, paramCount, procName, returnValueSyntax); | ||
ParsedSQLCacheItem cacheItem = new ParsedSQLCacheItem (parsedSql, parameterPositions, procName, returnValueSyntax); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to have an extra space? Might need to apply formatter. Same line 267.
* SQL text to parse for positions of parameters to intialize. | ||
*/ | ||
private static int[] locateParams(String sql) { | ||
List<Integer> parameterPositions = new ArrayList<Integer>(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Template type inferred from LHS, not necessary to declare it a second time.
Proposed change:
LinkedList<Integer> parameterPositions = new LinkedList<>();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List<Integer> parameterPositions = new LinkedList<>();
would be better imo
int i = 0; | ||
for (Integer parameterPosition : parameterPositions) { | ||
result[i++] = parameterPosition; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a LinkedList to make adding/iterating faster OR use a parallel stream with ArrayList.
Edit: The stream option isn't very viable, and since order matters, wouldn't offer much performance gain. LinkedList is probably the way to go.
Edit2:
int[] result = parameterPosition.stream().mapToInt(Integer::valueOf).toArray()
is a good streaming alternative
} | ||
else { | ||
srcEnd = paramPositions[paramIndex]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5388 - 5393 can be replaced with
srcEnd = (paramIndex >= paramPositions.length) ? sqlSrc.length() : paramPositions[paramIndex];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to test this and seem how the performance is with the (IMO unecessary for many customers, including us) new string comparison added is. If necessary we might ship with a version patched to remove the string comparison. But this is definitely an improvement, so I'm marking it 'Approve'.
cleaner code & logic
CityHash128Key(String s) { | ||
unhashedString = s; | ||
byte[] bytes = new byte[s.length()]; | ||
s.getBytes(0, s.length(), bytes, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getBytes is a deprecated method - suppress warning or replace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppress the warning -- there is an excellent performance reason why we're using it. And as Brett has pointed out, it's never going away, because it's sometimes the right thing to do.
Fix for review comments
* Update Snapshot for upcoming RTW release v7.0.0 * Change order of logic for checking the condition for using Bulk Copy API (#736) Fix | Change order of logic for checking the condition for using Bulk Copy API (#736) * Update CHANGELOG.md * Merge pull request #732 from cheenamalhotra/module (Export driver in automatic module) Introduce Automatic Module Name in POM. * Update CHANGELOG.md * Change Sha1HashKey to CityHash128Key for generating PreparedStatement handle and metadata cache keys (#717) * Change Sha1HashKey to CityHash128Key * Formatted code * Prepared statement performance fixes 1) Further speedups to prepared statement hashing 2) Caching of '?' chararacter positiobs in prepared statements to speed parameter substitution * String compare for hash keys added missing line for bulkcopy tests. * comment change * Move CityHash class to a separate file * spacings fixes cleaner code & logic * Add | Adding useBulkCopyForBatchInsert property to Request Boundary methods (#739) * Apply the collation name change to UTF8SupportTest * Package changes for CityHash with license information (#740) * Reformatted Code + Updated formatter (#742) * Reformatted Code + Updated formatter * Fix policheck issue with 'Country' keyword (#745) * Adding a new test for beginRequest()/endRequest() (#746) * Add | Adding a new test to notify the developers to consider beginRequest()/endRequest() when adding a new Connection API * Fix | Fixes for issues reported by static analysis tools (SonarQube + Fortify) (#747) * handle buffer reading for invalid buffer input by user * Revert "handle buffer reading" This reverts commit 11e2bf4. * updated javadocs (#754) * fixed some typos in javadocs (#760) * API and JavaDoc changes for Spatial Datatypes (#752) Add | API and JavaDoc changes for Spatial Datatypes (#752) * Disallow non-parameterized queries for Bulk Copy API for batch insert (#756) fix | Disallow non-parameterized queries for Bulk Copy API for batch insert (#756) * Formatting | Change scope of unwanted Public APIs + Code Format (#757) * Fix unwanted Public APIs. * Updated formatter to add new line to EOF + formatted project * Release | Release 7.0 changelog and version update (#748) * Updated Changelog + release version changes * Changelog entry updated as per review. * Updated changelog for new changes * Terminology update: request boundary declaration APIs * Trigger Appveyor * Update Samples and add new samples for new features (#761) * Update Samples and add new Samples for new features * Update samples from Peter * Updated JavaDocs * Switch to block comment * Update License copyright (#767)
Fixes issue #716