Performance problems with statement caching due to SHA1 hash #716

RDearnaley · 2018-06-05T22:30:50Z

Driver version or jar name

mssql-jdbc version 6.4.0.jre8

SQL Server version

We support MS SQL Server versions back to 2012, and have been testing on version 11.0.5058.0

Client operating system

We support Windows/Linux/Mac/various other flavors of UNIX, have been testing with the client+ JDBC driver on Linux (with the SQL Server on Windows, obviously)

Java/JVM version

1.8

Table schema

(large, complex, and irrelevant to this issue)

Problem description

We recently switched our code from using the old sqljdbc42 4.2.6420.100 MS SQL Server JDBC driver to using mssql-jdbc version 6.4.0.jre8 -- we encountered a significant (O(20-35%)) performance hit when doing so, which we have been investigating. Our performance test setup uses a repeating pattern of O(100) distinct read queries of large SQL text length (up to around 16kb) that were generated by Hibernate with some queries repeated many times withing each overall repeat pattern, each individual query returning only a small amount of data, with the client and database on two different machines in the same datacenter. Thus prepared statement caching is a big win-- the connection pool we are using no longer provides this, which is why we switched to mssql-jdbc version 6.4.0.jre8.

Using the YourKit Java profiting tool, I tracked down part of the problem to the hashing code for the JDBC driver's prepared statement cache, which is using SHA-1 hashing (which it delegates to the Java security system, which delegates it to our chosen security provided library). For this test setup 10-15% of the total test time was being spent performing SHA-1 hashing in order to check the prepared statement cache -- quite unacceptable overhead for a JDBC driver! (We experimented with simply turning the caching off, but alas that made things even slower -- despite its current heavy hashing overhead, the prepared statement cache is still 5-10%faster on average than repreparing every statement.)

Here is the relevant section of the stack trace (in caller-first order) when this SHA-1 behavior was happening:

... (Hikari connection pool library) ...
ProxyPreparedStatement.java:52 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery()
SQLServerPreparedStatement.java:401 com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(TDSCommand)
SQLServerStatement.java:204 com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(TDSCommand)
SQLServerStatement.java:224 com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(TDSCommand)
SQLServerConnection.java:2713 com.microsoft.sqlserver.jdbc.TDSCommand.execute(TDSWriter, TDSReader)
IOBuffer.java:7344 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute()
SQLServerPreparedStatement.java:479 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement$PrepStmtExecCmd)
SQLServerPreparedStatement.java:536 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.reuseCachedHandle(boolean, boolean, String)
SQLServerPreparedStatement.java:960 com.microsoft.sqlserver.jdbc.SQLServerConnection$Sha1HashKey.(String, String, String)
SQLServerConnection.java:126 com.microsoft.sqlserver.jdbc.SQLServerConnection$Sha1HashKey.(String)
SQLServerConnection.java:130 java.security.MessageDigest.digest(byte[])
MessageDigest.java:410 java.security.MessageDigest.update(byte[])
... (BouncyCastle security provider library) ...

Since what is being hashed is obviously the SQL statement before parameter values are applied, that should not be under untrusted end-user control (or if it is, it's extremely hard not to leave your self open to DROP-TABLE-type SQL injection attacks from your end-users). So there should be no concern over a malicious end-user deliberately forcing cache collisions by generating two arbitrary SQL queries carefully chosen so that they SHA-1 hash to the same value (as opposed to choosing arbitrary parameter values). Thus using a cryptographic-strength hash is wasteful in this context -- any hash with sufficient size and sufficiently good hashing properties to avoid accidental collisions will do, and the selection should be based on performance, which any crypto-strength hash is going to lose.

I brought this issue to the attention of Brett Wooldridge, who I understand originally contributed this code, see:

https://groups.google.com/forum/#!topic/hikari-cp/OTsKIFd3joY

and he tells me that he submitted it using CityHash128, a fast (high-throughput-speed) non-cryptographic-strength 128-bit hash, see:

#291 (review)

It appears that somehow this got switched (back?) to SHA-1 before 6.4 was released. I strongly recommend that you switch back to using CityHash128 as he contributed [or, if there is some license issue with CityHash128, then to some comparable high-throughput-speed non-cryptographic-strength hash] -- the performance with SHA-1 is unacceptably slow.

[If for some reason you feel that a few customers may have a security profile where the use of SHA-1 might actually be necessary, since their end users can issue arbitrary SQL queries (but are somehow blocked from SQL injection attacks), then I strongly recommend that you make the choice of whether to use fast hashing or cryptographic-strength hashing be configurable e.g. by a JDBC URL parameter, and that you mention in the documentation for that parameter that the use of cryptographic-strength hashing imposes a significant performance overhead, and should be avoided except in security scenarios where it is actually necessary (and that constructing a security scenario where that is the case but that isn't open to SQL injection attacks is challenging).]

Expected behavior and actual behavior

Expected behavior: SQL hashing for the prepared statement cache takes negligible time
Actual behavior: SHA-1 SQL hashing for the prepared statement cache takes the majority of all time used by the JDBC driver, and 10-15% of the entire run-time in our performance test case

Repro code

See problem description above -- in short, turn prepared statement caching on and make use of the prepared statement cache for any realistic Hibernate-like read load that includes repeating fairly long read queries that individually retrieve fairly small amounts of data.

brettwooldridge · 2018-06-06T14:47:37Z

@RDearnaley @TobiasSQL
Just thought I would mention that a maliciously engineered hash collision, for either SHA-1 or CityHash128, would not necessarily result in arbitrary SQL execution.

If prepared Statement^A is in the cache, and someone engineers the submission of a SQL string that has a hash collision, then Statement^A may be executed instead of their engineered SQL.

It would be possible to pre-poison the cache, by submitting the malicious SQL first, before Statement^A has a chance to execute. In which case, when Statement^A is executed, due to the hash collision, the engineered SQL would be executed instead.

However, this is assuming that someone can submit arbitrary SQL anyway! So, nothing is gained by engineering a collision in the first place.

Also note that SHA-1 is basically insecure these days; due to hardware acceleration malicious actors can generate billions of hashes per day.

Please do not conclude, therefore, that the cache key needs to be more cryptographically secure. That would be insane. Given that such an attack would require allowing arbitrary SQL execution, nothing is being protected.

The cache key algorithm should be as fast as possible, while still providing extremely low collision probabilities.

In the universe of "natural" SQL, as opposed to specifically engineered malicious SQL, the odds of a hash collision in CityHash128 -- as I have pointed out elsewhere -- make it more likely that the Earth will be destroyed by an asteroid strike than incur a collision in the cache.

I would like to see the use CityHash128 return to the driver. I have to believe that it was simply a mis-merge rather than a conscious decision. The CityHash128 implementation used was under Apache License 2.0, so there is no licensing issue.

cheenamalhotra · 2018-06-06T17:13:14Z

Hi @RDearnaley and @brettwooldridge

I see there was an agreement over using CityHash128, but I'm not sure as well why wasn't it merged with the feature, maybe it just got missed out. I also agree with the potential performance improvement it would bring to this feature, and have the changes ready in PR #717 and in attached Jar to be tested for you. I tested on a small scale application, so the difference in performance wasn't that alarming to me.

Would be interested to get a repro code if possible, or the query which is prepared in your application.

Let me also know if the change resolves performance issues with your application and if any changes are needed in the PR.

mssql-jdbc-PR717.zip

RDearnaley · 2018-06-13T02:41:54Z

@cheenamalhotra Thanks for the quick response -- we will give that a test, and will get back to you once we have results. Our Hibernate-generated queries are generally somewhere around 4kb to 16kb in size per query -- I should be able to get you some sample queries tomorrow.

Meanwhile I've been doing a more detailed performance analysis with the YourKit Java performance profiler:

For our test load of mostly-repeating large read queries, with the driver's prepared statement caching at the default settings of disabled and size zero (and caching performed in the connection pool instead), the new JDBC driver is consuming about one-and-are-half times as many CPU cycles as the old JDBC driver. The difference looks to be primarily or entirely due to SHA-1 hashing -- which it's doing even with the cache defaulting to off (!) I don't know if there's a good reason for this behavior, or if it's due to some bug (I don't see any hashing activity in the old JDBC driver so I suspect it might be a bug). In any case, if it actually needs to do hashing even with the cache defaulting to off, then it should presumably use a fast hash.
For the same test load and setup, but now with the driver's prepared statement caching enabled with size 250, the new JDBC driver slows down further, to consuming more than twice as as many CPU cycles as the old driver. The difference appears to be primarily due to a further increase in the time spent doing SHA-1 hashing. I also see a smaller increase due to the appearance of time spent in ParameterHash.scanSQLForChar() -- I haven't looked at the code, but if that's associated with inserting parameter values into the cached SQL statement, then I'm wondering if it might be worth caching the positions of the marker characters rather than rescanning for them every time? Like the hashing, the performance difference is probably small with short queries, but seems to become significant with large automatically generated queries.

brettwooldridge · 2018-06-13T04:24:15Z

Working from memory here, but I think the statement is hashed regardless of prepared statement caching to avoid a re-parse. I could be mistaken. I think the idea of caching the parameter positions is a good one, again regardless of statement caching/reuse.

Edit: and of course returning to the use of CityHash128 globally.

RDearnaley · 2018-06-14T00:37:58Z

Here are a sample of our SQLs. Look for the last " | " string on each log line -- the SQL starts after that
rdearnaley_sqls.txt

Let me know if you also need parameter bind values or any schema details, but I assume the key performance effects here depend just on total SQL length and number of parameters (and possibly characterset issues).

We've done some preliminary performance tests on the PR #717 code you sent out -- it's definitely better: for our test load it's now only about one-and-a half times as slow as the old driver, so there has been a significant improvement -- hopefully some more progress can be made to bring it even closer to parity.

I haven't yet done a detailed YourKit analysis of the new code, but I did take a brief look, and based on that, I have one new suggestion: I'm still seeing a significant amount of time spent inside the CityHash128Key.(String) code, some of which looks unavoidable, but three sample traces I saw at:

com.microsoft.sqlserver.jdbc.SQLServerConnection$CityHash128Key.(String)
SQLServerConnection.java: line 130 ->
java.lang.String.getBytes()
String.java: line 958 ->
java.lang.StringCoding.encode(char[], int, int)
StringCoding.java: line 387 ->
java.lang.StringCoding.encode(String, char[], int, int)
StringCoding.java: line 344 ->
java.lang.StringCoding$StringEncoder.encode(char[], int, int)
StringCoding.java: line 304

look like it's copying the string to a byte array and altering its encoding in the process. If it was possible to avoid that and just hash the raw bytes (or if that's not an option then a simple byte-for-byte copy of them), it would speed things up further.

RDearnaley · 2018-06-14T02:00:55Z

Looking at SQLServerConnection.java: line 130 in com.microsoft.sqlserver.jdbc.SQLServerConnection$CityHash128Key.(String) in PR #717
it is:

segments = CityHash.cityHash128(s.getBytes(), 0, s.length());

I assume the third parameter there supposed to be a number of bytes? Because you're passing a number of characters, which (depending on the default characterset) may well be half the number of bytes -- Java uses UTF-16 internally, I think always the bigendian variant -- so you're currently only hashing the first part of the string, which is a bug (if the default characterset is an 8 bit characterset and the contents of the string are 8-bit clean, then you'll get away with it, but the former definitely isn't a safe assumption).

The description on String.getBytes() says:

byte[] | String.getBytes() Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

So if the string isn't already in the platform's default characterset (I think we're using the UTF-8 characterset internally in our code, though I'm uncertain whether that would be the case for the JDBC driver -- for testing/reproducing this, I suggest you experiment with a couple of different default characterset values) it sounds like that will trigger character set conversion, which in my brief look with YourKit is clearly happening for us, and seems to be where a significant proportion of the time inside the CityHash128Key(String) code was being spent (3 out of 5 random samples -- not a large sample size I'm afraid, but clearly some time, likely at least half the time). Sadly we can't use

void | String.getBytes(int srcBegin, int srcEnd, byte[] dst, int dstBegin) Deprecated. This method does not properly convert characters into bytes. As of JDK 1.1, the preferred way to do this is via the getBytes() method, which uses the platform's default charset. Copies characters from this string into the destination byte array. Each byte receives the 8 low-order bits of the corresponding character. The eight high-order bits of each character are not copied and do not participate in the transfer in any way.

because it discards the high-order bits (if the string isn't 8-bit clean). So I think we need to either use:

void | getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin) Copies characters from this string into the destination character array.

and then figure out how to convert that to a byte[] (as quickly as possible, and if possible without copying it again), e.g. something like:

CharBuffer charBuffer = CharBuffer.wrap(chars);
ByteBuffer byteBuffer = Charset.UTF_16BE.encode(charBuffer);
byte[] bytes = Arrays.copyOfRange(byteBuffer.array(), byteBuffer.position(), byteBuffer.limit());

or it's probably faster to just (if it avoids triggering characterset reencoding) instead use

s.getBytes(Charset.UTF_16BE)

Or possibly those should have been Charset.UTF_16LE or Charset.UTF_16, whichever avoids triggering character conversions and is fastest for the endianness of the platform in use and/or of Java. You'll need to experiment and figure out which is of these is the fastest approach (and how it varies depending on endianness, if it does). With some of those you'll also need to watch out for a possible byte order mark, so some of the byte lengths may be off by 2 from the obvious lengths -- again, you'll need to experiment (probably in a debugger).

(It would be nice if CityHash128 had handled all this mess for you and provided a correctly-implemented and efficient interface taking a String.)

RDearnaley · 2018-06-14T03:38:52Z

Another stack trace where I see a noticeable amount of time being spent is:

com.microsoft.sqlserver.jdbc.SQLServerResultSet$1CursorInitializer.onColMetaData(TDSReader)
SQLServerResultSet.java:line261 ->
com.microsoft.sqlserver.jdbc.StreamColumns.setFromTDS(TDSReader)
StreamColumns.java: line 183 ->
com.microsoft.sqlserver.jdbc.TypeInfo.getInstance(TDSReader, boolean)
dtv.java: line 3370 ->
com.microsoft.sqlserver.jdbc.TypeInfo$Builder.build(TypeInfo, TDSReader)
dtv.java: line 3311

or a slight variant:

com.microsoft.sqlserver.jdbc.SQLServerResultSet$1CursorInitializer.onColMetaData(TDSReader)
SQLServerResultSet.java:line261 ->
com.microsoft.sqlserver.jdbc.StreamColumns.setFromTDS(TDSReader)
StreamColumns.java: line 183 ->
com.microsoft.sqlserver.jdbc.TypeInfo.getInstance(TDSReader, boolean)
dtv.java: line 3370 ->
com.microsoft.sqlserver.jdbc.TypeInfo$Builder.build(TypeInfo, TDSReader)
dtv.java: line 3311 ->
com.microsoft.sqlserver.jdbc.TypeInfo$Builder$1.apply(TypeInfo, TDSReader)
dtv.java: line 2684 ->
com.microsoft.sqlserver.jdbc.TypeInfo$Builder.build(TypeInfo, TDSReader)
dtv.java: line 3311

Looking at the code for this area I don't immediately see any way to speed it up, but I'm not sure I really understand it -- it might be worth someone more familiar with the codebase taking a look in this region for possible optimizations.

RDearnaley · 2018-06-14T18:39:51Z

Given that what's being hashed is the SQL before parameter values are inserted, in our case I believe that's going to be entirely 7-bit ASCII -- or at least, we don't have an non-ASCII characters in our table names or column names (the database name is customer-configurable, so I'm less certain there, but it's not going to vary, so truncating bits from characters in it before hashing is harmless). So for our scenario, using getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin) should be exactly the right thing to do (if it's fast), and would produce the shortest byte array for CityHash128 to hash. However, what I don't know is whether other JDBC driver users could have table names or column names that weren't 7-bit ASCII -- I gather MS SQL Server allows Unicode characters in table names (I didn't check for column names).

brettwooldridge · 2018-06-14T19:17:44Z

@RDearnaley @cheenamalhotra Added a line comment.

Regarding the onColMetaData(TDSReader) calls. I have a change sitting somewhere on one of my branches that adds caching of column metadata, but as I recall, I never completed the work.

This can have a huge impact of performance, which is why I had started implementing it. Long ago, many years, I implemented the same in the now-defunct jTDS driver. I do not know when I will have time to revisit -- possibly toward the end of the Summer.

RDearnaley · 2018-06-14T19:44:50Z

Brett: That sounds like an excellent project to complete, and should significantly improve the HikariCP + mssql-jdbc combination. As and when it happens, please let us know -- I strongly suspect we’ll be interested in redoing the config + testing work to update our JDBC driver version to pick it up.

…

--Roger From: Brett Wooldridge [mailto:notifications@github.com] Sent: Thursday, June 14, 2018 12:18 PM To: Microsoft/mssql-jdbc Cc: Roger Dearnaley; Mention Subject: Re: [Microsoft/mssql-jdbc] Performance problems with statement caching due to SHA1 hash (#716) @RDearnaley<https://github.com/RDearnaley> @cheenamalhotra<https://github.com/cheenamalhotra> Added a line comment<https://github.com/Microsoft/mssql-jdbc/pull/717/files#r195539948>. Regarding the onColMetaData(TDSReader) calls. I have a change sitting somewhere on one of my branches that adds caching of column metadata, but as I recall, I never completed the work. This can have a huge impact of performance, which is why I had started implementing it. Long ago, many years, I implemented the same in the now-defunct jTDS driver. I do not know when I will have time to revisit -- possibly toward the end of the Summer. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#716 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AcYrLjfE6dwhR4xamW6uNL5Hvihf8Ibhks5t8rbbgaJpZM4Ubrqr>.

brettwooldridge · 2018-06-14T19:50:23Z

@RDearnaley This is open source, feel free to jump in! 😁 I think it only took me a couple of nights to acquaint myself with the code and start making changes. Just so you know, I don't actually use SQL Server. I worked on the parse caching and PreparedStatement caching simply because HikariCP users expressed interest.

You clearly have the skills and the tools to pinpoint bottlenecks, so jump on in, the water's warm! 💦

RDearnaley · 2018-06-14T20:10:07Z

My management team are the people you need to persuade :-)

brettwooldridge · 2018-06-14T20:12:17Z

@RDearnaley It's 5am here in Tokyo. Two words: free time. 😀

RDearnaley · 2018-06-18T22:37:11Z

What sort of timescale should we expect further work on the remaining parts of this issue on (other than Brett's work on caching of column metadata, which as already discussed might happen later this summer, and anyway doesn't appear to be the biggest remaining contibuitor)?

cheenamalhotra · 2018-06-19T00:18:25Z

Hi @RDearnaley

We're currently in discussion with security team for this change while also considering performance improvements for newer driver versions.

Will keep you and @brettwooldridge posted on any developments!

RDearnaley · 2018-06-20T20:08:01Z

@cheenamalhotra Having worked at Microsoft for more than five years at one point in my career, I have some idea what's involved in discussions with the security team, so understood.

RDearnaley · 2018-06-20T20:30:33Z

I wanted to echo something @brettwooldridge said -- the potential performance improvement from caching metadata is huge. In our YourKit testing, for our test load, roughly 2/3 of the CPU cycles spent in the JDBC driver are being spent at stack traces that are underneath:

com.microsoft.sqlserver.jdbc.SQLServerResultSet$1CursorInitializer.onColMetaData(TDSReader) SQLServerResultSet.java:line 262

So that suggests to me that adding metadata caching could potentially speed the JDBC driver up by up to a factor of three (depending on the cache hit rate, and the time taken by cache management)!

@brettwooldridge You said that that enhancement was something you had been working on and had put aside, but were hoping to get back to (possibly towards the end of the Summer) -- what is the current status of that work (barely started, roughly half done, almost finished, ...)? Is the current code for this on GitHub, and if so where? We don't have any specific expertise with JDBC drivers, but we're quite keen to see this finished, and I want explore our options for doing so. If we wanted to make this happen sooner, how easy or hard do you think it would be for us to jump in and try to contribute on top of what you currently have on that: how much effort would you expect getting up to speed on the relevant parts of the codebase to be, and what sort of amount of work remains to be done on it? How tricky is this kind of work to get right and to test?

RDearnaley · 2018-06-26T01:13:33Z

I took a look at the ParameterHash.scanSQLForChar() code, and the code is already running scanSQLForChar repeatedly and then caching the number of '?' characters for parameter substitutions -- switching that to caching an array of the positions for them is easy. I have the code change for it done and tested, and am hoping to figure out how to send it to you as a pull request.

On the issue of metadata caching, a more detailed measurement with a larger sample size suggested the potential speedup for our test load is more like a factor of two (rather than three as I'd said bove), but it's still huge (if not quite as huge as I'd thought).

brettwooldridge · 2018-06-26T15:46:45Z

@RDearnaley Sorry for the late response. The code for metadata caching is on my speedify branch.

As I recall, it was pretty far along ... but master has likely drifted some since that work, so there are likely to be merge failures to clean-up if you attempt to rebase it. You might have better luck 'cherry-picking' the commits.

rene-ye · 2018-06-26T22:14:05Z

Hi @RDearnaley @brettwooldridge, we've finished discussions with the security team and are currently at a crossroad implementation-wise. Seeing how there is significant interest regarding this change, we would like to share some information and to get your opinions regarding certain decisions.

SHA-160 will be removed from the driver starting from the next release. This is deemed unsafe and not a security team approved hash function. The security team is firm regarding the criteria of the hash function we decide to use. Our solution must adhere to the following: use an approved hash function (SHA-256, SHA-384, SHA-512, etc), or use an unapproved and reliable (low collision chance) hash function, but verify that the strings are the same after the hash lookup. The team is currently looking at the following 3 approaches, ranked by preference.

Use a more secure hash function from java.security.MessageDigest
This is just changing from SHA-1 to SHA-256. The team has actually seen performance improvements when generating the hash compared to SHA-1. If this were to be implemented, we will allow the user to switch between hash functions provided by java.security.MessageDigest; seeing as there are cases where users can gain small performance increases, e.g. SHA-512 on 64-bit machines.
Use SHA-256 by default but allow users to provide their own HashKey class
The details of this implementation isn't fully fleshed out, but it would basically involve the user implementing an interface and providing said class to the driver. If security is of no concern to the user, a speed prioritized hash function can be used.
Use an unapproved hash function and validate the strings
This scenario allows the driver to implement any unapproved hashing function, such as CityHash, as long as it's reliable and the driver checks the string. This would involve performing a string comparison after doing the hash lookup, along with additional storage necessary to perform this comparison. This implementation will probably result in a performance degradation.

Please let us know what you would like to see in the driver, or any alternatives which would satisfy all parties. As for the speedify branch changes, we can take a look in a separate PR. This hash function change is confirmed to be in the next official release, but any other optimizations will likely not be. Seeing as @RDearnaley requires a production ready release, we don't suggest rushing the speedify changes.

cheenamalhotra · 2018-06-27T00:16:44Z

Hi @RDearnaley , thanks for contributing to the PR!

The changes look good to me, I'm just testing them today and shall merge the PR tomorrow if everything goes fine.

And as @rene-ye mentioned above, we'd like to know your opinion on HashKey changes. Going ahead with 2nd option would allow you to attach custom CityHash implementation to the driver via connection string and the driver shall then use the implementated methods to generate HashKeys.

RDearnaley · 2018-06-27T02:55:45Z

@rene-ye I’m afraid I think your Security team have misunderstood the situation.

There are basically two likely security scenarios for customer using the JDBC driver:

We are very sure that there is no danger of SQL injection attacks in our system. We need a hash where collisions are vanishingly unlikely to happen accidentally (i.e. a well-engineered 128-bit hash), and which is a fast as possible. We do not have any concern about a malicious end-user somehow carefully engineering hash collisions, because our system does not allow end users to generate arbitrary SQL (only parameters that get inserted in the SQL, but at the point the where SQL is being hashed that hasn’t happened yet, so en-users have no input into the strings being hashed). Thus there is also no need to do string comparisons, since an accidental hash collision is less likely that my server being destroyed by a meteorite strike, and we really don’t want the performance degradation that they would require. This is the case in our usage (and I suspect also for most other customers, since allowing SQL injection attacks is a bad idea), and there the correct answer is CityHash128 (or another equally performant non-cryptographic strength 128-bit hash) with no string comparison -- i.e. Pull Request Change Sha1HashKey to CityHash128Key for generating PreparedStatement handle and metadata cache keys #717.
We allow our users to do SQL injection attacks on us, or otherwise give them direct control of the queries generated sufficient for them to generate an arbitrary number of different SQL queries (not just of parameters that will be later inserted into the SQL). Nevertheless, we still want to avoid end users being able to engineer hash collisions in the JDBC driver prepared statement cache. Thus we want either a cryptographically strong hash (like SHA-256) with no string comparison, or a less-than cryptographically strong hash (like CityHash128) backed by a string comparison – whatever is the most performant solution (if there is in fact a solution to this sufficiently performant that it doesn’t make turning the prepared statement cache on completely pointless). This isn’t our usage case, nor will it ever be, so we care very little about how it’s handled.

From a security point of view, if you think that you may have some customers in category 2), then the correct thing to do is to provide a JDBC driver setting to control this choice: do customers want the secure mode, or the fast mode? It is generally a good idea to be secure by default, so this should default to option 2), and the section in the manual describing the option used to switch to 1) should make it clear that if your end-users can generate an arbitrary number of different SQL queries (not just the parameters in them, but the actual SQL), then using this option is unsafe for you. So the option would either switch between SHA-256 (default) and CityHash128 (for us), or would switch between CityHash128 with (default) and without (for us) keeping and comparing a full-SQL-string copy.

So what I’m suggesting is most similar to your option 2 above, but avoids then need for the customer to roll their own implementation – you just provide them with a switch to chose between two implementations: a performant non-crypto implementation for customers who are sure that don’t have any concern about arbitrary SQL being used to deliberately engineer cache hash collisions, and a less performant but safer inplementation that uses a cryptographically strong hash to block that attack.

If you aren’t willing to do that, then the only one of the three options that you describe above that we could live with is your option 2, if you could make the interface for that sufficiently fast that it doesn't add significant overhead compared to the actual CityHash128 hashing; otherwise we’re likely to be forking your driver and building our own patched version with this very slow and (for us) completely unnecessary security ripped out.

@brettwooldridge I'd love to hear your input on this too.

dancernetworks · 2018-06-27T07:48:28Z

@cheenamalhotra I am in agreement with @RDearnaley.

I too believe that the best option is to simply allow a driver/DataSource property to select between SHA-xxx (whatever ends up performing best, I would suggest on 64-bit) and CityHash128. Something like a useFastHash property, whose default value is false.

This lowers the burden for users to employ a high-performance non-cryptographic hash.

Even if an off-the-shelf open source implementation of some kind of extension/SPI model became available, the user would still have to wrestle with classpath/classloading issues in various containers (Spring, OSGi, JEE, Servlet, Play Framework, etc. etc.).

rene-ye · 2018-06-27T16:23:21Z

Hi @RDearnaley @dancernetworks, I understand your concerns and need for a fast hashing algorithm. However, we cannot officially offer CityHash128 as an option without also including a string comparison, which would degrade the performance this change is intended to enhance. We thought of option 2 because if a customer were allowed to implement any hashing they wanted; they would not be bound by the restrictions the driver team has to follow. If you require the hash to be as performant as possible, then officially integrating CityHash128 will not yield those results.

brettwooldridge · 2018-07-01T05:21:58Z

@rene-ye @RDearnaley Ok, I have a counter-proposal to address the concerns of the Microsoft security team.

Forget SHA-256 and CityHash. The solution we are looking for is SipHash. Instead of hashing purely on the SQL, a “secret” 128-bit key is used as well. Since the cache is ephemeral (not surviving restarts), this key can be generated in memory at startup.

It is impossible to generate intentional hash collisions with SipHash because of this key — the security characteristics are at least as secure as SHA. Yet, the hashing performance characteristics are similar to CityHash, also without requiring original text comparison.

All that is left is an implementation of SipHash for Java...

cheenamalhotra · 2018-07-04T00:41:03Z

Hi @brettwooldridge @RDearnaley

We've been approved for implementing CityHash in the driver, given we also compare strings when needed, as @rene-ye mentioned above.. The String compare doesn't seem to cost big performance deficit and performance is almost the same.

For SipHash too, we'd have to implement String comparison, as it doesn't fall in the "Approved HashKey" list from Microsoft. So, unless its Sha-xxx, SipHash/CityHash/(anything else) wouldn't matter. And if the performance is same as CityHash, we might as well go with this one.

With regards to performance improvements in Metadata Caching, we can track that separately from Brett's speedify branch for which we'd have to cherry-pick changes. We won't be targetting those changes for our upcoming release, but PR #717 has been updated with all requests and we are looking forward to merge that soon, request you to kindly review and test your scenarios and let us know if it gives you desired performance improvement.

cheenamalhotra · 2018-07-10T20:40:38Z

Closing issue as PR #717 merged.

RDearnaley changed the title ~~Performance problems with statemet caching due to SHA1 hash~~ Performance problems with statement caching due to SHA1 hash Jun 6, 2018

cheenamalhotra mentioned this issue Jun 6, 2018

Change Sha1HashKey to CityHash128Key for generating PreparedStatement handle and metadata cache keys #717

Merged

cheenamalhotra closed this as completed Jul 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance problems with statement caching due to SHA1 hash #716

Performance problems with statement caching due to SHA1 hash #716

RDearnaley commented Jun 5, 2018 •

edited

Loading

brettwooldridge commented Jun 6, 2018

cheenamalhotra commented Jun 6, 2018 •

edited

Loading

RDearnaley commented Jun 13, 2018

brettwooldridge commented Jun 13, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

brettwooldridge commented Jun 14, 2018

RDearnaley commented Jun 14, 2018 via email

brettwooldridge commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

brettwooldridge commented Jun 14, 2018

RDearnaley commented Jun 18, 2018 •

edited

Loading

cheenamalhotra commented Jun 19, 2018

RDearnaley commented Jun 20, 2018 •

edited

Loading

RDearnaley commented Jun 20, 2018 •

edited

Loading

RDearnaley commented Jun 26, 2018

brettwooldridge commented Jun 26, 2018

rene-ye commented Jun 26, 2018 •

edited

Loading

cheenamalhotra commented Jun 27, 2018

RDearnaley commented Jun 27, 2018 •

edited

Loading

dancernetworks commented Jun 27, 2018

rene-ye commented Jun 27, 2018

brettwooldridge commented Jul 1, 2018 •

edited

Loading

cheenamalhotra commented Jul 4, 2018

cheenamalhotra commented Jul 10, 2018

Performance problems with statement caching due to SHA1 hash #716

Performance problems with statement caching due to SHA1 hash #716

Comments

RDearnaley commented Jun 5, 2018 • edited Loading

Driver version or jar name

SQL Server version

Client operating system

Java/JVM version

Table schema

Problem description

Expected behavior and actual behavior

Repro code

brettwooldridge commented Jun 6, 2018

cheenamalhotra commented Jun 6, 2018 • edited Loading

RDearnaley commented Jun 13, 2018

brettwooldridge commented Jun 13, 2018 • edited Loading

RDearnaley commented Jun 14, 2018 • edited Loading

RDearnaley commented Jun 14, 2018 • edited Loading

RDearnaley commented Jun 14, 2018 • edited Loading

RDearnaley commented Jun 14, 2018 • edited Loading

brettwooldridge commented Jun 14, 2018

RDearnaley commented Jun 14, 2018 via email

brettwooldridge commented Jun 14, 2018 • edited Loading

RDearnaley commented Jun 14, 2018 • edited Loading

brettwooldridge commented Jun 14, 2018

RDearnaley commented Jun 18, 2018 • edited Loading

cheenamalhotra commented Jun 19, 2018

RDearnaley commented Jun 20, 2018 • edited Loading

RDearnaley commented Jun 20, 2018 • edited Loading

RDearnaley commented Jun 26, 2018

brettwooldridge commented Jun 26, 2018

rene-ye commented Jun 26, 2018 • edited Loading

cheenamalhotra commented Jun 27, 2018

RDearnaley commented Jun 27, 2018 • edited Loading

dancernetworks commented Jun 27, 2018

rene-ye commented Jun 27, 2018

brettwooldridge commented Jul 1, 2018 • edited Loading

cheenamalhotra commented Jul 4, 2018

cheenamalhotra commented Jul 10, 2018

RDearnaley commented Jun 5, 2018 •

edited

Loading

cheenamalhotra commented Jun 6, 2018 •

edited

Loading

brettwooldridge commented Jun 13, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

brettwooldridge commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 14, 2018 •

edited

Loading

RDearnaley commented Jun 18, 2018 •

edited

Loading

RDearnaley commented Jun 20, 2018 •

edited

Loading

RDearnaley commented Jun 20, 2018 •

edited

Loading

rene-ye commented Jun 26, 2018 •

edited

Loading

RDearnaley commented Jun 27, 2018 •

edited

Loading

brettwooldridge commented Jul 1, 2018 •

edited

Loading