-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance problems with statement caching due to SHA1 hash #716
Comments
@RDearnaley @TobiasSQL If prepared StatementA is in the cache, and someone engineers the submission of a SQL string that has a hash collision, then StatementA may be executed instead of their engineered SQL. It would be possible to pre-poison the cache, by submitting the malicious SQL first, before StatementA has a chance to execute. In which case, when StatementA is executed, due to the hash collision, the engineered SQL would be executed instead. However, this is assuming that someone can submit arbitrary SQL anyway! So, nothing is gained by engineering a collision in the first place. Also note that SHA-1 is basically insecure these days; due to hardware acceleration malicious actors can generate billions of hashes per day. Please do not conclude, therefore, that the cache key needs to be more cryptographically secure. That would be insane. Given that such an attack would require allowing arbitrary SQL execution, nothing is being protected. The cache key algorithm should be as fast as possible, while still providing extremely low collision probabilities. In the universe of "natural" SQL, as opposed to specifically engineered malicious SQL, the odds of a hash collision in CityHash128 -- as I have pointed out elsewhere -- make it more likely that the Earth will be destroyed by an asteroid strike than incur a collision in the cache. I would like to see the use CityHash128 return to the driver. I have to believe that it was simply a mis-merge rather than a conscious decision. The CityHash128 implementation used was under Apache License 2.0, so there is no licensing issue. |
Hi @RDearnaley and @brettwooldridge I see there was an agreement over using CityHash128, but I'm not sure as well why wasn't it merged with the feature, maybe it just got missed out. I also agree with the potential performance improvement it would bring to this feature, and have the changes ready in PR #717 and in attached Jar to be tested for you. I tested on a small scale application, so the difference in performance wasn't that alarming to me. Would be interested to get a repro code if possible, or the query which is prepared in your application. Let me also know if the change resolves performance issues with your application and if any changes are needed in the PR. |
@cheenamalhotra Thanks for the quick response -- we will give that a test, and will get back to you once we have results. Our Hibernate-generated queries are generally somewhere around 4kb to 16kb in size per query -- I should be able to get you some sample queries tomorrow. Meanwhile I've been doing a more detailed performance analysis with the YourKit Java performance profiler:
|
Working from memory here, but I think the statement is hashed regardless of prepared statement caching to avoid a re-parse. I could be mistaken. I think the idea of caching the parameter positions is a good one, again regardless of statement caching/reuse. Edit: and of course returning to the use of CityHash128 globally. |
Here are a sample of our SQLs. Look for the last " | " string on each log line -- the SQL starts after that Let me know if you also need parameter bind values or any schema details, but I assume the key performance effects here depend just on total SQL length and number of parameters (and possibly characterset issues). We've done some preliminary performance tests on the PR #717 code you sent out -- it's definitely better: for our test load it's now only about one-and-a half times as slow as the old driver, so there has been a significant improvement -- hopefully some more progress can be made to bring it even closer to parity. I haven't yet done a detailed YourKit analysis of the new code, but I did take a brief look, and based on that, I have one new suggestion: I'm still seeing a significant amount of time spent inside the CityHash128Key.(String) code, some of which looks unavoidable, but three sample traces I saw at: com.microsoft.sqlserver.jdbc.SQLServerConnection$CityHash128Key.(String) look like it's copying the string to a byte array and altering its encoding in the process. If it was possible to avoid that and just hash the raw bytes (or if that's not an option then a simple byte-for-byte copy of them), it would speed things up further. |
Looking at SQLServerConnection.java: line 130 in com.microsoft.sqlserver.jdbc.SQLServerConnection$CityHash128Key.(String) in PR #717
I assume the third parameter there supposed to be a number of bytes? Because you're passing a number of characters, which (depending on the default characterset) may well be half the number of bytes -- Java uses UTF-16 internally, I think always the bigendian variant -- so you're currently only hashing the first part of the string, which is a bug (if the default characterset is an 8 bit characterset and the contents of the string are 8-bit clean, then you'll get away with it, but the former definitely isn't a safe assumption). The description on String.getBytes() says:
So if the string isn't already in the platform's default characterset (I think we're using the UTF-8 characterset internally in our code, though I'm uncertain whether that would be the case for the JDBC driver -- for testing/reproducing this, I suggest you experiment with a couple of different default characterset values) it sounds like that will trigger character set conversion, which in my brief look with YourKit is clearly happening for us, and seems to be where a significant proportion of the time inside the CityHash128Key(String) code was being spent (3 out of 5 random samples -- not a large sample size I'm afraid, but clearly some time, likely at least half the time). Sadly we can't use
because it discards the high-order bits (if the string isn't 8-bit clean). So I think we need to either use:
and then figure out how to convert that to a byte[] (as quickly as possible, and if possible without copying it again), e.g. something like:
or it's probably faster to just (if it avoids triggering characterset reencoding) instead use
Or possibly those should have been Charset.UTF_16LE or Charset.UTF_16, whichever avoids triggering character conversions and is fastest for the endianness of the platform in use and/or of Java. You'll need to experiment and figure out which is of these is the fastest approach (and how it varies depending on endianness, if it does). With some of those you'll also need to watch out for a possible byte order mark, so some of the byte lengths may be off by 2 from the obvious lengths -- again, you'll need to experiment (probably in a debugger). (It would be nice if CityHash128 had handled all this mess for you and provided a correctly-implemented and efficient interface taking a String.) |
Another stack trace where I see a noticeable amount of time being spent is: com.microsoft.sqlserver.jdbc.SQLServerResultSet$1CursorInitializer.onColMetaData(TDSReader) or a slight variant: com.microsoft.sqlserver.jdbc.SQLServerResultSet$1CursorInitializer.onColMetaData(TDSReader) Looking at the code for this area I don't immediately see any way to speed it up, but I'm not sure I really understand it -- it might be worth someone more familiar with the codebase taking a look in this region for possible optimizations. |
Given that what's being hashed is the SQL before parameter values are inserted, in our case I believe that's going to be entirely 7-bit ASCII -- or at least, we don't have an non-ASCII characters in our table names or column names (the database name is customer-configurable, so I'm less certain there, but it's not going to vary, so truncating bits from characters in it before hashing is harmless). So for our scenario, using getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin) should be exactly the right thing to do (if it's fast), and would produce the shortest byte array for CityHash128 to hash. However, what I don't know is whether other JDBC driver users could have table names or column names that weren't 7-bit ASCII -- I gather MS SQL Server allows Unicode characters in table names (I didn't check for column names). |
@RDearnaley @cheenamalhotra Added a line comment. Regarding the This can have a huge impact of performance, which is why I had started implementing it. Long ago, many years, I implemented the same in the now-defunct jTDS driver. I do not know when I will have time to revisit -- possibly toward the end of the Summer. |
Brett:
That sounds like an excellent project to complete, and should significantly improve the HikariCP + mssql-jdbc combination. As and when it happens, please let us know -- I strongly suspect we’ll be interested in redoing the config + testing work to update our JDBC driver version to pick it up.
…--Roger
From: Brett Wooldridge [mailto:notifications@github.com]
Sent: Thursday, June 14, 2018 12:18 PM
To: Microsoft/mssql-jdbc
Cc: Roger Dearnaley; Mention
Subject: Re: [Microsoft/mssql-jdbc] Performance problems with statement caching due to SHA1 hash (#716)
@RDearnaley<https://github.com/RDearnaley> @cheenamalhotra<https://github.com/cheenamalhotra> Added a line comment<https://github.com/Microsoft/mssql-jdbc/pull/717/files#r195539948>.
Regarding the onColMetaData(TDSReader) calls. I have a change sitting somewhere on one of my branches that adds caching of column metadata, but as I recall, I never completed the work.
This can have a huge impact of performance, which is why I had started implementing it. Long ago, many years, I implemented the same in the now-defunct jTDS driver. I do not know when I will have time to revisit -- possibly toward the end of the Summer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#716 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AcYrLjfE6dwhR4xamW6uNL5Hvihf8Ibhks5t8rbbgaJpZM4Ubrqr>.
|
@RDearnaley This is open source, feel free to jump in! 😁 I think it only took me a couple of nights to acquaint myself with the code and start making changes. Just so you know, I don't actually use SQL Server. I worked on the parse caching and PreparedStatement caching simply because HikariCP users expressed interest. You clearly have the skills and the tools to pinpoint bottlenecks, so jump on in, the water's warm! 💦 |
My management team are the people you need to persuade :-) |
@RDearnaley It's 5am here in Tokyo. Two words: free time. 😀 |
What sort of timescale should we expect further work on the remaining parts of this issue on (other than Brett's work on caching of column metadata, which as already discussed might happen later this summer, and anyway doesn't appear to be the biggest remaining contibuitor)? |
Hi @RDearnaley We're currently in discussion with security team for this change while also considering performance improvements for newer driver versions. Will keep you and @brettwooldridge posted on any developments! |
@cheenamalhotra Having worked at Microsoft for more than five years at one point in my career, I have some idea what's involved in discussions with the security team, so understood. |
I wanted to echo something @brettwooldridge said -- the potential performance improvement from caching metadata is huge. In our YourKit testing, for our test load, roughly 2/3 of the CPU cycles spent in the JDBC driver are being spent at stack traces that are underneath:
So that suggests to me that adding metadata caching could potentially speed the JDBC driver up by up to a factor of three (depending on the cache hit rate, and the time taken by cache management)! @brettwooldridge You said that that enhancement was something you had been working on and had put aside, but were hoping to get back to (possibly towards the end of the Summer) -- what is the current status of that work (barely started, roughly half done, almost finished, ...)? Is the current code for this on GitHub, and if so where? We don't have any specific expertise with JDBC drivers, but we're quite keen to see this finished, and I want explore our options for doing so. If we wanted to make this happen sooner, how easy or hard do you think it would be for us to jump in and try to contribute on top of what you currently have on that: how much effort would you expect getting up to speed on the relevant parts of the codebase to be, and what sort of amount of work remains to be done on it? How tricky is this kind of work to get right and to test? |
I took a look at the ParameterHash.scanSQLForChar() code, and the code is already running scanSQLForChar repeatedly and then caching the number of '?' characters for parameter substitutions -- switching that to caching an array of the positions for them is easy. I have the code change for it done and tested, and am hoping to figure out how to send it to you as a pull request. On the issue of metadata caching, a more detailed measurement with a larger sample size suggested the potential speedup for our test load is more like a factor of two (rather than three as I'd said bove), but it's still huge (if not quite as huge as I'd thought). |
@RDearnaley Sorry for the late response. The code for metadata caching is on my speedify branch. As I recall, it was pretty far along ... but master has likely drifted some since that work, so there are likely to be merge failures to clean-up if you attempt to rebase it. You might have better luck 'cherry-picking' the commits. |
Hi @RDearnaley @brettwooldridge, we've finished discussions with the security team and are currently at a crossroad implementation-wise. Seeing how there is significant interest regarding this change, we would like to share some information and to get your opinions regarding certain decisions. SHA-160 will be removed from the driver starting from the next release. This is deemed unsafe and not a security team approved hash function. The security team is firm regarding the criteria of the hash function we decide to use. Our solution must adhere to the following: use an approved hash function (SHA-256, SHA-384, SHA-512, etc), or use an unapproved and reliable (low collision chance) hash function, but verify that the strings are the same after the hash lookup. The team is currently looking at the following 3 approaches, ranked by preference.
Please let us know what you would like to see in the driver, or any alternatives which would satisfy all parties. As for the speedify branch changes, we can take a look in a separate PR. This hash function change is confirmed to be in the next official release, but any other optimizations will likely not be. Seeing as @RDearnaley requires a production ready release, we don't suggest rushing the speedify changes. |
Hi @RDearnaley , thanks for contributing to the PR! The changes look good to me, I'm just testing them today and shall merge the PR tomorrow if everything goes fine. And as @rene-ye mentioned above, we'd like to know your opinion on HashKey changes. Going ahead with 2nd option would allow you to attach custom CityHash implementation to the driver via connection string and the driver shall then use the implementated methods to generate HashKeys. |
@rene-ye I’m afraid I think your Security team have misunderstood the situation. There are basically two likely security scenarios for customer using the JDBC driver:
From a security point of view, if you think that you may have some customers in category 2), then the correct thing to do is to provide a JDBC driver setting to control this choice: do customers want the secure mode, or the fast mode? It is generally a good idea to be secure by default, so this should default to option 2), and the section in the manual describing the option used to switch to 1) should make it clear that if your end-users can generate an arbitrary number of different SQL queries (not just the parameters in them, but the actual SQL), then using this option is unsafe for you. So the option would either switch between SHA-256 (default) and CityHash128 (for us), or would switch between CityHash128 with (default) and without (for us) keeping and comparing a full-SQL-string copy. So what I’m suggesting is most similar to your option 2 above, but avoids then need for the customer to roll their own implementation – you just provide them with a switch to chose between two implementations: a performant non-crypto implementation for customers who are sure that don’t have any concern about arbitrary SQL being used to deliberately engineer cache hash collisions, and a less performant but safer inplementation that uses a cryptographically strong hash to block that attack. If you aren’t willing to do that, then the only one of the three options that you describe above that we could live with is your option 2, if you could make the interface for that sufficiently fast that it doesn't add significant overhead compared to the actual CityHash128 hashing; otherwise we’re likely to be forking your driver and building our own patched version with this very slow and (for us) completely unnecessary security ripped out. @brettwooldridge I'd love to hear your input on this too. |
@cheenamalhotra I am in agreement with @RDearnaley. I too believe that the best option is to simply allow a driver/DataSource property to select between SHA-xxx (whatever ends up performing best, I would suggest on 64-bit) and CityHash128. Something like a useFastHash property, whose default value is This lowers the burden for users to employ a high-performance non-cryptographic hash. Even if an off-the-shelf open source implementation of some kind of extension/SPI model became available, the user would still have to wrestle with classpath/classloading issues in various containers (Spring, OSGi, JEE, Servlet, Play Framework, etc. etc.). |
Hi @RDearnaley @dancernetworks, I understand your concerns and need for a fast hashing algorithm. However, we cannot officially offer CityHash128 as an option without also including a string comparison, which would degrade the performance this change is intended to enhance. We thought of option 2 because if a customer were allowed to implement any hashing they wanted; they would not be bound by the restrictions the driver team has to follow. If you require the hash to be as performant as possible, then officially integrating CityHash128 will not yield those results. |
@rene-ye @RDearnaley Ok, I have a counter-proposal to address the concerns of the Microsoft security team. Forget SHA-256 and CityHash. The solution we are looking for is SipHash. Instead of hashing purely on the SQL, a “secret” 128-bit key is used as well. Since the cache is ephemeral (not surviving restarts), this key can be generated in memory at startup. It is impossible to generate intentional hash collisions with SipHash because of this key — the security characteristics are at least as secure as SHA. Yet, the hashing performance characteristics are similar to CityHash, also without requiring original text comparison. All that is left is an implementation of SipHash for Java... |
Hi @brettwooldridge @RDearnaley We've been approved for implementing CityHash in the driver, given we also compare strings when needed, as @rene-ye mentioned above.. The String compare doesn't seem to cost big performance deficit and performance is almost the same. For SipHash too, we'd have to implement String comparison, as it doesn't fall in the "Approved HashKey" list from Microsoft. So, unless its Sha-xxx, SipHash/CityHash/(anything else) wouldn't matter. And if the performance is same as CityHash, we might as well go with this one. With regards to performance improvements in Metadata Caching, we can track that separately from Brett's speedify branch for which we'd have to cherry-pick changes. We won't be targetting those changes for our upcoming release, but PR #717 has been updated with all requests and we are looking forward to merge that soon, request you to kindly review and test your scenarios and let us know if it gives you desired performance improvement. |
Closing issue as PR #717 merged. |
Driver version or jar name
mssql-jdbc version 6.4.0.jre8
SQL Server version
We support MS SQL Server versions back to 2012, and have been testing on version 11.0.5058.0
Client operating system
We support Windows/Linux/Mac/various other flavors of UNIX, have been testing with the client+ JDBC driver on Linux (with the SQL Server on Windows, obviously)
Java/JVM version
1.8
Table schema
(large, complex, and irrelevant to this issue)
Problem description
We recently switched our code from using the old sqljdbc42 4.2.6420.100 MS SQL Server JDBC driver to using mssql-jdbc version 6.4.0.jre8 -- we encountered a significant (O(20-35%)) performance hit when doing so, which we have been investigating. Our performance test setup uses a repeating pattern of O(100) distinct read queries of large SQL text length (up to around 16kb) that were generated by Hibernate with some queries repeated many times withing each overall repeat pattern, each individual query returning only a small amount of data, with the client and database on two different machines in the same datacenter. Thus prepared statement caching is a big win-- the connection pool we are using no longer provides this, which is why we switched to mssql-jdbc version 6.4.0.jre8.
Using the YourKit Java profiting tool, I tracked down part of the problem to the hashing code for the JDBC driver's prepared statement cache, which is using SHA-1 hashing (which it delegates to the Java security system, which delegates it to our chosen security provided library). For this test setup 10-15% of the total test time was being spent performing SHA-1 hashing in order to check the prepared statement cache -- quite unacceptable overhead for a JDBC driver! (We experimented with simply turning the caching off, but alas that made things even slower -- despite its current heavy hashing overhead, the prepared statement cache is still 5-10%faster on average than repreparing every statement.)
Here is the relevant section of the stack trace (in caller-first order) when this SHA-1 behavior was happening:
... (Hikari connection pool library) ...
ProxyPreparedStatement.java:52 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery()
SQLServerPreparedStatement.java:401 com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(TDSCommand)
SQLServerStatement.java:204 com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(TDSCommand)
SQLServerStatement.java:224 com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(TDSCommand)
SQLServerConnection.java:2713 com.microsoft.sqlserver.jdbc.TDSCommand.execute(TDSWriter, TDSReader)
IOBuffer.java:7344 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute()
SQLServerPreparedStatement.java:479 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement$PrepStmtExecCmd)
SQLServerPreparedStatement.java:536 com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.reuseCachedHandle(boolean, boolean, String)
SQLServerPreparedStatement.java:960 com.microsoft.sqlserver.jdbc.SQLServerConnection$Sha1HashKey.(String, String, String)
SQLServerConnection.java:126 com.microsoft.sqlserver.jdbc.SQLServerConnection$Sha1HashKey.(String)
SQLServerConnection.java:130 java.security.MessageDigest.digest(byte[])
MessageDigest.java:410 java.security.MessageDigest.update(byte[])
... (BouncyCastle security provider library) ...
Since what is being hashed is obviously the SQL statement before parameter values are applied, that should not be under untrusted end-user control (or if it is, it's extremely hard not to leave your self open to DROP-TABLE-type SQL injection attacks from your end-users). So there should be no concern over a malicious end-user deliberately forcing cache collisions by generating two arbitrary SQL queries carefully chosen so that they SHA-1 hash to the same value (as opposed to choosing arbitrary parameter values). Thus using a cryptographic-strength hash is wasteful in this context -- any hash with sufficient size and sufficiently good hashing properties to avoid accidental collisions will do, and the selection should be based on performance, which any crypto-strength hash is going to lose.
I brought this issue to the attention of Brett Wooldridge, who I understand originally contributed this code, see:
https://groups.google.com/forum/#!topic/hikari-cp/OTsKIFd3joY
and he tells me that he submitted it using CityHash128, a fast (high-throughput-speed) non-cryptographic-strength 128-bit hash, see:
#291 (review)
It appears that somehow this got switched (back?) to SHA-1 before 6.4 was released. I strongly recommend that you switch back to using CityHash128 as he contributed [or, if there is some license issue with CityHash128, then to some comparable high-throughput-speed non-cryptographic-strength hash] -- the performance with SHA-1 is unacceptably slow.
[If for some reason you feel that a few customers may have a security profile where the use of SHA-1 might actually be necessary, since their end users can issue arbitrary SQL queries (but are somehow blocked from SQL injection attacks), then I strongly recommend that you make the choice of whether to use fast hashing or cryptographic-strength hashing be configurable e.g. by a JDBC URL parameter, and that you mention in the documentation for that parameter that the use of cryptographic-strength hashing imposes a significant performance overhead, and should be avoided except in security scenarios where it is actually necessary (and that constructing a security scenario where that is the case but that isn't open to SQL injection attacks is challenging).]
Expected behavior and actual behavior
Expected behavior: SQL hashing for the prepared statement cache takes negligible time
Actual behavior: SHA-1 SQL hashing for the prepared statement cache takes the majority of all time used by the JDBC driver, and 10-15% of the entire run-time in our performance test case
Repro code
See problem description above -- in short, turn prepared statement caching on and make use of the prepared statement cache for any realistic Hibernate-like read load that includes repeating fairly long read queries that individually retrieve fairly small amounts of data.
The text was updated successfully, but these errors were encountered: