-
-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precision or encoding error storing JS numbers in MySQL DOUBLE #1525
Comments
More investigation: 0.4549425384311901 is how Number.toFixed() generally displays 0.454942538431190091063882618982461281120777130126953125, which is encoded in 64-bit IEEE 754 as 0x3FDD1DC74F07C17D These are adjacent numbers on the IEEE 754 number line, i.e. one bit difference. Both numbers represented in 32-bit IEEE 754 as 0x3EE8EE3A Conclusion: the mysql2 code is probably passing MySQL |
thanks for the report @wesgarland non-prepared statement results come back as plain text, I suspect the reading code is here node-mysql2/lib/packets/packet.js Line 648 in 943b8b2
Can you compare results to what returned with |
Good hypothesis: the correct value IS returned from |
And the correct arrives in parseFloat. So there must be a bug in parseFloat...
|
Okay, I've given parseFloat some thought. It's not clear to me if the current strategy in there can ever be correct, especially if we look at some of the corner cases in IEEE 754 like I think this is correct. On my machine, it is 4.8x slower than the existing implementation: const _parseFloat = global.parseFloat;
function parseFloat(len)
{
const ret = _parseFloat(String.fromCharCode.apply(null, this.buffer.slice(this.offset, this.offset + len)));
this.offset += len;
return ret;
} Did I miss anything? Do you have unit tests for this? |
Faster - I forgot that buffer has a native toString - const _parseFloat = global.parseFloat;
function parseFloat(len)
{
const ret = _parseFloat(this.buffer.slice(this.offset, this.offset + len).toString('ascii'));
this.offset += len;
return ret;
} Now 3.75x slower than the incorrect code, which is 200ms over 1E6 iterations on my test machine. About 35% of that seems to be the call to buffer.slice. |
PR incoming, just going to test it on a few million random numbers first. Do you expect Infinity and -Infinity to work? They throw,
0, Math.E, Math.PI and NaN passing fine. Not quite sure how to test -0. |
thanks for the PR @wesgarland ! Yes, the reason for manual parsing is performance, the slow part used to be buffer -> string conversion ( native parseFloat was and is faster than bespoke implementation ). I'd really like to see before/after performance numbers - can you share you benchmark script? Ideally we want to have large number of rows and some number of float fields in the schema, then selecting that multiple times. Maybe even better to remove server and network from performance timing use fake server and pre-computed result set buffer, that way its mostly parsing that is benchmarked |
@sidorares Here is a gist showing my benchmark code: https://gist.githubusercontent.com/wesgarland/06cb328984e38a1450618194b3d03983/raw/ab66131039ab0c2d18eaf5914cbda96c488a95d5/parse-float.js I actually tried several approaches before settling on the buffer.toString() code. My first reimplementation started at around 1,400ms. A couple of versions got to around 31,000ms (Buffer.forEach etc). This version run in about 275ms on that machine (work linux box node 12), with the old code running around 78ms. I am home now, testing on macOS Catalina. Times are in ms, for 1,000,000 calls:
(What the heck is going on with NodeJS getting slower?! I am using versions from nvm fwiw). The key issue with the old code is that it sacrifices precision by doing math on float64 quantities in float64. The IEEE 754 number line is not continuous, the "finite numbers" in the ES spec are absolutely not the same as our real numbers, and they are represented in binary, not base 10, internally. So a "place shift" by /10 or *10 is lossy - it lands on the nearest IEEE 754 number, not the exact value as though we had bit-shifted in binary. A faster way to do this would be to use the strtod() C library call; it would be able to look directly at the buffer's backing store and perform the conversion without copying, since strtod takes both a start and end pointer. This is the Achilles heel here; in order to use JS engine's native parseFloat(), we have to copy the Buffer in order to pin the length. Of course, that has the ugly property of throwing is into native-code land with the attendant node-gyp and cross-platform pain. So, I'm clocking 1,000,000 parse operations in 300-400ms; this patch would add about 1ms of program time for every 3000 doubles parsed as part of a prepared statement. It's slower than the old code, but still very fast. The old code has errors in about 10-15% of numbers generated by Math.random(). It's a big problem from our POV. |
so current also - is this again real server or fake js? is server running on the same computer (and also directly vs docker)?
you mean using some sort of native module for that? Probably not worth it because 1) cost of crossing js -> native boundary is probably higher than performance gains 2) pain of supporting native module maybe worth exploring wasm version of the strtod code? |
but in general while performance is a feature, IMO secure code > correct code > fast code, if no good other solution happy to just use buffer.toString() + global.parseFloat |
idk. How are they sent from the server via text protocol? Literal string "Infinity" ? |
Yes, the old code is still the fastest. It's hard to beat from a memory access POV, I suspect the JIT is able to keep the intermediate value in a single register, possibly not loading any memory from even L2 cache during that loop. The input string is likely in L1. The machine I ran those benchmarks on is a 2019 iMac; 3.6 GHz Quad-Core Intel Core i3 with 2400 MHz DDR4 RAM. It's plausible that a WASM solution would be faster, but my gut tells me it won't be. I'm pretty sure that as soon as we have to get memory out of the Node Buffer into something non-native code can understand, we drop off the fast(est) path... I also explored text_encoder+DataView options, these were 50% slower than my best...suggesting that getting the memory out of a Buffer and into something we can deal with on a generic JS basis costs about 150ms in this benchmark. |
To try and understand the perf bounds of this problem, I wrote a C program to measure strtod() on under similar circumstances. 275ms, built with clang and -O2: https://gist.github.com/wesgarland/682ed83db6a3f60b6fcb543a1878ae07 |
I guess I shouldn't mention that atof() in C is 68 times faster. Still convinced that this is probably the best solution available at this time. |
my benchmark of manual parse buf -> number vs almost 10x difference ( |
It's faster, but it doesn't fix the bug -- Buffer.from("0.45494253843119004")));
console.log(bufToFloat(numBuf, 0, numBuf.length)); yields Thanks for showing me |
yeah, it's a copy paste from the drivers code which you already showed is not always correct |
tbh
|
Both JS and MySQL support 64-bit IEEE 754 numbers. You need don't arbitrary precision, you need exactly that precision. This is a very well-defined set of numbers. The problem with the old parsing code is that it assumes that there exists a number N and a number N*10 and N/10 for every N in the finite number line. This is simply not how floating-point math works, and accurately parsing decimal is very tricky with many edge cases because of this. I was reading the MySQL protocol documentation last night, and it directly supports 64-bit IEEE 754 numbers as ProtocolBinary::MYSQL_TYPE_DOUBLE. Why are we even converting to decimal strings and then parsing them back into doubles? Is there any easy way to get it to just use this 8-byte representation on the wire instead? That's what's stored on disk... |
In my case, I use large int as id which introduced by Twitter. For example, the id value in MySQL is Here is the query before ↓
Then we use
It works for me since we're using it as id without any calculation things. For your case, it still a good solution to use the result string with |
My personal recommendation: if you don't do math with numbers ( ids etc ) - always store them as strings. Many other potential difficult to debug rounding problems, for example built in JSON.stringify / JSON.parse would corrupt data if Still, what @wesgarland described here is a legitimate bug and I'd really like to fix it ideally without big performance hit |
@sidorares Re. "without big performance hit" I commented earlier that "this patch would add about 1ms of program time for every 3000 doubles parsed as part of a prepared statement." How fast does it need to be to for correct data transmission to be acceptable? As a programmer, I would certainly accept a 300 nanosecond delay in exchange for receiving the actual floating point number that was stored in the database every time I asked for it. @Calvin-Huang your situation is completely different; you are hitting the limits of 64-bit floating point numbers. You should be using BigNumber if you need this. Note -- BigNumber is far, far, far slower than 64-bit. Your problem/example has nothing to do with JavaScript and can be demostrated with pure mysql: mysql> create table testFloat (num double);
Query OK, 0 rows affected (0.01 sec)
mysql> insert into testFloat values (375418676418970374);
Query OK, 1 row affected (0.01 sec)
mysql> select cast(num as decimal(65)) from testFloat;
+--------------------------+
| cast(num as decimal(65)) |
+--------------------------+
| 375418676418970400 |
+--------------------------+
1 row in set (0.00 sec) |
@sidorares re " So, when you ask JS to parse the string 375418676418970374 into a number, it selects the nearest* number and stores that as a number. That nearest* number is 375418676418970400. The same thing happens in every other programming language, and in mysql itself, when using floating point numbers. However, when you ask it to store a number which DOES exist in the number IEEE-754 line -- for example, 375418676418970400 -- that exact number is stored, and when you ask mysql to retrieve it, it retrieves exactly that number. The bug that is described in this issue is that when retrieving exact IEEE-754 numbers which do exist and are stored correctly, the mysql2
|
Hi @sidorares, Apologies for awakening an old thread, but we think we have stumbled across this same issue described above with a less 'extreme' number. In our case, the number When logging from the
I don't have an alternative solution to the one proposed by @wesgarland but I'm hoping that further discussion and/or another look at the proposed MR (#1527) might be possible. |
@emma-cw do you know why its result: 601730000000000000000, factor -1000000000000000000 and not 60173 / -100? What is the input string in your case? |
@wesgarland I don't think I answered this question yet:
the response to "query" is returned back as text. Even if your column has MYSQL_TYPE_DOUBLE type you get back bytes corresponding to stringified ascii view of the number, so we have to parse it back. And the reason we do it manually in mysql2 instead of built in parseFloat is because buffer -> js string conversion is the costliest part here, so I'm trying to avoid it. So basically we need a fast ( and correct ) implementation of "node.js buffer containing number as a string" -> JS number. A simple and correct implementation would be |
re speed - I doubt you are going to do any better than the native code in
v8 which is what the patch I submitted two years ago does (parseFloat).
Google's V8 engineers have already created hand-optimized C++ code to do
this as quickly as possible while remaining standards-compliant (see:
InternalStringToDouble in src/numbers/conversions.cc). This code is good
enough that the Firefox developers have imported it wholesale from Chromium
(mfbt/double-conversion/double-conversion/strtod.cc).
There has been research in the IEEE-748 community into faster float
parsing, e.g. https://arxiv.org/abs/2101.11408 - but implementing these
algorithms in JS isn't going to help. If this is faster than Google's
current implementation (and surely the Google guys have read these papers?)
then the solution will be to implement this inside v8 and submit an
upstream patch
The window to getting something *faster* is pretty small. On my
medium-speed machine running Node 10, I timed the patch with parseFloat()
as taking approximately 200 ns per float, compared to 53ns with the
incorrect code. This means that a query with 6800 floats in the result
will take 1ms longer to parse.
The *fastest* way would be to somehow change the query code to use binary
mode, because then you could send floats (and other data types) over the
wire in native representation and then recover them from the network buffer
via Float64Array. I don't know if that's even possible, but if it is, I'm
sure it's a very heavy lift.
Wes
…On Mon, 4 Sept 2023 at 20:34, Andrey Sidorov ***@***.***> wrote:
@wesgarland <https://github.com/wesgarland> I don't think I answered this
question yet:
Why are we even converting to decimal strings and then parsing them back
into doubles? Is there any easy way to get it to just use this 8-byte
representation on the wire instead? That's what's stored on disk...
the response to "query" is returned back as text. Even if your column has
MYSQL_TYPE_DOUBLE type you get back bytes corresponding to stringified
ascii view of the number, so we have to parse it back. And the reason we do
it manually in mysql2 instead of built in parseFloat is because buffer ->
js string conversion is the costliest part here, so I'm trying to avoid it.
So basically we need a fast ( and correct ) implementation of "node.js
buffer containing number as a string" -> JS number. A simple and correct
implementation would be parseFloat(buffer.toString('ascii'))
—
Reply to this email directly, view it on GitHub
<#1525 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABTIUDJONRFT5OJZKXPRB3XYZXS5ANCNFSM5PR3Z5KA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Wesley W. Garland
Director, Product Development
PageMail, Inc.
+1 613 542 2787 x 102
|
@wesgarland I agree with you, I'll try to do some additional benchmarks and see what % of time perf difference is in the real life query and probably just switch to other options to get correct float parsing while avoiding creation of JS strings:
Direct links to libraries you mentioned: |
This bug is unique to the mysql2 npm package - it does not happen with the mysql npm package.
I have a db schema which declares a column as a DOUBLE; mysql docs say this is an eight byte floating point number, presumably a 64-bit IEEE 754 quantity. JS uses, by definition, IEEE 754-2019 numbers.
When I store certain values in my database and select them back out, I sometimes get different numbers. Once such example is
0.45494253843119004
, which gets stored correctly, but selected back out as0.4549425384311901
.My platform is mysql2@2.3.3, node v12.22.9 on Linux wes-linux-kds 5.4.0-94-generic #106~18.04.1-Ubuntu SMP Fri Jan 7 07:23:53 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux.
We have an ORM in the way, so I haven't written a minimal test case yet. If the team wants to follow up on this bug, I can produce one. If you want help fixing the issue, point me in the right direction...I am a C and JS programmer with some experience with the mysql and v8 APIs.
This is a complete record of statements executed in our internal test case:
The callback invoked from
mysql2/lib/commands/query.js:86:15
received the following arguments -- the fields object has the wrong value in it:Schema --
The text was updated successfully, but these errors were encountered: