-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data damage when inserting or fetching #18
Comments
Which DBD::ODBC did you use? What platform are you running on and how did you build DBD::ODBC? If you want to use unicode you need to build it with unicode support. |
Hi @mjegh! We are using RHEL7. We have also tried to build Unicode build of DBD::ODBC and result is interesting. Some tests from above script started to work with Unicode build, but some of tests which are passing now started failing with Unicode build of DBD::ODBC. And as you can see this problem is independent of used ODBC driver. It is reproducible with MySQL ODBC, MariaDB ODBC, SQLite ODBC, Pg ODBC and Vertica ODBC. If you want output of that test script also from Unicode build of DBD::ODBC from RHEL7 box, I can provide it later in next week. But you can try to reproduce this problem also yourself with any ODBC driver to see that this is really problem in DBD::ODBC and not in RHEL7 or our build. |
Just to note, I have already tried to debug this problem and in found out that it is really in DBD::ODBC source code. And moreover, in both normal and Unicode builds. DBD::ODBC is affected by the Perl's Unicode Bug as described in |
It isn't as simple as that on Unix. The ODBC drivers you use need to support the ODBC wide APIs. I will try your code with an ODBC driver I know supports the wide APIs next week and get back to you. |
Anyway, by coincidence @Grinnz yestarday posted to IRC #p5p channel nice explanation of this problem: https://gist.github.com/Grinnz/704b03f4d2f0dc1b84d935efc0e3c077 |
@mjegh Thanks! I checked that MySQL ODBC, MariaDB ODBC, Pg ODBC, Vertica ODBC and also SQLite ODBC supports on Linux wide UTF-16 API (those with -W suffix). So this should not be a problem. |
There is also a possible issue with picking a varchar column type and not setting the character-set or collation - I missed that when I first glanced down the code. |
In addition, DBD::ODBC will only use wide characters on wide column types as reported by the SQLDescribeCol call. |
Also, which ODBC driver manager are you using. If it is iODBC you have no chance as that uses wchar_t types. |
We are using unixODBC manager which is default on RHEL7. |
On Windows (using postgres) the following code and output suggests it is working for me:
|
Have you tried script from the first post where are all different cases (unicode and also binary) of usage DBI? If it works on Windows? |
Not yet. That will take a bit more work. However, are you saying my code doesn't work in your scenario? |
Main problem with your script is that it requires to use |
My script works fine without bind_param on windows with postgres. i.e.,
outputs
I omitted the output of the returned strings as I was using the windows console which is flawed with utf8. |
The original script with postgres ODBC on Windows outputs the following. I've yet to understand the upgraded failures:
|
You are using Unicode build of DBD::ODBC as it can be seen in output (mode=UNICODE). You have there passed some tests which failed in our case (as written in first post), but you have there some failed tests which passed in our case. In first post we did not use Unicode build, but I remember that something similar thing happen also when I tried that script with Unicode version of DBD::ODBC (on RHEL7). If you need I can provide output from that script with Unicode version of DBD::ODBC on RHEL7 but later in this week. Also, could you install on Windows recent version of DBD::Pg and run that script to compare output from DBD::Pg and DBD::ODBC (with ODBC PostgreSQL driver)? I think this could be a good candidate for comparing. |
It looks to me like in the cases where it fails it is a binary column and upgraded and the bytes passed to DBD::ODBC are twice the length they should be. For instance for "\N{U+C3}\N{U+A1}" $bins is set to encode('UTF-8', $val) and is 4 characters and 4 bytes when downgraded but it is 8 bytes when upgraded. If you pass 8 bytes on the insert with type SQL_VARBINARY that is what is going to end up in the database. I cannot for the life of me see how this could work with other drivers.
outputs
So how can you expect the returned column to compare in:
|
+rebind_param 1 "�¡" (size SvCUR=8/SvLEN=16/max=0) svtype:7, value type:1, sql type:0 |
It does not matter if Perl scalar is upgraded or downgraded. It always stores same value, as can be seen in output of Really, look into @Grinnz explanation about scalar and characters problem. Other DBI drivers (DBD::Pg, DBD::SQLite, DBD::MariaDB) implemens it correctly. |
I understand the difference between bytes and characters. The point is:
|
So if both cases (prior and after upgrade) |
There must be some misunderstanding here and you are going to have to explain it better to me. The Devel::Peek output above doesn't show 4 bytes in both cases.
So you seem to be suggesting $bins is the same before and after upgrade but it doesn't look that way to me. |
Yes, it is same. This is common mistake in interpreting output from the Devel::Peek. This is why I suggested to look at output from the my $val = "\N{U+C3}\N{U+A1}";
Dump($val); Important part of Dump here is:
It means that my $bins = encode('UTF-8', $val); # this is 4 encoded bytes now
Dump($bins); Important part of Dump is:
It means that utf8::upgrade($bins);
Dump($bins); And here important part of Dump is:
Which means that You can also check that Perl scalar prior and after upgrade is same: use Encode;
my $val1 = "\N{U+C3}\N{U+A1}";
my $bins1 = encode('UTF-8', $val1);
utf8::upgrade($bins1);
my $val2 = "\N{U+C3}\N{U+A1}";
my $bins2 = encode('UTF-8', $val2);
if ($bins1 eq $bins2) { print "bins1 and bins2 are same\n" } else { print "bins1 and bins2 are different\n" } It prints:
Problem here is intepretation of Devel::Peek output. There is common mistake how to read it correctly. If scalar has UTF8 flag you need to look at UTF8 Dump output. If scalar does not have UTF8 flag set you must not look at UTF8 Dump outout. LEN and CUR contains size for actual internal representation (UTF-8 or not-UTF-8) and not number characters stored in scalar (this is another common mistake). There are similar mistakes when reading floating point numbers in scalar. If scalar does not have set NOK flag you must not look at NV part of output. Similarly if scalar does not have set IOK flag you must not look at IV part if output. Basically Devel::Peek::Dump show internal memory of scalar, including (probably) unused / uninitialized data and it needed to look at scalar flags to determinate which scalar slots are valid / filled. To avoid any confusion of above problem I'm using following sub ords { '(' . (join ', ', map ord, split //, $_[0] // '') . ')' } (as Perl split function correctly handles all above internals). So lets look that sub ords { '(' . (join ', ', map ord, split //, $_[0] // '') . ')' }
use Encode;
my $val1 = "\N{U+C3}\N{U+A1}";
my $bins1 = encode('UTF-8', $val1);
utf8::upgrade($bins1);
my $val2 = "\N{U+C3}\N{U+A1}";
my $bins2 = encode('UTF-8', $val2);
print "val1: " . ords($val1) . "\n";
print "val2: " . ords($val2) . "\n";
print "bins1: " . ords($bins1) . "\n";
print "bins2: " . ords($bins2) . "\n"; It prints:
Which is correct. val1 and val2 are not modified and contains Unicode code points (195, 161). Then encode converts Unicode code points to their UTF-8 representation to 4 numbers (195, 131, 194, 161) in bins1 and bins2. And additionally bins1 is upgraded -- and still contains same content. I hope that it is more clear right now. Reading Devel::Peek::Dump output is sometimes really a pain (need to first look at FLAGS and then at slots; and somtimes it prints output in octal and sometimes in hexadecimal). |
Thanks @pali for the explanation. To move on I need to work out how in XS to know this and how to convert it to the bytes (which are required by ODBC) in the SQLBindParam call. |
In XS: SvPV() macro returns raw readonly char* buffer and STRLEN length of that buffer. Following SvUTF8() macro (must be called after SvPV()) returns what is internal represnetation of previously returned raw char* buffer: either UTF-8 or Latin1. Based on this SvUTF8() result you need to convert returnd raw char* buffer to what representation you need (UTF-16, UTF-32, UTF-8, Latin1, whatever...). In XS there is a function If you want to return char* buffer always in UTF-8 you can use SvPVutf8() macro (so you do not have to do following SvUTF8() call and converting yourself). It again returns read-only buffer. Similarly there is a SvPVbyte() macro which returns char* buffer always in Latin1. I hope that this helps you to how to construct char* buffers from SV* perl scalar correctly. |
BTW an easy built in way to achieve what the "ords" function does is: |
I've singularly failed to change the code to make this work for binaries. Can you point me to where DBD::SQLite or DBD::Pg does this properly? |
ok, with a small change very specific to binding a binary type (probably doesn't work for inout parameters yet):
|
It is far more complex to write a fix for binary columns in this case when it is an in/out parameter but if you'd like to try the change for in parameters let me know and I'll do a dev release. Thanks for your help. |
I have extended test script from the first post, to include more combination for testing, download here: db-utf8.pl.gz. Output from that script on our RHEL7 box is here: output.log. I also compiled Unicode build of DBD::ODBC and run it on same RHEL7 box. But for some unknown reasons, DBD::ODBC freezed while executing statements via MySQL-ODBC driver. So I excluded MySQL-ODBC from testing script and output is here: output_unicode.log. If you are interested here is stack trace when DBD::ODBC freezed during execution of MySQL-ODBC gdb.log. It happened only with Unicode build of DBD::ODBC. With normal build there was no freeze. Could you run your tests on Windows with above extended test script if with your changes? Because on our RHEL7 boxes that script discovered another issues with upgraded/downgraded scalars in DBD::ODBC (which are not affected by other DBD::SQLite, DBD::MariaDB or DBD::Pg drivers). |
Thanks @pali. I'll try and take a look today. |
Postgres ODBC driver on windows with new test code - all tests succeeded. |
Great! Would you publish changes to DBD::ODBC so we could try it also on RHEL7? |
Sure. I'll commit the changes now to the master branch on the repository. It will be 1.62_1 experimental for now as although it is a minor change code-wise it is a major change in behaviour. |
@mjegh Thank you for your changes! I compiled version from git (with enabled Unicode) and run extended script from #18 (comment) post. But on RHEL7 box with PostgreSQL ODBC driver it failed on following tests:
Whole output with all tested drivers and all passed/failed tests is attached here: full.log |
Thanks @pali. That must be a difference between the unix ODBC driver manager or the postgres ODBC driver for unix. I'll try and have a look at it. |
If you look at full output log, interesting is that there are tests which are passing on Pg ODBC and Vertica ODBC, but are failing with MariaDB ODBC and SQLite ODBC. |
Characters out of 7-bit ASCII are often garbled when inserting or fetching to/from a database.
We used the following program to check the behaviour. Interestingly, it shows DBD::Pg, DBD::SQLite, and DBD::MariaDB work correctly. DBD::ODBC fails with any driver we tried (postgres, vertica, mysql); mysql and sqlite through ODBC were not included, but they fail, too.
Note the test doesn't do anything fancy: it just stores a string and retrieves it back and checks the identity.
Output on our RHEL7 box:
You can see DBD::MariaDB, DBD::Pg and DBD::SQLite work correctly. It's critical to fix DBD::ODBC to behave correctly similarly to other DBD drivers to prevent data loss.
The text was updated successfully, but these errors were encountered: