-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Wide Strings in SOCI for Enhanced Unicode Handling #1133
base: master
Are you sure you want to change the base?
Conversation
Converting from UTF-16 to UTF-8 is no problem when retrieving data, because the column data type is known. I'm thinking of adding another argument to "soci::use()" that lets the developer override the data type that's used for the underlying ODBC call. Another issue is the currently non-existing N'' enclosure for unicode strings for MSSQL in case of soci::use(). Another issue is the stream interface. Currently std::wstring isn't supported and as far as I understand, supporting it would require widening the query to UTF-16 before sending it to the DB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This globally looks good but there are globally 2 issues:
- The new functionality needs to be documented, notably it should be clearly stated that
wstring
andwchar_t
are only supported in the ODBC backend (and only when using SQL Server?). - The use of/checks for C++17 are confusing as it's not clear if it is required for wide char support or if it's just some kind of optimization (in the latter case I'd drop it, it's not worth the extra code complexity).
…hub.com/ORDIS-Co-Ltd/soci into wstring_support_with_unicode_conversion
This commit updates the Unicode conversion functions to handle UTF-16 on Windows and UTF-32 on other platforms. The changes include: 1. Updating the `utf8_to_wide` and `wide_to_utf8` functions to handle UTF-32 on Unix/Linux platforms. 2. Updating the `copy_from_string` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms. 3. Updating the `bind_by_pos` function to handle UTF-16 on Windows and convert UTF-32 to UTF-16 on other platforms. 4. Adding a test case for wide strings in the ODBC MSSQL tests.
Please note that I updated the FreeBSD Image for Cirrus from 13.2 to 13.3. |
…version' into wstring_support
I'm adding better UTF conversion first. |
@vadz Maybe this can be an optional feature, similar to boost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks mostly good to me and the limitation (lack of support for combined forms) can be addressed later.
I have some minor comments below and I admit I didn't read all the code in details, but it looks superficially fine (if a bit verbose) and the tests look good, thank you.
tests/odbc/test-odbc-mssql.cpp
Outdated
|
||
// } | ||
|
||
TEST_CASE("UTF-8 validation tests", "[unicode]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these tests are neither MSSQL nor ODBC specific, they should ideally be in their own file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved them to the "empty" test module, as it contains other non-backend-specific tests.
I don't think this is the best solution, but I would need more information on how you want a separate unicode test file to be treated in the context of the CMake files. The backend tests use the CMake macro soci_backend_test
.
I can just "nail it in", but I assume a more elegant solution is preferred.
Oh, I forgot to ask: why do you think this should be an option? AFAICS this doesn't affect the existing API, so I see no reason to not enable this unconditionally for people who need it, am I missing something? |
Co-authored-by: VZ <vz-github@zeitlins.org>
I was referring to the need to link against icu or iconv to have combining character support right away. But it's not necessary, if we can take care of the normalization later. |
Add UTF-8 BOM handling to unicode conversion functions.
This pull request adds comprehensive support for wide strings (
wchar_t
,std::wstring
) to the SOCI database library, significantly improving its support for Unicode string types such as SQL Server'sNVARCHAR
andNTEXT
. This enhancement is crucial for applications that require robust handling of international and multi-language data.Key Changes:
Introduced
exchange_type_traits
andexchange_traits
Specializations:Updated ODBC Backend:
wchar_t
andstd::wstring
.Enhanced Buffer Management:
Improved Unicode Support:
Extended Test Coverage:
Notes:
This update significantly bolsters SOCI's capabilities in handling Unicode data, making it a more versatile and powerful tool for database interactions in multi-language applications.
Example usage
Here are a few examples showing how the new wide string features can be used with the ODBC backend.
Example 1: Handling
std::wstring
in SQL QueriesInserting and Selecting
std::wstring
DataExample 2: Working with
wchar_t
VectorsInserting and Selecting Wide Characters
Example 3: Using
std::wstring
with thesql
Stream OperatorInserting and Selecting
std::wstring
Data with Stream OperatorIn this example:
soci::session
object is created to connect to the database.NVARCHAR
column.std::wstring
is defined for insertion.sql
stream operator is used to insert thestd::wstring
into the database. Note the use ofN'
to indicate a Unicode string in SQL Server.std::wstring
is retrieved from the database using thesql
stream operator and thesoci::into
function.std::wcout
.These examples demonstrate how to insert and retrieve wide strings and wide characters using SOCI's newly added features for handling wide strings (
wchar_t
,std::wstring
).Limitation: The current implementation does not handle combining characters correctly. Combining characters, such as accents or diacritical marks, are treated separately instead of being combined with their base characters. This limitation may result in incorrect conversions for strings containing combining characters. A potential solution would be to incorporate Unicode normalization before the conversion process to ensure that combining characters are properly combined with their base characters.
Unicode defines several normalization forms (e.g., NFC, NFD, NFKC, NFKD), each with its own set of rules and behaviors. Choosing the appropriate normalization form is crucial, as different forms may produce different results.
To have full Unicode support, linking against a library like ICU or iconv is necessary. It can be made optional.
Disclaimer: This text is partially AI generated.