FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444
Open
subrata-ms wants to merge 8 commits intomainfrom
Open
FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444subrata-ms wants to merge 8 commits intomainfrom
subrata-ms wants to merge 8 commits intomainfrom
Conversation
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 1165-1174 1165 if (_type != SQL_HANDLE_STMT) {
1166 // Log error but don't throw - we're likely in cleanup/destructor path
1167 LOG_ERROR("SAFETY VIOLATION: Attempted to mark non-STMT handle as implicitly freed. "
1168 "Handle type=%d. This will cause handle leak. Only STMT handles are "
! 1169 "automatically freed by parent DBC handles.",
! 1170 _type);
1171 return; // Refuse to mark - let normal free() handle it
1172 }
1173 _implicitly_freed = true;
1174 }Lines 2899-2907 2899 effectiveCharEncoding.c_str(), buffer.size(), py::len(decoded), colIndex);
2900 return decoded;
2901 } catch (const py::error_already_set& e) {
2902 LOG_ERROR("FetchLobColumnData: Failed to decode with '%s' for column %d: %s",
! 2903 effectiveCharEncoding.c_str(), colIndex, e.what());
2904 // Return raw bytes as fallback
2905 return raw_bytes;
2906 }
2907 }Lines 2989-2997 2989 py::len(decoded));
2990 } catch (const py::error_already_set& e) {
2991 LOG_ERROR(
2992 "SQLGetData: Failed to decode CHAR column %d with '%s': %s",
! 2993 i, decodeEncoding.c_str(), e.what());
2994 // Return raw bytes as fallback
2995 row.append(raw_bytes);
2996 }
2997 } else {mssql_python/pybind/ddbc_bindings.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.cursor.py: 84.7%
mssql_python.__init__.py: 84.9%🔗 Quick Links
|
added 2 commits
February 23, 2026 11:50
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a cross-platform fetch bug for CHAR/VARCHAR values at the exact column-size boundary when non-ASCII characters are involved (notably with CP1252), by adjusting buffer sizing and ensuring consistent, platform-appropriate decoding in the ODBC C++ bindings. It also adds regression coverage to prevent the issue from recurring.
Changes:
- Add platform-aware “effective” decoding selection for
SQL_C_CHAR(UTF-8 on Linux/macOS, user-configured encoding on Windows) and propagate it through fetch paths. - Increase
SQL_C_CHARfetch/bind buffer sizes on Linux/macOS to handle worst-case UTF-8 expansion at column boundaries. - Add a dedicated regression test suite for CP1252
VARCHARboundary cases and a small decoding config update in the encoding test suite.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/test_017_varchar_cp1252_boundary.py | New regression tests for CP1252 non-ASCII VARCHAR values at exact-length boundaries across multiple fetch paths. |
| tests/test_013_encoding_decoding.py | Adds setdecoding(SQL_CHAR, "utf-8") to align decoding configuration in an existing parameter encoding test. |
| mssql_python/pybind/ddbc_bindings.h | Updates ProcessChar fast/slow paths to decode correctly on Windows and to pass encoding through LOB fetch. |
| mssql_python/pybind/ddbc_bindings.cpp | Introduces GetEffectiveCharDecoding, adjusts buffer sizing and decoding in SQLGetData_wrap, and propagates encoding into batch fetch metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Work Item / Issue Reference
Summary
This pull request improves the handling of character encoding and buffer sizing for SQL
CHAR/VARCHARdata in the ODBC Python bindings, especially for cross-platform compatibility between Linux/macOS and Windows. The changes ensure that character data is decoded correctly and that buffer sizes are sufficient to prevent corruption or truncation when dealing with multi-byte UTF-8 data returned by the ODBC driver on non-Windows systems.Character Encoding Handling:
GetEffectiveCharDecodingfunction to determine the correct decoding to use for SQLCHARdata: always UTF-8 on Linux/macOS (since the ODBC driver returns UTF-8), and the user-specified encoding on Windows. This function is now used consistently throughout the codebase to select the decoding method. [1] [2] [3] [4] [5] [6]Buffer Sizing for UTF-8:
columnSize * 4 + 1for SQLCHAR/VARCHARcolumns on Linux/macOS, accounting for the worst-case UTF-8 expansion (up to 4 bytes per character), preventing data truncation when multi-byte characters are present at the column boundary. [1] [2] [3]Decoding and Data Fetching:
FetchLobColumnData,SQLGetData_wrap,ProcessChar, and batch fetch functions) to use the effective character encoding and the correct buffer sizes, ensuring consistent and correct decoding regardless of platform. [1] [2] [3] [4]API Changes:
FetchBatchDatafunction and its callers to accept and propagate the character encoding parameter, ensuring the encoding context is preserved throughout the data-fetching stack. [1] [2] [3]Minor Fixes:
These changes collectively improve correctness and reliability when handling string data from SQL databases, especially in multi-platform environments.
CP1252 VARCHAR Boundary Fix — Summary
Problem:: VARCHAR columns with CP1252 non-ASCII characters (e.g., é, ñ, ö) returned corrupted data when the string length exactly equaled the column size. Inserting "café René!" into VARCHAR(10) returned "©!".
Root Cause::
Three bugs in ddbc_bindings.cpp:
Undersized buffer — SQLGetData / SQLBindCol allocated columnSize + 1 bytes, but on Linux/macOS the ODBC driver converts server data to UTF-8 where CP1252 é (1 byte) becomes 0xC3 0xA9 (2 bytes). A 10-char string with 2 accented characters needs 12 bytes, exceeding the 11-byte buffer → truncation → LOB fallback re-reads consumed data → corruption.
Wrong decode encoding — After fixing the buffer, data arrived intact but was decoded with the user's charEncoding (CP1252) instead of UTF-8. Since ODBC on Linux/macOS already converts to UTF-8, double-interpreting as CP1252 produced mojibake (café René!).
ProcessChar assumed UTF-8 on all platforms — The batch/fetchall hot path used PyUnicode_FromStringAndSize which assumes UTF-8 input. Correct on Linux (ODBC returns UTF-8), but wrong on Windows (ODBC returns native encoding like CP1252).