Skip to content

Comments

FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444

Open
subrata-ms wants to merge 8 commits intomainfrom
subrata-ms/CP1252_Conversion
Open

FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444
subrata-ms wants to merge 8 commits intomainfrom
subrata-ms/CP1252_Conversion

Conversation

@subrata-ms
Copy link
Contributor

@subrata-ms subrata-ms commented Feb 23, 2026

Work Item / Issue Reference

AB#42604

GitHub Issue: #435


Summary

This pull request improves the handling of character encoding and buffer sizing for SQL CHAR/VARCHAR data in the ODBC Python bindings, especially for cross-platform compatibility between Linux/macOS and Windows. The changes ensure that character data is decoded correctly and that buffer sizes are sufficient to prevent corruption or truncation when dealing with multi-byte UTF-8 data returned by the ODBC driver on non-Windows systems.

Character Encoding Handling:

  • Introduced the GetEffectiveCharDecoding function to determine the correct decoding to use for SQL CHAR data: always UTF-8 on Linux/macOS (since the ODBC driver returns UTF-8), and the user-specified encoding on Windows. This function is now used consistently throughout the codebase to select the decoding method. [1] [2] [3] [4] [5] [6]

Buffer Sizing for UTF-8:

  • Updated buffer allocation logic to use columnSize * 4 + 1 for SQL CHAR/VARCHAR columns on Linux/macOS, accounting for the worst-case UTF-8 expansion (up to 4 bytes per character), preventing data truncation when multi-byte characters are present at the column boundary. [1] [2] [3]

Decoding and Data Fetching:

  • Modified all data fetching and decoding paths (FetchLobColumnData, SQLGetData_wrap, ProcessChar, and batch fetch functions) to use the effective character encoding and the correct buffer sizes, ensuring consistent and correct decoding regardless of platform. [1] [2] [3] [4]

API Changes:

  • Updated the FetchBatchData function and its callers to accept and propagate the character encoding parameter, ensuring the encoding context is preserved throughout the data-fetching stack. [1] [2] [3]

Minor Fixes:

  • Minor formatting and logging improvements for error messages and function signatures.

These changes collectively improve correctness and reliability when handling string data from SQL databases, especially in multi-platform environments.

CP1252 VARCHAR Boundary Fix — Summary
Problem:: VARCHAR columns with CP1252 non-ASCII characters (e.g., é, ñ, ö) returned corrupted data when the string length exactly equaled the column size. Inserting "café René!" into VARCHAR(10) returned "©!".

Root Cause::
Three bugs in ddbc_bindings.cpp:

Undersized buffer — SQLGetData / SQLBindCol allocated columnSize + 1 bytes, but on Linux/macOS the ODBC driver converts server data to UTF-8 where CP1252 é (1 byte) becomes 0xC3 0xA9 (2 bytes). A 10-char string with 2 accented characters needs 12 bytes, exceeding the 11-byte buffer → truncation → LOB fallback re-reads consumed data → corruption.

Wrong decode encoding — After fixing the buffer, data arrived intact but was decoded with the user's charEncoding (CP1252) instead of UTF-8. Since ODBC on Linux/macOS already converts to UTF-8, double-interpreting as CP1252 produced mojibake (café René!).

ProcessChar assumed UTF-8 on all platforms — The batch/fetchall hot path used PyUnicode_FromStringAndSize which assumes UTF-8 input. Correct on Linux (ODBC returns UTF-8), but wrong on Windows (ODBC returns native encoding like CP1252).

@github-actions github-actions bot added the pr-size: medium Moderate update size label Feb 23, 2026
@github-actions
Copy link

github-actions bot commented Feb 23, 2026

📊 Code Coverage Report

🔥 Diff Coverage

67%


🎯 Overall Coverage

76%


📈 Total Lines Covered: 5542 out of 7218
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (89.2%): Missing lines 1169-1170,2903,2993
  • mssql_python/pybind/ddbc_bindings.h (26.3%): Missing lines 838-846,852-856

Summary

  • Total: 56 lines
  • Missing: 18 lines
  • Coverage: 67%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1165-1174

  1165     if (_type != SQL_HANDLE_STMT) {
  1166         // Log error but don't throw - we're likely in cleanup/destructor path
  1167         LOG_ERROR("SAFETY VIOLATION: Attempted to mark non-STMT handle as implicitly freed. "
  1168                   "Handle type=%d. This will cause handle leak. Only STMT handles are "
! 1169                   "automatically freed by parent DBC handles.",
! 1170                   _type);
  1171         return;  // Refuse to mark - let normal free() handle it
  1172     }
  1173     _implicitly_freed = true;
  1174 }

Lines 2899-2907

  2899             effectiveCharEncoding.c_str(), buffer.size(), py::len(decoded), colIndex);
  2900         return decoded;
  2901     } catch (const py::error_already_set& e) {
  2902         LOG_ERROR("FetchLobColumnData: Failed to decode with '%s' for column %d: %s",
! 2903                   effectiveCharEncoding.c_str(), colIndex, e.what());
  2904         // Return raw bytes as fallback
  2905         return raw_bytes;
  2906     }
  2907 }

Lines 2989-2997

  2989                                         py::len(decoded));
  2990                                 } catch (const py::error_already_set& e) {
  2991                                     LOG_ERROR(
  2992                                         "SQLGetData: Failed to decode CHAR column %d with '%s': %s",
! 2993                                         i, decodeEncoding.c_str(), e.what());
  2994                                     // Return raw bytes as fallback
  2995                                     row.append(raw_bytes);
  2996                                 }
  2997                             } else {

mssql_python/pybind/ddbc_bindings.h

  834         if (!pyStr) {
  835             // Decode failed — fall back to returning raw bytes (consistent with
  836             // FetchLobColumnData and SQLGetData_wrap which also return raw bytes
  837             // on decode failure instead of silently converting to None).
! 838             PyErr_Clear();
! 839             PyObject* pyBytes = PyBytes_FromStringAndSize(dataPtr, numCharsInData);
! 840             if (pyBytes) {
! 841                 PyList_SET_ITEM(row, col - 1, pyBytes);
! 842             } else {
! 843                 PyErr_Clear();
! 844                 Py_INCREF(Py_None);
! 845                 PyList_SET_ITEM(row, col - 1, Py_None);
! 846             }
  847         } else {
  848             PyList_SET_ITEM(row, col - 1, pyStr);
  849         }
  850     } else {

  848             PyList_SET_ITEM(row, col - 1, pyStr);
  849         }
  850     } else {
  851         // Slow path: LOB data requires separate fetch call
! 852         PyList_SET_ITEM(
! 853             row, col - 1,
! 854             FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false, colInfo->charEncoding)
! 855                 .release()
! 856                 .ptr());
  857     }
  858 }
  859 
  860 // Process SQL NCHAR/NVARCHAR (wide/Unicode string) column into Python str


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.cursor.py: 84.7%
mssql_python.__init__.py: 84.9%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a cross-platform fetch bug for CHAR/VARCHAR values at the exact column-size boundary when non-ASCII characters are involved (notably with CP1252), by adjusting buffer sizing and ensuring consistent, platform-appropriate decoding in the ODBC C++ bindings. It also adds regression coverage to prevent the issue from recurring.

Changes:

  • Add platform-aware “effective” decoding selection for SQL_C_CHAR (UTF-8 on Linux/macOS, user-configured encoding on Windows) and propagate it through fetch paths.
  • Increase SQL_C_CHAR fetch/bind buffer sizes on Linux/macOS to handle worst-case UTF-8 expansion at column boundaries.
  • Add a dedicated regression test suite for CP1252 VARCHAR boundary cases and a small decoding config update in the encoding test suite.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
tests/test_017_varchar_cp1252_boundary.py New regression tests for CP1252 non-ASCII VARCHAR values at exact-length boundaries across multiple fetch paths.
tests/test_013_encoding_decoding.py Adds setdecoding(SQL_CHAR, "utf-8") to align decoding configuration in an existing parameter encoding test.
mssql_python/pybind/ddbc_bindings.h Updates ProcessChar fast/slow paths to decode correctly on Windows and to pass encoding through LOB fetch.
mssql_python/pybind/ddbc_bindings.cpp Introduces GetEffectiveCharDecoding, adjusts buffer sizing and decoding in SQLGetData_wrap, and propagates encoding into batch fetch metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added pr-size: large Substantial code update and removed pr-size: medium Moderate update size labels Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant