FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters by subrata-ms · Pull Request #444 · microsoft/mssql-python

subrata-ms · 2026-02-23T11:28:06Z

Work Item / Issue Reference

AB#42604

GitHub Issue: #435

Summary

This pull request improves the handling of character encoding and buffer sizing for SQL CHAR/VARCHAR data in the ODBC Python bindings, especially for cross-platform compatibility between Linux/macOS and Windows. The changes ensure that character data is decoded correctly and that buffer sizes are sufficient to prevent corruption or truncation when dealing with multi-byte UTF-8 data returned by the ODBC driver on non-Windows systems.

Character Encoding Handling:

Introduced the GetEffectiveCharDecoding function to determine the correct decoding to use for SQL CHAR data: always UTF-8 on Linux/macOS (since the ODBC driver returns UTF-8), and the user-specified encoding on Windows. This function is now used consistently throughout the codebase to select the decoding method. [1] [2] [3] [4] [5] [6]

Buffer Sizing for UTF-8:

Updated buffer allocation logic to use columnSize * 4 + 1 for SQL CHAR/VARCHAR columns on Linux/macOS, accounting for the worst-case UTF-8 expansion (up to 4 bytes per character), preventing data truncation when multi-byte characters are present at the column boundary. [1] [2] [3]

Decoding and Data Fetching:

Modified all data fetching and decoding paths (FetchLobColumnData, SQLGetData_wrap, ProcessChar, and batch fetch functions) to use the effective character encoding and the correct buffer sizes, ensuring consistent and correct decoding regardless of platform. [1] [2] [3] [4]

API Changes:

Updated the FetchBatchData function and its callers to accept and propagate the character encoding parameter, ensuring the encoding context is preserved throughout the data-fetching stack. [1] [2] [3]

Minor Fixes:

Minor formatting and logging improvements for error messages and function signatures.

These changes collectively improve correctness and reliability when handling string data from SQL databases, especially in multi-platform environments.

CP1252 VARCHAR Boundary Fix — Summary
Problem:: VARCHAR columns with CP1252 non-ASCII characters (e.g., é, ñ, ö) returned corrupted data when the string length exactly equaled the column size. Inserting "café René!" into VARCHAR(10) returned "©!".

Root Cause::
Three bugs in ddbc_bindings.cpp:

Undersized buffer — SQLGetData / SQLBindCol allocated columnSize + 1 bytes, but on Linux/macOS the ODBC driver converts server data to UTF-8 where CP1252 é (1 byte) becomes 0xC3 0xA9 (2 bytes). A 10-char string with 2 accented characters needs 12 bytes, exceeding the 11-byte buffer → truncation → LOB fallback re-reads consumed data → corruption.

Wrong decode encoding — After fixing the buffer, data arrived intact but was decoded with the user's charEncoding (CP1252) instead of UTF-8. Since ODBC on Linux/macOS already converts to UTF-8, double-interpreting as CP1252 produced mojibake (cafÃ© RenÃ©!).

ProcessChar assumed UTF-8 on all platforms — The batch/fetchall hot path used PyUnicode_FromStringAndSize which assumes UTF-8 input. Correct on Linux (ODBC returns UTF-8), but wrong on Windows (ODBC returns native encoding like CP1252).

github-actions · 2026-02-23T11:49:39Z

📊 Code Coverage Report

🔥 Diff Coverage 67%	🎯 Overall Coverage 76%	📈 Total Lines Covered: `5542` out of `7218` 📁 Project: `mssql-python`

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

mssql_python/pybind/ddbc_bindings.cpp (89.2%): Missing lines 1169-1170,2903,2993
mssql_python/pybind/ddbc_bindings.h (26.3%): Missing lines 838-846,852-856

Summary

Total: 56 lines
Missing: 18 lines
Coverage: 67%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1165-1174

  1165     if (_type != SQL_HANDLE_STMT) {
  1166         // Log error but don't throw - we're likely in cleanup/destructor path
  1167         LOG_ERROR("SAFETY VIOLATION: Attempted to mark non-STMT handle as implicitly freed. "
  1168                   "Handle type=%d. This will cause handle leak. Only STMT handles are "
! 1169                   "automatically freed by parent DBC handles.",
! 1170                   _type);
  1171         return;  // Refuse to mark - let normal free() handle it
  1172     }
  1173     _implicitly_freed = true;
  1174 }

Lines 2899-2907

  2899             effectiveCharEncoding.c_str(), buffer.size(), py::len(decoded), colIndex);
  2900         return decoded;
  2901     } catch (const py::error_already_set& e) {
  2902         LOG_ERROR("FetchLobColumnData: Failed to decode with '%s' for column %d: %s",
! 2903                   effectiveCharEncoding.c_str(), colIndex, e.what());
  2904         // Return raw bytes as fallback
  2905         return raw_bytes;
  2906     }
  2907 }

Lines 2989-2997

  2989                                         py::len(decoded));
  2990                                 } catch (const py::error_already_set& e) {
  2991                                     LOG_ERROR(
  2992                                         "SQLGetData: Failed to decode CHAR column %d with '%s': %s",
! 2993                                         i, decodeEncoding.c_str(), e.what());
  2994                                     // Return raw bytes as fallback
  2995                                     row.append(raw_bytes);
  2996                                 }
  2997                             } else {

mssql_python/pybind/ddbc_bindings.h

  834         if (!pyStr) {
  835             // Decode failed — fall back to returning raw bytes (consistent with
  836             // FetchLobColumnData and SQLGetData_wrap which also return raw bytes
  837             // on decode failure instead of silently converting to None).
! 838             PyErr_Clear();
! 839             PyObject* pyBytes = PyBytes_FromStringAndSize(dataPtr, numCharsInData);
! 840             if (pyBytes) {
! 841                 PyList_SET_ITEM(row, col - 1, pyBytes);
! 842             } else {
! 843                 PyErr_Clear();
! 844                 Py_INCREF(Py_None);
! 845                 PyList_SET_ITEM(row, col - 1, Py_None);
! 846             }
  847         } else {
  848             PyList_SET_ITEM(row, col - 1, pyStr);
  849         }
  850     } else {

  848             PyList_SET_ITEM(row, col - 1, pyStr);
  849         }
  850     } else {
  851         // Slow path: LOB data requires separate fetch call
! 852         PyList_SET_ITEM(
! 853             row, col - 1,
! 854             FetchLobColumnData(hStmt, col, SQL_C_CHAR, false, false, colInfo->charEncoding)
! 855                 .release()
! 856                 .ptr());
  857     }
  858 }
  859 
  860 // Process SQL NCHAR/NVARCHAR (wide/Unicode string) column into Python str

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.pybind.ddbc_bindings.h: 67.8%
mssql_python.pybind.ddbc_bindings.cpp: 69.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.cursor.py: 84.7%
mssql_python.__init__.py: 84.9%

🔗 Quick Links

⚙️ Build Summary	📋 Coverage Details
View Azure DevOps Build	Browse Full Coverage Report

Copilot

Pull request overview

This PR fixes a cross-platform fetch bug for CHAR/VARCHAR values at the exact column-size boundary when non-ASCII characters are involved (notably with CP1252), by adjusting buffer sizing and ensuring consistent, platform-appropriate decoding in the ODBC C++ bindings. It also adds regression coverage to prevent the issue from recurring.

Changes:

Add platform-aware “effective” decoding selection for SQL_C_CHAR (UTF-8 on Linux/macOS, user-configured encoding on Windows) and propagate it through fetch paths.
Increase SQL_C_CHAR fetch/bind buffer sizes on Linux/macOS to handle worst-case UTF-8 expansion at column boundaries.
Add a dedicated regression test suite for CP1252 VARCHAR boundary cases and a small decoding config update in the encoding test suite.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
tests/test_017_varchar_cp1252_boundary.py	New regression tests for CP1252 non-ASCII `VARCHAR` values at exact-length boundaries across multiple fetch paths.
tests/test_013_encoding_decoding.py	Adds `setdecoding(SQL_CHAR, "utf-8")` to align decoding configuration in an existing parameter encoding test.
mssql_python/pybind/ddbc_bindings.h	Updates `ProcessChar` fast/slow paths to decode correctly on Windows and to pass encoding through LOB fetch.
mssql_python/pybind/ddbc_bindings.cpp	Introduces `GetEffectiveCharDecoding`, adjusts buffer sizing and decoding in `SQLGetData_wrap`, and propagates encoding into batch fetch metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/pybind/ddbc_bindings.h

mssql_python/pybind/ddbc_bindings.cpp

tests/test_017_varchar_cp1252_boundary.py

base fix for the issue

82c580f

github-actions bot added the pr-size: medium Moderate update size label Feb 23, 2026

subrata-ms and others added 2 commits February 23, 2026 17:00

Merge branch 'main' into subrata-ms/CP1252_Conversion

70c00bc

fix for linting and additional changes

016e549

subrata-ms added 2 commits February 23, 2026 11:50

fixing build failure

e324c29

test fix for windows

90d0919

subrata-ms marked this pull request as ready for review February 23, 2026 12:50

subrata-ms requested review from bewithgaurav, Copilot, gargsaumya, jahnvi480 and sumitmsft February 23, 2026 12:50

Copilot started reviewing on behalf of subrata-ms February 23, 2026 12:50 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

subrata-ms and others added 3 commits February 24, 2026 10:10

Merge branch 'main' into subrata-ms/CP1252_Conversion

f89ad95

copilot review comments

ef8af68

improving test coverage

645ff6a

github-actions bot added pr-size: large Substantial code update and removed pr-size: medium Moderate update size labels Feb 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444

FIX: VARCHAR fetch fails when data length equals column size with non-ASCII CP1252 characters#444
subrata-ms wants to merge 8 commits intomainfrom
subrata-ms/CP1252_Conversion

subrata-ms commented Feb 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 23, 2026 •

edited

Loading

🔥 Diff Coverage

67%

🎯 Overall Coverage

76%

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

subrata-ms commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work Item / Issue Reference

Summary

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Code Coverage Report

🔥 Diff Coverage

67%

🎯 Overall Coverage

76%

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

Summary

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/pybind/ddbc_bindings.h

📋 Files Needing Attention

🔗 Quick Links

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

subrata-ms commented Feb 23, 2026 •

edited

Loading

github-actions bot commented Feb 23, 2026 •

edited

Loading