UnicodeDecodeError - pyodbc.cursor.columns #328

bergerod · 2018-01-12T15:27:53Z

Environment

To diagnose, we usually need to know the following, including version numbers. On Windows, be
sure to specify 32-bit Python or 64-bit:

Python: 3.5.1 32bit
pyodbc: 4.0.21
OS: Windows 7 64 bit
DB: MS Access
driver: Microsoft Access Driver (*.mdb, *.accdb) 32 bit driver

def build_access_table_list(self):
table_list = list()
# Generate list of tables in access database
for table in self.access_cur.tables():
if table.table_type == "TABLE":
table_list.append(table.table_name)

    return table_list

def create_fields(self,table):
for column in self.access_cur.columns(table):

It fails randomly at the point of looping through the columns. It does not happen all the time. This process can execute 10 times in a row without a problem, but then it will cause a unicode error. There are no special characters in the access table or column names

Error for the log:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 84-85: illegal UTF-16 surrogate

The text was updated successfully, but these errors were encountered:

v-chojas · 2018-01-12T16:13:32Z

When it does work, do you occasionally see column names with additional characters at the end? That may help determine where the problem is.

Also, please preserve indentation when posting Python code... you can enclose it in three backticks

to
 preserve
  the
   indentation
    like
     this.

bergerod · 2018-01-12T17:17:00Z

When it does work there are no additional characters at the end. The function will print out the Insert SQL generated after it makes it through the columns. Anything that I can provide to help?

v-chojas · 2018-01-12T17:36:21Z

You can get an ODBC trace of a failed run, and also a successful run.

yunruse · 2018-02-22T10:38:48Z

After experimenting with MS access databases, I found that UnicodeDecodeError only seems to occur for me when a column contains a description (labelled Description (optional) in Access). Your mileage may vary, but a test goes as:

Create a blank database with a table 'x' and a single column (of any kind) with no description.
Try the following code, adding a character to the description each time you run:

import pyodbc
with pyodbc.connect(
    dbq='pyodbc.connect(dbq='N:/test.accdb',
    driver='Microsoft Access Driver (*.mdb, *.accdb)') as db:
    for i in db.cursor().columns('x'):
        desc, c, rest = i[11].partition('\x00')
        print(desc, list(bytes(c + rest, 'utf8')))

Your mileage may vary with the format you get, but it appears to me as if Access adds on the same amount of junk bytes as there are characters in the description. Certain descriptions simply have bytes so high that they're illegible under UTF-16.

@bergerod: as a workaround, does UnicodeDecodeError go away if you remove descriptions from the columns of the table you're doing Cursor.columns on?

bergerod · 2018-02-23T13:31:19Z

Thank you! As a work around this worked. I didn’t think to look at the descriptions.

eianlei · 2018-04-11T19:46:54Z

I had the same issue. Then found this issue by Google.
I am reading with a Python script an Access *.mdb file that is used by a Windows application and get this UnicodeDecodeError randomly when handling data from cursor.columns.
I printed the .remarks field on my python test script and there was random junk at the end of those fields. It is a matter of luck if and when the junk causes decode errors.

Then I opened the Access DB with Access itself and deleted the description fields, making them empty. After that there were no more problems, the .remarks field on my python output are empty and no more junk is seen, no more crashes.

So this is a good workaround.
Would be good to find a more robust solution though :-)

gordthompson · 2018-09-07T18:24:02Z

@v-chojas

I can reproduce this issue with the Access_2010 ODBC driver (ACEODBC.DLL 14.00.7180.5000). It really is garbage characters at the end of the remarks column as returned by SQLColumnsW.

For a table named "Clients" as shown in Design View

Field Name  Data Type   Description
----------  ----------  --------------------
ID          AutoNumber  identity primary key
LastName    Text        Family name
FirstName   Text        Given name(s)
DOB         Date/Time   Date of Birth

excerpts from the ODBC trace log show

main            1624-1c40   EXIT  SQLColumnsW  with return code 0 (SQL_SUCCESS)
        HSTMT               0x000000000018E440
        WCHAR *             0x0000000000000000 <null pointer>
        SWORD                       -3 
        WCHAR *             0x0000000000000000 <null pointer>
        SWORD                       -3 
        WCHAR *             0x0000000000537AA0 [      -3] "Clients\ 0"
        SWORD                       -3 
        WCHAR *             0x0000000000000000 <null pointer>
        SWORD                       -3 

...

main            1624-1c40   EXIT  SQLDescribeColW  with return code 0 (SQL_SUCCESS)
        HSTMT               0x000000000018E440
        UWORD                       12 
        WCHAR *             0x000000000037F420 [       7] "REMARKS"
        SWORD                      300 
        SWORD *             0x000000000037F3E8 (7)
        SWORD *             0x000000000037F3E4 (-9)
        SQLULEN *           0x000000000037F3F8 (254)
        SWORD *             0x000000000037F3F0 (0)
        SWORD *             0x000000000037F3EC (1)

...

main            1624-1c40   EXIT  SQLGetData  with return code 0 (SQL_SUCCESS)
        HSTMT               0x000000000018E440
        UWORD                       12 
        SWORD                       -8 <SQL_C_WCHAR>
        PTR                 0x0000000002879FE0 [      80] "identity primary key\ 0\ 0??\ 0\ 0\ 1\ 0\ 0\ 0??\ 0\ 0??\ 0\ 0?7\ 0\ 0"
        SQLLEN                  4096
        SQLLEN *            0x000000000037F680 (80)

When printed to the PyCharm console the remarks look something like this

ID: identity primary key  ᥰ˼  �   ᥰ˼  䒰˩  7

and the garbage characters are slightly different each time the test code runs, so it is only a matter of time before those extra bytes contain an illegal UTF-16LE code unit, whereupon we get

Traceback (most recent call last):
  File "C:/Users/Gord/PycharmProjects/py3pyodbc_demo/main.py", line 91, in <module>
    for row in crsr.columns("Clients").fetchall():
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 80-81: illegal encoding

To me this really looks like a bug in the Access ODBC driver, but I don't know where I might report such a thing.

v-chojas · 2018-09-07T18:50:11Z

I agree that appears to be a bug; SQLGetData is returning a length of 80, but the valid data is shorter than that. The log doesn't show the raw bytes but perhaps the data is null-terminated, and it is only the length which is wrong? Or is the driver actually writing 80 bytes of data into the buffer?

Observe that the valid data is 20 ASCII characters, which is 40 bytes of UTF-16. 80 looks suspiciously like a confusion between bytes/characters.

To me this really looks like a bug in the Access ODBC driver, but I don't know where I might report such a thing.

You can try here: https://social.msdn.microsoft.com/forums/office/en-us/home?forum%3Daccessdev

gordthompson · 2018-09-07T23:32:23Z

I also was able to reproduce the issue in C# with System.Data.Odbc:

System.Data.DataTable dt = conn.GetSchema("Columns", new string[] { null, null, "Clients", null });

This issue can probably be closed since it is not a pyodbc problem, it's an Access ODBC problem.

yunruse · 2018-09-08T00:39:39Z

The function in question is at line 1302 in cursor.cpp. As writing a workaround would be a little hacky for this one specific instance where Microsoft forgot to terminate their string (I know the feeling), here's a little workaround function for Cursor.columns():

def columns(cur, table_name):
    for line in cur.columns(table_name):
        line = list(line)
        line[11], null_terminator, garbage = line[11].partition('\x00')
        yield tuple(line)

for foo in columns(cur, table_name):
    pass #stuff you were going to do

gordthompson · 2018-09-08T16:24:27Z

Another possible workaround is to use an output converter function, e.g.,

def decode_sketchy_utf16(raw_bytes):
    s = raw_bytes.decode("utf-16le", "ignore")
    try:
        n = s.index('\u0000')
        s = s[:n]  # respect null terminator
    except ValueError:
        pass
    return s

# ...

prev_converter = cnxn.get_output_converter(pyodbc.SQL_WVARCHAR)
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, decode_sketchy_utf16)
col_info = crsr.columns("Clients").fetchall()
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, prev_converter)  # restore previous behaviour

renaatd · 2020-05-16T12:51:49Z

On my system - Windows 10 ver 1909, Python 3.8.1 (32-bit), PyOdbc 4.0.30 - the workarounds from yunruse and gordthompson don't work. Yunruse workaround: error is in PyOdbc module (call to columns) so before the Python part. Gordthompson workaround: "decode_sketchy_utf16" is never called. On my system (because of 32-bit Python?), an output converter on SQL_VARCHAR or another decoder on SQL_CHAR (via cnxn.setdecoding) must be used.

def decode_sketchy_utf8(raw_bytes):
    null_terminated_bytes = raw_bytes.split(b'\x00')[0]
    return null_terminated_bytes.decode('utf-8')

cnxn.add_output_converter(pyodbc.SQL_VARCHAR, decode_sketchy_utf8)

# do something with columns

cnxn.remove_output_converter(pyodbc.SQL_VARCHAR)

gordthompson · 2020-05-16T17:08:04Z

On my system (because of 32-bit Python?), an output converter on SQL_VARCHAR or another decoder on SQL_CHAR (via cnxn.setdecoding) must be used.

That's odd. Microsoft ODBC drivers are pretty consistently UTF-16LE, and no, 32-bit Python doesn't affect that. Are you using setdecoding on your connection? (Hint: You shouldn't.)

renaatd · 2020-05-16T17:37:51Z

I've made two variants, * setdecoding() on connection, no output converter * no setdecoding, with output converter Both are in attachment, and I can send a small test database if desired. Version used for testing: Python version: 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)] PyOdbc version: 4.0.30 Regards, Renaat Op za mei 16 2020, om 19:08 schreef Gord Thompson:

> On my system (because of 32-bit Python?), an output converter on SQL_VARCHAR or another decoder on SQL_CHAR (via cnxn.setdecoding) must be used. That's odd. Microsoft ODBC drivers are pretty consistently UTF-16LE, and no, 32-bit Python doesn't affect that. Are you using `setdecoding` on your connection? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#328 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADF4QFO5KIVHVRMM26ICSP3RR3CABANCNFSM4ELRYZGQ>.

import codecs import sys from pathlib import Path import pyodbc filename = r'lpemptydb.mdb' pwd = '20KHhl00' def custom_encode(text: str, errors): result = text.encode('utf-16le') return result, len(result) def custom_decode(binary: bytes, errors): null_terminated_bytes = binary.tobytes().split(b'\x00')[0] result = null_terminated_bytes.decode('utf-8') return result, len(result) def custom_search_function(encoding_name): print(f"custom_search: {encoding_name}") return codecs.CodecInfo(custom_encode, custom_decode, name='sketchy_utf8') print(f"Python version: {sys.version}") print(f"PyOdbc version: {pyodbc.version}") codecs.register(custom_search_function) conn_str = 'DRIVER={{Microsoft Access Driver (*.mdb)}};DBQ={0};PWD={1};'.format(Path(filename).absolute(), pwd) cnxn = pyodbc.connect(conn_str) cursor = cnxn.cursor() tables = [t.table_name for t in cursor.tables(tableType='TABLE')] # method 1 # Note: NOT setdecoding(pyodbc.SQL_WMETADATA, ...) cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='sketchy_utf8') # #328 # https://github.com/mkleehammer/pyodbc/wiki/Unicode for table in tables: print("Table {0}:".format(table)) for x in cursor.columns(table): print(x) print("========") print("FileUsage") print("========") for row in cursor.execute("SELECT SourceFile, RefereceFile, SourceType, LastWritten, Status FROM FileUsage"): print(f"{row.SourceFile},{row.RefereceFile},{row.SourceType},{row.LastWritten},{row.Status}") import codecs import sys from pathlib import Path import pyodbc filename = r'lpemptydb.mdb' pwd = '20KHhl00' # followed instructions in https://github.com/mkleehammer/pyodbc/wiki/Using-an-Output-Converter-function, never called # def decode_sketchy_utf16(raw_bytes): # print("decode_sketchy_utf16") # s = raw_bytes.decode("utf-16le", "ignore") # try: # n = s.index('\u0000') # s = s[:n] # respect null terminator # except ValueError: # pass # return s def decode_sketchy_utf8(raw_bytes): null_terminated_bytes = raw_bytes.split(b'\x00')[0] return null_terminated_bytes.decode('utf-8') print(f"Python version: {sys.version}") print(f"PyOdbc version: {pyodbc.version}") conn_str = 'DRIVER={{Microsoft Access Driver (*.mdb)}};DBQ={0};PWD={1};'.format(Path(filename).absolute(), pwd) cnxn = pyodbc.connect(conn_str) cursor = cnxn.cursor() tables = [t.table_name for t in cursor.tables(tableType='TABLE')] # method 2 cnxn.add_output_converter(pyodbc.SQL_VARCHAR, decode_sketchy_utf8) # #328 # https://github.com/mkleehammer/pyodbc/wiki/Unicode for table in tables: print("Table {0}:".format(table)) for x in cursor.columns(table): print(x) # restore previous behaviour cnxn.remove_output_converter(pyodbc.SQL_VARCHAR) print("========") print("FileUsage") print("========") for row in cursor.execute("SELECT SourceFile, RefereceFile, SourceType, LastWritten, Status FROM FileUsage"): print(f"{row.SourceFile},{row.RefereceFile},{row.SourceType},{row.LastWritten},{row.Status}")

gordthompson · 2020-05-16T18:41:21Z

What do you get when you run this using 32-bit cscript.exe (after updating the file path)?

Option Explicit

PathToMyDatabase = "C:\Users\Gord\Desktop\zzz2007.accdb"

Dim objAccess
Set objAccess = CreateObject("Access.Application")
objAccess.OpenCurrentDatabase PathToMyDatabase

Dim intFormat
intFormat = objAccess.CurrentProject.FileFormat

Select Case intFormat
    Case 2 Wscript.Echo "Microsoft Access 2" 
    Case 7 Wscript.Echo "Microsoft Access 95"
    Case 8 Wscript.Echo "Microsoft Access 97" 
    Case 9 Wscript.Echo "Microsoft Access 2000"
    Case 10 Wscript.Echo "Microsoft Access 2003"
    Case 12 Wscript.Echo "Microsoft Access 2007/2010"
    Case Else Wscript.Echo "Unknown FileFormat value: " & intFormat
End Select

objAccess.CloseCurrentDatabase

renaatd · 2020-05-17T08:10:48Z

I couldn't get the script running ("The expression you entered refers to an object that is closed or doesn't exist" on the intFormat line), but mdbtools mdb-ver reports "JET3" as database type. I guess that's Access 97 or older in your script. And that pretty much explains the difference between your results and mine. AFAIK JET3 has no Unicode support. The data returned is probably in whatever codepage was used to create the database. In this case UTF-8 for decoding is probably good enough because this database doesn't contain non-ASCII characters in column info. The first version of the application creating this database file was originally written at the end of the nineties, and it creates new database files by copying a template. So that's why it still uses JET3. The real problem is of course the Access ODBC driver returning null-terminated strings followed by junk, but that's not something PyODBC can solve. Op za mei 16 2020, om 20:41 schreef Gord Thompson:

What do you get when you run this using 32-bit cscript.exe (after updating the file path)? Option Explicit PathToMyDatabase = "C:\Users\Gord\Desktop\zzz2007.accdb" Dim objAccess Set objAccess = CreateObject("Access.Application") objAccess.OpenCurrentDatabase PathToMyDatabase Dim intFormat intFormat = objAccess.CurrentProject.FileFormat Select Case intFormat

Case 2 Wscript.Echo "Microsoft Access 2" Case 7 Wscript.Echo "Microsoft Access 95" Case 8 Wscript.Echo "Microsoft Access 97" Case 9 Wscript.Echo "Microsoft Access 2000" Case 10 Wscript.Echo "Microsoft Access 2003" Case 12 Wscript.Echo "Microsoft Access 2007/2010" Case Else Wscript.Echo "Unknown FileFormat value: " & intFormat

…

End Select objAccess.CloseCurrentDatabase — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#328 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADF4QFNBUIEPGPBTZE6K62DRR3M5ZANCNFSM4ELRYZGQ>.

gordthompson · 2021-05-29T14:27:15Z

Closed due to inactivity. Feel free to re-open with current information if necessary.

lwolf-sagetechs · 2021-09-08T20:20:53Z

I really appreciate this fix. Could not figure out why I was getting random errors when reading an Access database.
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position xxx

But @gordthompson, your suggested workaround on Sep 8, 2018 works great. Thanks!

padhr2810 · 2021-12-01T14:07:25Z

Hack: Setting 'latin-1' encoding seems to work for me. I used a try / except block, if I encounter a UnicodeDecodeError, then the following code is used:

con = pyodbc.connect(f'DRIVER={self.DRV};DBQ={self.MDB};PWD={self.PWD}')
con.setdecoding(pyodbc.SQL_CHAR, encoding='latin-1')
con.setdecoding(pyodbc.SQL_WCHAR, encoding='latin-1')
con.setencoding(encoding='latin-1')
				
cur = con.cursor()
SQL = f'SELECT * FROM "TableName";'
rows = cur.execute(SQL).fetchall()

average-everyman · 2022-04-14T14:16:21Z

Another possible workaround is to use an output converter function, e.g.,

def decode_sketchy_utf16(raw_bytes):
    s = raw_bytes.decode("utf-16le", "ignore")
    try:
        n = s.index('\u0000')
        s = s[:n]  # respect null terminator
    except ValueError:
        pass
    return s

# ...

prev_converter = cnxn.get_output_converter(pyodbc.SQL_WVARCHAR)
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, decode_sketchy_utf16)
col_info = crsr.columns("Clients").fetchall()
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, prev_converter)  # restore previous behaviour

Dear @gordthompson : how could I apply this under sqlalchemy-acess?

gordthompson · 2022-04-14T15:19:11Z

sqlalchemy-access already uses that fix:

https://github.com/gordthompson/sqlalchemy-access/blob/dbf69a7057aa8c2166ea9e2b319fe187ebccefa5/sqlalchemy_access/base.py#L720

average-everyman · 2022-04-19T15:12:10Z

sqlalchemy-access already uses that fix:

https://github.com/gordthompson/sqlalchemy-access/blob/dbf69a7057aa8c2166ea9e2b319fe187ebccefa5/sqlalchemy_access/base.py#L720

thank you @gordthompson ! still did not manage to get this to work. I also tried another route with jadebeapi (as per your reply here: https://stackoverflow.com/a/25614063/11727912) and this worked. however the downside is there does not seem to be a dialect for sqlalchemy to get tables using pandas read_sql...

magnusfarstad · 2023-07-11T08:41:00Z

One of my columns was in a format of geography::Position(POINT(lat, long)) and was represented as 0xE6... I tried everything here and more, but nothing seemed to work. Adding my SQL quickfix if anyone is need of an additional optional solution: SELECT *, Cast([Position] As NVARCHAR(max)) AS PositionText FROM Locations

Make sure to add:

from binascii import hexlify

connection.add_output_converter(-151, self.HandleSpatialData)

def HandleSpatialData(self, v):
        return f"0x{hexlify(v).decode().upper()}"

EDIT: Just discovered I could simply SELECT Position.Lat 'Latitude', Position.Long 'Longitude' From Locations. When you're too deep in it, you forget KISS...

mkleehammer added the Waiting for Info label Feb 20, 2018

gordthompson mentioned this issue Jul 24, 2019

pyodbc.Error mangled with unicode characters #594

Closed

gordthompson removed the Waiting for Info label May 29, 2021

gordthompson closed this as completed May 29, 2021

JCZuurmond mentioned this issue Sep 9, 2021

Add pyodbc connection method, with Databricks connection sodadata/soda-core#481

Closed

SteveDMurphy mentioned this issue Dec 15, 2021

Handling Bad Unicode Data SteveDMurphy/pipelinewise-tap-mssql#20

Merged

2 tasks

sachinwadhwa mentioned this issue Feb 22, 2023

SQL Ingestion fails with UnicodeDecodeError datahub-project/datahub#7380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError - pyodbc.cursor.columns #328

UnicodeDecodeError - pyodbc.cursor.columns #328

bergerod commented Jan 12, 2018

v-chojas commented Jan 12, 2018

bergerod commented Jan 12, 2018

v-chojas commented Jan 12, 2018

yunruse commented Feb 22, 2018 •

edited

Loading

bergerod commented Feb 23, 2018

eianlei commented Apr 11, 2018

gordthompson commented Sep 7, 2018

v-chojas commented Sep 7, 2018

gordthompson commented Sep 7, 2018

yunruse commented Sep 8, 2018 •

edited

Loading

gordthompson commented Sep 8, 2018 •

edited

Loading

renaatd commented May 16, 2020

gordthompson commented May 16, 2020 •

edited

Loading

renaatd commented May 16, 2020 via email

gordthompson commented May 16, 2020

renaatd commented May 17, 2020 via email

gordthompson commented May 29, 2021

lwolf-sagetechs commented Sep 8, 2021

padhr2810 commented Dec 1, 2021

average-everyman commented Apr 14, 2022

gordthompson commented Apr 14, 2022

average-everyman commented Apr 19, 2022

magnusfarstad commented Jul 11, 2023 •

edited

Loading

UnicodeDecodeError - pyodbc.cursor.columns #328

UnicodeDecodeError - pyodbc.cursor.columns #328

Comments

bergerod commented Jan 12, 2018

Environment

v-chojas commented Jan 12, 2018

bergerod commented Jan 12, 2018

v-chojas commented Jan 12, 2018

yunruse commented Feb 22, 2018 • edited Loading

bergerod commented Feb 23, 2018

eianlei commented Apr 11, 2018

gordthompson commented Sep 7, 2018

v-chojas commented Sep 7, 2018

gordthompson commented Sep 7, 2018

yunruse commented Sep 8, 2018 • edited Loading

gordthompson commented Sep 8, 2018 • edited Loading

renaatd commented May 16, 2020

gordthompson commented May 16, 2020 • edited Loading

renaatd commented May 16, 2020 via email

gordthompson commented May 16, 2020

renaatd commented May 17, 2020 via email

gordthompson commented May 29, 2021

lwolf-sagetechs commented Sep 8, 2021

padhr2810 commented Dec 1, 2021

average-everyman commented Apr 14, 2022

gordthompson commented Apr 14, 2022

average-everyman commented Apr 19, 2022

magnusfarstad commented Jul 11, 2023 • edited Loading

yunruse commented Feb 22, 2018 •

edited

Loading

yunruse commented Sep 8, 2018 •

edited

Loading

gordthompson commented Sep 8, 2018 •

edited

Loading

gordthompson commented May 16, 2020 •

edited

Loading

magnusfarstad commented Jul 11, 2023 •

edited

Loading