Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError - pyodbc.cursor.columns #328

Closed
bergerod opened this issue Jan 12, 2018 · 23 comments
Closed

UnicodeDecodeError - pyodbc.cursor.columns #328

bergerod opened this issue Jan 12, 2018 · 23 comments

Comments

@bergerod
Copy link

Environment

To diagnose, we usually need to know the following, including version numbers. On Windows, be
sure to specify 32-bit Python or 64-bit:

  • Python: 3.5.1 32bit
  • pyodbc: 4.0.21
  • OS: Windows 7 64 bit
  • DB: MS Access
  • driver: Microsoft Access Driver (*.mdb, *.accdb) 32 bit driver

def build_access_table_list(self):
table_list = list()
# Generate list of tables in access database
for table in self.access_cur.tables():
if table.table_type == "TABLE":
table_list.append(table.table_name)

    return table_list 

def create_fields(self,table):
for column in self.access_cur.columns(table):

It fails randomly at the point of looping through the columns. It does not happen all the time. This process can execute 10 times in a row without a problem, but then it will cause a unicode error. There are no special characters in the access table or column names

Error for the log:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 84-85: illegal UTF-16 surrogate

@v-chojas
Copy link
Contributor

When it does work, do you occasionally see column names with additional characters at the end? That may help determine where the problem is.

Also, please preserve indentation when posting Python code... you can enclose it in three backticks

to
 preserve
  the
   indentation
    like
     this.

@bergerod
Copy link
Author

When it does work there are no additional characters at the end. The function will print out the Insert SQL generated after it makes it through the columns. Anything that I can provide to help?

@v-chojas
Copy link
Contributor

You can get an ODBC trace of a failed run, and also a successful run.

@yunruse
Copy link

yunruse commented Feb 22, 2018

After experimenting with MS access databases, I found that UnicodeDecodeError only seems to occur for me when a column contains a description (labelled Description (optional) in Access). Your mileage may vary, but a test goes as:

  1. Create a blank database with a table 'x' and a single column (of any kind) with no description.
  2. Try the following code, adding a character to the description each time you run:
import pyodbc
with pyodbc.connect(
    dbq='pyodbc.connect(dbq='N:/test.accdb',
    driver='Microsoft Access Driver (*.mdb, *.accdb)') as db:
    for i in db.cursor().columns('x'):
        desc, c, rest = i[11].partition('\x00')
        print(desc, list(bytes(c + rest, 'utf8')))

Your mileage may vary with the format you get, but it appears to me as if Access adds on the same amount of junk bytes as there are characters in the description. Certain descriptions simply have bytes so high that they're illegible under UTF-16.

@bergerod: as a workaround, does UnicodeDecodeError go away if you remove descriptions from the columns of the table you're doing Cursor.columns on?

@bergerod
Copy link
Author

Thank you! As a work around this worked. I didn’t think to look at the descriptions.

@eianlei
Copy link

eianlei commented Apr 11, 2018

I had the same issue. Then found this issue by Google.
I am reading with a Python script an Access *.mdb file that is used by a Windows application and get this UnicodeDecodeError randomly when handling data from cursor.columns.
I printed the .remarks field on my python test script and there was random junk at the end of those fields. It is a matter of luck if and when the junk causes decode errors.

Then I opened the Access DB with Access itself and deleted the description fields, making them empty. After that there were no more problems, the .remarks field on my python output are empty and no more junk is seen, no more crashes.

So this is a good workaround.
Would be good to find a more robust solution though :-)

@gordthompson
Copy link
Collaborator

@v-chojas

I can reproduce this issue with the Access_2010 ODBC driver (ACEODBC.DLL 14.00.7180.5000). It really is garbage characters at the end of the remarks column as returned by SQLColumnsW.

For a table named "Clients" as shown in Design View

Field Name  Data Type   Description
----------  ----------  --------------------
ID          AutoNumber  identity primary key
LastName    Text        Family name
FirstName   Text        Given name(s)
DOB         Date/Time   Date of Birth

excerpts from the ODBC trace log show

main            1624-1c40   EXIT  SQLColumnsW  with return code 0 (SQL_SUCCESS)
        HSTMT               0x000000000018E440
        WCHAR *             0x0000000000000000 <null pointer>
        SWORD                       -3 
        WCHAR *             0x0000000000000000 <null pointer>
        SWORD                       -3 
        WCHAR *             0x0000000000537AA0 [      -3] "Clients\ 0"
        SWORD                       -3 
        WCHAR *             0x0000000000000000 <null pointer>
        SWORD                       -3 

...

main            1624-1c40   EXIT  SQLDescribeColW  with return code 0 (SQL_SUCCESS)
        HSTMT               0x000000000018E440
        UWORD                       12 
        WCHAR *             0x000000000037F420 [       7] "REMARKS"
        SWORD                      300 
        SWORD *             0x000000000037F3E8 (7)
        SWORD *             0x000000000037F3E4 (-9)
        SQLULEN *           0x000000000037F3F8 (254)
        SWORD *             0x000000000037F3F0 (0)
        SWORD *             0x000000000037F3EC (1)

...

main            1624-1c40   EXIT  SQLGetData  with return code 0 (SQL_SUCCESS)
        HSTMT               0x000000000018E440
        UWORD                       12 
        SWORD                       -8 <SQL_C_WCHAR>
        PTR                 0x0000000002879FE0 [      80] "identity primary key\ 0\ 0??\ 0\ 0\ 1\ 0\ 0\ 0??\ 0\ 0??\ 0\ 0?7\ 0\ 0"
        SQLLEN                  4096
        SQLLEN *            0x000000000037F680 (80)

When printed to the PyCharm console the remarks look something like this

ID: identity primary key  ᥰ˼  �   ᥰ˼  䒰˩  7

and the garbage characters are slightly different each time the test code runs, so it is only a matter of time before those extra bytes contain an illegal UTF-16LE code unit, whereupon we get

Traceback (most recent call last):
  File "C:/Users/Gord/PycharmProjects/py3pyodbc_demo/main.py", line 91, in <module>
    for row in crsr.columns("Clients").fetchall():
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 80-81: illegal encoding

To me this really looks like a bug in the Access ODBC driver, but I don't know where I might report such a thing.

@v-chojas
Copy link
Contributor

v-chojas commented Sep 7, 2018

I agree that appears to be a bug; SQLGetData is returning a length of 80, but the valid data is shorter than that. The log doesn't show the raw bytes but perhaps the data is null-terminated, and it is only the length which is wrong? Or is the driver actually writing 80 bytes of data into the buffer?

Observe that the valid data is 20 ASCII characters, which is 40 bytes of UTF-16. 80 looks suspiciously like a confusion between bytes/characters.

To me this really looks like a bug in the Access ODBC driver, but I don't know where I might report such a thing.

You can try here: https://social.msdn.microsoft.com/forums/office/en-us/home?forum%3Daccessdev

@gordthompson
Copy link
Collaborator

I also was able to reproduce the issue in C# with System.Data.Odbc:

System.Data.DataTable dt = conn.GetSchema("Columns", new string[] { null, null, "Clients", null });

schema

This issue can probably be closed since it is not a pyodbc problem, it's an Access ODBC problem.

@yunruse
Copy link

yunruse commented Sep 8, 2018

The function in question is at line 1302 in cursor.cpp. As writing a workaround would be a little hacky for this one specific instance where Microsoft forgot to terminate their string (I know the feeling), here's a little workaround function for Cursor.columns():

def columns(cur, table_name):
    for line in cur.columns(table_name):
        line = list(line)
        line[11], null_terminator, garbage = line[11].partition('\x00')
        yield tuple(line)

for foo in columns(cur, table_name):
    pass #stuff you were going to do

@gordthompson
Copy link
Collaborator

gordthompson commented Sep 8, 2018

Another possible workaround is to use an output converter function, e.g.,

def decode_sketchy_utf16(raw_bytes):
    s = raw_bytes.decode("utf-16le", "ignore")
    try:
        n = s.index('\u0000')
        s = s[:n]  # respect null terminator
    except ValueError:
        pass
    return s

# ...

prev_converter = cnxn.get_output_converter(pyodbc.SQL_WVARCHAR)
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, decode_sketchy_utf16)
col_info = crsr.columns("Clients").fetchall()
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, prev_converter)  # restore previous behaviour

@renaatd
Copy link

renaatd commented May 16, 2020

On my system - Windows 10 ver 1909, Python 3.8.1 (32-bit), PyOdbc 4.0.30 - the workarounds from yunruse and gordthompson don't work. Yunruse workaround: error is in PyOdbc module (call to columns) so before the Python part. Gordthompson workaround: "decode_sketchy_utf16" is never called. On my system (because of 32-bit Python?), an output converter on SQL_VARCHAR or another decoder on SQL_CHAR (via cnxn.setdecoding) must be used.

def decode_sketchy_utf8(raw_bytes):
    null_terminated_bytes = raw_bytes.split(b'\x00')[0]
    return null_terminated_bytes.decode('utf-8')

cnxn.add_output_converter(pyodbc.SQL_VARCHAR, decode_sketchy_utf8)

# do something with columns

cnxn.remove_output_converter(pyodbc.SQL_VARCHAR) 

@gordthompson
Copy link
Collaborator

gordthompson commented May 16, 2020

On my system (because of 32-bit Python?), an output converter on SQL_VARCHAR or another decoder on SQL_CHAR (via cnxn.setdecoding) must be used.

That's odd. Microsoft ODBC drivers are pretty consistently UTF-16LE, and no, 32-bit Python doesn't affect that. Are you using setdecoding on your connection? (Hint: You shouldn't.)

@renaatd
Copy link

renaatd commented May 16, 2020 via email

@gordthompson
Copy link
Collaborator

What do you get when you run this using 32-bit cscript.exe (after updating the file path)?

Option Explicit

PathToMyDatabase = "C:\Users\Gord\Desktop\zzz2007.accdb"

Dim objAccess
Set objAccess = CreateObject("Access.Application")
objAccess.OpenCurrentDatabase PathToMyDatabase

Dim intFormat
intFormat = objAccess.CurrentProject.FileFormat

Select Case intFormat
    Case 2 Wscript.Echo "Microsoft Access 2" 
    Case 7 Wscript.Echo "Microsoft Access 95"
    Case 8 Wscript.Echo "Microsoft Access 97" 
    Case 9 Wscript.Echo "Microsoft Access 2000"
    Case 10 Wscript.Echo "Microsoft Access 2003"
    Case 12 Wscript.Echo "Microsoft Access 2007/2010"
    Case Else Wscript.Echo "Unknown FileFormat value: " & intFormat
End Select

objAccess.CloseCurrentDatabase

@renaatd
Copy link

renaatd commented May 17, 2020 via email

@gordthompson
Copy link
Collaborator

Closed due to inactivity. Feel free to re-open with current information if necessary.

@lwolf-sagetechs
Copy link

I really appreciate this fix. Could not figure out why I was getting random errors when reading an Access database.
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position xxx

But @gordthompson, your suggested workaround on Sep 8, 2018 works great. Thanks!

@padhr2810
Copy link

Hack: Setting 'latin-1' encoding seems to work for me. I used a try / except block, if I encounter a UnicodeDecodeError, then the following code is used:

con = pyodbc.connect(f'DRIVER={self.DRV};DBQ={self.MDB};PWD={self.PWD}')
con.setdecoding(pyodbc.SQL_CHAR, encoding='latin-1')
con.setdecoding(pyodbc.SQL_WCHAR, encoding='latin-1')
con.setencoding(encoding='latin-1')
				
cur = con.cursor()
SQL = f'SELECT * FROM "TableName";'
rows = cur.execute(SQL).fetchall()

@average-everyman
Copy link

Another possible workaround is to use an output converter function, e.g.,

def decode_sketchy_utf16(raw_bytes):
    s = raw_bytes.decode("utf-16le", "ignore")
    try:
        n = s.index('\u0000')
        s = s[:n]  # respect null terminator
    except ValueError:
        pass
    return s

# ...

prev_converter = cnxn.get_output_converter(pyodbc.SQL_WVARCHAR)
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, decode_sketchy_utf16)
col_info = crsr.columns("Clients").fetchall()
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, prev_converter)  # restore previous behaviour

Dear @gordthompson : how could I apply this under sqlalchemy-acess?

@gordthompson
Copy link
Collaborator

sqlalchemy-access already uses that fix:

https://github.com/gordthompson/sqlalchemy-access/blob/dbf69a7057aa8c2166ea9e2b319fe187ebccefa5/sqlalchemy_access/base.py#L720

@average-everyman
Copy link

sqlalchemy-access already uses that fix:

https://github.com/gordthompson/sqlalchemy-access/blob/dbf69a7057aa8c2166ea9e2b319fe187ebccefa5/sqlalchemy_access/base.py#L720

thank you @gordthompson ! still did not manage to get this to work. I also tried another route with jadebeapi (as per your reply here: https://stackoverflow.com/a/25614063/11727912) and this worked. however the downside is there does not seem to be a dialect for sqlalchemy to get tables using pandas read_sql...

@magnusfarstad
Copy link

magnusfarstad commented Jul 11, 2023

One of my columns was in a format of geography::Position(POINT(lat, long)) and was represented as 0xE6... I tried everything here and more, but nothing seemed to work. Adding my SQL quickfix if anyone is need of an additional optional solution: SELECT *, Cast([Position] As NVARCHAR(max)) AS PositionText FROM Locations

Make sure to add:

from binascii import hexlify

connection.add_output_converter(-151, self.HandleSpatialData)

def HandleSpatialData(self, v):
        return f"0x{hexlify(v).decode().upper()}"

EDIT: Just discovered I could simply SELECT Position.Lat 'Latitude', Position.Long 'Longitude' From Locations. When you're too deep in it, you forget KISS...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests