-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError - pyodbc.cursor.columns #328
Comments
When it does work, do you occasionally see column names with additional characters at the end? That may help determine where the problem is. Also, please preserve indentation when posting Python code... you can enclose it in three backticks
|
When it does work there are no additional characters at the end. The function will print out the Insert SQL generated after it makes it through the columns. Anything that I can provide to help? |
You can get an ODBC trace of a failed run, and also a successful run. |
After experimenting with MS access databases, I found that UnicodeDecodeError only seems to occur for me when a column contains a description (labelled Description (optional) in Access). Your mileage may vary, but a test goes as:
import pyodbc
with pyodbc.connect(
dbq='pyodbc.connect(dbq='N:/test.accdb',
driver='Microsoft Access Driver (*.mdb, *.accdb)') as db:
for i in db.cursor().columns('x'):
desc, c, rest = i[11].partition('\x00')
print(desc, list(bytes(c + rest, 'utf8'))) Your mileage may vary with the format you get, but it appears to me as if Access adds on the same amount of junk bytes as there are characters in the description. Certain descriptions simply have bytes so high that they're illegible under UTF-16. @bergerod: as a workaround, does |
Thank you! As a work around this worked. I didn’t think to look at the descriptions. |
I had the same issue. Then found this issue by Google. Then I opened the Access DB with Access itself and deleted the description fields, making them empty. After that there were no more problems, the .remarks field on my python output are empty and no more junk is seen, no more crashes. So this is a good workaround. |
I can reproduce this issue with the Access_2010 ODBC driver (ACEODBC.DLL 14.00.7180.5000). It really is garbage characters at the end of the For a table named "Clients" as shown in Design View
excerpts from the ODBC trace log show
When printed to the PyCharm console the
and the garbage characters are slightly different each time the test code runs, so it is only a matter of time before those extra bytes contain an illegal UTF-16LE code unit, whereupon we get
To me this really looks like a bug in the Access ODBC driver, but I don't know where I might report such a thing. |
I agree that appears to be a bug; SQLGetData is returning a length of 80, but the valid data is shorter than that. The log doesn't show the raw bytes but perhaps the data is null-terminated, and it is only the length which is wrong? Or is the driver actually writing 80 bytes of data into the buffer? Observe that the valid data is 20 ASCII characters, which is 40 bytes of UTF-16. 80 looks suspiciously like a confusion between bytes/characters.
You can try here: https://social.msdn.microsoft.com/forums/office/en-us/home?forum%3Daccessdev |
The function in question is at line 1302 in def columns(cur, table_name):
for line in cur.columns(table_name):
line = list(line)
line[11], null_terminator, garbage = line[11].partition('\x00')
yield tuple(line)
for foo in columns(cur, table_name):
pass #stuff you were going to do |
Another possible workaround is to use an output converter function, e.g., def decode_sketchy_utf16(raw_bytes):
s = raw_bytes.decode("utf-16le", "ignore")
try:
n = s.index('\u0000')
s = s[:n] # respect null terminator
except ValueError:
pass
return s
# ...
prev_converter = cnxn.get_output_converter(pyodbc.SQL_WVARCHAR)
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, decode_sketchy_utf16)
col_info = crsr.columns("Clients").fetchall()
cnxn.add_output_converter(pyodbc.SQL_WVARCHAR, prev_converter) # restore previous behaviour |
On my system - Windows 10 ver 1909, Python 3.8.1 (32-bit), PyOdbc 4.0.30 - the workarounds from yunruse and gordthompson don't work. Yunruse workaround: error is in PyOdbc module (call to columns) so before the Python part. Gordthompson workaround: "decode_sketchy_utf16" is never called. On my system (because of 32-bit Python?), an output converter on SQL_VARCHAR or another decoder on SQL_CHAR (via def decode_sketchy_utf8(raw_bytes):
null_terminated_bytes = raw_bytes.split(b'\x00')[0]
return null_terminated_bytes.decode('utf-8')
cnxn.add_output_converter(pyodbc.SQL_VARCHAR, decode_sketchy_utf8)
# do something with columns
cnxn.remove_output_converter(pyodbc.SQL_VARCHAR) |
That's odd. Microsoft ODBC drivers are pretty consistently UTF-16LE, and no, 32-bit Python doesn't affect that. Are you using |
I've made two variants,
* setdecoding() on connection, no output converter
* no setdecoding, with output converter
Both are in attachment, and I can send a small test database if desired.
Version used for testing:
Python version: 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)]
PyOdbc version: 4.0.30
Regards,
Renaat
Op za mei 16 2020, om 19:08 schreef Gord Thompson:
> On my system (because of 32-bit Python?), an output converter on SQL_VARCHAR or another decoder on SQL_CHAR (via cnxn.setdecoding) must be used.
That's odd. Microsoft ODBC drivers are pretty consistently UTF-16LE, and no, 32-bit Python doesn't affect that. Are you using `setdecoding` on your connection?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#328 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADF4QFO5KIVHVRMM26ICSP3RR3CABANCNFSM4ELRYZGQ>.
import codecs
import sys
from pathlib import Path
import pyodbc
filename = r'lpemptydb.mdb'
pwd = '20KHhl00'
def custom_encode(text: str, errors):
result = text.encode('utf-16le')
return result, len(result)
def custom_decode(binary: bytes, errors):
null_terminated_bytes = binary.tobytes().split(b'\x00')[0]
result = null_terminated_bytes.decode('utf-8')
return result, len(result)
def custom_search_function(encoding_name):
print(f"custom_search: {encoding_name}")
return codecs.CodecInfo(custom_encode, custom_decode, name='sketchy_utf8')
print(f"Python version: {sys.version}")
print(f"PyOdbc version: {pyodbc.version}")
codecs.register(custom_search_function)
conn_str = 'DRIVER={{Microsoft Access Driver (*.mdb)}};DBQ={0};PWD={1};'.format(Path(filename).absolute(), pwd)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
tables = [t.table_name for t in cursor.tables(tableType='TABLE')]
# method 1
# Note: NOT setdecoding(pyodbc.SQL_WMETADATA, ...)
cnxn.setdecoding(pyodbc.SQL_CHAR, encoding='sketchy_utf8')
# #328
# https://github.com/mkleehammer/pyodbc/wiki/Unicode
for table in tables:
print("Table {0}:".format(table))
for x in cursor.columns(table):
print(x)
print("========")
print("FileUsage")
print("========")
for row in cursor.execute("SELECT SourceFile, RefereceFile, SourceType, LastWritten, Status FROM FileUsage"):
print(f"{row.SourceFile},{row.RefereceFile},{row.SourceType},{row.LastWritten},{row.Status}")
import codecs
import sys
from pathlib import Path
import pyodbc
filename = r'lpemptydb.mdb'
pwd = '20KHhl00'
# followed instructions in https://github.com/mkleehammer/pyodbc/wiki/Using-an-Output-Converter-function, never called
# def decode_sketchy_utf16(raw_bytes):
# print("decode_sketchy_utf16")
# s = raw_bytes.decode("utf-16le", "ignore")
# try:
# n = s.index('\u0000')
# s = s[:n] # respect null terminator
# except ValueError:
# pass
# return s
def decode_sketchy_utf8(raw_bytes):
null_terminated_bytes = raw_bytes.split(b'\x00')[0]
return null_terminated_bytes.decode('utf-8')
print(f"Python version: {sys.version}")
print(f"PyOdbc version: {pyodbc.version}")
conn_str = 'DRIVER={{Microsoft Access Driver (*.mdb)}};DBQ={0};PWD={1};'.format(Path(filename).absolute(), pwd)
cnxn = pyodbc.connect(conn_str)
cursor = cnxn.cursor()
tables = [t.table_name for t in cursor.tables(tableType='TABLE')]
# method 2
cnxn.add_output_converter(pyodbc.SQL_VARCHAR, decode_sketchy_utf8)
# #328
# https://github.com/mkleehammer/pyodbc/wiki/Unicode
for table in tables:
print("Table {0}:".format(table))
for x in cursor.columns(table):
print(x)
# restore previous behaviour
cnxn.remove_output_converter(pyodbc.SQL_VARCHAR)
print("========")
print("FileUsage")
print("========")
for row in cursor.execute("SELECT SourceFile, RefereceFile, SourceType, LastWritten, Status FROM FileUsage"):
print(f"{row.SourceFile},{row.RefereceFile},{row.SourceType},{row.LastWritten},{row.Status}")
|
What do you get when you run this using 32-bit cscript.exe (after updating the file path)? Option Explicit
PathToMyDatabase = "C:\Users\Gord\Desktop\zzz2007.accdb"
Dim objAccess
Set objAccess = CreateObject("Access.Application")
objAccess.OpenCurrentDatabase PathToMyDatabase
Dim intFormat
intFormat = objAccess.CurrentProject.FileFormat
Select Case intFormat
Case 2 Wscript.Echo "Microsoft Access 2"
Case 7 Wscript.Echo "Microsoft Access 95"
Case 8 Wscript.Echo "Microsoft Access 97"
Case 9 Wscript.Echo "Microsoft Access 2000"
Case 10 Wscript.Echo "Microsoft Access 2003"
Case 12 Wscript.Echo "Microsoft Access 2007/2010"
Case Else Wscript.Echo "Unknown FileFormat value: " & intFormat
End Select
objAccess.CloseCurrentDatabase |
I couldn't get the script running ("The expression you entered refers to an object that is closed or doesn't exist" on the intFormat line), but mdbtools mdb-ver reports "JET3" as database type. I guess that's Access 97 or older in your script.
And that pretty much explains the difference between your results and mine. AFAIK JET3 has no Unicode support. The data returned is probably in whatever codepage was used to create the database. In this case UTF-8 for decoding is probably good enough because this database doesn't contain non-ASCII characters in column info.
The first version of the application creating this database file was originally written at the end of the nineties, and it creates new database files by copying a template. So that's why it still uses JET3.
The real problem is of course the Access ODBC driver returning null-terminated strings followed by junk, but that's not something PyODBC can solve.
Op za mei 16 2020, om 20:41 schreef Gord Thompson:
What do you get when you run this using 32-bit cscript.exe (after updating the file path)?
Option Explicit
PathToMyDatabase = "C:\Users\Gord\Desktop\zzz2007.accdb"
Dim objAccess
Set objAccess = CreateObject("Access.Application")
objAccess.OpenCurrentDatabase PathToMyDatabase
Dim intFormat
intFormat = objAccess.CurrentProject.FileFormat
Select Case intFormat
Case 2 Wscript.Echo "Microsoft Access 2"
Case 7 Wscript.Echo "Microsoft Access 95"
Case 8 Wscript.Echo "Microsoft Access 97"
Case 9 Wscript.Echo "Microsoft Access 2000"
Case 10 Wscript.Echo "Microsoft Access 2003"
Case 12 Wscript.Echo "Microsoft Access 2007/2010"
Case Else Wscript.Echo "Unknown FileFormat value: " & intFormat
… End Select
objAccess.CloseCurrentDatabase
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#328 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADF4QFNBUIEPGPBTZE6K62DRR3M5ZANCNFSM4ELRYZGQ>.
|
Closed due to inactivity. Feel free to re-open with current information if necessary. |
I really appreciate this fix. Could not figure out why I was getting random errors when reading an Access database. But @gordthompson, your suggested workaround on Sep 8, 2018 works great. Thanks! |
Hack: Setting 'latin-1' encoding seems to work for me. I used a try / except block, if I encounter a UnicodeDecodeError, then the following code is used:
|
Dear @gordthompson : how could I apply this under |
sqlalchemy-access already uses that fix: |
thank you @gordthompson ! still did not manage to get this to work. I also tried another route with jadebeapi (as per your reply here: https://stackoverflow.com/a/25614063/11727912) and this worked. however the downside is there does not seem to be a dialect for sqlalchemy to get tables using pandas read_sql... |
One of my columns was in a format of Make sure to add: from binascii import hexlify
connection.add_output_converter(-151, self.HandleSpatialData)
def HandleSpatialData(self, v):
return f"0x{hexlify(v).decode().upper()}" EDIT: Just discovered I could simply |
Environment
To diagnose, we usually need to know the following, including version numbers. On Windows, be
sure to specify 32-bit Python or 64-bit:
def build_access_table_list(self):
table_list = list()
# Generate list of tables in access database
for table in self.access_cur.tables():
if table.table_type == "TABLE":
table_list.append(table.table_name)
def create_fields(self,table):
for column in self.access_cur.columns(table):
It fails randomly at the point of looping through the columns. It does not happen all the time. This process can execute 10 times in a row without a problem, but then it will cause a unicode error. There are no special characters in the access table or column names
Error for the log:
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 84-85: illegal UTF-16 surrogate
The text was updated successfully, but these errors were encountered: