-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I was trying to use the data for some experiments, but when reading it directly with open in python3, in encountered an encoding error for file R121.html:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 1213: invalid continuation byte
I wrote a small script to check the encoding of the failing files using the chardet utility. The code is below.
for f in *.html; do
encoding=$(file -i $f | cut -d"=" -f 2) # get the mime encoding
if [ "$encoding" != "us-ascii" ] && [ "$encoding" != "utf-8" ]; then
res=$(chardetect $f) # try to detect it otherwise
encoding=$(echo $res | cut -d" " -f 2)
echo $res
fi
doneAnd produces the following result:
R121.html: Windows-1254 with confidence 0.5434633906826465
R17.html: ISO-8859-1 with confidence 0.73
R736.html: Windows-1252 with confidence 0.73
R757.html: Windows-1252 with confidence 0.73
R826.html: Windows-1252 with confidence 0.73
R827.html: Windows-1252 with confidence 0.73
T156.html: windows-1251 with confidence 0.7538428528079772
T19.html: Windows-1252 with confidence 0.73
T2.html: Windows-1254 with confidence 0.5434729438118417
T31.html: Windows-1254 with confidence 0.5239184224706976
T97.html: ISO-8859-1 with confidence 0.73
These inconsistencies are not major and I managed to fix them afterwards with a few changes to the detection script, but a few failed even with recode(R121.html, T19.html, T2.html, T31.html) and I had to remove them. Here is the script I used to convert the inconsistent ones.
for f in *.html; do
encoding=$(file -i $f | cut -d"=" -f 2) # get the mime encoding
if [ "$encoding" != "us-ascii" ] && [ "$encoding" != "utf-8" ]; then
res=$(chardetect $f) # try to detect it otherwise
encoding=$(echo $res | cut -d" " -f 2)
echo $res - CONVERTING TO UTF-8
recode ${encoding}..utf-8 $f
fi
doneThis might be an issue on my part, maybe, so I'm curious if this is something that came to your attention before.