Incorrect encoding detection #41

jmacura · 2021-11-27T23:47:16Z

Hello,

at first, let me thank you for this great reconciliation tool!

I've been trying to use csv-reconcile with the csv-reconcile-geo plugin and I am not confident, where the error comes from so feel free to direct me elsewhere, if the problem does not occur at your site.

So, the problem was that the "budovy_wdqs.tsc" file I was providing was using UTF-8 encoding, while the program apparently expect it to be in cp-1250 for some reason. When I have resaved the .tsv in cp-1250, the bug went gone.

(venv) C:\...\csv-reconcile [master ≡ +4 ~0 -0 !]> csv-reconcile --init-db budovy_wdqs.tsv item coords --scorer geo --config config.txt
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
  File "C:\...\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 195, in main
    initdb.init_db_with_context()
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 90, in init_db_with_context
    return init_db(db,
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 58, in init_db
    header = next(reader)
  File "C:\Python310\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 2094: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

gitonthescene · 2021-11-28T00:57:32Z

Hi there,

It looks like you need to specify the encoding in a config file. Just add the --config option with a file which contains the following:

CSVENCODING=<encoding>

where <encoding> is replaced with the encoding you need. I’m assuming you want cp1250 but possibly utf-8.

I’m not 100% sure what’s happening from your description. If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default? This is all handled by Python’s csv module. The config allows you to supply an explicit encoding to use.

Please let me know if this helps.

Regards

gitonthescene · 2021-11-28T01:16:30Z

FWIW, Google turned up this description of how to determine the file encoding that might be worth trying.

b2m · 2021-12-01T08:06:59Z

At the current version (0.3.0) csv-reconcile doesn't try to guess the encoding of a CSV file.

There is a separate python library for that called chardet.
It is already in the depenency tree of csv-reconcile as it is a direct dependency of normality.

It may be worth a try to guess the encoding of a file when no user specific encoding is given.

There is also the csv.Sniffer class that helps detecting the correct delimiter without relying on user parameters for every deviation from the defaults.

gitonthescene · 2021-12-01T09:00:15Z

@b2m - Thanks for the tips. I’ll have a look. I don’t believe the last release did either unless something changed in Python’s csv module.

b2m · 2021-12-01T09:40:04Z

I don’t believe the last release did either [...]

Exactly, the comment was meant as tips for improvement of the usability of csv-reconcile to avoid most of the csv encoding/reading problems =)

jmacura · 2021-12-01T21:10:30Z

It looks like you need to specify the encoding in a config file. Just add the --config option with a file which contains the following

Oh, thank you! I wasn't aware of this config option. I guess this closes this issue.

If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default?

The file is not large, I am attaching it whole in its original form (i.e., before I re-encoded it).
query.zip

Perhaps the right question here is, what is the default encoding csv-reconcile is expecting in the .tsv file? For me, it was apparently cp-1250, but this must be wrong for the vast majority of files, so could it be platform-dependent?

gitonthescene · 2021-12-01T23:59:34Z

Thanks. I’ll try to work in @b2m’s tips to auto-discover the encoding, but for now you should be able to use the override. Would you please let me know if you’re up and running using the override so I can close this issue?

Also, thanks for the file. I’ll use it to test the suggested features.

gitonthescene · 2021-12-04T01:21:24Z

@jmacura FWIW, I did check that the tsv in the file you posted is using utf-8. I'm not sure why your system thought it should be encoded cp1250. In any event, I implemented @b2m's suggestions above to add encoding detection and that will be in the next release. This issue will close once that gets merged back into master.

jmacura · 2021-12-04T21:20:40Z

@gitonthescene Great, thank you for this improvement! Beside that, I can confirm that appending a line CSVENCODING = "utf-8" into the config.txt (and --config config.txt) does work around the problem as well. Thank you for the hint.

woody544 · 2022-04-23T04:27:52Z

FYSA: Windows uses the cp1250 encoding, which can cause hiccups like this, and I ran into this problem with csv-reconcile as well.

I had solved it before I saw the above issue/solution, by opening the file in the text editor, and saving the reps.tsv file as 'UTF-8 with BOM'. However, I expect changing the configuration file as suggested in the issue is a more robust and lasting solution.

gitonthescene · 2022-04-29T06:45:47Z

FWIW, this has gone out in the latest release.

gitonthescene pushed a commit that referenced this issue Dec 4, 2021

Fixes #41. Try to detect CSV file encoding and dialect. Allow overrides.

5f34626

gitonthescene mentioned this issue Dec 4, 2021

Feature/smart csv #42

Merged

gitonthescene mentioned this issue Apr 23, 2022

ValueError: 'item' is not in list #65

Closed

gitonthescene closed this as completed in 9a9e8d4 Apr 29, 2022

b2m mentioned this issue Apr 29, 2022

Problem with CSV dialect sniffing #68

Closed

dchoi127 mentioned this issue May 3, 2022

localhost:5000/reconcile not displaying properly #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect encoding detection #41

Incorrect encoding detection #41

jmacura commented Nov 27, 2021

gitonthescene commented Nov 28, 2021

gitonthescene commented Nov 28, 2021

b2m commented Dec 1, 2021

gitonthescene commented Dec 1, 2021 •

edited

Loading

b2m commented Dec 1, 2021

jmacura commented Dec 1, 2021

gitonthescene commented Dec 1, 2021

gitonthescene commented Dec 4, 2021

jmacura commented Dec 4, 2021

woody544 commented Apr 23, 2022

gitonthescene commented Apr 29, 2022

Incorrect encoding detection #41

Incorrect encoding detection #41

Comments

jmacura commented Nov 27, 2021

gitonthescene commented Nov 28, 2021

gitonthescene commented Nov 28, 2021

b2m commented Dec 1, 2021

gitonthescene commented Dec 1, 2021 • edited Loading

b2m commented Dec 1, 2021

jmacura commented Dec 1, 2021

gitonthescene commented Dec 1, 2021

gitonthescene commented Dec 4, 2021

jmacura commented Dec 4, 2021

woody544 commented Apr 23, 2022

gitonthescene commented Apr 29, 2022

gitonthescene commented Dec 1, 2021 •

edited

Loading