Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect encoding detection #41

Closed
jmacura opened this issue Nov 27, 2021 · 11 comments
Closed

Incorrect encoding detection #41

jmacura opened this issue Nov 27, 2021 · 11 comments

Comments

@jmacura
Copy link

jmacura commented Nov 27, 2021

Hello,

at first, let me thank you for this great reconciliation tool!

I've been trying to use csv-reconcile with the csv-reconcile-geo plugin and I am not confident, where the error comes from so feel free to direct me elsewhere, if the problem does not occur at your site.

So, the problem was that the "budovy_wdqs.tsc" file I was providing was using UTF-8 encoding, while the program apparently expect it to be in cp-1250 for some reason. When I have resaved the .tsv in cp-1250, the bug went gone.

(venv) C:\...\csv-reconcile [master ≡ +4 ~0 -0 !]> csv-reconcile --init-db budovy_wdqs.tsv item coords --scorer geo --config config.txt
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
  File "C:\...\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 195, in main
    initdb.init_db_with_context()
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 90, in init_db_with_context
    return init_db(db,
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 58, in init_db
    header = next(reader)
  File "C:\Python310\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 2094: character maps to <undefined>
@gitonthescene
Copy link
Owner

Hi there,

It looks like you need to specify the encoding in a config file. Just add the --config option with a file which contains the following:

CSVENCODING=<encoding>

where <encoding> is replaced with the encoding you need. I’m assuming you want cp1250 but possibly utf-8.

I’m not 100% sure what’s happening from your description. If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default? This is all handled by Python’s csv module. The config allows you to supply an explicit encoding to use.

Please let me know if this helps.

Regards

@gitonthescene
Copy link
Owner

FWIW, Google turned up this description of how to determine the file encoding that might be worth trying.

@b2m
Copy link
Contributor

b2m commented Dec 1, 2021

At the current version (0.3.0) csv-reconcile doesn't try to guess the encoding of a CSV file.

There is a separate python library for that called chardet.
It is already in the depenency tree of csv-reconcile as it is a direct dependency of normality.

It may be worth a try to guess the encoding of a file when no user specific encoding is given.

There is also the csv.Sniffer class that helps detecting the correct delimiter without relying on user parameters for every deviation from the defaults.

@gitonthescene
Copy link
Owner

gitonthescene commented Dec 1, 2021

@b2m - Thanks for the tips. I’ll have a look. I don’t believe the last release did either unless something changed in Python’s csv module.

@b2m
Copy link
Contributor

b2m commented Dec 1, 2021

I don’t believe the last release did either [...]

Exactly, the comment was meant as tips for improvement of the usability of csv-reconcile to avoid most of the csv encoding/reading problems =)

@jmacura
Copy link
Author

jmacura commented Dec 1, 2021

It looks like you need to specify the encoding in a config file. Just add the --config option with a file which contains the following

Oh, thank you! I wasn't aware of this config option. I guess this closes this issue.

If you attach the first few lines of the file I can try it out for myself. Perhaps this is a system default?

The file is not large, I am attaching it whole in its original form (i.e., before I re-encoded it).
query.zip

Perhaps the right question here is, what is the default encoding csv-reconcile is expecting in the .tsv file? For me, it was apparently cp-1250, but this must be wrong for the vast majority of files, so could it be platform-dependent?

@gitonthescene
Copy link
Owner

Thanks. I’ll try to work in @b2m’s tips to auto-discover the encoding, but for now you should be able to use the override. Would you please let me know if you’re up and running using the override so I can close this issue?

Also, thanks for the file. I’ll use it to test the suggested features.

@gitonthescene
Copy link
Owner

@jmacura FWIW, I did check that the tsv in the file you posted is using utf-8. I'm not sure why your system thought it should be encoded cp1250. In any event, I implemented @b2m's suggestions above to add encoding detection and that will be in the next release. This issue will close once that gets merged back into master.

@jmacura
Copy link
Author

jmacura commented Dec 4, 2021

@gitonthescene Great, thank you for this improvement! Beside that, I can confirm that appending a line CSVENCODING = "utf-8" into the config.txt (and --config config.txt) does work around the problem as well. Thank you for the hint.

@woody544
Copy link

FYSA: Windows uses the cp1250 encoding, which can cause hiccups like this, and I ran into this problem with csv-reconcile as well.

I had solved it before I saw the above issue/solution, by opening the file in the text editor, and saving the reps.tsv file as 'UTF-8 with BOM'. However, I expect changing the configuration file as suggested in the issue is a more robust and lasting solution.

@gitonthescene
Copy link
Owner

FWIW, this has gone out in the latest release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants