Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in Concept #345

Closed
mzeidhassan opened this issue Oct 13, 2021 · 2 comments
Closed
Labels

Comments

@mzeidhassan
Copy link

mzeidhassan commented Oct 13, 2021

Hi textacy team,

I am running in this problem only when I use Concept to get antonyms, meronyms, and hyponyms, and it works just fine with synonyms. So, I have no clue why this is happening.

steps to reproduce

I am using the same code in documentation.

rs.get_antonyms("love", lang="en", sense="v")
rs.get_hyponyms("marriage", lang="en", sense="n")

environment

  • operating system: Windows 10
  • python version: 3.7.9
  • spacy version: 3.0.6
  • installed spacy models: en_core_web_sm
  • textacy version: 0.11.0

error:

0it [00:00, ?it/s]
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-322-2fc8ec20a447> in <module>
      1 # rs.get_synonyms("spouse", lang="en", sense="n")
      2 # ['mate', 'married person', 'better half', 'partner']
----> 3 rs.get_antonyms("love", lang="en", sense="v")
      4 # ['detest', 'hate', 'loathe']
      5 # >>> rs.get_hyponyms("marriage", lang="en", sense="n")

c:\users\m-lap\miniconda3\lib\site-packages\textacy\resources\concept_net.py in get_antonyms(self, term, lang, sense)
    273             List[str]
    274         """
--> 275         return self._get_relation_values(self.antonyms, term, lang=lang, sense=sense)
    276 
    277     @property

c:\users\m-lap\miniconda3\lib\site-packages\textacy\resources\concept_net.py in antonyms(self)
    258         """
    259         if not self._antonyms:
--> 260             self._antonyms = self._get_relation_data("/r/Antonym", is_symmetric=True)
    261         return self._antonyms
    262 

c:\users\m-lap\miniconda3\lib\site-packages\textacy\resources\concept_net.py in _get_relation_data(self, relation, is_symmetric)
    163             rows = tio.read_csv(self.filepath, delimiter="\t", quoting=1, encoding='cp437')
    164             with tqdm() as pbar:
--> 165                 for row in rows:
    166                     pbar.update(1)
    167                     _, rel_type, start_uri, end_uri, _ = row

c:\users\m-lap\miniconda3\lib\site-packages\textacy\io\csv.py in read_csv(filepath, encoding, fieldnames, dialect, delimiter, quoting)
     96                 f, dialect=dialect, delimiter=delimiter, quoting=quoting, encoding='cp437'
     97             )
---> 98         for row in csv_reader:
     99             yield row
    100 

c:\users\m-lap\miniconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 49: character maps to <undefined>

Thanks

@bdewilde
Copy link
Collaborator

Hi @mzeidhassan , thanks for submitting — definitely a bug! I'm not able to test against Windows, so this sort of error is hard for me to catch. The issue here is that textacy is opening the ConceptNet raw CSV file without a specified encoding, so falls back on the system's default. In macOS and Linux systems, the default "utf-8" decodes the data just fine; in Windows, I believe the default is CP1252, and it apparently doesn't recognize all these bytes.

I'm pretty sure that the solution is to add encoding="utf-8" to this line: https://github.com/chartbeat-labs/textacy/blob/main/src/textacy/resources/concept_net.py#L163. I'm going to push that commit to the develop branch, but again, I can't test against Windows. If you're able, please make the change on your local copy (or fetch textacy from the develop branch), then try again. Please let me know if it doesn't solve your issue!

@mzeidhassan
Copy link
Author

Hi @bdewilde and welcome back!
This was the fix indeed. I tested it and it works perfectly fine now.
Thanks for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants