Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 encoding bug in Finding Places in Text #2783

Closed
srappel opened this issue Dec 13, 2022 · 3 comments · Fixed by #2857
Closed

utf8 encoding bug in Finding Places in Text #2783

srappel opened this issue Dec 13, 2022 · 3 comments · Fixed by #2857

Comments

@srappel
Copy link

srappel commented Dec 13, 2022

I'm reporting a bug and potential workaround in Finding Places in Text with the World Historical Gazeteer

The specific part of the lesson is Section 6.1

The following line of python code from the lesson produces a UnicodeDecodeError for me:
gazetteer = Path("gazetteer.txt").read_text()

But by simply specifying UTF-8 encoding, as in:
gazetteer = Path("gazetteer.txt").read_text('utf8')
the error is then resolved.

I'm using a Windows PC and python version 3.9 (Anaconda)

@anisa-hawes
Copy link
Contributor

Thank you for letting us know about this, @srappel!

I will test this. If I encounter the same error, I'll check if the proposed workaround you suggest solves it from my side too.

Many thanks,
Anisa

@anisa-hawes
Copy link
Contributor

anisa-hawes commented Feb 3, 2023

I am able to work through the steps of the lesson on Google Colab including step 6.1, and can successfully load and print the gazetteer in Google Colab. Here's the link to the notebook I worked in.

I also tested these steps on my mac, running OS BigSur and Python 3.10. I wrote my code in Atom, and ran it in the command line. I was able to successfully work through to tokenise the sentence Berlin ist eine Stadt in Deutschland but then ran into a few unexpected problems at step 6.1 too.

Traceback (most recent call last):
  File "/Users/anisahawes/Desktop/test-whg.py", line 3, in <module>
    gazetteer = Path("gazetteer.txt").read_text()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1132, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1117, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'gazetteer.txt'

I tried your suggestion of adjusting the line gazetteer = Path("gazetteer.txt").read_text( ) to gazetteer = Path("gazetteer.txt").read_text('utf8') but it didn't work for me. I encounter the following errors:

  File "/Users/anisahawes/Desktop/test-whg.py", line 3, in <module>
    gazetteer = Path("gazetteer.txt").read_text('utf8')
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1132, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1117, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'gazetteer.txt'

I wonder if @hawc2 might be able to advise? One of the authors is Andy Janco, who we might be able to reach out to with this question.

--

Notes to myself:

This Stackoverflow post could be useful but I don't understand well enough to implement this at the moment.

This comment I found on a Python community GitHub repository could also be relevant.

Looking at the traceback errors I got, I think this could be the key line: with self.open(mode='r', encoding=encoding, errors=errors) as f:

@hawc2
Copy link
Contributor

hawc2 commented Feb 6, 2023

@anisa-hawes the proposed solution by @srappel makes sense to me. That's pretty good standard practice, to cite the encoding with the read function.

I believe the error you're running into Anisa just relates to the gazatteer.txt file not being locateable by the Python script you're running, and probably is not a problem to worry about for updating the tutorial. It sounds like a quick fix to me to just add the utf-8 specificity to the line of code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants