utf8 encoding bug in Finding Places in Text #2783

srappel · 2022-12-13T16:43:20Z

I'm reporting a bug and potential workaround in Finding Places in Text with the World Historical Gazeteer

The specific part of the lesson is Section 6.1

The following line of python code from the lesson produces a UnicodeDecodeError for me:
gazetteer = Path("gazetteer.txt").read_text()

But by simply specifying UTF-8 encoding, as in:
gazetteer = Path("gazetteer.txt").read_text('utf8')
the error is then resolved.

I'm using a Windows PC and python version 3.9 (Anaconda)

The text was updated successfully, but these errors were encountered:

anisa-hawes · 2022-12-14T14:16:49Z

Thank you for letting us know about this, @srappel!

I will test this. If I encounter the same error, I'll check if the proposed workaround you suggest solves it from my side too.

Many thanks,
Anisa

anisa-hawes · 2023-02-03T20:19:18Z

I am able to work through the steps of the lesson on Google Colab including step 6.1, and can successfully load and print the gazetteer in Google Colab. Here's the link to the notebook I worked in.

I also tested these steps on my mac, running OS BigSur and Python 3.10. I wrote my code in Atom, and ran it in the command line. I was able to successfully work through to tokenise the sentence Berlin ist eine Stadt in Deutschland but then ran into a few unexpected problems at step 6.1 too.

Traceback (most recent call last):
  File "/Users/anisahawes/Desktop/test-whg.py", line 3, in <module>
    gazetteer = Path("gazetteer.txt").read_text()
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1132, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1117, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'gazetteer.txt'

I tried your suggestion of adjusting the line gazetteer = Path("gazetteer.txt").read_text( ) to gazetteer = Path("gazetteer.txt").read_text('utf8') but it didn't work for me. I encounter the following errors:

  File "/Users/anisahawes/Desktop/test-whg.py", line 3, in <module>
    gazetteer = Path("gazetteer.txt").read_text('utf8')
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1132, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1117, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: 'gazetteer.txt'

I wonder if @hawc2 might be able to advise? One of the authors is Andy Janco, who we might be able to reach out to with this question.

--

Notes to myself:

This Stackoverflow post could be useful but I don't understand well enough to implement this at the moment.

This comment I found on a Python community GitHub repository could also be relevant.

Looking at the traceback errors I got, I think this could be the key line: with self.open(mode='r', encoding=encoding, errors=errors) as f:

hawc2 · 2023-02-06T22:17:02Z

@anisa-hawes the proposed solution by @srappel makes sense to me. That's pretty good standard practice, to cite the encoding with the read function.

I believe the error you're running into Anisa just relates to the gazatteer.txt file not being locateable by the Python script you're running, and probably is not a problem to worry about for updating the tutorial. It sounds like a quick fix to me to just add the utf-8 specificity to the line of code

anisa-hawes self-assigned this Dec 14, 2022

anisa-hawes added Lesson Maintenance English labels Dec 14, 2022

anisa-hawes mentioned this issue Feb 9, 2023

Update finding-places-world-historical-gazetteer.md #2857

Merged

7 tasks

anisa-hawes closed this as completed in #2857 Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8 encoding bug in Finding Places in Text #2783

utf8 encoding bug in Finding Places in Text #2783

srappel commented Dec 13, 2022 •

edited

Loading

anisa-hawes commented Dec 14, 2022

anisa-hawes commented Feb 3, 2023 •

edited

Loading

hawc2 commented Feb 6, 2023

utf8 encoding bug in Finding Places in Text #2783

utf8 encoding bug in Finding Places in Text #2783

Comments

srappel commented Dec 13, 2022 • edited Loading

anisa-hawes commented Dec 14, 2022

anisa-hawes commented Feb 3, 2023 • edited Loading

hawc2 commented Feb 6, 2023

srappel commented Dec 13, 2022 •

edited

Loading

anisa-hawes commented Feb 3, 2023 •

edited

Loading