diff --git a/en/lessons/transliterating.md b/en/lessons/transliterating.md old mode 100755 new mode 100644 index cf09e0ecd..4feebbfeb --- a/en/lessons/transliterating.md +++ b/en/lessons/transliterating.md @@ -61,23 +61,6 @@ used in texts. Finally, transliteration can be more practical for authors who can type more fluently with Latin letters than in the native alphabet of a language that does not use Latin characters. -Programming languages like Python also benefit from transliteration. -Python handles Cyrillic relatively well in certain environments, like -[Terminal][] for MacOS or Linux, or in Windows, [IDLE][], the official -Python integrated development environment. However, even in these Python -converts non-ASCII characters into code. Other environments, like the -Python shell for Windows (command line) or [Komodo Edit][], know Unicode -but will not print the Cyrillic characters that Unicode represents -without tricky additional configuration. In environments that do support -Cyrillic, switching between a Latin character set to write code and a -non-Latin character set to handle inputs can be tedious. Thus, creating -a program to transliterate evidence automatically eliminates the step of -transliteration for researchers and it converts the text into a format -that Python can handle more readily. **This lesson was built and tested -using IDLE for Windows and Terminal for MacOS. The author strongly -recommends that you follow along using the program tested on your -operating system rather Windows Command Prompt or Komodo Edit.** - This lesson will be particularly useful for research in fields that use a standardized transliteration format, such as Russian history field, where the convention is to use a simplified version of the American @@ -125,17 +108,17 @@ our purposes, what is important is that the encoding is stored under the ``` python #transliterator.py -import urllib2 +from urllib.request import urlopen -page = urllib2.urlopen('http://lists.memo.ru/d1/f1.htm') +page = urlopen('http://lists.memo.ru/d1/f1.htm') #what is the encoding? -print page.headers['content-type'] +print(page.headers['content-type']) ``` Under the ‘content-type’ key we find this information: -``` python +``` text/html; charset=windows-1251 ``` @@ -144,7 +127,7 @@ accessed is in HTML and that its encoding (after ‘charset=’, meaning character set) is ‘windows-1251′, a common encoding for Cyrillic characters. You can visit the webpage and view the Page Source and see for yourself that the first line does in fact contain a ‘content-type’ -variable with the value text/html; charset=windows-1251. It would not be +variable with the value `text/html; charset=windows-1251`. It would not be so hard to work with the ‘windows-1251′ encoding. However, ‘windows-1251′ is specifically for Cyrillic and will not handle all languages. For the sake of learning a standard method, what we want is @@ -157,7 +140,7 @@ allow. How do you convert the characters to Unicode? First, Python needs to know the original encoding of the source, ‘windows-1251.’ We could just -assign ‘windows-1251′ to a variable by typing it manually but the +assign ‘windows-1251’ to a variable by typing it manually but the encoding may not always be ‘windows-1251.’ There are other character sets for Cyrillic, not to mention other languages. Let’s find a way to make the process more automatic for those cases. It helps that the @@ -179,9 +162,10 @@ encoding = page.headers['content-type'].split('charset=')[1] The encoding is assigned to the variable called ‘*encoding*’. You can check to see if this worked by printing the ‘*encoding*’ variable. Now we can tell Python how to read the page as Unicode. Using the -`unicode(object [, encoding])` method turns a string of characters into a -Unicode object. A Unicode object is similar to a string but it can -contain special characters. If they are in a non-ASCII character set, +`str(object [, encoding])` method turns a text encoded in a specific encoding +into a generic Unicode string. A Unicode string cannot only contain ASCII +characters, but also +special characters. If the original text is in a non-ASCII character set, like here with ‘windows-1251’, we have to use the optional encoding parameter. @@ -190,44 +174,32 @@ parameter. content = page.read() # the unicode method tries to use ASCII so we need to tell it the encoding -content = unicode(content, encoding) +content = str(content, encoding) content[200:300] ``` -  - ``` python -u'"list-right">\r\n -