Skip to content
This repository has been archived by the owner on Jun 11, 2021. It is now read-only.

04.08 Preparing translated texts for SuttaCentral

sujato edited this page Mar 28, 2015 · 9 revisions

Introduction

This page gives general guidelines for transforming a text from its original format for use on SuttaCentral. Obviously the exact process will vary tremendously depending on the original format. Since the process is complex and will vary for each case, Suttacentral is happy to offer what assistance we can. But of course, we appreciate it if helpers show initiative and resourcefulness!

There is a complementary description of the process of preparing texts that you may find at suttacentral.net/tools.

Identification

  • Identify what texts exist in the relevant language.

  • Assess the quality of the translations. We aim for the highest standards of fidelity and readability. Rarely are these two things found in a well balanced form. We prefer translations directly from the original. If translations are made from the English, we prefer from Bodhi rather than Thanissaro.

Copyright

  • Determine the copyright status of the translation. Generally speaking, original texts are not subject to copyright, but translations are generally regarded as being so subject. If the licence allows reproduction there is no problem; if not, we get in touch with the copyright holder and ask for permission. We must ensure that appropriate copyright notice is included with the translation.

Sources

  • Obtain source files for digital texts. These will often be HTML webpages. In some cases they may be pdfs, Word documents, and so on.

  • Type or OCR printed texts. In some cases the translations may only exist as printed texts. In such cases we will create digital files from these. This will be either via OCR or by typing afresh. We can use outsourcing services to do this.

  • Determine encoding. SuttaCentral is in Unicode (UTF8). If the source file is not UTF8 we find out how to convert the encoding.

  • Determine structure of files. Hopefully the files will have a meaningful structure, with marked up headings and the like. Often this will be not properly semantic, as we usually find with HTML, however it is often possible to programmatically convert it to semantic HTML. The main thing is that there should be something in the files that indicates the heading, verses, uddanas, and other structural elements. If not, these must be added.

Tools

SuttaCentral has developed a range of tools to help with this process. See suttacentral.net/tools.

Programmers will of course use whatever tools they are used to. For others, we recommend using a good text editor, such as Sublime Text, Notepad ++, Geany, Kate, etc.

It is not necessary to be a programmer to prepare these texts, however a reasonable facility with IT is required. Regular Expressions are a wonderful way to make large scale changes on massive bodies of text.

Random note: Sublime Text has a bug that prevents it from inputting special characters via the linux Compose key. The fix is here: http://whitequark.org/blog/2014/04/14/xcompose-support-in-sublime-text/

Conversion

  • Basics. Run scripts to convert to Unicode if necessary. In the case of Word documents, poorly formatted HTML and the like, we need to clean all the cruft, using HTML tidy and other tools. Sujato and Nandiya have experience doing this.

  • Create HTML. Each text for SuttaCentral must be in completely clean and semantic HTML. Please follow to the letter the template given here. This means NO inline styles and the like, no new lines except on block level elements, no using <br> tags to make paragraphs, and so on. Also note that we use exclusively HTML5, so don't use <br/>, <hr/> and other deprecated forms.

Process

Here is a description of the process used to prepare many texts on SC, as developed by Bhante Sujato. Each case is different, so this represents some useful tips rather than hard and fast rules. All tools are for Linux. If you use inferior so-called "operating systems", this is your kamma, deal with it!

First, get all the texts in one file. Usually I work on a nikaya or equivalent at a time. To download all the suttas from a site, use wget. Something like:

wget -r -l 1 -A htm,html siteurl

This will download all links on page "siteurl", to the depth of one link, and taking only html pages.

Once the suttas are in a folder, check that they are unicode. If not convert them, using something like:

find . -name '*.html' -exec iconv -f=cp1251 -t=utf8 -o "{}.utf8" "{}" \;

This will convert all files in the folder with a suffix of .html from an encoding of "cp1251" to unicode (utf8), adding the suffix .utf8.

Next, if you like you can subject the texts to an initial clean using HTML Tidy. Either use locally, or use the version at http://suttacentral.net/tools/cleanup-bad-html. Try a few different settings, but for the first go, "merciless" is usually best. Fix any problems that Tidy finds.

Then put all files in a single html file. Use

cat `ls -v *.txt` >all.html

This will concatenate all files in "natural order". as "all.html".

Then go thru and clean the HTML using find/replace and regex.

There are two methods that are useful here: one is the HTML tags provided in the files. If they use, say, tags, keep them and ensure that they are at the proper level. Sometimes you can discern regular patterns in the HTML that can be converted to SC usage.

However, often the HTML in source files is so poorly formed that this method is of dubious value. In such cases, you will be better off relying on patterns in the text itself. For example, verses will tend to be sequences of short lines; headings will tend to be a single short line, without punctuation at the end; and so on. You can recognize these patterns, even in a language you don't understand, and use them for your regex. I prepared the entire canon in Portuguese using this method, lacking any HTML as guide.

You will also have to locate and remove any footnotes. A simple regex will usually clean off the footnote anchors, but the footnote text can be difficult to isolate, so be careful.

When the text is ready, ensure that the file structure is good. Each sutta must, in accord with the template, begin and end with a

, which is marked with the ID of that text. To split the files into their individual suttas, you can use a small python utility called split.py. I'm pasting the code at the end of this file, otherwise if you like just email the texts to Ven Sujato or Ven Nandiya and we can do it. How it works is, each sutta must be enclosed in a set of tags <split name="ID.html">...</split>. Then you just run split.py and it will create individual files with the appropriate name (ID.html). There are a few finicky things here: first, split.py doesn't process multiple tags that are meant to be singletons, like <!doctype>, <head> and so on, so add these later. Also, you must specify <meta charset="utf8"> at the start of the file.

Finally, add the <!doctype> and so on, and you should be done.

Marked up references

In many cases reference data is included as plain text in a sutta file. This will typically include such things as the page number of the printed edition, and so on. In some cases there are several kinds of reference data in one text. For SuttaCentral all such reference data is marked up using distinct classes.

If your text uses a common form of reference, for example the page numbers of the PTS or Wisdom publications editions, then use an existing tag. If it is unique, then assign your own class for the tag.

So, let us assume this is the same as a text published by the Pali Text Society <a class="pts" id="pts45"></a> or one published by Wisdom <a class="wp" id="wp5"></a>. If it is a unique case then assign some meaningful class to it, such as the initials of the author <a class="rp" id="rp34"></a>. That's all you have to do in the text itself.

Note that the numbers will be absolutely positioned, which means that it doesn't matter where they occur in the line, they will appear in the correct place in the margin. If you are unsure how to use the tags, consult with us.

In addition to using the tags, maintain a separate list of all the tags you use. That list should consist of pairs with the tag and a description of the tag. This informs us how exactly the tag is used, and the description can also be inserted as extra information in the text. For example:

wp    Paragraph number in the Wisdom Publications edition.
rp    Section number in the Random Person edition.

Technical details for reference tags markup

A new paragraph number class has to be registered in three files:

  • css/utf8.css
  • css/text/paragraphnum.scss (Once or twice if you want custom color/position)
  • js/sc_init.js

We'd like to do a more unified scheme where the details are defined only once (presumably in JSON) but that's how it works at the moment.

The 'canonical' way of markup for paragraph numbers is this:

<a class="bps" id="bps1"></a>

A little bit of magic happens where the content of "id" has the class name stripped off (so it becomes just 1) and used as the text value, then you use the css :before for the prefix:

<a class="bps" id="bps1">1</a>

If the magic can't happen because the id does not start with the class, eg:

<a class="pnum" id="BPS1"></a>

Then it will use the id asis for the text content:

<a class="pnum" id="BPS1">BPS1</a>

It will also automatically insert a title tag, if an entry is found in sc_init.js, in the list of class names to titles mapping. This saves needing to duplicate the title tag and means the title is only defined once.

Niceties

Texts in SuttaCentral are distinguished by attention to detail in all matters of presentation. Take the time and care to ensure that the texts you are working on meet our standard.

  • Punctuation. Ensure all punctuation is correct. Main examples include “proper quotation marks” not "straight marks"; correct “nesting of ‘quote marks’”; correct use of dashes (use hyphen to connect runon words, endash for ranges of numbers etc, such as sections 4–6, and emdash to offset text—like this—from the main sentence (none of these are spaced); single space after period; and probably a few more. Nandiya's script at suttacentral.net/tools does a great job of cleaning and correcting these things, but you need to check by hand.

  • Local usage. Different languages use different punctuation styles for quotes and the like. Ensure that your text uses the correct style for your language. For Chinese texts we use proper monospace punctuation. Note that what you see on the web is, if it's anything like English, not always the correct style. If in doubt, consult a proper style guide for your language. If there are some special items that SuttaCentral does not currently support, let us know and we will fix it. In addition, ensure that you use the correct Unicode glyph for any special characters in your language. If any of these are absent from the fonts we use on SuttaCentral, we will work with the font designers to include them.

  • Diacritical marks. Ensure that if your text uses any Pali terms that it has the correct diacritical marks. Do not accept substitutes, such as Velthuis with its doubled aa's. If there are no diacriticals, they should be added. If there are a lot of them, we can help. We run the file against a dictionary in your language, extract the nonnative words, then add the marks, then put them back again.

split.py

#!/usr/bin/env python3

Split HTML or XML files into sub-files.

The contents of every <split>....</split> tag range, will be put into
individual files, with no header or footer. The <split> tags themselves
will not appear in the output files. Do not include multiple instances of singleton tags, like <!doctype>, <head>, etc. Also, your file must include <meta charset="utf8"> at the start, otherwise the encoding will be corrupted. This will work even if there is no <head>.

"""

import regex
import pathlib
import argparse
import lxml.html
import lxml.etree
from logging import Logger

logger = Logger(__name__)

def parseargs():
    parser = argparse.ArgumentParser(description='Split a file on <split name="foo">')
    parser.add_argument('file', type=pathlib.Path)
    parser.add_argument('--out', type=pathlib.Path, default=None, help="Defaults to same location as source file")
    return parser.parse_args()


args = parseargs()
if not args.file.exists():
    logger.error('File `{}` does not exist'.format(args.file))
    exit(1)

if args.out is None:
    args.out = args.file.parent

if not args.out.exists():
    args.out.mkdir(parents=True)

XML = '.xml'
HTML = '.html'

modes = {XML, HTML}

mode = args.file.suffix

if mode not in modes:
    logger.error('Only works for .html and .xml files')
    exit(1)

if mode == HTML:
    doc = lxml.html.parse(str(args.file))
elif mode == XML:
    doc = lxml.etree.parse(str(args.file))
    
splits = doc.findall('//split')
if not splits:
    logger.error('No <split> tags found')
    exit(1)
    
for i, split in enumerate(splits):
    name = split.get('name', None)
    if not name:
        name = '{}-{}'.format(args.file.stem, i)
    outfile = (args.out / name).with_suffix(args.file.suffix)
    string = lxml.etree.tostring(split, encoding="unicode", method = {XML: 'XML', HTML: 'HTML'}[mode])
    string = regex.sub(r'(?s)<split[^>]*>(.*)</split>', r'\1', string)
    with outfile.open('w', encoding="utf8") as f:
        f.write(string)
Clone this wiki locally