Bitextor uses a configuration file to define the variables required by the pipeline. Depending on the options defined in this configuration file the pipeline can behave differently, running alternative tools and functionalities. The following is an exhaustive overview of all the options that can be set in the configuration file and how they affect to the pipeline.
Suggestion: A configuration wizard called bitextor-config
gets installed with Bitextor to help with this task. Furthermore, a minimalist configuration file sample is provided in this repository. You can take it as an starting point by changing all the paths to match your environment.
Current pipeline constists of the following steps:
- Crawling
- Plain text extraction
- Sharding
- Sentence splitting
- Translation
- Tokenisation (source and translated target)
- Document alignment
- Segment alignment
- Cleaning and filtering
Following is a description of configuration related to each step, as well as basic variables.
There are a few variables that are mandatory for running Bitextor, independently of the task to be carried out, namely the ones related to where final & intermediate files should be stored.
permanentDir: ~/permanent/bitextor-output
dataDir: ~/permanent/data
transientDir: ~/transient
tempDir: ~/transient
permanentDir
: will contain the final results of the run, i.e. the parallel corpus builtdataDir
: will contain the results of crawling (WARC files) and files generated during preprocessing (plain text extraction, sharding, sentence splitting, tokenisation and translation), i.e. every step up to document alignmenttransientDir
: will contain the results of intermediate steps related to document and sentence alignment, as well as cleaningtempDir
: will contain temporary files that are needed by some steps and removed immediately after they are no longer required
There are some optional parameters that allow for a finer control of the execution of the pipeline, namely it is possible to configure some jobs to use more than one core; and it is possible to have a partial execution of Bitextor by specifying what step should be final.
until: preprocess
parallelJobs: {translate: 1, docaling: 2, segaling: 2, bicleaner: 1}
parallelWorkers: {translate: 4, docaling: 8, segaling: 8, bicleaner: 2, mgiza: 2}
profiling: True
verbose: True
until
: pipeline executes until specified step and stops. The resulting files will not necessarily be inpermanentDir
, they can also be found indataDir
ortransientDir
depending on the rule. Allowed values:crawl
,preprocess
,shard
,split
,translate
,tokenise
,tokenise_src
,tokenise_trg
,docalign
,segalign
,bifixer
,bicleaner
,filter
parallelJobs
: a dictionary specifying the number snakemake jobs which will be running in parallel. By default, all the jobs will be run in parallel only being limited by the number of cores or threads provided to Bitextor (check-c
and-j
from snakemake CLI arguments). This option might be useful for cases where, e.g., you need to limit the resources for a specific job (e.g. run Bicleaner with GPU and only 1 GPU is available). Allowed values:split
,translate
,tokenise
,docalign
,segalign
,bifixer
andbicleaner
.parallelWorkers
: a dictionary specifying the number of cores or threads that should be used for a tool (this might be done throughtparallel
or native configuration of the specific tool). Allowed values:split
,translate
,tokenise
,docalign
,segalign
,bifixer
,bicleaner
,filter
,sents
andmgiza
. Be aware that, if the provided value tomgiza
is greater than 1, the result will not be deterministic (check out this issue for more information).profiling
: use/usr/bin/time
tool to obtain profiling information about each step.verbose
: output more details about the pipeline execution.
The next set of option srefer to the source from which data will be harvested. It is possible to specify a list of websites to be crawled and/or a list of WARC files that contain pre-crawled websites. Both can be specified either via a list of source directly in the config file, or via a separated gzipped file that contains one source per line. It is also possible to specify a local directory containing files in different formats (pdf, docx, doc...). The directory can contain subdirectories with more documents.
hosts: ["www.elisabethtea.com","vade-antea.fr"]
hostsFile: ~/hosts.gz
warcs: ["/path/to/a.warc.gz", "/path/to/b.warc.gz"]
warcsFile: ~/warcs.gz
preverticals: ["/path/to/a.prevert.gz", "/path/to/b.prevert.gz"]
preverticalsFile: ~/preverticals.gz
directories: ["/path/to/dir_1", "/path/to/dir_2"]
directoriesFile: ~/directories.gz
hosts
: list of hosts to be crawled; the host is the part of the URL of a website that identifies the web domain, i.e. the URL without the protocol and the path. For example, in the case of the url https://github.com/bitextor/bitextor the host would be github.comhostsFile
: a path to a file that contains a list of hosts to be crawled; in this file each line should contain a single host, written in the format described abovewarcs
: specify one or multiple WARC files to use; WARC files must contain individually compressed recordswarcsFile
: a path to a file that contains a list of WARC files to be included in parallel text mining (silimar tohosts
andhostsFile
)preverticals
: specify one or multiple prevertical files to use; prevertical files are the output of the SpiderLing crawlerpreverticalsFile
: a path to a file that contains a list of prevertical files to be included in parallel text mining (silimar tohosts
andhostsFile
)directories
: list of directories with files to be included in parallel text mining. All files in the directories will be processed. Files can be in office, openoffice, epub, pdf, txt and html formatsdirectoriesFile
: a path to a file that contains a list of directories to be included in parallel text mining (silimar tohosts
andhostsFile
)
Two crawlers are supported by Bitextor: Heritrix and wget
tool. The basic options are:
crawler: wget
crawlTimeLimit: 1h
crawler
: set which crawler is used (heritrix
orwget
)crawlTimeLimit
: time for which a website can be crawled; the format of this field is an integer number followed by a suffix indicating the units (accepted units are s(seconds), m(minutes), h(hours), d(days), w(weeks)), for example:86400s
, or1440m
or24h
, or1d
wget
is the most basic of the provided crawling tools, it will launch a crawling job for each specified host, which will be finished either when there is nothing more to download or the specified time limit has been reached. The following parameters may be configured when using this tool:
crawlUserAgent: "Mozilla/5.0 (compatible; Bitextor/8 +https://github.com/bitextor/bitextor)"
crawlWait: 5
crawlFileTypes: ["html", "pdf"]
crawlerUserAgent
: user agent to be added to the header of the crawler when doing requests to a web server (identifies your crawler when downloading a website)crawlWait
: time (in seconds) that should be waited between the retrievals; it is intended to avoid a web-site to cut the connection of the crawler due too many connections in a low interval of timecrawlFileTypes
: filetypes that sould be retrieved;wget
will check the extension of the document
Finally, to use Heritrix, these parameters must be set:
crawler: heritrix
heritrixPath: /home/user/heritrix-3.4.0-20190418
heritrixUrl: "https://localhost:8443"
heritrixUser: "admin:admin"
heritrixPath
is the installation folder of heritirxheritrixUrl
is the URL where heritrix service is running,https://localhost:8443
by defaultheritrixUser
provides the necessary credentials to access heritrix service in the format oflogin:password
(admin:admin
) by default
Heritrix crawler will check if there is a checkpoint in its 'jobs' folder and resume from the latest. If crawl takes longer than the crawl time limit, it will automatically create a checkpoint for a future incremental crawl.
After crawling, the downloaded webs are processed to extract clean text, detect language, etc.
After plain text extracion, the extracted data is sharded via giashard in order to create balanced jobs. Crawled websites and WARCs are distributed in shards for a more balanced processing, where each shard contains one or more complete domain(s). Shards in turn are split into batches of specified size to keep memory consumption in check. Document alignment works within shards, i.e. all documents in a shard will be compared for document alignment.
The following set of option define how that process is carried out.
# preprocessing
preprocessor: warc2text
langs: [en, es, fr]
## with warc2preprocess only
parser: "bs4"
ftfy: False
cleanHTML: False
langID: cld2
## remove boilerplate, only warc2preprocess in WARC processing and prevertical2text in prevertical files
boilerplateCleaning: true
## identify paragraphs
paragraphIdentification: true
## language identification at paragraph level
preverticals_cld2: true
## other metadata
additionalMetadata: true
# sharding
shards: 8 # 2^8 shards
batches: 1024 # batches of up to 1024MB
preprocessor
: this options allows to select one of two text extraction tools,warc2text
(default) orwarc2preprocess
.warc2text
is faster but less flexibile (less options) thanwarc2preprocess
. There is another preprocessor, but cannot be set, and that isprevertical2text
. This preprocessor will be used automatically when you have prevertical files, which is the format of the SpiderLing crawler. The reason why cannot be set is because is not a generic preprocessor, but specific for SpiderLing files.preverticals_cld2
: by default,prevertical2text
looks for a cld2 paragraph language identification. If the preverticals used don't have this mark, preverticals_cld2 must be False to use the trigram model language identificationlangs
: list of languages that will be processed in addition tolang1
andlang2
PDFprocessing
: option that allows to select a specific PDF processor. It is possible to use pdfextraxt or apacheTika instead of popplerpdf2html
converterPDFextract_configfile
: set a path for a PDFExtract config file, specially for language models for a better sentence splitting (see more info)PDFextract_sentence_join_path
: set a path for sentence-join.py script, otherwise, the one included with bitextor will be usedPDFextract_kenlm_path
: set path for kenlm binaries
Options specific to warc2preprocess
:
langID
: the model that should be used for language identification,cld2
(default) orcld3
;cld2
is faster, butcld3
can be more accurate for certain languagesftfy
: ftfy is a tool that solves encoding errors (disabled by default)cleanHTML
: attempt to remove some parts of HTML that don't contain text (such as CSS, embedded scripts or special tags) before running ftfy, which is a quite slow, in order to improve overall speed; this has an unwanted side effect of removing too much content if the HTML document is malformed (disabled by default)html5lib
: extra parsing withhtml5lib
, which is slow but the cleanest option and parses the HTML the same way as the modern browsers, which is interesting for broken HTMLs (disabled by default)parser
: select HTML parsing library for text extraction; options are:bs4
(default),modest
,lxml
(useshtml5lib
) orsimple
(very basic HTML tokenizer)
Options specific to warc2text
:
multilang
: option to detect and separate multiple languages in a single document
Boilerplate:
boilerplateCleaning
: ifpreprocessor: warc2preprocess
, enables boilerpipe to remove boilerplates from HTML documents. If you have providedpreverticals
files, it will discard those entries detected as boilerplate byprevertical2text
automatically.warc2text
does not support this option. It is disabled by defaultboilerpipeMaxHeapSize
: in order to runboilerpipe
, we use a library that one of its dependencies isjpype
.jpype
does take the default max. heap size of the JVM and does not take into account the environment variableJAVA_OPTS
(common envvar to provide options to the JVM). If big documents are being processed, you might like to increase the max. heap size in order to be able to process them withboilerpipe
Metadata:
paragraphIdentification
: if this option is enabled, the selectedpreprocessor
will generate information which will identify the paragraphs. This information will be used to link every sentence to the position which it took in the original paragraph.additionalMetadata
: if this option is enabled, the selectedpreprocessor
will generate metadata which will be propagated through the execution (currently, this option only generates metadata whenpreverticals
are provided).
Sharding options:
shards
: set number of shards, where a value of 'n' will result in 2^n shards, default is 8 (2^8 shards);shards: 0
will force all domains to be compared for alignmentbatches
: batch size in MB, default is 1024; large batches will increase memory consumption during document alignment, but will reduce time overhead
By default a Python wrapper of Loomchild Segment will be used for sentence splitting.. This is recommened even without language support, since it is possible to provide custom non-breaking prefixes. External sentence splitter can by used via sentence-splitters
parameter (less efficient).
Custom sentence splitters must read plain text documents from standard input and write one sentence per line to standard output.
sentenceSplitters: {
'fr': '/home/user/bitextor/preprocess/moses/ems/support/split-sentences.perl -q -b -l fr',
'default': '/home/user/bitextor/bitextor/example/nltk-sent-tokeniser.py english'
}
customNBPs: {
'fr': '/home/user/bitextor/myfrenchnbp.txt'
}
sentenceSplitters
: provide custom scripts for sentence segmentation per language, script specified underdefault
will be applied to all lanuagescustomNBPs
: provide a set of files with custom Non-Breaking Prefixes for the default sentence-splitter; see already existing files for examples
Moses tokenizer.perl
is the default tokeniser, which is used through an efficient Python wrapper. This is the recommended option unless a language is not supported.
Custom scripts for tokenisation must read sentences from standard input and write the same number of tokenised sentences to standard output.
wordTokenizers: {
'fr': '/home/user/bitextor/mytokenizer -l fr',
'default': '/home/user/bitextor/moses/tokenizer/my-modified-tokenizer.perl -q -b -a -l en'
}
morphologicalAnalysers: {
'lang1': 'path/to/morph-analyser1',
'lang2': 'path/to/morph-analyser2'
}
wordTokenizers
: scripts for word-tokenization per language,default
script will be applied to all languagesmorphologicalAnalysers
: scripts for morphological analysis (lemmatizer/stemmer). It will only be applied to specified languages after tokenisation, or all of them ifdefault
script is also provided.
From this step forward, bitextor works with a pair of languages, which are specified through lang1
and lang2
parameters. The output will contain the sentence pairs in that order.
lang1: es
lang2: en
Different strategies are implemented in Bitextor for document alignment:
- It uses a bilingual lexica to compute word-overlapping-based similarity metrics; these metrics are combined with other features that are extracted from HTML files and used by a linear regressor to obtain a similarity score.
- It uses an external machine translation (MT) and a TF/IDF similarity metric computed on the original documents in one of the languages, and the translation of the documents of the other language.
- Neural Document Aligner or NDA: it embeds each sentence from each document to an embedding space and merge them in order to have a semantic representation of the document in the resulted embedding.
documentAligner: externalMT
The variable documentAligner
can take different values, each of them taking a different document-alignment strategy:
DIC
: selects the strategy using bilingual lexica and a linear regressorexternalMT
: selects the strategy using MT, in this case using an external MT script (provided by the user) that reads source-language text from the standard input and writes the translations to the standard outputNDA
: selects the strategy using embeddings from SentenceTransformers
documentAligner: DIC
dic: /home/user/en-fr.dic
Option dic
specifies the path to the bilingual lexicon to be used for document alignment. This dictionary should have words in lang1
in the first column, and lang2
in the second one.
If the lexicon specified does not exist, you can specify the option generateDic
in order to build it using a provided parallel corpus through the variable initCorpusTrainingPrefix
using mgiza
tools:
generateDic: True
initCorpusTrainingPrefix: ['/home/user/Europarl.en-fr.train']
This variable must contain one or more corpus prefixes. For a given prefix (/home/user/training
in the example) the pipeline expects to find one file 'prefix
.lang1
' and another 'prefix
.lang2
' (in the example, /home/user/Europarl.en-fr.train.en
and /home/user/Europarl.en-fr.train.fr
). If several training prefixes are provided, the corresponding files will be concatenated before building the bilingual lexicon.
Suggestion: a number of pre-built bilingual lexica is available in the repository bitextor-data. It is also possible to use other lexica already available, such as those in OPUS, as long as their format is the same as those in the repository.
documentAligner: externalMT
alignerCmd: "example/dummy-translate.sh"
translationDirection: "es2en"
documentAlignerThreshold: 0.1
alignerCmd
: command to call the external MT script; the MT system must read documents (one sentence per line) from standard input, and write the translation, with the same number of lines, to standard outputtranslationDirection
: the direction of the translation system, specified as 'srcLang
2trgLang
'; default is lang1->lang2documentAlignerThreshold
: threshold for discarding document pairs with a very low TF/IDF similarity score; this option takes values in [0,1] and is 0.1 by default
documentAligner: NDA
embeddingsBatchSize: 64
embeddingsModel: LaBSE
embeddingsBatchSize
: specify the batch size of the embeddings when processing them. This may allow you to control the total amount of size used in your device, what may be very useful for GPUs.embeddingsModel
: model which will be used in order to generate the embeddings. There are different models available from SentenceTransformers, but there should be used a multilingual model. This option affects to thevecalign
segment aligner as well.
After document alignment, the next step in the pipeline is segment alignment. There are different tools available:
- hunalign: it uses a bilingual lexicon and is best suited for the
DIC
option ofdocumentAligner
. - bleualign: it uses MT and is only available if one of the options based on MT has been specified in
documentAligner
. - vecalign: it uses an embedding space in order to look for the closest semantic related sentences (it is only available if NDA has been specified in
documentAligner
).
sentenceAligner: bleualign
sentenceAlignerThreshold: 0.1
sentenceAligner
: segment aligner tool,bleualign
,hunalign
orvecalign
.sentenceAlignerThreshold
: threshold for filtering pairs of sentences with a score too low, values in [0,1] range; default is 0.0
Parallel data filtering is carried out with Bicleaner or Bicleaner AI; these tools use a pre-trained regression model to filter out pairs of segments with a low confidence score.
A number of pre-trained models for Bicleaner are available here. They are ready to be downloaded and decompressed. The pre-trained models for Bicleaner AI are available here.
The options required to make it work are:
bicleaner: True
bicleanerModel: /home/user/bicleaner-model/en-fr/training.en-fr.yaml
bicleaner
: use Bicleaner to filter out pairs of segmentsbicleanerFlavour
: select which version to use. The allowed values areclassic
for Bicleaner andai
for Bicleaner AI (default value)bicleanerModel
: path to the YAML configuration file of a pre-trained modelbicleanerThreshold
: threshold to filter low-confidence segment pairs, accepts values in [0,1] range; default is 0.0 (no filtering). It is recommended to set it to values in [0.5,0.7]
If the Bicleaner model is not available, you can specify the option bicleanerGenerateModel
in order to train one automatically from the data provided through the config file option bicleanerCorpusTrainingPrefix
. If you need to train a Bicleaner model, you will need to specify initCorpusTrainingPrefix
as well. If you are using Bicleaner AI instead, you will need to specify the config options bicleanerParallelCorpusDevPrefix
and bicleanerMonoCorpusPrefix
. Be aware that the direction of the generated model will be the same that the one specified in translationDirection
, if specified, or lang1
to lang2
.
bicleanerGenerateModel: True
bicleanerCorpusTrainingPrefix: ['/home/user/RF.en-fr']
# Bicleaner
initCorpusTrainingPrefix: ['/home/user/Europarl.en-fr.train']
# Bicleaner AI
bicleanerParallelCorpusDevPrefix: ['/home/user/DGT.en-fr.train']
bicleanerMonoCorpusPrefix: ['/home/user/en-fr.mono']
bicleanerCorpusTrainingPrefix
: prefix to the parallel corpus that will be used to train the regressor that obtains the confidence score in Bicleaner or Bicleaner AI- Bicleaner:
initCorpusTrainingPrefix
: prefix to parallel corpus (see Variables for bilingual lexica) that will be used to train statistical dictionaries which are part of the Bicleaner model. Ifdic
is provided and does not exist, you will need to generate one withgenerateDic
. Even ifdic
exists because you downloaded it, the whole process of generating it might be carried out since what is really necessary to build the model is the statistical information from the dictionary, what might not be available if you downloaded it
- Bicleaner AI:
bicleanerParallelCorpusDevPrefix
: prefix to parallel corpus that will be used for evaluation of the trained modelbicleanerMonoCorpusPrefix
: prefix to mono corpus that will be used in order to obtain noise sentences and train SentencePiece embeddings
For Bicleaner, it is important to provide different parallel corpora for the options as this helps Bicleaner when dealing with unknown words (that do not appear in the statistical dictionaries) during scoring. In the case of Bicleaner AI, this applies to the mono data as well, and the evaluation corpus should be of great quality.
Some other options can be configured to specify the output format of the parallel corpus:
bifixer: True
bifixerAggressiveDedup: False
bifixerIgnoreSegmentation: False
deferred: False
elrc: True
tmx: True
deduped: False
granularity: ["sentences","documents"]
biroamer: True
biroamerOmitRandomSentences: True
biroamerMixFiles: ["/home/user/file-tp-mix1", "/home/user/file-to-mix2"]
biroamerImproveAlignmentCorpus: /home/user/Europarl.en-fr.txt
bifixer
: use Bifixer to fix parallel sentences and tag near-duplicates for removalbifixerAggressiveDedup
: it marks near-duplicates sentences as duplicate, so they can be removed in the deduplication step (i.e.deduped: True
). This step is enabled by default if not specified andbifixer: True
bifixerIgnoreSegmentation
: it does not resplit the long sentences. This step is enabled by default if not specified andbifixer: True
deferred
: if this option is set, segment contents (plain text or TMX) are deferred to the original location given a Murmurhash2 64bit checksumelrc
: include some ELRC quality indicators in the final corpus, such as the ratio of target length to source length; these indicators can be used later to filter-out some segment pairs manuallytmx
: generate a TMX translation memory of the output corpusdeduped
: generate a de-duplicated tmx and regular versions of the corpus; the tmx corpus will contain a list of URLs for the sentence pairs that were found in multiple websitesgranularity
: by default, Bitextor generates a file with parallel sentences. With this option it is possible to add an additional output file containing the full parallel documents. For this output, two documents are parallel when the{lang1}-{lang2}.sents.gz
file contains at least one pair of sentences extracted from these documents.biroamer
: use Biroamer to ROAM (randomize, omit, anonymize and mix) the parallel corpus; in order to use this feature,tmx: True
ordeduped: True
will be necessarybiroamerOmitRandomSentences
: omit close to 10% of the tmx corpusbiroamerMixFiles
: use extra sentences to improve anonymization, this option accepts a list of files which will add the stored sentences, the files are expected to be in Moses formatbiroamerImproveAlignmentCorpus
: an alignment corpus can be provided in order to improve the entities detection; expected to be in Moses format.
NOTE: In case you need to convert a TMX to a tab-separated plain-text file (Moses format), you could use TMXT tool.