A CLI application for parsing tldr pages from the tldr-pages/tldr repository, and producing a dataset that maps the strings across localized pages. The motivation was to provide an additional corpus for OPUS, see What is Opus? for more context.
You can install the tool by running the following commands:
# Clone the repository
git clone https://github.com/tldr-pages/tldr-translation-pairs-gen.git
# Enter the directory that git created when cloning
cd tldr-translation-pairs-gen
# Install dependencies
npm install
# Build the project
npm run build
# Install the project on your machine
npm install -g .
You should now have tldr-translation-pairs-gen
on your path, try the help command to see the available options:
tldr-translation-pairs-gen --help
One way or another, obtain a copy of the tldr-pages. The easiest way is to use Git:
git clone https://github.com/tldr-pages/tldr.git
Point tldr-translation-pairs-gen to the directory using the --source
argument. This will output a file for every combination of languages to the dataset/
directory, with all alignments that can be found between localized pages.
tldr-translation-pairs-gen --source {{path/to/tldr_dir}}
You can also pass the --format
argument to specify a different output format. The supported file formats are TMX (Translation Memory eXchange), XML, CSV, and JSON.
tldr-translation-pairs-gen --source {{path/to/tldr_dir}} --format csv
When generating the dataset, you'll find that not all strings are included. Due to how the project is structured, and the current translation workflow, there are instances where the order or number of examples differ. This results in the localized pages falling out of sync.
Each example in a page features two strings, the description of the command, and the command itself. To work around the aforementioned issue, we parse each example and use the command as an identifier.
To map strings between languages, we parse all examples, remove tokens between curly braces (i.e. {{path/to/file}}
) as they can be internationalized, and then find the pairing example in the page of other languages if it exists.
After removing the content between curly braces, two or more examples in the same page may have the same content because the only difference was the tokens. In these cases, we omit them from the corpus as there's no way to unambiguously determine the translation pair.
Here is a real-world example of the problem: the English version was modified after the French translation was made, so now the pages have fallen out of sync. If we made pairs using the index, we'd create mismatches.
EN | FR |
---|---|
- Print the tldr page for a specific command (hint: this is how you got here!): tldr {{command}} |
- Affiche la page tldr d'une commande (indice : c'est comme ça que vous êtes arrivé ici !) : tldr {{commande}} |
- Print the tldr page for a specific subcommand: tldr {{command}}-{{subcommand}} |
- Affiche la page tldr de cd , en forçant la plateforme par défaut : tldr -p {{android|linux|osx|sunos|windows}} {{cd}} |
- Print the tldr page for a command for a specific [p]latform: tldr {{command}} |
- Affiche la page tldr d'une sous-commande : tldr {{git-checkout}} |
- [u]pdate the local cache of tldr pages: tldr -u |
- Met à jour les pages enregistrées localement (si le client supporte la mise en cache) : tldr -u |
OPUS is public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions. These datasets are helpful for a variety of applications such as research and machine learning.
A notable project that uses the OPUS corpuses is LibreTranslate, powered by argos-translate. It's a free, open-source, and self-hostable machine translation API that doesn't depend on third-party services. Now by translating tldr-pages, we're collectively contributing more data to improve open-source machine translations!