Skip to content

A collection of NLP pipelines powered by Nextflow

License

Notifications You must be signed in to change notification settings

proycon/aNtiLoPe

Repository files navigation

Language Machines Badge Build Status

GitHub release (latest by date)

Project Status: Active – The project has reached a stable, usable state and is being actively developed.

aNtiLoPe: Natural Language Processing pipelines that run!

aNtiLoPe offers various NLP workflows that build on a variety of tools. This repository hosts the relevant workflows, powered by Nextflow. The tools the workflows depend on are not included as-such, but aNtiLope itself and all its dependencies are shipped as part of our LaMachine software distribution.

Some related but more specialised workflows are available as standalone projects:

  • PICCL - A set of workflows for corpus building through OCR, post-correction and normalisation.
  • Nederlab Pipeline - Linguistic enrichment pipeline for historical dutch, as used in the Nederlab project
  • Quoll - NLP text classification pipeline

Running these workflows, as opposed to manually invoking the underlying NLP tools that do the actual work, enables less effort on the part of the user, and more portability and scalability, as the pipelines can be executed across multiple computing nodes on a high performance cluster such as SGE, LSF, SLURM, PBS, HTCondor, Kubernetes and Amazon AWS. Parallellisation is handled automatically. Consult the Nextflow documentation for details regarding this.

aNtiLoPe makes extensive use of the FoLiA format, a rich XML-based format for linguistic annotation.

Important Note: This is beta software still in development; for the old and deprecated version consult this repository.

Installation

aNtiLoPe is already shipped as a part of LaMachine, you may need to explicitly add it using lamachine-add antilope if you already have a LaMachine instance running. The workflows are invoked on the command line and end with the extension .nf.

It's also possible to use Nextflow directly and have it install and use the Docker flavour of LaMachine. In this case you need to ensure to always run it with the -with-docker proycon/lamachine parameter:

$ nextflow run proycon/aNtiLoPe -with-docker proycon/lamachine

Workflows

  • tokenize.nf - A tokenisation workflow using the ucto tokeniser; takes either plaintext or untokenised FoLiA documents (e.g. output from ticcl), and produces tokenised FoLiA documents.
  • frog.nf - An NLP workflow for Dutch using the frog NLP suite; takes either plaintext or FoLiA documents and produces linguistically enriched FoLiA documents, takes care of tokenisation as well.
  • foliavalidator.nf - A simple validation workflow to validate FoLiA documents. Uses the FoLiA tools
  • foliaupgrader.nf - An upgrade tool to upgrade FoLiA documents to FoLiA v2. Uses the FoLiA tools

Running with these workflows with the --help parameter or absence of any parameters will output usage information.

Technical Details & Contributing

Please see CONTRIBUTE.md for technical details and information on how to contribute.

About

A collection of NLP pipelines powered by Nextflow

Resources

License

Stars

Watchers

Forks

Packages

No packages published