A set of (low level) Natural Language Encoder-Decoders (codecs), that are useful in preprocessing stage of NLP pipeline. These codecs include encoding of sequences into one of the following:
-
Character
-
Word
-
BPE based subwords
-
Class (for multiclass classification)
-
Byte: Character is a Unicode codepoint (which can be higher than 255) where as bytes are [0-255]; a proxy over
utf-8
scheme
It provides python (so embed into your app) and CLI APIs (use it as stand alone tool).
There are many BPE implementations available already, but this one provides differs:
-
Pure python implementation that is easy to modify anything to try new ideas. (other implementations require c++ expertise to modify the core)
-
BPE model is a simple text that can be inspected with
less
orcut
. It includes info on which pieces were put together and what frequencies etc. -
Reasonably faster than the other pure python implementations — speed in python comes with the cost of extra memory due to indexing.
-
PySpark backend for extracting term frequencies from large datasets
Please run only one of these
# Install from pypi (recommended)
$ pip install nlcodec
# Clone repo for development mode
git clone https://github.com/isi-nlp/nlcodec
cd nlcodec
pip install --editable .
pip
installer registers these CLI tools in your PATH:
-
nlcodec
— CLI for learn, encode, decode. Same aspython -m nlcodec
+nlcodec-learn
— CLI for learn BPE, with PySpark backend. Same aspython -m nlcodec.learn
-
nlcodec-db
— CLI for bitextdb. Same aspython -m nlcodec.bitextdb
-
nlcodec-freq
— CLI for extracting word and char frequencies from corpus using spark backend.
$ python -m nlcodec -h usage: __main__.py [-h] [-i INP] [-o OUT] -m MODEL [-idx] [-vs VOCAB_SIZE] [-l {char,word,bpe,class}] [-mf MIN_FREQ] {learn,encode,decode,estimate} positional arguments: {learn,encode,decode,estimate} "task" or sub-command. "learn" - learns vocabulary. use --level and vocab_size for type and size "encode" - encodes a dataset "decode" - decodes an already encoded dataset "estimate" - estimates quality attributes of an encoding optional arguments: -h, --help show this help message and exit -i INP, --inp INP Input file path (default: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='UTF-8'>) -o OUT, --out OUT Output file path. Not valid for "learn" or "estimate" task (default: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>) -m MODEL, --model MODEL Path to model aka vocabulary file (default: None) -idx, --indices Indices instead of strings. Valid for task=encode and task=decode (default: None) args for task=learn: -vs VOCAB_SIZE, --vocab_size VOCAB_SIZE Vocabulary size. Valid only for task=learn. This is required for "bpe", but optional for "word" and "char" models, specifying it will trim the vocabulary at given top most frequent types. (default: -1) -l {char,word,bpe}, --level {char,word,bpe} Encoding Level; Valid only for task=learn (default: None) -mf MIN_FREQ, --min_freq MIN_FREQ Minimum frequency of types for considering inclusion in vocabulary. Types fewer than this frequency will be ignored. For --level=word, freq is type freq and default is 2.for --level=char or --level=bpe, characters fewer than this value will be excluded. default=20 (default: None)
Example:
# learn
head -2000 somefile.tok | nlcodec learn -l bpe -m bpe.model --vocab_size 2000
# encode with text pieces
head somefile.tok | nlcodec encode -m bpe.model
# encode with indexes
head somefile.tok | nlcodec encode -m bpe.model -idx
# decode -- undo encoding
head somefile.tok | nlcodec decode -m bpe.model
head somefile.tok | nlcodec decode -m bpe.model -idx
# estimate quality
head somefile.tok | nlcodec estimate -m bpe.model
from nlcodec import load_scheme
path = 'path/to/vocab.model'
vocab = load_scheme(path)
line = 'this is a sample sentence'
# encode a line of text into list of ids
vocab.encode(line)
# parallel encode a bunch of lines using multiple cpus
vocab.encode_parallel(seqs=[line], n_cpus=2)
# encode a line of text into pieces
vocab.encode_str(line)
# decode
vocab.decode(vocab.encode(line))
vocab.decode_str(vocab.encode_str(line))
from nlcodec import load_scheme, BPEScheme
path = 'path/to/bpe-vocab.model'
bpe: BPEScheme = load_scheme(path)
some_type = bpe.table[1000] # select some bpe piece type
# get stochastic split
some_type.get_stochastic_split(split_ratio=0.5, name=False)
# get all possible permutations
some_type.get_permutations(name=False)
For larger datasets, you may take advantage of PySpark.
Note
|
Please install PySpark using pip install pyspark
|
python -m nlcodec.learn # nlcodec-learn
.... (same as "python -m nlcodec learn" sub command but with these extra otions: )
-spark SPARK_MASTER, --spark-master SPARK_MASTER
Spark master (default: local[*])
-dm DRIVER_MEM, --driver-mem DRIVER_MEM
Spark driver memory (default: 4g)
-dd, --dedup Deduplicate the sentences: use only unique sequences
(default: True)
-ndd, --no-dedup Do not deduplicate. (default: False)
$ python -m nlcodec.learn -i train.eng.tok train.kan.tok -l bpe -vs 8000 -m ~/tmp/bpe.8k.model
# Tip: This also created, two intermediate files
~/tmp/bpe.8k.charfreq.gz
~/tmp/bpe.8k.wordfreq.gz
# these can be reused again with "nlcodec learn -tfs -i <path>"
To compute term-frequencies on a separate step:
$ python -m nlcodec.term_freq -h
usage: term_freq.py [-h] [-i INP [INP ...]] [-wf WORD_FREQS] [-cf CHAR_FREQS]
[-dd] [-ndd]
optional arguments:
-h, --help show this help message and exit
-i INP [INP ...], --inp INP [INP ...]
Input file paths (default: None)
-wf WORD_FREQS, --word_freqs WORD_FREQS
Output file path for word frequencies (default: None)
-cf CHAR_FREQS, --char_freqs CHAR_FREQS
Output file path for character frequencies (default:
None)
-dd, --dedup Deduplicate the sentences: use only unique sequences
(default: True)
-ndd, --no-dedup Do not deduplicate. (default: False)
And, then learn vocabulary from extracted frequencies. Example:
# word vocab of 32K
python -m nlcodec learn -i words.tsv -tfs -l word -vs 32000 -m word.model
# Character vocab of 99.95% coverage
python -m nlcodec learn -i chars.tsv -tfs -l char -mf 1 -cv 0.9995 -m char.model
# BPE vocab of 8K
python -m nlcodec learn -i words.tsv -tfs -l bpe -vs 8000 -m bpe.model
# BPE vocab until minimum merge frequency is 100; set -vs=64000 as some large number
python -m nlcodec learn -i words.tsv -tfs -l bpe -vs 64000 -m bpe.model -cv 0.99995 -mce 100
from typing import List
from nlcodec import learn_vocab, term_freq
from pathlib import Path
import logging as log
def train(model_type: str, vocab_size: int, model_path: str, files: List[str],
char_coverage: float = 0, dedup=True, spark=None):
"""
:param model_type: word, char, bpe
:param vocab_size: vocabulary size
:param model_path: where to store vocabulary model
:param files: text for creating vcabulary
:param char_coverage: character coverage (0, 1]. value <= 0 => default coverage
:param: spark: an instance of spark.sql.SparkSession (optional)
:return:
"""
kwargs = dict(char_coverage=char_coverage) if char_coverage > 0 else {}
stats_file = Path(model_path + '.termfreqs')
if not stats_file.exists():
log.info("Extracting term frequencies... ")
paths = [f if isinstance(f, Path) else Path(f) for f in files]
wfs, chfs, n_lines = term_freq.word_counts(paths=paths, dedup=dedup, spark=spark)
log.info(f"Lines = {n_lines:,}, Word Types: {len(wfs):,} Char Types:{len(chfs):,}")
stats = chfs if model_type == 'char' else wfs
log.info(f"Writing frequencies to {stats_file}")
with stats_file.open('w') as out:
term_freq.write_stats(stats=stats, out=out, line_count=n_lines)
kwargs['term_freqs'] = True
inp = stats_file.read_text().splitlines()
learn_vocab(inp=inp, level=model_type, model=model_path, vocab_size=vocab_size, **kwargs)
In the above example, if you already have spark.sql.SparkSession
instance, set it to spark
argument.
By default, a local SparkSession will be created. and shutdown.
To control the default spark backend, set these environment variables before calling the above code.
import os
os.environ["SPARK_DRIVER_MEM"]="4g"
os.environ["SPARK_MATSER"]="local[*]"