seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API.
Get NumPy >=1.6, SciPy >=0.11, Cython >=0.20.2 and a recent version of scikit-learn. Then issue:
python setup.py install
to install seqlearn.
If you want to use seqlearn from its source directory without installing, you have to compile first:
python setup.py build_ext --inplace
The easiest way to start using seqlearn is to fetch a dataset in CoNLL 2000 format. Define a task-specific feature extraction function, e.g.:
>>> def features(sequence, i): ... yield "word=" + sequence[i].lower() ... if sequence[i].isupper(): ... yield "Uppercase" ...
Load the training file, say train.txt
:
>>> from seqlearn.datasets import load_conll >>> X_train, y_train, lengths_train = load_conll("train.txt", features)
Train a model:
>>> from seqlearn.perceptron import StructuredPerceptron >>> clf = StructuredPerceptron() >>> clf.fit(X_train, y_train, lengths_train)
Check how well you did on a validation set, say validation.txt
:
>>> X_test, y_test, lengths_test = load_conll("validation.txt", features) >>> from seqlearn.evaluation import bio_f_score >>> y_pred = clf.predict(X_test, lengths_test) >>> print(bio_f_score(y_test, y_pred))
For more information, see the documentation.
If you're using the .score() method, note that it might not be evaluating/scoring the right granularity (for example, evaluating letter by letter versus word by word when your model outputs words). Also check out whole_sequence_accuracy() method.