lazy_nlp_pipeline

NLP pipeline for rule-based pattern matching, heavily inspired by Spacy. This project have very limited subset of patterns available, so it is just a showcase of possibilities to improve over Spacy, not a viable ready-to-use replacement.

Main advantages of lazy_nlp_pipeline over Spacy are lazy evaluation and ability to create nested patterns.

Lazy evaluation allows to skip computationally expensive unnessessary steps while matching patterns. For example, in Spacy full pipeline is applied to all documents. lazy_nlp_pipeline apply steps when necessary. I.e. if there is a pattern which starts with some match of any token of length between X and Y and ends with a token of particular lemma, then there's no need to lemmatize all tokens in all documents - only those which weren't filtered out by a much quicker to check first part of a pattern.

Nested patterns provides a nice opportunity to create more sofisticated rules than mere list of all possible end-to-end sequences of token rules as in Spacy. It also helps to define some subpatterns in seperate variables and reuse them in different parts of different patterns.

See basic usage examples in BasicUsage.ipynb

See example and comparison with Spacy in Example.ipynb

Key examples

Basic example

from lazy_nlp_pipeline import (NLP, Pattern as P,
                               TokenPattern as TP, WordPattern as WP,
                               )

nlp = NLP(project_name='simplest_token_patterns')

pattern = P(
    TP('1'),
    TP('a'),
)

test_texts = [
    '1 a',
    'Something 1 a something',
    'Something Something2',
]
for span in nlp.match_patterns([pattern], texts=test_texts):
    print(span)
# Span('1 a')[0:3 doc=140386624266448]
# Span('1 a')[10:13 doc=140386624411344]

Date example

from lazy_nlp_pipeline import NLP, Pattern as P, TokenPattern as TP

nlp = NLP(project_name='simplest_token_patterns')

ymd_date = P(
    TP(isnumeric=True, min_len=4, max_len=4),
    TP('-'),
    TP(isnumeric=True, min_len=2, max_len=2),
    TP('-'),
    TP(isnumeric=True, min_len=2, max_len=2),
    
    allow_inbetween=None,
)
dmy_date = P(
    TP(isnumeric=True, min_len=2, max_len=2),
    TP('-'),
    TP(isnumeric=True, min_len=2, max_len=2),
    TP('-'),
    TP(isnumeric=True, min_len=4, max_len=4),
    
    allow_inbetween=None,
)

pattern = P(
    TP('from', ignore_case=True)[0:1],
    ymd_date | dmy_date,
    TP('to', ignore_case=True),
    ymd_date | dmy_date,
)

test_texts = [
    'From 2001-01-10 to 2009-01-10',
    'Something 10-01-2001 to 2009-01-10 Something2',
]
for span in nlp.match_patterns([pattern], texts=test_texts):
    print(span)
# Span('From 2001-01-10 to 2009-01-10')[0:29 doc=139710858833808]
# Span('2001-01-10 to 2009-01-10')[5:29 doc=139710858833808]
# Span('10-01-2001 to 2009-01-10')[10:34 doc=139710858245392]

Russian lemmatization

from lazy_nlp_pipeline import NLP, Pattern as P, TokenPattern as TP

nlp = NLP(project_name='simplest_token_patterns')

pattern = P(
    WP(lemma='общедоступный'),
    TP(isspace=False)[1:],
)

test_texts = [
    'Википедия (англ. Wikipedia) — общедоступная интернет-энциклопедия реализованная на принципах вики',
]
for span in nlp.match_patterns([pattern], texts=test_texts):
    print(span)
# Span('общедоступная интернет')[30:52 doc=139710859896592]
# Span('общедоступная интернет-')[30:53 doc=139710859896592]
# Span('общедоступная интернет-энциклопедия')[30:65 doc=139710859896592]

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
src/lazy_nlp_pipeline		src/lazy_nlp_pipeline
tests		tests
.gitignore		.gitignore
BasicUsage.ipynb		BasicUsage.ipynb
Example.ipynb		Example.ipynb
LICENSE		LICENSE
README.md		README.md
msgs_ischi_svoih.csv		msgs_ischi_svoih.csv
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lazy_nlp_pipeline

Key examples

About

Releases

Packages

Languages

License

Alex314/lazy-nlp-pipeline

Folders and files

Latest commit

History

Repository files navigation

lazy_nlp_pipeline

Key examples

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages