-
Notifications
You must be signed in to change notification settings - Fork 1
PmatchContainer
A class for performing pattern matching.
Probably the easiest way to perform pattern matching is with functions hfst.compile_pmatch_expression and hfst.compile_pmatch_file
Initialize a PmatchContainer. Is this needed?
Create a PmatchContainer based on definitions defs
.
-
defs
: A tuple of transducers in HFST_OLW_TYPE defining how pmatch is done.
An example:
If we have a file named streets.txt
that contains:
define CapWord UppercaseAlpha Alpha* ;
define StreetWordFr [{avenue} | {boulevard} | {rue}] ;
define DeFr [ [{de} | {du} | {des} | {de la}] Whitespace ] | [{d'} | {l'}] ;
define StreetFr StreetWordFr (Whitespace DeFr) CapWord+ ;
regex StreetFr EndTag(FrenchStreetName) ;
and which has been earlier compiled and stored in file streets.pmatch.hfst.ol
:
defs = hfst.compile_pmatch_file('streets.txt')
ostr = hfst.HfstOutputStream(filename='streets.pmatch.hfst.ol', type=hfst.ImplementationType.HFST_OLW_TYPE)
for tr in defs:
ostr.write(tr)
ostr.close()
we can read the pmatch definitions from file and perform string matching with:
istr = hfst.HfstInputStream('streets.pmatch.hfst.ol')
defs = []
while(not istr.is_eof()):
defs.append(istr.read())
istr.close()
cont = hfst.PmatchContainer(defs)
assert cont.match("Je marche seul dans l'avenue des Ternes.") == "Je marche seul dans l'<FrenchStreetName>avenue des Ternes</FrenchStreetName>."
See also: hfst.compile_pmatch_file, hfst.compile_pmatch_expression
Match input input
.
todo
todo
todo
todo
Tokenize input
and return a list of tokens i.e. strings.
-
input
: The string to be tokenized.
Tokenize input
and get a string representation of the tokenization
(essentially the same that command line tool hfst-tokenize would give).
-
input
: The input string to be tokenized. -
kwargs
: Possible parameters are: output_format, max_weight_classes, dedupe, print_weights, print_all, time_cutoff, verbose, beam, tokenize_multichar. -
output_format
: The format of output; possible values are 'tokenize', 'xerox', 'cg', 'finnpos', 'giellacg', 'conllu' and 'visl'; 'tokenize' being the default. -
max_weight_classes
: Maximum number of best weight classes to output (where analyses with equal weight constitute a class), defaults to None i.e. no limit. -
dedupe
: Whether duplicate analyses are removed, defaults to False. -
print_weights
: Whether weights are printd, defaults to False. -
print_all
: Whether nonmatching text is printed, defaults to False. -
time_cutoff
: Maximum number of seconds used per input after limiting the search. -
verbose
: Whether input is processed verbosely, defaults to True. -
beam
: Beam within analyses must be to get printed. -
tokenize_multichar
: Tokenize input into multicharacter symbols present in the transducer, defaults to false.
Package hfst
- AttReader
- PrologReader
- HfstBasicTransducer
- HfstBasicTransition
- HfstTransducer
- HfstInputStream
- HfstOutputStream
- MultiCharSymbolTrie
- HfstTokenizer
- LexcCompiler
- XreCompiler
- PmatchContainer
- ImplementationType