Skip to content

frameword for creating filters based on properties in text

Notifications You must be signed in to change notification settings

FID-Judaica/filtermaker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

filtermaker

filtermaker is a tiny framework for creating string-property testing suites. I will forego an explanation of how it works because it's one of those sausage factory things.

So to start, you get your filtering namespace:

>>> import filtermaker
>>> fs = filtermaker.get_filterspace()

Let's say I'm trying to determine if a certain string is part of a transliteration standard I'm working with (which is why this exists), first I will define a set which contains all the characters in that standard. It could be a sequence type as well, but a set will be faster for lookups, so it is recommended to use them. Set arithmetic is also helpful for defining subsets and supersets of things you've already defined so... use sets.

Once you have your set, you can register a couple of tests with it.

>>> all_chars = set('ʾʿăaāâbdĕeēêfghḥiîkḵlmnŏoōôpqrsṣśštṭuûvwyz')
>>> special_chars = all_chars - set(string.ascii_lowercase)
>>> fs.haschars(special_chars, 'trans chars')
>>> fs.onlycharset(all_chars, 'only trans chars')
  • create a set with all characters in the transliteration standard.
  • create a subset that contains only the non-ascii characters.
  • register a test called trans chars that checks to see if a string contains any of the non-ascii characters in the standard.
  • register a test that only returns true if all letter-characters in the string are part of the transliteration standard. Characters like punctuation and so forth will be skipped.

From there, we will want to create an instance of Filter:

>>> line = fs.Filter('Šel-lô be-derek ham-melek')
>>> line.has('trans chars')
True
>>> line.has('only trans chars')
True

One can also combine the tests:

>>> line.has('trans chars', 'only trans chars')
True

Note that this will return false as soon as the first test returns false.

>>> line = fs.Filter('Ḳef ṿe-Śimḥah') # ḳ isn't part of the standard!
>>> line.has('trans chars')
True
>>> line.has('only trans_chars')
False
>>> line.has('only trans chars', 'trans chars')
False

There are a couple of other built-in test generators:

>>> # some characters in our transliteration are expressed as digraphs.
>>> fs.hascluster({'kh', 'sh', 'ts'}, 'digraphs')
>>> fs.Filter('Ani Okhel Shoḳolad be-Śimḥah').has('digraphs')
True

as you see, .hascluster() creates a test that will check in the string contains certain clusters.

>>> fs.hasregex(re.compile('(\W|^)al-[^p]'), 'arabic article')
>>> # Some of the arabic transliteration is almost impossible to
>>> # distinguish from Hebrew. This at least checks for the presence
>>> # of the Arabic article.

This is basically the same as testing with re.search(). It just makes it easier to integrate regex searches into the rest of the framework.

So, those are the four built-in test types. What if you want to do a test that doesn't fit into any of those models? You can create your own tests with a decorator:

>>> @fs.register
... def only_western(line):
...    return all(ord(c) < 256 for c in line.data)
...
>>> fs.Filter('Šel-lô be-derek ham-melek').has('only_western')
False

The wrapped function should take one argument as input, which is going to be the Filter instance itself, which is a subclass of collections.UserString, so you access the real string with the .data attribute. The return value should be a bool.

If you want to get very fancy, you can create an entire class of tests with the .registrar decorator. Here is the internal implementation of the .haschars function to show how it works:

@registrar
def haschars(charset):
    return lambda line: any(c for c in line.data if c in charset)

The wrapped function to take one object as input (of any type), and it should return a closure that takes the Filter instance to be checked as input and returns a bool. The returned function will take the object to be passed to the wrapped function as the first argument and the name of the test as the second argument.

About

frameword for creating filters based on properties in text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages