Gorgona allows you to preprocess your text datasets without writing a lot of code.
For now, only Python 3.8 is supported. You can try other versions, but no guarantees that it'll work properly.
Clone repository:
git clone https://github.com/theseus-automl/gorgona
cd gorgona
Install package:
python setup.py install
First, you need to create YAML configuration file. Let's start with stages section review:
stages:
- type: "unicode"
form: "nfkd"
- type: "html"
repl: ""
- type: "email"
repl: ""
- type: "phone"
repl: ""
- type: "url"
repl: ""
- type: "emoji"
repl: ""
- type: "whitespace"
repl: ""
- type: "strip"
Each stage may includes following parameters:
- type - stage type
- name - optional stage name. It can be useful for debug mode
- repl - string to replace on for stages based on replacing
- join_on - string to join on for stages based on splitting
Language detection stage is a bit more complex and may include such parameters:
- model_path - path to FastText model for language detection. You can download it from here or left it to Gorgona
- target_lang - texts in languages different from the target are replaced with empty strings
- threshold - max threshold for set language to unknown
You can use defaults section to set custom default repl and join_on for all stages:
defaults:
repl: ""
join_on: " "
Now, you can import Runner and start preprocessing:
from pathlib import Path
from gorgona import Runner
# create Runner instance, set path to config and number of workers
r = Runner(Path('config.yaml'), 4)
# get your texts
texts = [...]
# start preprocessing
res = r.run(texts)
Previous example uses multiprocessing as a backend, but Ray is also supported. If you don't know how to set up a cluster, Ray has beautiful docs:
r = Runner(Path('config.yaml'), 4, backend='ray', ray_cluster_address='<address>')
You can also use Preprocessor separately in your code:
from pathlib import Path
from gorgona import Preprocessor
# create Preprocessor instance and set path to config
pr = Preprocessor(Path('config.yaml'))
# call Preprocessor on your text
pr('hello, world!')
Preprocessor also supports debug mode, where you can view every stage result:
pr('hello, world!', True)
# output:
# GORGONA DEBUG MODE
# RUNNING STAGE #0: stage 0
# BEFORE: ...
# AFTER: ...
# -------------------------
# RUNNING STAGE #1: stage 1
# BEFORE: ...
# AFTER: ...
# -------------------------
Important notes:
- You must inherit from any of the base stages
- You must use register decorator on your class. Provided alias is used to identify your stage when parsing config
- No duplicate aliases are allowed
There are two base stages to choose from:
- BaseStage is the most flexible one. You can do anything in call method, for example:
from gorgona.stages import BaseStage, register
@register(alias='my_stage')
class MyStage(BaseStage):
def __init__(self, name, regexp):
super().__init__(name, regexp)
def __call__(self, text, *args, **kwargs):
print('this is my stage!')
text = self._regexp.sub(':)', text)
return text
- Replacer allows you to create convenient text replacement stages. For example, standard HtmlClenaer is a Replacer:
from gorgona.stages import Replacer
class HtmlCleaner(Replacer):
def __init__(
self,
name: str,
repl: str,
) -> None:
super().__init__(
name,
r'<.*?>',
repl,
)
Feel free to open issues, send pull requests and ask any questions!