TEI - TimeBankPT Event Identification
TEI is an event trigger identifier system for sentences in the Portuguese language. It locates the event trigger terms in a sentence. The model was trained on the TimeBankPT (COSTA; BRANCO,2012) corpus.
The system outputs the identified events in the following Json format:
[
{
"text": "Vazamentos",
"start": 0,
"end": 10
},
{
"text": "expõem",
"start": 20,
"end": 26
},
{
"text": "diz",
"start": 62,
"end": 65
}
]
- Download and place the BERTimbau Base (SOUZA; NOGUEIRA;LOTUFO, 2020) model and vocabulary file:
$ wget https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/bert-base-portuguese-cased_tensorflow_checkpoint.zip
Then unzip and place it in the the models directory as follows:$ wget https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/vocab.txt
├──models | └── BERTimbau | └── bert_config.json | └── bert_model.ckpt.data-00000-of-00001 | └── bert_model.ckpt.index | └── bert_model.ckpt.meta | └── vocab.txt | |...
- Install the packages.
$ pip install -r requirements.txt
-h, --help Print this help text and exit
--sentence SENTENCE Sentence string to identify events from
--dir INPUT-DIR OUTPUT-DIR Identify events from files of input directory
(one sentence per line) and write output json
files on output directory.
The text files in the input directory are expected to have the format:
* all text files end with the extension .txt
* sentences are separated by newlines
$ python3 src/tei.py --dir /tmp/input-dir /tmp/output-dir
$ python3 src/tei.py --sentence 'Vazamentos de dados expõem senhas de funcionários do governo, diz relatório.'
Peer-reviewed accepted paper:
- Sacramento, A., Souza, M.: Joint Event Extraction with Contextualized Word Embeddings for the Portuguese Language. In: 10th Brazilian Conference on Intelligent System, BRACIS, São Paulo, Brazil, from November 29 to December 3, 2021.