UGCNormal

This is a normalizer tool for user-generated content (Brazilian Portuguese). You can use it as a service, look at ugcnormal_interface. Also consider using this dockerized service of UGCNormal features ugcnormal-microservice.


                            UGC-Normalizer

INPUT
|
|    -----------------------------      -------------      -----------
---> | SentenceBoundaryDetection | ---> | tokenizer | ---> | speller | ----
     -----------------------------      -------------      -----------    |
                                                                          |
                                                                          |
     ----------------------------------------------------------------------
     |
     |    --------------      ------------------      ----------
     ---> | siglas_map | ---> | internetes_map | ---> | np_map | ---> OUTPUT
          --------------      ------------------      ----------



>>> HOW TO USE:

Before anything else, run ./configure.sh script to check and solve all
dependencies. After that you can run the normalizer script.

Main script is ugc_norm.sh. Use it to apply the normalization pipeline. Just run and pass as
parameters INPUT_dir and OUTPUT_dir. The INPUT_dir must contain all text files
to be processed.

You can test the normalizer using the data in directory "test":

./ugc_norm.sh ./test/input/ ./test/output/



>>> MORE INFO:

******************************* test
Input and output directories to test the normalizer. The output directory tree
has the output produced by each step of this pipeline (sent -> tok -> checked
-> siglas -> internetes -> nomes). The deeper directory ('nomes') has the
result of the full pipeline (probably you are interested only in this result).


******************************* internetes_map.pl
perl script to translate web language using dictionary


******************************* np_map.pl
perl script to normalize NPs using (./resources/np_data.txt). It just
capitalizes the first letter


******************************* siglas_map.pl
Script to put all letters to upper case, if it is in ./resources/lexico_siglas.txt


******************************* upper_handler.py
It checks if a text file is totally in uppercase, if it is, only words after
punctuation are capitalized, all the others are set to lowercase


******************************* upper_periods.py
It capitalizes words after periods


******************************* README.txt
This file !


******************************* resources
Directory with dictionaries for NPs and web language


******************************* SentenceBoundaryDetection
Sentence boundary detection tool, it appends <S> tags at the end of each sentence


******************************* speller
Speller tool directory


******************************* tokenizer
Tokenizer tool directory, you can change lex rules in webtok.lex and run
Makefile using make tool


******************************* utils
- ./utils/extract.sh
This script extract all opinions (text files) in a corpus (many
subdirectories)

References

Duran, M. S.; Avanço, L. V.; Nunes, M. G. V. (2015). A Normalizer for UGC in Brazilian Portuguese. In: ACL 2015, Workshop on Noisy User-generated Text - WNUT, 2015, Beijing, China, p. 38-47. http://aclanthology.info/papers/W15-4305/a-normalizer-for-ugc-in-brazilian-portuguese

Name	Name	Last commit message	Last commit date
Latest commit avanco Update ML_anotado_60_reviews.txt May 13, 2022 58e8de5 · May 13, 2022 History 14 Commits
SentenceBoundaryDetection	SentenceBoundaryDetection	code	Sep 13, 2017
annotation_data	annotation_data	Update ML_anotado_60_reviews.txt	May 13, 2022
input	input	code	Sep 13, 2017
output/tok	output/tok	code	Sep 13, 2017
resources	resources	code	Sep 13, 2017
speller	speller	speller use lib; run directly speller	Jan 2, 2018
tokenizer	tokenizer	code	Sep 13, 2017
utils	utils	code	Sep 13, 2017
LICENSE.md	LICENSE.md	initial commit	Sep 13, 2017
README.md	README.md	Update README.md	Mar 9, 2019
configure.sh	configure.sh	code	Sep 13, 2017
internetes_map.pl	internetes_map.pl	code	Sep 13, 2017
np_map.pl	np_map.pl	code	Sep 13, 2017
siglas_map.pl	siglas_map.pl	code	Sep 13, 2017
ugc_norm.sh	ugc_norm.sh	code	Sep 13, 2017
upper_handler.py	upper_handler.py	code	Sep 13, 2017
upper_periods.py	upper_periods.py	code	Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UGCNormal

References

About

Releases

Packages

Languages

License

avanco/UGCNormal

Folders and files

Latest commit

History

Repository files navigation

UGCNormal

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages