-
Notifications
You must be signed in to change notification settings - Fork 20
HAREM collection
André Pires edited this page Jun 28, 2017
·
12 revisions
All of HAREM's resources can be downloaded here, which includes the dataset and the golden collection, all of the participants results and extra programs available in the HAREM conference.
It is comprised of 129 annotated documents. With texts in both native Portuguese (pt-PT, ~60%) and brazilian Portuguese (pt-BR, ~40%).
- ABSTRACAO: DISCIPLINA, ESTADO, IDEIA, NOME, OUTRO
- ACONTECIMENTO: EFEMERIDE, EVENTO, ORGANIZADO, OUTRO
- COISA: CLASSE, MEMBROCLASSE, OBJECTO, SUBSTANCIA, OUTRO
- LOCAL: FISICO (ILHA, AGUACURSO, PLANETA, REGIAO, RELEVO, AGUAMASSA, OUTRO), HUMANO (RUA, PAIS, DIVISAO, REGIAO, CONSTRUCAO, OUTRO), VIRTUAL (COMSOCIAL, SITIO, OBRA, OUTRO), OUTRO
- OBRA: ARTE, PLANO, REPRODUZIDA, OUTRO
- ORGANIZACAO: ADMINISTRACAO, EMPRESA, INSTITUICAO, OUTRO
- PESSOA: CARGO, GRUPOCARGO, GRUPOIND, GRUPOMEMBRO, INDIVIDUAL, MEMBRO, POVO, OUTRO
- TEMPO: DURACAO, FREQUENCIA, GENERICO, TEMPO_CALEND (HORA, INTERVALO, DATA, OUTRO), OUTRO
- VALOR: CLASSIFICACAO, MOEDA, QUANTIDADE, OUTRO
- OUTRO
Table form here.
Examples for each one here.
Used lxml for XML related methods.
- Strip tags from unnecessary categories, types and subtypes (for filtered level)
- Removed categories: ['OBRA','COISA','ABSTRACCAO','OUTRO']
- Removed types: ['CARGO','GRUPOCARGO','GRUPOMEMBRO','MEMBRO','GRUPOIND','POVO', 'EFEMERIDE','VIRTUAL']
- Removed subtypes: ['REGIAO','OUTRO','AGUAMASSA','AGUACURSO','RELEVO','PLANETA','ADMINISTRACAO']
- For the remaining elements, remove unnecessary attributes
- Removed: ['TIPO','SUBTIPO','COREL','TIPOREL','ID','COMENT']
- Stripped OMITIDO tag and everything inside it
- Deal with multiple category, type or subtype assignments
- Select the first option in each alternative
- Deal with the ALT tag (script)
- Select all ALT tags
- For the ALT tags which don't have entities inside, select the first alternative
- For the rest, calculate the number of entities inside each alternative and select the alternative which has the highest number of entities
- Strip all ALT tags
- Output to file
Other processes:
- Remove unwanted spaces (script)
- Split dataset between train and test sets (script)
- To output dataset with only categories, only types or only subtypes, set the category to the desired level
- Replace
&
with&
Check scripts for filtration folder. Use these commands to run the scripts.