Skip to content

dhfbk/simpitiki

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SIMPITIKI: a Simplification corpus for Italian

SIMPITIKI is a Simplification corpus for Italian and it consists of two sets of simplified pairs: the first one is harvested from the Italian Wikipedia in a semi-automatic way; the second one is manually annotated sentence-by-sentence from documents in the administrative domain.

The first part is the result of a study aimed at assessing the possibility to leverage a simplification corpus from Wikipedia in a semi-automated way, starting from Wikipedia edits. The study is inspired by the work presented in (Woodsend and Lapata 2011), in which a set of parallel sentences was extracted from Simple Wikipedia revision history. However, the present work is different in that: (i) we use the Italian Wikipedia revision history, demonstrating that the approach can be applied also to languages other than English and on edits of Wikipedia that were not created for educational purposes, and (ii) we manually select the actual simplifications and label them following the annotation scheme already applied to other Italian corpora. This makes possible the comparison with other resources for text simplification, and allows a seamless integration between different corpora. Our methodology can be summarised as follows: we first select the edited sentence pairs which were commented as 'simplified' in Wikipedia edits, filtering out some specific simplification types (for example, template pages). Then, we manually check the extracted pairs and, in case of simplification, we annotate the types in compliance with the existing annotation scheme for Italian (see below).

The second part is manually created, using the same annotation paradigm, starting from documents in the administrative domain, downloaded from the Municipality of Trento website.

The corpus

In the corpus folder one can find both versions of the corpus. Data contained in version 2 has better sentence boundaries.

In order to develop a corpus which is compliant with the annotation scheme already used in previous works on simplification, we followed the simplification types described in (Brunato et al., 2015). The tagset is included in the XML using the <legenda> tag, and can be summarized as follows (columns from 2 to 4 count the number of instances for each type for each resource):

Type Count (part one) Count (part two) Total
Split 20 18 38
Merge 22 0 22
Reordering 14 20 34
Insert - Verb 11 5 16
Insert - Subject 5 1 6
Insert - Other 58 21 79
Delete - Verb 12 1 13
Delete - Subject 17 1 18
Delete - Other 146 31 177
Transformation - Lexical Substitution (word level) 96 253 349
Transformation - Lexical Substitution (phrase level) 143 184 327
Transformation - Anaphoric replacement 14 3 17
Transformation - Noun to Verb 3 32 35
Transformation - Verb to Noun (nominalization) 2 0 2
Transformation - Verbal Voice 2 1 3
Transformation - Verbal Features 10 20 30
Total 575 591 1166

The <simplifications> tag introduces the list of simplifications texts. Each simplification pair uses the <simplification> tag: the type attribute links the pair to the corresponding simplification type; the origin attribute specifies the resource (itwiki for Wikipedia, tn for the Municipality of Trento); the <before> and <after> tags contain the text before and after the simplification, respectively. Inside them, <ins> and <del> tags are used to highlight the parts where the text has been modified (<ins> means 'insert', <del> means 'delete').

Credits

This resource has been developed in the Digital Humanities Unit at Fondazione Bruno Kessler by Sara Tonelli, Alessio Palmero Aprosio and Francesca Saltori.

The research leading to this corpus is partially supported by the EU Horizon 2020 Programme via the SIMPATICO Project (H2020-EURO-6-2015, n. 692819).

If you use SIMPITIKI in your work or research, please cite the following paper:

Tonelli, Sara, Alessio Palmero Aprosio, and Francesca Saltori. "SIMPITIKI: a Simplification corpus for Italian.". Proceedings of CLiC-it (2016).

@article{tonelli2016simpitiki,
  title={SIMPITIKI: a Simplification corpus for Italian},
  author={Tonelli, Sara and Aprosio, Alessio Palmero and Saltori, Francesca},
  journal={Proceedings of CLiC-it},
  year={2016}
}

For more information, please send an e-mail to aprosio@fbk.eu.

License

The SIMPITIKI corpus is released under the CC-BY 4.0 license.

About

A Simplification corpus for Italian

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages