SIMPITIKI: a Simplification corpus for Italian

SIMPITIKI is a Simplification corpus for Italian and it consists of two sets of simplified pairs: the first one is harvested from the Italian Wikipedia in a semi-automatic way; the second one is manually annotated sentence-by-sentence from documents in the administrative domain.

The first part is the result of a study aimed at assessing the possibility to leverage a simplification corpus from Wikipedia in a semi-automated way, starting from Wikipedia edits. The study is inspired by the work presented in (Woodsend and Lapata 2011), in which a set of parallel sentences was extracted from Simple Wikipedia revision history. However, the present work is different in that: (i) we use the Italian Wikipedia revision history, demonstrating that the approach can be applied also to languages other than English and on edits of Wikipedia that were not created for educational purposes, and (ii) we manually select the actual simplifications and label them following the annotation scheme already applied to other Italian corpora. This makes possible the comparison with other resources for text simplification, and allows a seamless integration between different corpora. Our methodology can be summarised as follows: we first select the edited sentence pairs which were commented as 'simplified' in Wikipedia edits, filtering out some specific simplification types (for example, template pages). Then, we manually check the extracted pairs and, in case of simplification, we annotate the types in compliance with the existing annotation scheme for Italian (see below).

The second part is manually created, using the same annotation paradigm, starting from documents in the administrative domain, downloaded from the Municipality of Trento website.

The corpus

In the corpus folder one can find both versions of the corpus. Data contained in version 2 has better sentence boundaries.

In order to develop a corpus which is compliant with the annotation scheme already used in previous works on simplification, we followed the simplification types described in (Brunato et al., 2015). The tagset is included in the XML using the <legenda> tag, and can be summarized as follows (columns from 2 to 4 count the number of instances for each type for each resource):

Type	Count (part one)	Count (part two)	Total
Split	20	18	38
Merge	22	0	22
Reordering	14	20	34
Insert - Verb	11	5	16
Insert - Subject	5	1	6
Insert - Other	58	21	79
Delete - Verb	12	1	13
Delete - Subject	17	1	18
Delete - Other	146	31	177
Transformation - Lexical Substitution (word level)	96	253	349
Transformation - Lexical Substitution (phrase level)	143	184	327
Transformation - Anaphoric replacement	14	3	17
Transformation - Noun to Verb	3	32	35
Transformation - Verb to Noun (nominalization)	2	0	2
Transformation - Verbal Voice	2	1	3
Transformation - Verbal Features	10	20	30
Total	575	591	1166

The <simplifications> tag introduces the list of simplifications texts. Each simplification pair uses the <simplification> tag: the type attribute links the pair to the corresponding simplification type; the origin attribute specifies the resource (itwiki for Wikipedia, tn for the Municipality of Trento); the <before> and <after> tags contain the text before and after the simplification, respectively. Inside them, <ins> and <del> tags are used to highlight the parts where the text has been modified (<ins> means 'insert', <del> means 'delete').

Credits

This resource has been developed in the Digital Humanities Unit at Fondazione Bruno Kessler by Sara Tonelli, Alessio Palmero Aprosio and Francesca Saltori.

The research leading to this corpus is partially supported by the EU Horizon 2020 Programme via the SIMPATICO Project (H2020-EURO-6-2015, n. 692819).

If you use SIMPITIKI in your work or research, please cite the following paper:

Tonelli, Sara, Alessio Palmero Aprosio, and Francesca Saltori. "SIMPITIKI: a Simplification corpus for Italian.". Proceedings of CLiC-it (2016).

@article{tonelli2016simpitiki,
  title={SIMPITIKI: a Simplification corpus for Italian},
  author={Tonelli, Sara and Aprosio, Alessio Palmero and Saltori, Francesca},
  journal={Proceedings of CLiC-it},
  year={2016}
}

For more information, please send an e-mail to aprosio@fbk.eu.

License

The SIMPITIKI corpus is released under the CC-BY 4.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
corpus		corpus
src/main/java/eu/fbk/dh/simpitiki		src/main/java/eu/fbk/dh/simpitiki
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIMPITIKI: a Simplification corpus for Italian

The corpus

Credits

License

About

Releases

Packages

Languages

dhfbk/simpitiki

Folders and files

Latest commit

History

Repository files navigation

SIMPITIKI: a Simplification corpus for Italian

The corpus

Credits

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages