SiNER dataset contains 1,338 news articles and more than 1.35 million tokens collected from Kawish and Awami Awaz Sindhi newspapers. The dataset is labelled using the begin-inside-outside (BIO) tagging scheme. The proposed dataset is likely to be a significant resource for statistical Sindhi language processing. The ultimate goal of developing SiNER is to present a gold-standard dataset for Sindhi NER along with quality baselines. We implement several baseline approaches of conditional random field (CRF) and recent popular state-of-the-art bi-directional long-short term memory (Bi-LSTM) models.
Historically, Sindhi belongs to the Indo-Aryan language family passed through many literary evolutions. It has some unique linguistic characteristics such as rich morphological structure, multiple writings systems, and dialects with the historical linguistic and cultural background. Presently the Sindhi language is an official language in the Sindh province of Pakistan, also being taught as a compulsory subject from primary to higher education. It is also one of the national languages in India with Devanagari (सिन्धी) script. However, Sindhi Persian-Arabic ( سنڌي ) is the standard writing system. Both scripts differ from each other in terms of writing script, grammar, and vocabulary. Persian and Arabic languages influence Sindhi Persian-Arabic, while the writing system of Hindi influences Sindhi-Devanagari script.
Sindh province in Pakistan is the largest area of Sindhi native speakers. Also, a good number of Sindhi native speakers reside in Rajasthan, Ulhasnagar, Maharashtra, and Gujrat in India. Moreover, Sindhi is also the first language of native speakers who migrated to America, the United Kingdom, Tanzania, Hong Kong, Canada, Singapore, Philippines, Kenya, Uganda, South, and East Africa. The total number of Sindhi speakers is around 75 million across the world. At present many news literary, academic, and official blogs and websites in Pakistan and India have become a good source for text generation. Sindhi is a rich morphological cursive language like Arabic and Urdu. Its alphabet consists of 52 letters, 29 letters borrowed from Arabic, four from Persian, and 18 are modified letters. Sindhi words have the capacity to have multiple meanings. Moreover, the absence of diacritic signs and many possible ways for word-formation make it a morphologically complex language.
SiNER was published in LREC 2020.
SiNER: A Large Dataset for Sindhi Named Entity Recognition (Ali et al., LREC 2020)