Shellshal

Shell Scripts for Arabic Language processing سكريبتات سطر الأوامر للتعامل مع النصوص العربية

Description

This project is a collection of small shell scripts used to process arabic texts, like:

Tokenize a file text into tokens (duplicate, unique)
Strip Tashkeel from text ( all diacritics, keep shadda, last haraka )
Strip only the last mark from every word in text.
Convert Alef_Wasla into Alef
Build a list from csv file

هذا المشروع يجمع سكريبتات بسيطة لمعالجة الملفات النصية العربية مثل:

تفريق نص إلى كلمات
تفريق النص، وحذف المكررات
حذف التشكيل، حذف الحركات وحفظ الشدة، حذف آخر حركة
تحويل ألف الوصلة إلى ألف عادية
تحويل ملف نصي csv إلى قائمة

التسمية

شَلْشَلَ [ ش ل ش ل ]. ( فعل : رباعي لازم متعد ). :- شَلْشَلْتُ ، أُشَلْشِلُ ، شَلْشِلْ ، مصدر شَلْشَلَةٌ . 1 . :- شَلْشَلَ الْمَاءَ :- : صَبَّهُ مُتَتَابِعاً . 2 . :- شَلْشَلَ الْمَاءُ :- : قَطَرَ وَسَالَ مُتَتَابِعاً . 3 . :- شَلْشَلَ السَّيْفُ الدَّمَ :- : صَبَّهُ .

الاسم مأخوذ من شبهه بكلمة shell التي تعني سطر الأوامر،

والمعنى في الشلشلة هي التتابع

Usage

Install

make install

Test

make test

Scripts

Display all possible command by using

shellshal

Commands

Tokenize

1- You can tokenize a text file by the following script.

tokenize.sh filename

source

sed 's/[[:punct:][:space:]×،؛]/\n/g'  < $1 |sed '/^\s*$/d'

2- Tokenize, sort, removre duplicates and count frequencies for words in file. The result file is filename.unq.

tokenize_uniq.sh filename

source

sed 's/[[:punct:][:space:]×،؛]/\n/g'  < $1 |sed '/^\s*$/d' | sort | uniq -c | sort -nr >$1.unq

Tashkeel Removing

1- Remove Harakat (diacritics), Tatweel and Shadda from text

strip_tashkeel.sh filename

source

CHARS=$(python -c 'print u"\u064b\u064c\u064d\u064e\u064f\u0651\u0652".encode("utf8")')
sed 's/['"$CHARS"']//g' < $1

2- Remove Harakat (diacritics) and Tatweel from text, but keep Shadda

strip_harakat.sh filename

source

CHARS=$(python -c 'print (u"\u064b\u064c\u064d\u064e\u064f\u0650\u0652\u0670".encode("utf8"))')
sed 's/['"$CHARS"']//g' < $1

3- Remove last Haraka (diacritic) from the end of words from text

strip_lastmark.sh filename

source

CHARS=$(python -c 'print u"\u064b\u064c\u064d\u064e\u064f\u0651\u0652".encode("utf8")')
sed 's/['"$CHARS"']$//g' < $1

4- Replace Alef wasla to simple alef in words from text

replace_wasla.sh filename

source

CHARS=$(python -c 'print (u"\u0671".encode("utf8"))')
TO=$(python -c 'print (u"\u0627".encode("utf8"))')
sed 's/['"$CHARS"']/'"$TO"'/g' < $1

Build lists and dictionary

1-Makelist Convert file into list; csv file or one word per line

shellshal/makelist.sh testfile.csv

source

awk 'BEGIN{print "MyList=["};/^[^#]/{printf "u\"%s\",\n",$1};END{print "]"}' $1

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
shellshal		shellshal
template		template
tests/samples		tests/samples
CREDITS		CREDITS
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
logo.png		logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shellshal

Description

التسمية

Usage

Install

Test

Scripts

Commands

Tokenize

source

source

Tashkeel Removing

Build lists and dictionary

About

Uh oh!

Releases

Packages

Languages

License

linuxscout/shellshal

Folders and files

Latest commit

History

Repository files navigation

Shellshal

Description

التسمية

Usage

Install

Test

Scripts

Commands

Tokenize

source

source

Tashkeel Removing

Build lists and dictionary

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages