Semantic dependency relationship extractor untuk bahasa Indonesia (termasuk bahasa gaul dan alay ;))
Optional: mendukung ekstraksi dari bahasa Indonesia ke struktur sentence RelEx (English).
i(first, singular) 'aku', saya' you(second, singular) 'kamu', 'Anda' he(third, singular) 'dia', 'ia' she(third, singular) 'dia', 'ia' we_sm(first, plural) 'kami' we_lg(first, plural) 'kita' you_plural(second, plural) 'kalian' they(third, plural) 'mereka'
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix lemon: <http://lemon-model.net/lemon#> . @prefix dbpedia-owl: <http://dbpedia.org/ontology/> . @prefix dbpedia: <http://dbpedia.org/resource/> . @prefix schema: <http://schema.org> . @prefix wordnet-ontology: <http://wordnet-rdf.princeton.edu/ontology#> @prefix wn31: <http://wordnet-rdf.princeton.edu/wn31/> . @prefix wn20: <http://www.w3.org/2006/03/wn/wn20/instances/> . @prefix uby: <http://lemon-model.net/lexica/uby/wn/> .
Note: it’s hard to find dataset for @prefix wn30: http://wordnet-rdf.princeton.edu/wn30/ .
and nobody uses the wn30 purl.org version, so we’re using wn31.
If we need to use wn-msa someday then we’ll need to make it wn31-compliant.
But currently wn31 already provides ind
translations. :)
BTW http://wordnet-rdf.princeton.edu/wn30/ namespace always redirects to wn31
.
dbpedia-owl:Animal 'binatang', 'hewan'
Semua resources di sini diasumsikan a dbpedia-owl:Animal
.
dbpedia:Elephant 'gajah'
Aku melihat binatang gajah di kebun binatang.
noun verb 'binatang' dbpedia-owl:Animal 'di' 'kebun binatang' . => S( (NP 1) (VP 2 (NP 3 4)) (at dbpedia:Zoo) . )
Kamu melihat binatang apa di kebun binatang?
noun verb 'binatang' 'apa' 'di' 'kebun binatang' ? => S( (NP 1) (VP 2 (NP 3 what=dbpedia-owl:Animal)) (at dbpedia:Zoo) ? )
what
di sini ditentukan memiliki jenis dbpedia-owl:Animal
,
untuk membatasi search space jawaban.
Gajah memiliki 4 jumlah kaki.
Animal 'memiliki' int 'jumlah' 'kaki' . => S( (NP 1) (VP has (NP 4 feet)) .)
Gajah suaranya "auuuk…"
Gajah memiliki hidung panjang yang sering disebut belalai.
Hewan di samping namanya ular.
Tubuhnya panjang dan tidak memiliki kaki dan tangan.
Ular memiliki racun yang sering disebut bisa.
Warna ular disamping adalah hijau.
Kakak menunggangi hewan yang bernama kuda.
Kuda suaranya "yihheeekk".
Kuda suka mengangkut andong / bendi.
Kuda suka makan rumput.
Tubuh Nelly berwarna oranye/jingga/oren/kuning yang indah.
Nelly memiliki kuku/cakar yang runcing/tajam untuk mencakar mangsa.
Nelly suka makan ikan asin.
Hewan di samping namanya adalah kucing.
Cuicit adalah burung pipit Kesayangan Icha
Warnanya kuning dan oranye.
Cuicit suka berkicau/bernyanyi/bersuara di pagi hari.
Suara nya cit cit cit / cuit cuit cuit / merdu. :)
[WordNet 3.1 RDF](http://wordnet-rdf.princeton.edu/) (77 MB gzip, 1.2 GB expanded) comes in N-Triples format which is too big to parse anyway. So please convert it first to TURTLE using [rdfcat](http://semapps.blogspot.com/2010/05/using-jena-to-convert-rdfowl-file.html) using following techniques.
This is slow and generates unusable data anyway. Skip to TDB + arq.
time JVM_ARGS='-Xms6g -Xmx6g' ~/apache-jena-2.11.2/bin/rdfcat -out ttl -rdfxml ~/Downloads/wn31.nt > ~/Downloads/wn31.ttl
That will take 7m15s on i7, you can use 4 GB heap too, but no less. And generates 480 MB TURTLE file without nsPrefixes (sigh!). :(
You need to use tdbloader2
to load the WordNet 3.1 data.
ceefour@amanah:/media/ceefour/passport/project_passport/Lumen/wn31 > tdbloader2 --loc ~/tmp/wn31 wn31.nt
This took 108 seconds on i7 :) (8,574,807 tuples!) and generates 735 MB data.
Test:
ceefour@amanah:~ > tdbquery --loc=$HOME/wn31_tdb --file ~/git/relex-id/core/elephant.sparql ------------------------------------------------------------------ | y | z | ================================================================== | rdf:type | wordnet-ontology:Synset | | wordnet-ontology:translation | "象"@zho | | wordnet-ontology:translation | "éléphant"@fra | | wordnet-ontology:translation | "elefante"@glg | | wordnet-ontology:translation | "elefante"@ita | | wordnet-ontology:translation | "biram"@zsm | | wordnet-ontology:translation | "elefante"@por | | wordnet-ontology:translation | "elefant"@dan | | wordnet-ontology:translation | "elefanta"@por | | wordnet-ontology:translation | "biram"@ind | | wordnet-ontology:translation | "ゾウ"@jpn | | wordnet-ontology:translation | "elefant"@nob | | wordnet-ontology:translation | "ช้าง"@tha | | wordnet-ontology:translation | "فیل"@fas | | wordnet-ontology:translation | "gajah"@zsm | | wordnet-ontology:translation | "elefante"@spa | | wordnet-ontology:translation | "ช้างสาร"@tha | | wordnet-ontology:translation | "פִּיל"@heb | | wordnet-ontology:translation | "象さん"@jpn | | wordnet-ontology:translation | "elefante"@eus | | wordnet-ontology:translation | "gajah"@ind | | wordnet-ontology:translation | "象"@jpn | | wordnet-ontology:translation | "norsu"@fin | | wordnet-ontology:translation | "elefantti"@fin | | wordnet-ontology:translation | "پیل"@fas | | wordnet-ontology:translation | "Elefantes"@por | | wordnet-ontology:translation | "elephantidae"@spa | | wordnet-ontology:translation | "éléphantidés"@fra | | wordnet-ontology:translation | "elefant"@nno | | wordnet-ontology:translation | "elefant"@cat | | wordnet-ontology:translation | "หัตถี"@tha | | wordnet-ontology:hyponym | wn31:102507401-n | | wordnet-ontology:hyponym | wn31:102506644-n | | wordnet-ontology:hyponym | wn31:102509414-n | | wordnet-ontology:hyponym | wn31:102506387-n | | wordnet-ontology:hyponym | wn31:102507089-n | | wordnet-ontology:synset_member | wn31:elephant-n | | wordnet-ontology:gloss | "five-toed pachyderm"@eng | | wordnet-ontology:part_of_speech | wordnet-ontology:noun | | owl:sameAs | wn20:synset-elephant-noun-1 | | owl:sameAs | uby:WN_Synset_13287 | | rdfs:label | "elephant"@eng | | wordnet-ontology:lexical_domain | wordnet-ontology:noun.animal | | wordnet-ontology:hypernym | wn31:102505758-n | | wordnet-ontology:hypernym | wn31:102455739-n | | wordnet-ontology:part_holonym | wn31:101468354-n | | wordnet-ontology:part_holonym | wn31:102455598-n | | wordnet-ontology:member_meronym | wn31:102505944-n | ------------------------------------------------------------------
Yay! :)
Put in ~/.bashrc
:
export PATH=$PATH:$HOME/apache-jena-2.11.2/bin:$HOME/jena-fuseki-1.0.2
Then execute:
chmod +x ~/jena-fuseki-1.0.2/s-* ~/jena-fuseki-1.0.2/fuseki-server --update --loc ~/wn31_tdb /ds
Go to http://localhost:3030/sparql.tpl and upload WordNet 3.1 data.
(You can also use tdbloader2
to load the WordNet 3.1 data.)
Test:
> s-query --output text --service http://localhost:3030/ds/query --file ~/git/relex-id/core/elephant.sparql
--------------------------------------------------------------------------------------------------------------------------- | s | p | o | =========================================================================================================================== | | rdf:type | wordnet-ontology:Synset | | | wordnet-ontology:translation | "象"@zho | | | wordnet-ontology:translation | "éléphant"@fra | | | wordnet-ontology:translation | "elefante"@glg | | | wordnet-ontology:translation | "elefante"@ita | | | wordnet-ontology:translation | "biram"@zsm | | | wordnet-ontology:translation | "elefante"@por | | | wordnet-ontology:translation | "elefant"@dan | | | wordnet-ontology:translation | "elefanta"@por | | | wordnet-ontology:translation | "biram"@ind | | | wordnet-ontology:translation | "ゾウ"@jpn | | | wordnet-ontology:translation | "elefant"@nob | | | wordnet-ontology:translation | "ช้าง"@tha | | | wordnet-ontology:translation | "فیل"@fas | | | wordnet-ontology:translation | "gajah"@zsm | | | wordnet-ontology:translation | "elefante"@spa | | | wordnet-ontology:translation | "ช้างสาร"@tha | | | wordnet-ontology:translation | "פִּיל"@heb | | | wordnet-ontology:translation | "象さん"@jpn | | | wordnet-ontology:translation | "elefante"@eus | | | wordnet-ontology:translation | "gajah"@ind | | | wordnet-ontology:translation | "象"@jpn | | | wordnet-ontology:translation | "norsu"@fin | | | wordnet-ontology:translation | "elefantti"@fin | | | wordnet-ontology:translation | "پیل"@fas | | | wordnet-ontology:translation | "Elefantes"@por | | | wordnet-ontology:translation | "elephantidae"@spa | | | wordnet-ontology:translation | "éléphantidés"@fra | | | wordnet-ontology:translation | "elefant"@nno | | | wordnet-ontology:translation | "elefant"@cat | | | wordnet-ontology:translation | "หัตถี"@tha | | | wordnet-ontology:hyponym | wn31:102507401-n | | | wordnet-ontology:hyponym | wn31:102506644-n | | | wordnet-ontology:hyponym | wn31:102509414-n | | | wordnet-ontology:hyponym | wn31:102506387-n | | | wordnet-ontology:hyponym | wn31:102507089-n | | | wordnet-ontology:synset_member | wn31:elephant-n | | | wordnet-ontology:gloss | "five-toed pachyderm"@eng | | | wordnet-ontology:part_of_speech | wordnet-ontology:noun | | | owl:sameAs | wn20:synset-elephant-noun-1 | | | owl:sameAs | uby:WN_Synset_13287 | | | rdfs:label | "elephant"@eng | | | wordnet-ontology:lexical_domain | wordnet-ontology:noun.animal | | | wordnet-ontology:hypernym | wn31:102505758-n | | | wordnet-ontology:hypernym | wn31:102455739-n | | | wordnet-ontology:part_holonym | wn31:101468354-n | | | wordnet-ontology:part_holonym | wn31:102455598-n | | | wordnet-ontology:member_meronym | wn31:102505944-n | | wn31:101468354-n | wordnet-ontology:part_meronym | | | wn31:102505944-n | wordnet-ontology:member_holonym | | | wn31:102505758-n | wordnet-ontology:hyponym | | | wn31:102455739-n | wordnet-ontology:hyponym | | | wn31:102455598-n | wordnet-ontology:part_meronym | | | wn31:102507401-n | wordnet-ontology:hypernym | | | wn31:102506644-n | wordnet-ontology:hypernym | | | wn31:102509414-n | wordnet-ontology:hypernym | | | wn31:102506387-n | wordnet-ontology:hypernym | | | wn31:102507089-n | wordnet-ontology:hypernym | | | <http://wordnet-rdf.princeton.edu/wn31/elephant-n#1-n> | lemon:reference | | ---------------------------------------------------------------------------------------------------------------------------
Yay! :)
WordNet only contains nouns, verbs, adjectives, and adverbs. For other part-of-speeches, we need to use something else (probably [DBpedia Wiktionary](http://datahub.io/dataset/wiktionary-dbpedia-org)) or create our own data (but still using lemon-model.net ontology).
However there still needs to be corrections, especially false inclusions:
TDB:
tdbupdate -v --loc ~/wn31_tdb --update ~/git/relex-id/core/wn31patch.sparql
Fuseki:
s-update -v --service http://localhost:3030/ds/update --file ~/git/relex-id/core/wn31patch.sparql
Check:
tdbquery --results text --loc ~/wn31_tdb --file ~/git/relex-id/core/me.sparql
Required to run id.ac.itb.ee.lskk.relexid.core.BabelNetTest
-
Extract [BabelNet-API-2.5.zip](http://babelnet.org/download.jsp) to
$HOME/BabelNet-API-2.5
-
Extract the indexes to $HOME (will create subdirectories inside
$HOME/BabelNet-2.5
. For testing you can use the small indexes only:-
babelnet-2.5-APACHE-20-index.tar.bz2
-
babelnet-2.5-CC-BY-30-index.tar.bz2
-
babelnet-2.5-CC-BY-NC-SA-30-index.tar.bz2
-
babelnet-2.5-CECILL-C-index.tar.bz2
-
-
BabelNet API v1.0.1 + Path indexes v1.0.1: (I think we can use 1.1.1 instead, but not 2.0+)
-
Edit
$HOME/BabelNet-API-2.5/config/babelnet.var.properties
and setbabelnet.dir
to${user.home}/BabelNet-2.5
. -
Edit
$HOME/BabelNet-API-2.5/config/knowledge.var.properties
and setknowledge.graph.pathIndex
to${user.home}/BabelNet-1.0.1
.
-
[WordNet 3.1 RDF](http://wordnet-rdf.princeton.edu/)
-
[Lexical Resources & NLP Tools Bahasa Indonesia - Universitas Indonesia](http://bahasa.cs.ui.ac.id/resources.php)
-
[Membangun Tree Parse untuk Parsing di Stanford Parser Menggunakan Java - Yuita Arum Sari](http://arumsha.wordpress.com/2012/12/15/membangun-tree-parse-untuk-parsing-di-stanford-parser-menggunakan-java/)
-
[WordNet Bahasa Melayu/Malaysia/Indonesia](http://wn-msa.sourceforge.net/)
-
[NLP resource yang tersedia untuk bahasa Indonesia](http://alfan-farizki.blogspot.com/2010/04/nlp-resource-yang-tersedia-untuk-bahasa.html)
-
[Open Multilingual Wordnet](http://compling.hss.ntu.edu.sg/omw/)
-
[ConceptNet](http://conceptnet5.media.mit.edu/) is a semantic network containing lots of things computers should know about the world, especially when understanding text written by people.
-
[Indonesian WordNet](http://bahasa.cs.ui.ac.id/iwn)
-
[Indonesian Dictionary (Kamus Besar Bahasa Indonesia)](http://bahasa.cs.ui.ac.id/kbbi)
-
[LexInfo](http://lexinfo.net/) builds on the [lemon model](http://lemon-model.net/) to represent lexical information attached to ontologies on the semantic web.
-
Porter stemmer for Indonesian
-
[Symbolic Parser](http://bahasa.cs.ui.ac.id/tools/SymbolicParser.zip) — Bahasa Indonesia Symbolic Parser is a parser is a tool that will create a parse tree structures for Indonesian sentences. Developed by defining Context-Free Grammar (CFG) Rules for Bahasa Indonesia grammar, complete with a simple lexicon of words in Bahasa Indonesia, and run on PCPATR application, available at "http://www.sil.org/pcpatr/".
-
Statistical parser
-
Named entity tagger
-
[Semantic Analyzer](http://bahasa.cs.ui.ac.id/tools/SemanticAnalyzer.zip) — Bahasa Indonesia Semantic Analyzer is a tool that will create a semantic representation of Bahasa Indonesia sentences in first order predicate logic form. This Semantic Analyzer uses Syntax-Driven Semantic Analysis approach and developed using Indonesian Grammar (Symbolic Parser) that has been translated into PROLOG source. It runs in SWI-PROLOG application available at "http://www.swi-prolog.org/".
-
Semantic Analyzer with Axioms
-
[Morphological Analyzer](http://bahasa.cs.ui.ac.id/tools/MorphologicalAnalyzerIndonesia.zip) — Morphological Analyzer is a words recognition tools that split word into one or more morphems and also make the corresponding morphological analysis.
-
[Reasoning over Semantic Networks](http://ileriseviye.wordpress.com/2009/03/28/reasoning-over-semantic-networks/)
-
[OpenCog Concept formation](http://wiki.opencog.org/w/Concept_formation)