Skip to content

An NLP framework for large scale processing using Hadoop

Notifications You must be signed in to change notification settings

peterexner/KOSHIK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KOSHIK

An NLP framework for large scale processing using Hadoop. KOSHIK supports parsing of text in multiple languages including English, Swedish, and Chinese.

USAGE

Before processing a corpus, the corpus must be imported into Koshik. Koshik supports import from plain text, CoNLL2006/2009, and Wikipedia XML dumps. To import from a Wikipedia XML dump file, run:

hadoop jar Koshik-1.0.1.jar se.lth.cs.koshik.util.Import -input /enwiki-20140102-pages-articles.xml -inputformat wikipedia -language eng -charset utf-8 -output /enwiki_avro

The imported documents can then be parsed using the analysis tools in Koshik. To parse using an English semantic role labeler, run:

hadoop jar Koshik-1.0.1.jar se.lth.cs.koshik.util.EnglishPipeline -D mapred.reduce.tasks=12 -D mapred.child.java.opts=-Xmx8G -archives model.zip -input /enwiki_avro -output /enwiki_semantic

Querying data through HIVE

  • Importing data into Hive:

    CREATE EXTERNAL TABLE koshikdocs ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/hivetablekoshik' TBLPROPERTIES('avro.schema.url'='hdfs:///AvroDocument.avsc');
    LOAD DATA INPATH '/enwiki_semantic/*.avro' INTO TABLE koshikdocs;

  • Number of articles:

    SELECT count(identifier) from koshikdocs;

  • Number of sentences:

    SELECT count(ann) FROM koshikdocs LATERAL VIEW explode(annotations.layer) annTable as ann WHERE ann LIKE '%Sentence';

  • Number of tokens:

    SELECT count(ann) FROM koshikdocs LATERAL VIEW explode(annotations.layer) annTable as ann WHERE ann LIKE '%Token';

  • Number of nouns:

    SELECT count(key) FROM (SELECT explode(ann) AS (key,value) FROM (SELECT ann FROM koshikdocs LATERAL VIEW explode(annotations.features) annTable as ann) annmap) decmap WHERE key='POSTAG' AND value LIKE 'NN%';

NLP Model files

The language model files for the tools used in KOSHIK can be downloaded from the following sites:

References

Please cite the following paper, if you use KOSHIK:

About

An NLP framework for large scale processing using Hadoop

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages