Library in python with includes several functionalities for dealing with NAF/KAF files
The library is a python package and it is divided into several subpackages
List of subpackages:
- VUA_pylib.feature_extractor: functions for extracting elaborated information from KAF/NAF files
- VUA_pylib.lexicon: functions for accessing lexicons
- VUA_pylib.common: common used functions and utilities
- VUA_pylib.io: reading and writing files in different formats, feature files...
- VUA_pylib.corpus_reader: loading and querying corpora
Contains functions to extract information from a NAF/KAF file
###class Cconstituency_extractor###
Extract information from the constituency layer in a NAF file
- get_deepest_phrase_for_termid(termid): gets the deepest phrase type for a term identifier AND the list of termids in that same chunk
- get_least_common_subsumer(termid1, termid2): gets the least common subsumer of both ids in the constituency tree
- get_path_from_to(term1,term2): path in the constituency tree from term1 to term2
- get_path_for_termid(termid): constituency type path from termid to sentence root+
- get_chunks(chunk_type): gets all the chunks for taht type
- get_all_chunks_for_term(termid): gets all pairs (chunk_type, list_ids) of all chunks where the termid is contained
from NafParserPy import NafParser
from VUA_pylib.feature_extractor.constituency import Cconstituency_extractor
naf_obj = NafParser(file)
extractor = Cconstituency_extractor(naf_obj)
print extractor.get_deepest_phrase_for_termid('t363')
print extractor.get_path_from_to('t363','t365')
for ch in extractor.get_chunks('NP'):
print ch
print [naf_obj.get_term(i).get_lemma() for i in ch]
for type_chunk, list_ids in extractor.get_all_chunks_for_term('t713'):
print type_chunk, list_ids
###class Cdependency_extractor###
Extract information from the dependency layer in a NAF file
- get_shortest_path(term1,term2) --> gets the shortest dependency path from term1 to term2
- get_shortest_path_spans(span1,span2) --> gets the shortest dependency path between 2 span of term ids
- get_path_to_root(termid) --> gets the shortest dependency path from the termid to the sentence root
- get_shortest_path_to_root_span(span) --> gets the shortest dependency path from the span of termids to the sentence root
from NafParserPy import NafParser
from VUA_pylib.feature_extractor import Cdependency_extractor
naf_obj = NafParser(file)
extractor = Cdependency_extractor(naf_obj)
p = extractor.get_shortest_path('t446','t453')
p2 = extractor.get_shortest_path_spans(['t444','t445','t446'], ['t451','t452','t454'])
p3 = extractor.get_path_to_root('t460')
p4 = extractor.get_shortest_path_to_root_span(['t444','t445','t446'])
Encapsulate different lexicons
###class MPQA_subjectivity_lexicon### Provides access to the MPQA subjectivity lexicon
- get_type_and_polarity(word,pos=None): returns the type and polarity for the give word (and optionally pos)
>>> from VUA_pylib.lexicon import MPQA_subjectivity_lexicon
>>> my_lex = MPQA_subjectivity_lexicon()
>>> my_lex.get_type_and_polarity('abidance','noun')
('strongsubj', 'positive')
Common functions widely used
- get_max_distr_dict(my_dict): gets the max (key,count) from a dict like my_dict = {'a':20,'b':1,'c':50}
- normalize_pos(pos): normalize different POS tags to --> a/r/n/v/*
>>> from VUA_pylib.common import *
>>> print get_max_distr_dict({'a':20,'b':1,'c':50})
('c', 50)
>>> print normalize_pos('noun')
>>> print normalize_pos('AdVeRb')
Access to corpus
###class Cgoogle_web_nl##
Access to the google web 5-gram in Dutch on http://www.let.rug.nl/gosse/bin/Web1T5_freq.perl
- query(this_query): runs a query like "interessante *"
- get_items(): returns items which are objects of the class Citem
- set_limit(l): set the maximum limit of results
- set_min_freq(m): set the minimum frequency for n-grams
from VUA_pylib.corpus_reader import Cgoogle_web_nl
google = Cgoogle_web_nl()
google.query('interessante *')
for res in google.get_items():
print res
print res.get_hits()
print res.get_word()
print res.get_tokens()
###class Citem##
Encapsulates the information for an item: number of this, word string and tokens
Main methods:
- get_hits(): returns the number of this
- get_word(): returns the word
- get_tokens(): returns the list of tokens in the word
- Ruben Izquierdo
- Vrije University of Amsterdam
- ruben.izquierdobevia@vu.nl