Skip to content

macocu/prevert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prevert iterator

To use the prevert parser, copy the file prevert.py in your directory.

Use

# import libraries
from prevert import dataset
import pandas as pd

If you are using the MaCoCu corpora in the XML format, the method dataset() needs only the path of the file as the argument:

# Open the dataset with the prevert parser 
dset = dataset("/data/monolingual/mk.xml")

dset consists of docs where you can access the metadata by doc.meta['attribute_name']. Docs consist of paragraphs where you can access the metadata by par.meta['attribute_name'].

Basic use:

for doc in dset: # iterating through documents of a dataset
    print(doc.meta) # all attributes
    print(eval(doc.meta['lang_distr'])[0][0]) # most prominent language in the document
    print(str(doc)) # whole document text
    for par in doc: # iterating through paragraphs of a document
        print(par.meta['id']) # specific attribute
        print(str(par)) # whole paragraph text
    print(doc.to_prevert()) # obtaining the original format

About

Iterator for the prevert format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages