Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Python parser #4

Open
joelkuiper opened this issue Jul 17, 2019 · 1 comment
Open

Simple Python parser #4

joelkuiper opened this issue Jul 17, 2019 · 1 comment

Comments

@joelkuiper
Copy link

joelkuiper commented Jul 17, 2019

In case someone is interested in a simple parser



def read(ifile):
    obj = {"mention": []}


    HEADER = re.compile(r"(?P<pmid>[0-9]*)\|(?P<t>[t|a])\|(?P<content>.*)")
    MENTIONS = re.compile(r"(?P<pmid>[0-9]*)\t(?P<start>[0-9]*)\t(?P<end>[0-9]*)\t(?P<content>.*)\t(?P<tui>(T.+|UnknownType))\t(?P<cui>C[0-9]+)")


    with gzip.open(ifile, 'r') as fin:
        for line in fin:
            l = line.decode("utf-8")
            h = HEADER.match(l)
            if h:
                obj["pmid"] = int(h.group("pmid"))
                obj[h.group("t")] = h.group("content")
                continue
            m = MENTIONS.match(l)
            if m:
                mention = {"start": m.group("start"),
                           "end": m.group("end"),
                           "content": m.group("content"),
                           "tui": m.group("tui").split(","),
                           "cui": m.group("cui")}
                obj["mention"].append(mention)
                continue
            else:
                yield obj
                obj = {"mention": []}

Code isn't pretty, and not sure how useful outside my own use case; however, I thought it would be nice to share.

@GregSilverman
Copy link

GregSilverman commented Oct 13, 2020

@joelkuiper, thanks! This is extremely helpful.

However, there seems to be an issue with the text offsets on a lot of the manual annotations not aligning to what is in the text (that is, the title + abstract).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants