Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting entities inside an entity #32

Open
gihanpanapitiya opened this issue Aug 29, 2020 · 4 comments
Open

Extracting entities inside an entity #32

gihanpanapitiya opened this issue Aug 29, 2020 · 4 comments

Comments

@gihanpanapitiya
Copy link

gihanpanapitiya commented Aug 29, 2020

Does anyone knows how to write a custom parser to extract a named entity inside an entity.

For example from the following sentence I want to extract 'boiling' which will be inside the prefix entity.

d = Sentence('Synthesis of 2,4,6-trinitrotoluene (3a).The procedure was followed to yield a pale yellow solid (boiling point 240 °C)')

This is my attempt to write the parser:

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    prefix = StringType()
    name = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))`


prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling')(u'name') + I(u'point')).add_action(join)(u'prefix')
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')


class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()')),
                    prefix = first(result.xpath('./prefix/text()')),
                    name = first(result.xpath('./name/text()')),
                    
                )
            ]
        )
        yield compound

Sentence.parsers = [BpParser()]

However what d.records.serialize() produces is,

[{'boiling_points': [{'value': '240',
'units': '°C',
'prefix': 'boiling point'}]}]

@maddenfederico
Copy link

All you have to do is tweak the xpath you use to access the result from the name element. Element results are returned as a tree with whatever you assign to root as the root and all the elements that form a part of root as child nodes, and so on.

So you would write name = first(result.xpath('./prefix/name/text()')), since name is a child of prefix

@gihanpanapitiya
Copy link
Author

All you have to do is tweak the xpath you use to access the result from the name element. Element results are returned as a tree with whatever you assign to root as the root and all the elements that form a part of root as child nodes, and so on.

So you would write name = first(result.xpath('./prefix/name/text()')), since name is a child of prefix

I tried that, but I am still getting the same output as before.

@maddenfederico
Copy link

might be the .add_action(join) then. Seems like that merges all of the tokens and puts them in the same node. It may not be the best solution, but the first thing that comes to my mind is to capture boiling and point as separate elements and then join them within interpret(). I'm actually curious so I'm about to do my own tests

@gihanpanapitiya
Copy link
Author

Thanks for the suggestion! I haven't worked with interpret(). I am going to start experimenting with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants