-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Paragraph and Section Contexts #76
Conversation
fonduer/parser/parser.py
Outdated
@@ -269,13 +283,117 @@ def _parse_figure_node(self, node, state): | |||
|
|||
return state | |||
|
|||
def _parse_sentence(self, node, state): | |||
def _parse_sentence(self, paragraph, node, text, field, state): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of how many arguments this function has. Any suggestions on simplifying this would be welcome. I don't want to have to alter the state
for the parent so I pass the Paragraph directly, and unfortunately parsing sentences uses the node, and field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put text and field into state? Then when we done with _parse_sentence we can remove those two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Good idea.
|
||
<body> | ||
<h1 id="sample-markdown">Sample Markdown</h1> | ||
<p>This is some basic, sample markdown. Unlike the other markdown document, however, this document actually contains paragraphs of text. That is, larger amounts of text that are all present in a single HTML node like the one you are currently reading.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added this document so that we have a real paragraph to test.
return self.__repr__() > other.__repr__() | ||
|
||
|
||
class Row(Context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping Row/Col
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -10,13 +10,15 @@ test: dev check | |||
|
|||
check: | |||
isort -rc -c fonduer/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we don't enforce flake8, add style checks for the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -10,13 +10,15 @@ test: dev check | |||
|
|||
check: | |||
isort -rc -c fonduer/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
fonduer/parser/models/sentence.py
Outdated
self.document.name, | ||
self.section.position, | ||
self.paragraph.position, | ||
self.sentence_num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you all use position instead of some position and some num?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
fonduer/parser/parser.py
Outdated
document=state["document"], | ||
# TODO: This just takes the one and only Section in a document | ||
# and assigns it as the Table's parent. | ||
section=state["document"].sections[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why sections[0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now there is only one section in a document. This just grabs that one Section directly, rather than dealing with the logic of traversing up a node's parents.
fonduer/parser/parser.py
Outdated
parts["stable_id"] = stable_id | ||
parts["document"] = state["document"] | ||
parts["position"] = state["paragraph"]["idx"] | ||
parts["section"] = state["document"].sections[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. why sections[0] here instead of the current section from parents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to expand on this a little. Paragraph can be in Section, Table, or Cell. Would need to add several lines of logic to navigate up to the Section.
state["sentence"]["abs_offset"], | ||
abs_sentence_offset_end, | ||
# Process the Paragraph | ||
stable_id = "{}::{}:{}".format( |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
) | ||
state["sentence"]["abs_offset"] = abs_sentence_offset_end | ||
if self.structural: | ||
context_node = node.getparent() if field == "tail" else node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is old code. Just make sure the logic here is correct. Will the the tail node be the sibling of the text node or now?
fonduer/parser/parser.py
Outdated
@@ -269,13 +283,117 @@ def _parse_figure_node(self, node, state): | |||
|
|||
return state | |||
|
|||
def _parse_sentence(self, node, state): | |||
def _parse_sentence(self, paragraph, node, text, field, state): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put text and field into state? Then when we done with _parse_sentence we can remove those two.
return self.__repr__() > other.__repr__() | ||
|
||
|
||
class Row(Context): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -103,6 +105,74 @@ def test_parse_md_details(caplog): | |||
assert header.dep_labels == ["compound", "ROOT"] | |||
|
|||
|
|||
def test_parse_md_paragraphs(caplog): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some checks for logic like some sentence's parent is some paragraph, section etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentence can only have a Paragraph parent, which is tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just want to check the parser results the correct answer (paragraph, section index).
LGTM. |
No description provided.