Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Paragraph and Section Contexts #76

Merged
merged 6 commits into from
Jul 18, 2018
Merged

Add Paragraph and Section Contexts #76

merged 6 commits into from
Jul 18, 2018

Conversation

lukehsiao
Copy link
Contributor

No description provided.

@lukehsiao lukehsiao added the enhancement New feature or request label Jul 18, 2018
@lukehsiao lukehsiao added this to the v0.2.0 milestone Jul 18, 2018
@lukehsiao lukehsiao self-assigned this Jul 18, 2018
@lukehsiao lukehsiao requested a review from senwu July 18, 2018 10:20
@@ -269,13 +283,117 @@ def _parse_figure_node(self, node, state):

return state

def _parse_sentence(self, node, state):
def _parse_sentence(self, paragraph, node, text, field, state):
Copy link
Contributor Author

@lukehsiao lukehsiao Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of how many arguments this function has. Any suggestions on simplifying this would be welcome. I don't want to have to alter the state for the parent so I pass the Paragraph directly, and unfortunately parsing sentences uses the node, and field.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put text and field into state? Then when we done with _parse_sentence we can remove those two.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Good idea.


<body>
<h1 id="sample-markdown">Sample Markdown</h1>
<p>This is some basic, sample markdown. Unlike the other markdown document, however, this document actually contains paragraphs of text. That is, larger amounts of text that are all present in a single HTML node like the one you are currently reading.</p>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added this document so that we have a real paragraph to test.

return self.__repr__() > other.__repr__()


class Row(Context):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping Row/Col

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -10,13 +10,15 @@ test: dev check

check:
isort -rc -c fonduer/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we don't enforce flake8, add style checks for the tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -10,13 +10,15 @@ test: dev check

check:
isort -rc -c fonduer/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self.document.name,
self.section.position,
self.paragraph.position,
self.sentence_num,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you all use position instead of some position and some num?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

document=state["document"],
# TODO: This just takes the one and only Section in a document
# and assigns it as the Table's parent.
section=state["document"].sections[0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why sections[0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now there is only one section in a document. This just grabs that one Section directly, rather than dealing with the logic of traversing up a node's parents.

parts["stable_id"] = stable_id
parts["document"] = state["document"]
parts["position"] = state["paragraph"]["idx"]
parts["section"] = state["document"].sections[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. why sections[0] here instead of the current section from parents?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to expand on this a little. Paragraph can be in Section, Table, or Cell. Would need to add several lines of logic to navigate up to the Section.

state["sentence"]["abs_offset"],
abs_sentence_offset_end,
# Process the Paragraph
stable_id = "{}::{}:{}".format(

This comment was marked as resolved.

This comment was marked as resolved.

)
state["sentence"]["abs_offset"] = abs_sentence_offset_end
if self.structural:
context_node = node.getparent() if field == "tail" else node
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is old code. Just make sure the logic here is correct. Will the the tail node be the sibling of the text node or now?

@@ -269,13 +283,117 @@ def _parse_figure_node(self, node, state):

return state

def _parse_sentence(self, node, state):
def _parse_sentence(self, paragraph, node, text, field, state):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put text and field into state? Then when we done with _parse_sentence we can remove those two.

return self.__repr__() > other.__repr__()


class Row(Context):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -103,6 +105,74 @@ def test_parse_md_details(caplog):
assert header.dep_labels == ["compound", "ROOT"]


def test_parse_md_paragraphs(caplog):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some checks for logic like some sentence's parent is some paragraph, section etc..

Copy link
Contributor Author

@lukehsiao lukehsiao Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence can only have a Paragraph parent, which is tested.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just want to check the parser results the correct answer (paragraph, section index).

@senwu
Copy link
Collaborator

senwu commented Jul 18, 2018

LGTM.

@senwu senwu closed this Jul 18, 2018
@senwu senwu reopened this Jul 18, 2018
@senwu senwu merged commit 2b51778 into master Jul 18, 2018
@senwu senwu deleted the data-model branch July 18, 2018 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants