Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ptb.py (new file), convert.py (lines of code added inside) #8

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

matanboaz
Copy link

  • Added ptb.py file: conversion from ptb to UCCA (to xml)

  • In convert.py:

  1. I added the support to use *.mrg files to convert ptb to UCCA.
  2. I added from_ptb and to_ptb functions (to_ptb is not implemented yet)

- In convert.py:
1. I added the support to use *.mrg files to convert ptb to UCCA.
2. I added from_ptb and to_ptb functions (to_ptb is not implemented yet)
Copy link
Member

@danielhers danielhers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but please address the comments I left and then commit your changes and just push again. Your pull request will be updated automatically when you push.

Also, could you add an example mrg file to test_files and a test to tests? See examples from other formats there. Thanks!

self.terminals = []
self.pending_nodes = []
self.node_ids_with_children = set()
self.sentence_id = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is self.sentence_id used anywhere? If not, please remove it.

self.pending_nodes = []
self.node_ids_with_children = set()
self.sentence_id = 1
self.passage_id = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better initialize self.passage_id to None or remove it if it's not needed.

l1 = layer1.Layer1(p)
paragraph = 1

next(self.parse(stream))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems to parse the first expression from stream but does not assign it to anything. What is it for then? To initialize self.pending_nodes? Please document it appropriately.

next(self.parse(stream))

# add normal nodes
self.pending_nodes = list(reversed(self.pending_nodes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for reversing the nodes? Please document appropriately.

while self.pending_nodes:
for i in reversed(range(len(self.pending_nodes))):
parent_id, edge_tag, node_id = self.pending_nodes[i]
parent = self.node_by_id.get(parent_id, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use None instead of -1 here. Note that None is the default in the dict get method, so you can just call self.node_by_id.get(parent_id) without a second argument to get.


PUNC = [',', '.']
class Leaf:
def __init__(self, word, pos, parent_id, edge_tag = "Terminal", node_id = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the constant layer1.EdgeTags.Terminal


def is_punc(self, word):
if word in PUNC:
return "Punctuation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the constants from layer0.NodeTags

def __str__(self):
return '({} {})'.format(self.pos, self.word)

def is_punc(self, word):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method's name is a little misleading, since it implies that it returns True or False. Maybe rename it to get_node_tag or something. Also, I think it's better to use the PTB tag to determine if a token is punctuation. See the list of tags: https://www.clips.uantwerpen.be/pages/mbsp-tags

@@ -171,6 +171,34 @@ def to_sdp(passage, test=False, tree=False, mark_aux=False, **kwargs):
return SdpConverter(mark_aux=mark_aux, tree=tree).to_format(passage, test=test, format=kwargs.get("format"))


def from_ptb(lines, passage_id=None, return_original=False, **kwargs):
#def from_ptb(passage, passage_id=None, return_original=False, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented-out lines unless there's a reason to keep them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.



def to_ptb(passage, test=False, tree=False, **kwargs):
""" Convert from a Passage object to a string in Penn TreeBank mrg format (export)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not export, that's the name of another format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, forgot to change it after copying it. I removed it.

@danielhers danielhers force-pushed the master branch 4 times, most recently from 4c69e1f to f1e55fc Compare June 2, 2020 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants