-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ptb.py (new file), convert.py (lines of code added inside) #8
base: master
Are you sure you want to change the base?
Conversation
- In convert.py: 1. I added the support to use *.mrg files to convert ptb to UCCA. 2. I added from_ptb and to_ptb functions (to_ptb is not implemented yet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, but please address the comments I left and then commit your changes and just push again. Your pull request will be updated automatically when you push.
Also, could you add an example mrg file to test_files
and a test to tests
? See examples from other formats there. Thanks!
self.terminals = [] | ||
self.pending_nodes = [] | ||
self.node_ids_with_children = set() | ||
self.sentence_id = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is self.sentence_id
used anywhere? If not, please remove it.
self.pending_nodes = [] | ||
self.node_ids_with_children = set() | ||
self.sentence_id = 1 | ||
self.passage_id = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better initialize self.passage_id
to None
or remove it if it's not needed.
l1 = layer1.Layer1(p) | ||
paragraph = 1 | ||
|
||
next(self.parse(stream)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems to parse the first expression from stream
but does not assign it to anything. What is it for then? To initialize self.pending_nodes
? Please document it appropriately.
next(self.parse(stream)) | ||
|
||
# add normal nodes | ||
self.pending_nodes = list(reversed(self.pending_nodes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for reversing the nodes? Please document appropriately.
while self.pending_nodes: | ||
for i in reversed(range(len(self.pending_nodes))): | ||
parent_id, edge_tag, node_id = self.pending_nodes[i] | ||
parent = self.node_by_id.get(parent_id, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to use None
instead of -1
here. Note that None
is the default in the dict get
method, so you can just call self.node_by_id.get(parent_id)
without a second argument to get
.
|
||
PUNC = [',', '.'] | ||
class Leaf: | ||
def __init__(self, word, pos, parent_id, edge_tag = "Terminal", node_id = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use the constant layer1.EdgeTags.Terminal
|
||
def is_punc(self, word): | ||
if word in PUNC: | ||
return "Punctuation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use the constants from layer0.NodeTags
def __str__(self): | ||
return '({} {})'.format(self.pos, self.word) | ||
|
||
def is_punc(self, word): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method's name is a little misleading, since it implies that it returns True
or False
. Maybe rename it to get_node_tag
or something. Also, I think it's better to use the PTB tag to determine if a token is punctuation. See the list of tags: https://www.clips.uantwerpen.be/pages/mbsp-tags
@@ -171,6 +171,34 @@ def to_sdp(passage, test=False, tree=False, mark_aux=False, **kwargs): | |||
return SdpConverter(mark_aux=mark_aux, tree=tree).to_format(passage, test=test, format=kwargs.get("format")) | |||
|
|||
|
|||
def from_ptb(lines, passage_id=None, return_original=False, **kwargs): | |||
#def from_ptb(passage, passage_id=None, return_original=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented-out lines unless there's a reason to keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
|
||
|
||
def to_ptb(passage, test=False, tree=False, **kwargs): | ||
""" Convert from a Passage object to a string in Penn TreeBank mrg format (export) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not export
, that's the name of another format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, forgot to change it after copying it. I removed it.
4c69e1f
to
f1e55fc
Compare
Added ptb.py file: conversion from ptb to UCCA (to xml)
In convert.py: