ptb.py (new file), convert.py (lines of code added inside) #8

matanboaz · 2019-05-06T10:30:18Z

Added ptb.py file: conversion from ptb to UCCA (to xml)
In convert.py:

I added the support to use *.mrg files to convert ptb to UCCA.
I added from_ptb and to_ptb functions (to_ptb is not implemented yet)

- In convert.py: 1. I added the support to use *.mrg files to convert ptb to UCCA. 2. I added from_ptb and to_ptb functions (to_ptb is not implemented yet)

danielhers

Looks good overall, but please address the comments I left and then commit your changes and just push again. Your pull request will be updated automatically when you push.

Also, could you add an example mrg file to test_files and a test to tests? See examples from other formats there. Thanks!

danielhers · 2019-05-06T12:12:01Z

semstr/conversion/ptb.py

+        self.terminals = []
+        self.pending_nodes = []
+        self.node_ids_with_children = set()
+        self.sentence_id = 1


Is self.sentence_id used anywhere? If not, please remove it.

danielhers · 2019-05-06T12:12:42Z

semstr/conversion/ptb.py

+        self.pending_nodes = []
+        self.node_ids_with_children = set()
+        self.sentence_id = 1
+        self.passage_id = 3


Better initialize self.passage_id to None or remove it if it's not needed.

danielhers · 2019-05-06T12:15:24Z

semstr/conversion/ptb.py

+        l1 = layer1.Layer1(p)
+        paragraph = 1
+
+        next(self.parse(stream))


This line seems to parse the first expression from stream but does not assign it to anything. What is it for then? To initialize self.pending_nodes? Please document it appropriately.

danielhers · 2019-05-06T12:15:53Z

semstr/conversion/ptb.py

+        next(self.parse(stream))
+
+        # add normal nodes
+        self.pending_nodes = list(reversed(self.pending_nodes))


What is the reason for reversing the nodes? Please document appropriately.

danielhers · 2019-05-06T12:17:52Z

semstr/conversion/ptb.py

+        while self.pending_nodes:
+            for i in reversed(range(len(self.pending_nodes))):
+                parent_id, edge_tag, node_id = self.pending_nodes[i]
+                parent = self.node_by_id.get(parent_id, -1)


I think it's better to use None instead of -1 here. Note that None is the default in the dict get method, so you can just call self.node_by_id.get(parent_id) without a second argument to get.

danielhers · 2019-05-06T12:33:17Z

semstr/conversion/ptb.py

+
+PUNC = [',', '.']
+class Leaf:
+    def __init__(self, word, pos, parent_id, edge_tag = "Terminal", node_id = None):


Maybe use the constant layer1.EdgeTags.Terminal

danielhers · 2019-05-06T12:33:22Z

semstr/conversion/ptb.py

+
+    def is_punc(self, word):
+        if word in PUNC:
+            return "Punctuation"


Maybe use the constants from layer0.NodeTags

danielhers · 2019-05-06T12:37:55Z

semstr/conversion/ptb.py

+    def __str__(self):
+        return '({} {})'.format(self.pos, self.word)
+
+    def is_punc(self, word):


This method's name is a little misleading, since it implies that it returns True or False. Maybe rename it to get_node_tag or something. Also, I think it's better to use the PTB tag to determine if a token is punctuation. See the list of tags: https://www.clips.uantwerpen.be/pages/mbsp-tags

danielhers · 2019-05-06T12:39:01Z

semstr/convert.py

@@ -171,6 +171,34 @@ def to_sdp(passage, test=False, tree=False, mark_aux=False, **kwargs):
    return SdpConverter(mark_aux=mark_aux, tree=tree).to_format(passage, test=test, format=kwargs.get("format"))


+def from_ptb(lines, passage_id=None, return_original=False, **kwargs):
+#def from_ptb(passage, passage_id=None, return_original=False, **kwargs):


Remove commented-out lines unless there's a reason to keep them.

danielhers · 2019-05-06T12:39:45Z

semstr/convert.py

+
+
+def to_ptb(passage, test=False, tree=False, **kwargs):
+    """ Convert from a Passage object to a string in Penn TreeBank mrg format (export)


It's not export, that's the name of another format.

My bad, forgot to change it after copying it. I removed it.

- Added ptb.py file: conversion from ptb to UCCA (to xml)

d48eb3d

- In convert.py: 1. I added the support to use *.mrg files to convert ptb to UCCA. 2. I added from_ptb and to_ptb functions (to_ptb is not implemented yet)

danielhers requested changes May 6, 2019

View reviewed changes

danielhers force-pushed the master branch 4 times, most recently from 4c69e1f to f1e55fc Compare June 2, 2020 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ptb.py (new file), convert.py (lines of code added inside) #8

ptb.py (new file), convert.py (lines of code added inside) #8

matanboaz commented May 6, 2019

danielhers left a comment

danielhers May 6, 2019

danielhers May 6, 2019

danielhers May 6, 2019

danielhers May 6, 2019

danielhers May 6, 2019

danielhers May 6, 2019

danielhers May 6, 2019

danielhers May 6, 2019

danielhers May 6, 2019

matanboaz May 12, 2019

danielhers May 6, 2019

matanboaz May 12, 2019



		def to_ptb(passage, test=False, tree=False, **kwargs):
		""" Convert from a Passage object to a string in Penn TreeBank mrg format (export)

ptb.py (new file), convert.py (lines of code added inside) #8

Are you sure you want to change the base?

ptb.py (new file), convert.py (lines of code added inside) #8

Conversation

matanboaz commented May 6, 2019

danielhers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment