You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To make this process as transparent as possible, here is some of the (relevant) reviewer feedback from the LREC submission process.
Any feedback relevant only to the paper & writing is not listed here, but feel free to comment on any points for clarification.
(R1) "I do not find their reasons for doing so (“avoid encoding errors”) convincing: They could simply use a character encoding defined in the Unicode standard, e.g. UTF-8." Comment: This is a fair point, but the main problem was that Klexikon does not use non-Latin characters. This means that cities like "Århus" will never appear as such, and instead have "Aarhus" in Klexikon. Unfortunately, Python does not have any sufficient libraries for dealing with this, as it would additionally turn the German Umlauts (Ä, Ö, Ü) into (A, O, U), which is an incorrect transformation that likely would happen more frequently than other non-Latin characters. Further, and I'm not sure if this is explained sufficiently well, I have made sure to replace the topmost-occurring characters in a manual "translation table" to ensure the correct treatment of most of the letter characters at least (or merging " '' etc.)
(R2) "It is clear why the authors chose to disregard Wikipedia articles with less than 15 paragraphs given their specific goal, however, the dataset would be useful to a much wider audience (e.g. researchers interested in TS only) if all Wikipedia-Klexiko alignments were kept regardless of the posterior case-specific filtering by length" Comment: This is actually a good idea for a raw corpus. The original reasoning is based on the fact that shorter articles were mostly uninformative in my personal opinion. Since we remove the list-like elements, it creates a certain bias towards texts that only contain the descriptor of a subsequent list, which generally is not very applicable. Other examples included biology-related articles, where generally the article would consist only of several explanations of sub-species, without actual content information. However, a raw corpus could be re-crawled either way with the most recent number of articles. Then again, this would require re-matching all ambiguous articles, which is a bit more time-consuming.
(R2) "It would be very informative to give some statistical information in 4.1.1., i.e. what was the starting number of documents in Klexikon and how was it affected by each of the steps 1 to 4 to get to the final 2,898 documents." Comment: I'll have to see if I can produce that information again, but would be a nice addition.
(R3) "I wonder why the authors only consider lead-3, lead-k and the full input text as baseline sets, and do not attempt a simple extractive summarization algorithm such as the classic Luhn algorithm to gain a more reliable set" Comment: Actually a good idea. I have some intermediate results that go towards that direction, so it should be fairly easy to generate these results.
In case any of the reviewers should read this: Thanks for your constructive feedback, I genuinely appreciated the helpful comments!
The text was updated successfully, but these errors were encountered:
To make this process as transparent as possible, here is some of the (relevant) reviewer feedback from the LREC submission process.
Any feedback relevant only to the paper & writing is not listed here, but feel free to comment on any points for clarification.
Comment: This is a fair point, but the main problem was that Klexikon does not use non-Latin characters. This means that cities like "Århus" will never appear as such, and instead have "Aarhus" in Klexikon. Unfortunately, Python does not have any sufficient libraries for dealing with this, as it would additionally turn the German Umlauts (Ä, Ö, Ü) into (A, O, U), which is an incorrect transformation that likely would happen more frequently than other non-Latin characters. Further, and I'm not sure if this is explained sufficiently well, I have made sure to replace the topmost-occurring characters in a manual "translation table" to ensure the correct treatment of most of the letter characters at least (or merging " '' etc.)
Comment: This is actually a good idea for a raw corpus. The original reasoning is based on the fact that shorter articles were mostly uninformative in my personal opinion. Since we remove the list-like elements, it creates a certain bias towards texts that only contain the descriptor of a subsequent list, which generally is not very applicable. Other examples included biology-related articles, where generally the article would consist only of several explanations of sub-species, without actual content information. However, a raw corpus could be re-crawled either way with the most recent number of articles. Then again, this would require re-matching all ambiguous articles, which is a bit more time-consuming.
Comment: I'll have to see if I can produce that information again, but would be a nice addition.
Comment: Actually a good idea. I have some intermediate results that go towards that direction, so it should be fairly easy to generate these results.
In case any of the reviewers should read this: Thanks for your constructive feedback, I genuinely appreciated the helpful comments!
The text was updated successfully, but these errors were encountered: