-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What about Dates? #113
Comments
I would say that in "17 November 2014", "November" is NOUN and "17" and "2014" are NUM. In "17. 11. 2014", the numbers are NUM and the dots are PUNCT. Alternatively, one could consider "17" an ordinal numeral, in which case it would become an ADJ with the feature NumType=Ord. |
BTW, if you have a link to a source describing the EAGLES standard, I'd love to know about it. In the past, I tried several times to learn more about EAGLES but I found it tough to find resources on-line. |
Yes, that annotation is logical. I was just wondering is as some standards annotate the dates as a new PoS, may be worth considering adding it, more than delaying in the pipeline its detection. A couple of links of EAGLES: The official website: http://www.ilc.cnr.it/EAGLES96/home.html The official information is hard though. |
Thanks, @rcostu ! So it looks like the Freeling analyzer tries to generally follow the EAGLES approach to annotation, but the "W" category is their extension over the EAGLES standard, unless I am missing something. It does not appear at the EAGLES site here: http://www.ilc.cnr.it/EAGLES96/annotate/node20.html#SECTION00063000000000000000 It is true that some corpora have a special tag for date/time expressions. (E.g. AnCora (Catalan and Spanish) have a "w" tag :-)) I personally am not much in favor of that, since it is always a compound expression made of "normal" words that have their morphological and syntactic properties. Dates can occur in various structures and they can be incomplete ("Jan 6" vs. "6 January 2005") so I think it is better to capture the relations and leave the interpretation for a specialized module. But apparently it is not the only possibility how to do it. |
+1 for "normal" POS tags for the words that make up dates. (Also, UD v1, including the POS tags, is frozen until at least Oct 2015.) On a related note, I would argue that in languages using the period as the ordinal indicator (such as Finnish, see e.g. http://en.wikipedia.org/wiki/Ordinal_indicator#Finnish) the period is part of the token and the analysis |
Yes @dan-zeman. Actually it seems that the Freeling team decided to put it over there. In fact they use the AnCora corpus which is developed by the same university. I was asking just in case it is considered or any other language uses similar tags to get this information tagged. I also support the use of normal POS tags to tag each word in a date and further processing is done just to extract the knowledge that it is a date or so on. OT: Is any there any Spanish contribution to this project or I am the first one? |
Great, I believe the original issue is then resolved, closing. |
@rcostu : Yes, I believe you are the first to work on UD for Spanish. You can have a look at the "stanfordized" Ancora we have in HamleDT 2.0, but it predates UD and it is an automatic conversion only. For POS tags and morphosyntactic features, you can have a look at the tagset conversion table that I uploaded here: http://universaldependencies.github.io/docs/tagset-conversion/es-conll2009-uposf.html (also automatic approximation, see the disclaimer; "w" is one of the tags for which it did not do a good job). |
@dan-zeman Thanks for the info. I will look at them as I am working into getting corpus such as AnCora working with UD and towards getting proper standard constituency to dependency parsing. I have made my conversion from EAGLES to UD, and i will be contributing with the list when I can, and I am also interested in contributing into the Spanish conversion and documentation of UD as well. I could review that automatic conversion to get a proper CONLL09 -> UD conversion too. What is the best way to contribute? Forking and pull-requesting? |
If you're interested in contributing an entire treebank (or several), I'd suggest to first propose this so that current project members have an opportunity to comment (to avoid overlap etc.). You might want to open a separate issue for this or just contact people by email (comments on this issue are unlikely to be widely read). |
Hi @rcostu and all — I worked with Christopher Manning this summer on Spanish NLP, and as a side-effect of some of the work I produced some documentation for Spanish relations: https://github.com/UniversalDependencies/docs/tree/spanish/_es I unfortunately don't have the bandwidth to continue this at the moment, but I figure I should mention it somewhere so the docs don't get lost.. Most of the examples are short excerpts from AnCora. We were thinking it might be possible to produce a reliable UD Spanish corpus by synthesizing the HamleDT output (which is lossy and at times very incorrect) with the original AnCora dependency treebank. |
Hi,
I have been working lately on Part-of-Speech tagging in Spanish and we tend to follow the EAGLES standard which uses tag "w" to define dates.
What about using any special tag to define dates? Such as DATE?
If not, how are they supposed to be tagged within this new standard?
The text was updated successfully, but these errors were encountered: