Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What about Dates? #113

Closed
rcostu opened this issue Nov 17, 2014 · 11 comments
Closed

What about Dates? #113

rcostu opened this issue Nov 17, 2014 · 11 comments

Comments

@rcostu
Copy link

rcostu commented Nov 17, 2014

Hi,

I have been working lately on Part-of-Speech tagging in Spanish and we tend to follow the EAGLES standard which uses tag "w" to define dates.

What about using any special tag to define dates? Such as DATE?

If not, how are they supposed to be tagged within this new standard?

@dan-zeman
Copy link
Member

I would say that in "17 November 2014", "November" is NOUN and "17" and "2014" are NUM. In "17. 11. 2014", the numbers are NUM and the dots are PUNCT. Alternatively, one could consider "17" an ordinal numeral, in which case it would become an ADJ with the feature NumType=Ord.

@dan-zeman
Copy link
Member

BTW, if you have a link to a source describing the EAGLES standard, I'd love to know about it. In the past, I tried several times to learn more about EAGLES but I found it tough to find resources on-line.

@rcostu
Copy link
Author

rcostu commented Nov 17, 2014

Yes, that annotation is logical. I was just wondering is as some standards annotate the dates as a new PoS, may be worth considering adding it, more than delaying in the pipeline its detection.

A couple of links of EAGLES:

The official website: http://www.ilc.cnr.it/EAGLES96/home.html
Spanish tagset of EAGLES: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html

The official information is hard though.

@dan-zeman
Copy link
Member

Thanks, @rcostu ! So it looks like the Freeling analyzer tries to generally follow the EAGLES approach to annotation, but the "W" category is their extension over the EAGLES standard, unless I am missing something. It does not appear at the EAGLES site here:

http://www.ilc.cnr.it/EAGLES96/annotate/node20.html#SECTION00063000000000000000
http://www.ilc.cnr.it/EAGLES96/annotate/node16.html#cmobli

It is true that some corpora have a special tag for date/time expressions. (E.g. AnCora (Catalan and Spanish) have a "w" tag :-)) I personally am not much in favor of that, since it is always a compound expression made of "normal" words that have their morphological and syntactic properties. Dates can occur in various structures and they can be incomplete ("Jan 6" vs. "6 January 2005") so I think it is better to capture the relations and leave the interpretation for a specialized module. But apparently it is not the only possibility how to do it.

@spyysalo
Copy link
Member

+1 for "normal" POS tags for the words that make up dates. (Also, UD v1, including the POS tags, is frozen until at least Oct 2015.)

On a related note, I would argue that in languages using the period as the ordinal indicator (such as Finnish, see e.g. http://en.wikipedia.org/wiki/Ordinal_indicator#Finnish) the period is part of the token and the analysis ADJ[NumType=Ord] is most appropriate for e.g. 5. in 5. maaliskuuta "5th of March".

@rcostu
Copy link
Author

rcostu commented Nov 20, 2014

Yes @dan-zeman. Actually it seems that the Freeling team decided to put it over there. In fact they use the AnCora corpus which is developed by the same university.

I was asking just in case it is considered or any other language uses similar tags to get this information tagged.

I also support the use of normal POS tags to tag each word in a date and further processing is done just to extract the knowledge that it is a date or so on.

OT: Is any there any Spanish contribution to this project or I am the first one?

@spyysalo
Copy link
Member

Great, I believe the original issue is then resolved, closing.

@dan-zeman
Copy link
Member

@rcostu : Yes, I believe you are the first to work on UD for Spanish. You can have a look at the "stanfordized" Ancora we have in HamleDT 2.0, but it predates UD and it is an automatic conversion only. For POS tags and morphosyntactic features, you can have a look at the tagset conversion table that I uploaded here: http://universaldependencies.github.io/docs/tagset-conversion/es-conll2009-uposf.html (also automatic approximation, see the disclaimer; "w" is one of the tags for which it did not do a good job).

@rcostu
Copy link
Author

rcostu commented Nov 20, 2014

@dan-zeman Thanks for the info. I will look at them as I am working into getting corpus such as AnCora working with UD and towards getting proper standard constituency to dependency parsing.

I have made my conversion from EAGLES to UD, and i will be contributing with the list when I can, and I am also interested in contributing into the Spanish conversion and documentation of UD as well.

I could review that automatic conversion to get a proper CONLL09 -> UD conversion too.

What is the best way to contribute? Forking and pull-requesting?

@spyysalo
Copy link
Member

What is the best way to contribute? Forking and pull-requesting?

If you're interested in contributing an entire treebank (or several), I'd suggest to first propose this so that current project members have an opportunity to comment (to avoid overlap etc.). You might want to open a separate issue for this or just contact people by email (comments on this issue are unlikely to be widely read).

@hans
Copy link
Contributor

hans commented Dec 21, 2014

Hi @rcostu and all — I worked with Christopher Manning this summer on Spanish NLP, and as a side-effect of some of the work I produced some documentation for Spanish relations: https://github.com/UniversalDependencies/docs/tree/spanish/_es

I unfortunately don't have the bandwidth to continue this at the moment, but I figure I should mention it somewhere so the docs don't get lost..

Most of the examples are short excerpts from AnCora. We were thinking it might be possible to produce a reliable UD Spanish corpus by synthesizing the HamleDT output (which is lossy and at times very incorrect) with the original AnCora dependency treebank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants