Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Rule-based Annotator for Cardinal/Ordinal #1429

Open
alaindesilets opened this issue Nov 21, 2019 · 15 comments
Open

Feature request: Rule-based Annotator for Cardinal/Ordinal #1429

alaindesilets opened this issue Nov 21, 2019 · 15 comments

Comments

@alaindesilets
Copy link
Contributor

While DKPro has UIMA types for Cardinal and Ordinal, it seems there are no annotators that can produce them.

So I implemented my own CardOrdAnnotator for English based on the Stanford NLP QuantifiableEntityNormalizer class.

If you are interested, I could roll that into dkpro-core-api-ner-asl, or whatever module you think is appropriate.

I attach the classes and tests that I wrote for that. Note that you won't be able to run them as they use some utilities that I wrote for myself, but it should give you an idea of how they work.

Basically, the annotator uses a class CardOrdParser, which I wrote based on QuantifiableEntityNormalizer. This means that the annotator would have to be GPLed.

Note that at the moment, the parser is only available for English, but it would be probably be relatively easy to implement it for other languages. To do that however, we would have to re-write (or extend) QuantifiableEntityNormalizer because in its current implementation, it uses static variables to store words for cardinals and ordinals (ex: "first", "one", etc...). As a result, you cannot have different instances of QuantifiableEntityNormalizer for different languages. I guess we could rewrite QuantifiableEntityNormalizer altogether (using its code as "inspiration"). Not sure if that would be sufficient to remove the GPL constraint on CardOrdParser.

Let me know if you are interested.

CardOrdAnnotator_files.zip

@reckart
Copy link
Member

reckart commented Nov 21, 2019

Is GPL vs ASL an issue for you?

If you run a compatible POS tagger before the CoreNlpNamedEntityRecognizer (i.e. the CoreNlpPosTagger), then you can also get e.g. ORDINAL tags:

    @Test
    public void thatOrdinalNumbersAreRecognized() throws Exception
    {
        JCas jcas = runTest("en", "John made the second place in the run .");
        
        String[] ne = {
                "[  0,  4]Person(PERSON) (John)",
                "[ 14, 20]NamedEntity(ORDINAL) (second)" };

        AssertAnnotations.assertNamedEntity(ne, select(jcas, NamedEntity.class));
    }

    private JCas runTest(String language, String testDocument)
        throws Exception
    {
        AnalysisEngineDescription engine = createEngineDescription(
                createEngineDescription(CoreNlpPosTagger.class),
                createEngineDescription(CoreNlpNamedEntityRecognizer.class));

        return TestRunner.runTest(engine, language, testDocument);
    }

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Nov 21, 2019 via email

@reckart
Copy link
Member

reckart commented Nov 21, 2019

        JCas jcas = runTest("en", "John bought one hundred laptops .");
        
        String[] ne = {
                "[  0,  4]Person(PERSON) (John)",
                "[ 12, 15]NamedEntity(NUMBER) (one)",
                "[ 16, 23]NamedEntity(NUMBER) (hundred)" };

Looks like they are simply tagged as NUMBER. I'm not sure ifCARDINAL is even produced by CoreNLP - references to it never seem to be assignments.

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Nov 21, 2019 via email

@zesch
Copy link
Member

zesch commented Nov 22, 2019 via email

@reckart
Copy link
Member

reckart commented Nov 22, 2019

I have committed the extended test which combines the Pos Tagger and the NER from CoreNLP here:

https://github.com/dkpro/dkpro-core/blob/4f8d74fdb003c90fdef8ccff7039a799ab471699/dkpro-core-corenlp-gpl/src/test/java/org/dkpro/core/corenlp/CoreNlpPosTaggerAndNamedEntityRecognizerTest.java

Feel free to play around with it and test additional number types (durations, percentages, etc.).

If I saw it correctly, the components you implemented work without requiring POS tags. If you would like to contribute them, it would be best if you create a PR. Since the classes depend on CoreNLP, the CoreNLP (GPL) module would be the best to place them.

I saw in the CoreNLP code, that there is also some support for normalizing quantities. If normalization is also something you are after, we might consider extending the DKPro Core type system with a way of storing such normalizations and to transfer them out of components such as CoreNLP which produce them.

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Nov 22, 2019 via email

@reckart
Copy link
Member

reckart commented Nov 22, 2019

I'd probably simply use a string feature even for numeric/boolean values...

IMHO having an "Object" attribute also isn't a great solution because it would also require type-casting.

The equivalent to an "Object" attribute in UIMA would be a "Feature Structure"-type attribute which could then point to e.g. a to-be-defined "DoubleValue" Feature structure which simply has a feature "value" of the type "double".

UIMAv3 also has new features to store custom objects in the CAS - but I have never tried this out so far: https://uima.apache.org/d/uimaj-current/version_3_users_guide.html#uv3.custom_java_objects - might be worth a look.

As for contributing via a PR - see here: https://dkpro.github.io/contributing/

@reckart
Copy link
Member

reckart commented Nov 22, 2019

I wonder why time expressions, monetary expressions and so on are even considered as named entities / handled by the CoreNLP NER tools. They are not really entities... in particular not named ones.

@jcklie
Copy link
Contributor

jcklie commented Nov 22, 2019

They are annotated in Ontonotes at least.

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Nov 22, 2019 via email

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Nov 22, 2019 via email

@reckart
Copy link
Member

reckart commented Nov 22, 2019

Very awkward in my opinion. I am sure the UIMA people had a good reason for
inventing their own type system instead of just going with Java's, but I
have never seen an explanation of the rationale.

UIMA is supposed to be cross-platform. There is a C++ implementation provided by the Apache UIMA project. There are also some outdated Python bindings and the more recent DKPro Cassis library which implements the CAS in Python. So just using the full Java type system wouldn't really do.

@alaindesilets
Copy link
Contributor Author

alaindesilets commented Nov 22, 2019 via email

@reckart
Copy link
Member

reckart commented Nov 22, 2019

Too bad the Python bindings are outdated. There are lots of excellent
Python NLP frameworks out there (Spacy in particular).

That's why we have built DKPro Cassis :) We use it amongst other things to connect tools such as spacy to the UIMA-based INCEpTION annotation editor.

Wrt. object serialization - the best place to discuss this would be the UIMA user's mailing list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants