The Dictionary Matcher Transformer is an information extraction tool that extracts words and phrases from text based on SKOS taxonomies. It supports multiple languages and performs very fast keyword matching even with huge taxonomies. The transformer is based on the modified version of a string matching algorithm called Aho-Corasick.
To try it out, you can download the binaries of the latest release from the release page.
Then start the application with
java -jar dictionary-matcher-transformer-v1.0.0-*-jar-with-dependencies.jar
Once it started use cURL to test it
curl -X POST -d "Frauds and Swindlings cause significant concerns with regards to Ethics." "http://localhost:8301/?taxonomy=http://data.nytimes.com/descriptors.rdf"
Compile source code and run the application using Maven with
mvn clean install exec:java
Or start the application from binary
java -jar dictionary-matcher-transformer-v1.0.0-*-jar-with-dependencies.jar
This will start the Dictionary Matcher Transformer on port 8301 by default.
java -Xmx{size} -jar {jar-name} [options]
-P|--port int The port on which the proxy shall listen
-C|--enableCors Enable a liberal CORS policy
-H|--help Show help on command line arguments
The supported input and output formats of the transformer can be retrieved by the following GET request
curl -X GET "http://localhost:8301/"
<http://localhost:8301/>
<http://vocab.fusepool.info/transformer#supportedInputFormat>
"text/turtle"^^<http://www.w3.org/2001/XMLSchema#string> , "text/plain"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://vocab.fusepool.info/transformer#supportedOutputFormat>
"text/turtle"^^<http://www.w3.org/2001/XMLSchema#string> .
The transformer accepts the input data enclosed in the request message’s body, and expects the URI of the taxonomy (and additional options) in the query string.
curl -X POST --data-binary <data> "http://localhost:8301/?taxonomy=<taxonomy_URI>&stemming=<stemming_language>&casesensitive=true"
taxonomy
- URI of the taxonomy (it must be a valid resource location)
stemming
- optional - if present, enables stemming (supported languages: danish
, dutch
, english
, finnish
, french
, german
, hungarian
, italian
, norwegian
, portuguese
, romanian
, russian
, spanish
, swedish
, turkish
)
casesensitive
- optional - if present, enables case sensitivity
The following curl example shows an example invocation of the Dictionary Matcher Transformer of a local running instance:
curl -X POST --data-binary "Frauds and Swindlings cause significant concerns with regards to Ethics." "http://localhost:8301/?taxonomy=http://data.nytimes.com/descriptors.rdf&stemming=english"
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation-body1>
a <http://vocab.fusepool.info/fam#LinkedEntity> ;
<http://vocab.fusepool.info/fam#entity-label>
"Frauds and Swindling"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://vocab.fusepool.info/fam#entity-mention>
"Frauds and Swindlings"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://vocab.fusepool.info/fam#entity-reference>
<http://data.nytimes.com/N38522309997148425060> ;
<http://vocab.fusepool.info/fam#extracted-from>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897> ;
<http://vocab.fusepool.info/fam#selector>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#char=0,21> .
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation1>
a <http://www.w3.org/ns/oa#Annotation> ;
<http://www.w3.org/ns/oa#annotatedAt>
"2014-10-30T14:35:26+0100"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://www.w3.org/ns/oa#annotatedBy>
<p3-dictionary-matcher-transformer> ;
<http://www.w3.org/ns/oa#hasBody>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation-body1> ;
<http://www.w3.org/ns/oa#hasTarget>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#sp-resource1> .
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#sp-resource1>
a <http://www.w3.org/ns/oa#SpecificResource> ;
<http://www.w3.org/ns/oa#hasSelector>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#char=0,21> ;
<http://www.w3.org/ns/oa#hasSource>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation-body1> .
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#char=0,21>
a <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#String> , <http://vocab.fusepool.info/fam#NifSelector> ;
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#beginIndex>
"0"^^<http://www.w3.org/2001/XMLSchema#int> ;
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#endIndex>
"21"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation-body2>
a <http://vocab.fusepool.info/fam#LinkedEntity> ;
<http://vocab.fusepool.info/fam#entity-label>
"Ethics"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://vocab.fusepool.info/fam#entity-mention>
"Ethics"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://vocab.fusepool.info/fam#entity-reference>
<http://data.nytimes.com/48662871776634757120> ;
<http://vocab.fusepool.info/fam#extracted-from>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897> ;
<http://vocab.fusepool.info/fam#selector>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#char=65,71> .
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation2>
a <http://www.w3.org/ns/oa#Annotation> ;
<http://www.w3.org/ns/oa#annotatedAt>
"2014-10-30T14:35:26+0100"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://www.w3.org/ns/oa#annotatedBy>
<p3-dictionary-matcher-transformer> ;
<http://www.w3.org/ns/oa#hasBody>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation-body2> ;
<http://www.w3.org/ns/oa#hasTarget>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#sp-resource2> .
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#sp-resource2>
a <http://www.w3.org/ns/oa#SpecificResource> ;
<http://www.w3.org/ns/oa#hasSelector>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#char=65,71> ;
<http://www.w3.org/ns/oa#hasSource>
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#annotation-body2> .
<http://localhost:8301/bad1b7a2-431a-4861-acd5-f01515a6d897#char=65,71>
a <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#String> , <http://vocab.fusepool.info/fam#NifSelector> ;
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#beginIndex>
"65"^^<http://www.w3.org/2001/XMLSchema#int> ;
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#endIndex>
"71"^^<http://www.w3.org/2001/XMLSchema#int> .
<http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#endIndex>
"146"^^<http://www.w3.org/2001/XMLSchema#int> .
This application implements the requirements in FP-39, FP-105 and FP-197.