Skip to content
petermr edited this page Aug 15, 2020 · 14 revisions

Dictionary creation using amidict

Overview

purpose

To create an AMI dictionary.

description

An AMI dictionary is fundamentally a list of terms (words and phrases), with variants (synonyms, abbreviations) and annotations (links to other resources, especially Wikidata). There are many ways for dictionary authors to select the terms, variants and annotate them ranging from typing on the commandline to searching Wikidata.

picocli help

2020-08-01

amidict create --help
Usage: amidict create [-hV] [--query[=query]] [--directory=<directory>]
                      [-i=FILE] [--informat=input format] [--linkcol=<linkCol>]
                      [--sparqlquery=<sparqlQueryFile>] [--termcol=<termCol>]
                      [--termfile=<termfile>] [--testString=<testString>]
                      [--wptype=<wptype>] [--sparqlmap=<String=String>[,
                      <String=String>...]]... [--synonyms=<synonymList>]...
                      [--wikilinks[=<wikiLinks>[,<wikiLinks>...]...]]...
                      [--namecol=<nameCol>...] [-d=<dictionaryList>[,
                      <dictionaryList>...]...]... [--datacols=datacol[,
                      datacol...]...]... [--hrefcols=hrefcol[,
                      hrefcol...]...]... [-L=PATH...]... [--outformats=output 
                      format[,output format...]...]...
                      [--template=<templateNames>...]... [--terms=<terms>[,
                      <terms>...]...]...
creates dictionaries from text, Wikimedia, etc..
TBD
  -d, --dictionary=<dictionaryList>[,<dictionaryList>...]...
                            input or output dictionary name/s. for 'create'
                              must be singular; when 'display' or 'translate',
                              any number. Names should be lowercase, unique.
                              [a-z][a-z0-9._]. Dots can be used to structure
                              dictionaries intodirectories. Dictionary names
                              are relative to 'directory'. If <directory> is
                              absent then dictionary names are absolute.
      --datacols=datacol[,datacol...]...
                            use these columns (by name) as additional data
                              fields in dictionary. datacols='foo,bar' creates
                              foo='fooval1' bar='barval1' if present. No
                              controlled use or vocabulary and no hyperlinks.
      --directory=<directory>
                            top directory containing dictionary/s.
                              Subdirectories will use structured names (NYI).
                              Thus dictionary 'animals' is found in
                              '<directory>/animals.xml', while 'plants.parts'
                              is found in <directory>/plants/parts.xml.
                              Required for relative dictionary names.
  -h, --help                Show this help message and exit.
      --hrefcols=hrefcol[,hrefcol...]...
                            external hyperlink column from table; might be
                              Wikidata or remote site(s)
  -i, --input=FILE          Input filename (no defaults)
      --informat=input format
                            input format (csv, list, mediawikitemplate,
                              wikisparqlcsv, wikisparqlxml, wikicategory,
                              wikipage, wikitable, wikitemplate)
  -L, --inputnamelist=PATH...
                            List of inputnames; will iterate over them,
                              essentially compressing multiple commands into
                              one. Experimental.
      --linkcol=<linkCol>   column to extract link to internal pages. main use
                              Wikipedia. Defaults to the 'name' column
      --namecol=<nameCol>...
                            column(s) to extract name; use exact case (e.g.
                              Common name)
      --outformats=output format[,output format...]...
                            output format (xml, html, json); default XML
      --query[=query]       generate query for cut and paste into EPMC or
                              similar. value sets size of chunks (too large
                              crashes EPMC). If missing, no query generated.Not
                              very useful.
      --sparqlmap=<String=String>[,<String=String>...]
                            maps wikidata/SPARQL name onto AMIDict names.
                              builtin names = term, name, wikidata, wikipedia,
                              description, wikidata names are p_* (properties)
                              and q_* (items), other names are _* , everything
                              else is an error.Mandatory for wikisparql inputs
      --sparqlquery=<sparqlQueryFile>
                            File with wikidata query
      --synonyms=<synonymList>
                            synonyms retrived from source. Syntax depends on
                              source type.for `sparql` `AltLabels` this is a
                              single String with comma-separated synonyms (and
                              maybe extraneous commas)
      --template=<templateNames>...
                            names of Wikipedia Templates, e.g.
                              Viral_systemic_diseases (note underscores not
                              spaces). Dictionaries will be created with
                              lowercasenames and all punctuation removed).
      --termcol=<termCol>   column(s) to extract term; use exact case (e.g.
                              Term). Could be same as namecol
      --termfile=<termfile> list of terms in file, line-separated. <basename>
                              will become dictionary name, i.e. terpenes.txt
                              creates basename=terpenes
      --terms=<terms>[,<terms>...]...
                            list of terms (entries), space-separated. Requires
                              `inputname` or `dictionary`
      --testString=<testString>
                            String input for debugging; semantics depend on task
  -V, --version             Print version information and exit.
      --wikilinks[=<wikiLinks>[,<wikiLinks>...]...]
                            try to add link to Wikidata and/or Wikipedia page
                              of same name.
      --wptype=<wptype>     type of input (HTML , mediawiki)

context

see amidict for relation to amidict and general commands.

Also developed in parallel with Java Tests in org.contentmine.ami.dictionary , see https://github.com/petermr/ami3/blob/master/src/test/java/org/contentmine/ami/tools/AMIDictionaryTest.java

help

amidict create --help

Test data

see https://github.com/petermr/ami3/tree/master/src/test/resources/org/contentmine/ami/dictionary for examples of files used as input and expected output.

Test code

Most of the examples are in: https://github.com/petermr/ami3/blob/master/src/test/java/org/contentmine/ami/dictionary/AMIDictCreateTest.java The command is usually spread over several lines and would need re-joining manually if you copy it for a test. It also needs amidict at the start.

command

amidict <input and output> create <options>

input and output options

The most important are:

--dictionary .

This is the name of the dictionary and occurs in its title attribute and filename.

e.g.

--dictionary drug

creates or reads a dictionary file drug.xml which starts:

<?xml version="1.0" encoding="UTF-8"?> 
  <dictionary title="drug" ...

If the filename is changed then an error will be thrown. Note that names must start with a letter and not contain spaces, or non-alphanumeric characters [a-z0-9_] are allowed.

create from terms

in Java test

		String cmd = " "
				+ " -vvvv"                                     // debug output
				+ " --dictionary myterpenes"                   // name of dictionary
				+ " --directory=target/dictionary/create"      // where to put dictionary
				+ " --inputname miniterpenes"                  // ??? is this needed?
				+ " create"                                    // subcommand
				+ " --wikilinks wikidata wikipedia"            // create linked to both Wikipedia and Wikidata (IDs)
				+ " --terms thymol menthol borneol junkolol "  // list of terms to include in dictionary
				+ " --informat list"                           // format of input
				+ " --outformats xml"		               // format of dictionary
				;
		AbstractAMIDictTool dictionaryTool = AMIDict.execute(DictionaryCreationTool.class, cmd);
	}

on commandline

amidict  --dictionary myterpenes  --directory=target/dictionary/create  --inputname miniterpenes \
    create \
      --informat list --terms thymol menthol borneol junkolol --wikilinks wikidata wikipedia --outformats xml

Java test

org.contentmine.ami.tools.AMIDictionaryTest.testCreateWikipedia()

This adds links to wikipedia and Wikidata, but we are probably doing it better with SPARQL.

	public void testCreateWikipedia() {
		String cmd = " "
				+ " -vvvv"
				+ " --dictionary myterpenes"
				+ " --directory=target/dictionary/create"
				+ " --inputname miniterpenes"
				+ " create"
				+ " --wikilinks wikidata wikipedia"
				+ " --terms thymol "
				+ " menthol borneol"
				+ " junkolol "
				+ " --informat list"
				+ " --outformats xml"		
				;
		AbstractAMIDictTool dictionaryTool = AMIDict.execute(DictionaryCreationTool.class, cmd);
	}

*This section describes current dictionary building* 2020-08-05

create from Wikidata SPARQL queries:

This is increasingly the preferred method for items likely to have Wikidata entries.

  • Create your query interactively on https://query.wikidata./org until you are happy with it. It will normally start with SELECT and have balanced {...}. The sparql variable names do not have to match amidict dictionary attributes names. Your query will normally have at least 5 variable names - see below. Always test with English ("en") first.
  • cut and paste into a local file (e.g. disease.sparql)
  • create a mapping (--sparqlmap) of the amidict keys to the wikidata-sparql variable names. (see below)
  • add --sparqlquery and --sparqlmap to the amidict create options
  • run. Note that this will run a the query remotely and download results into <directory>/<dictionary>.xml . If you have poor connections this may be slow. The query itself should be as quick as in the Wikidata query service GUI.
  • report problems on your wiki. Try to capture all relevant messages.

### Typical query:


SELECT ?wikidataURL ?wikidataLabel ?wikidataAltLabel ?wikidataDescription ?wikipediaURL WHERE {
    ?wikidata wdt:P31 wd:Q12136 .
	SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en"
	}
	OPTIONAL {
		?wikipedia schema:about ?wikidata .
		?wikipedia schema:inLanguage "en" . 
		?wikipedia schema:isPartOf <https://en.wikipedia.org/> .
	} 
}
LIMIT 4

This submits the triple:

    ?wikidata wdt:P31 wd:Q12136 .

read as "all items with property p31 ('instance of') q12136 ('disease')"

(You can replay this interactively with https://query.wikidata.org/#%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FitemAltLabel%20%3FitemDescription%20%3Fwikipedia%20%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ12136%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%20%20%20%20%20%20OPTIONAL%20%7B%0A%20%20%20%20%20%20%3Fwikipedia%20schema%3Aabout%20%3Fitem%20.%0A%20%20%20%20%20%20%3Fwikipedia%20schema%3AinLanguage%20%22en%22%20.%0A%20%20%20%20%20%20%3Fwikipedia%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E%20.%0A%0A%20%20%20%20%7D%0A%0A%7D%0ALIMIT%204

which brings up the GUI with the query).

SELECT

This prints or downloads the values of the selected terms, as a table. You can select any free variable (?foo) and also its Label, Description, or AltLabel if you use the SERVICE

SERVICE

The labels (human-readable text) are NOT properties or items and must be added with the magic:

	SERVICE wikibase:label {
		bd:serviceParam wikibase:language "en"
	}

You can add other languages to this and are doing so. SERVICE creates ?wikidataLabel, ?wikidataDescription and ?wikidataAltLabel for ?wikidata, etc.

OPTIONAL

This is used when an item MIGHT be present (e.g. not all Wikidata items have corresponding Wikipedia pages.

	OPTIONAL {
		?wikipedia schema:about ?wikidata .
		?wikipedia schema:inLanguage "en" . 
		?wikipedia schema:isPartOf <https://en.wikipedia.org/> .
	} 

This rather convoluted query is because Wikipedia pages do not have QNumbers (probably because they are language-dependent). Read as: Find items which are "about" the item ?wikidata (probably by containment), in English, and which are part of (English) Wikipedia. (Yes, I copied it).

LIMIT

Only output a subsection of the results (useful for debugging). May not reduce query processing time.


*Earlier sparql method. Only use if you cannot run sparql queries from `amidict`*

SPARQL results

The sparql endpoint creates an XML output (sparql, unfortunately without the suffix). A typical result is a table-like file. The variable names are those submitted by the query-er and are uncontrolled. They are needed for translation to amiNames. The "rows" contain named "cells" which have wikidata dataTypes ('literal= string,uri`, dates and numbers and maybe more). Cells can sometimes be missing. These results are for a "country" query.

<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
		<variable name='wikidata'/>
		<variable name='wikidataLabel'/>
		<variable name='_iso3166'/>
		<variable name='wikipedia'/>
		<variable name='alt'/>
		<variable name='synonym'/>
		<variable name='wikidataDescription'/>
		<variable name='term'/>
	</head>
	<results>
		<result>
			<binding name='wikidata'>
				<uri>http://www.wikidata.org/entity/Q889</uri>
			</binding>
			<binding name='wikipedia'>
				<uri>https://en.wikipedia.org/wiki/Afghanistan</uri>
			</binding>
			<binding name='wikidataLabel'>
				<literal xml:lang='en'>Afghanistan</literal>
			</binding>
			<binding name='wikidataDescription'>
				<literal xml:lang='en'>sovereign state situated at the confluence of Western, Central, and South Asia</literal>
			</binding>
			<binding name='synonym'>
				<literal xml:lang='en'>AFG, af, Islamic Republic of Afghanistan</literal>
			</binding>
			<binding name='_iso3166'>
				<literal>AF</literal>
			</binding>
			<binding name='alt'>
				<literal xml:lang='en'>AFG, af, 🇦🇫, Islamic Republic of Afghanistan</literal>
			</binding>
			<binding name='term'>
				<literal xml:lang='en'>Afghanistan</literal>
			</binding>
		</result>
...
        </results>
</sparql>

queries

https://www.wikidata.org/w/index.php?title=Wikidata:SPARQL_query_service/queries/examples/maintenance&action=history may be useful

processing sparql results

This will change sporadically.

The variable names in the results are user-defined and arbitrary. They need to be mapped to allowed amidict names. , e.g. in picocli syntax:

        wikidataURL=wikidata,
	wikipediaURL=wikipedia,
	description=wikidataDescription,
	wikidataAltLabel=wikidataAltLabel,
	term=wikidataLabel,
	name=wikidataLabel

Thus the value of

    <binding name='wikidata'>
        <uri>http://www.wikidata.org/entity/Q889</uri>
    </binding>

is transferred to wikidataURL or:

wikidataURL="http://www.wikidata.org/entity/Q889"

in the dictionary.

transformations

We may need to do processing on some of the strings. We cannot retrieve the Wikidata QID directly.

--transformName wikidataID:wikidataURL.EXTRACT(.*/(.*)) 

will extract just the last field of the URL as the QID.

NOT YET Implemented 2020-08-05. Likely to change.


*This is an older messier method*
### `amidict create ` options.

download XML manually from Wikidata endpoint

The SPARQL can be run manually in the Wikidata query service GUI and the "Link" button (Right) gives "SPARQL endpoint" which can download the results (to file sparql in your download folder). A bit messy - practice it. This can be re-input into amidict to create the dictionary, --sparqlmap will be required.

BUGS

 Dictionary name and filename confused

(Source: Priya)

I have created a disease dictionary with 10 entries at https://github.com/petermr/openVirus/blob/master/dictionaries/diseases/issue/disease_icd.xml from the sparql input disease_icd using the syntax

(//Comments by PMR)

amidict -vv
 --dictionary disease.        // the NAME of the dictionary
 --directory dic              // the directory for output
 --input disease_icd          // the name of the inout file (sparql results)
create 
 --informat wikisparqlxml     // the inout format
 --sparqlmap                  // mapping from SPAQRL query to `amidict` syntax
wikidataURL=wikidata,
wikipediaURL=wikipedia,
wikidataAltNames=wikidataAltLabel,
name=wikidataLabel,
term=wikidataLabel,
Description=wikidataDescription,
ICD-10_codes=ICD_10
 --transformName wikidataID=EXTRACT(wikidataURL,.*/(.*)) // creation of new attribute in dictionary
 --synonyms=wikidataAltLabel       // creation of synonyms

Here, though I used --dictionary disease, the output file in dictionary has dictionary title disease_icd which was the input file's name.

PMR: agree this is a problem

Intended output would be disease.xml containing

<dictionary title="disease" ...>
...
</dictionary>

#### analysis
BUG: The dictionaryName was overwritten by the `--input`` later. This was corrected. Please retry.

#### Test output
This ran to completion.

Version:

Generic values (DictionaryCreationTool)

--testString : d null --wikilinks : d [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@f1da57d --datacols : d null --hrefcols : d null --informat : m wikisparqlxml --linkcol : d null --namecol : d null --outformats : d [Lorg.contentmine.ami.tools.AbstractAMIDictTool$DictionaryFileFormat;@4007f65e --query : d null --sparqlmap : m {wikidataURL=wikidata, wikipediaURL=wikipedia, wikidataAltNames=wikidataAltLabel, name=wikidataLabel, term=wikidataLabel, Description=wikidataDescription, ICD-10_codes=ICD_10} --sparqlquery : d null --synonyms : m [wikidataAltLabel] --template : d null --termcol : d null --termfile : d null --terms : d null --transformName : m {wikidataID=EXTRACT(wikidataURL,./(.))} --wptype : d null --inputnamelist : d null --input : d null --help : d false --version : d false --dictionary : d [disease] // PMR this should determine output names and the dictionary title --directory : d target/dictionary/create

Specific values (DictionaryCreationTool)

dictionaryName: disease {wikidataLabel=[name, term], wikidataDescription=[Description], ICD_10=[ICD-10_codes], wikipedia=[wikipediaURL], wikidata=[wikidataURL], wikidataAltLabel=[wikidataAltNames]} WS>[Description, name, term, wikidataURL, ICD-10_codes, wikidataAltNames, wikipediaURL] // PMR input names need changing sparqlMap SHOULD contain key: description sparqlMap SHOULD contain key: wikidata // obsolete sparqlMap SHOULD contain key: wikipedia // obsolete unknown ami name: Description // should be lowercase unknown ami name: ICD-10_codes // should start with underscore unknown ami name: wikidataAltNames // ? use labels? unknown ami name: wikidataURL unknown ami name: wikipediaURL sparql names [wikidata, wikidataLabel, wikipedia, wikidataAltLabel, wikidataDescription, ICD_10] results 10 dicc> cannot find binding: ICD_10 dicc> cannot find binding: ICD_10 dicc> cannot find binding: ICD_10 dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: wikidataAltLabel dicc> cannot find binding: ICD_10 FIX OUTPUT. // PMR ignore Crohn diseaseCrohnCrohn's disease of colonCrohn's disease of colon (disorder)Crohn's disease of large bowelGranulomatous ColitisPediatric Crohn's diseaseregional colitisregional enteritisregional enteritis of small intestine with large intestineregional enteritis of the large bowelregional Ileitisregional ileocolitis Ulcerative colitisUlcerative ColitisColitis Ulcerativehemorrhagic colitisLeft-sided ulcerative colitis SLESystemic lupus erythematosuslupusdisseminated lupus erythematosusLupus ErythematosussystemicSLE - Lupus Erythematosussystemic anaemia Hepatitis Bhepatitis B infectionhepatitis type Bserum hepatitisViral Hepatitis B Hepatitis Bhepatitis B infectionhepatitis type Bserum hepatitisViral Hepatitis B writing dictionary to /Users/pm286/workspace/cmdev/ami3/target/dictionary/create/disease.xml

// PMR the dictionary now has correct title and filename