provide human-readable sense IDs in `*tab` mapping files for Princeton WN #14

chiarcos · 2022-05-02T16:22:20Z

Traditionally, Princeton WordNet used two concurrent types of sense identification:

numerical, e.g., 00182630-n
human-readable, e.g., election%1:04:01::

In the mapping files, only the former are covered.

Request:

provide human-readable labels for all older-wn-mappings (and, potentially, all others)
if stored as an additional column to the current tab files, this should have no side-effects on existing software using the current mapping files.

Objective:

It would be nice to process older WordNet-annotated data with conventional RDF technology, without resorting to legacy software

Use case:

Trying to build an RDF-native processing workflow for the SemCor corpus,
SemCor is manually annotated against PWN 1.6, but provides human-readable IDs, only.
At the moment, the only viable way to retrieve a mapping from older human-readable to numerical IDs is to use platform-specific legacy software. (There doesn't seem to be an RDF nor a plain TAB edition of PWN 1.6.)
TAB files, XML files and modern WordNets can be processed with conventional RDF technology (either natively or by means of wrapper technologies such as TARQL or Fintan), but there is no mapping from (ILI-unmapped) human-readable IDs to (ILI-mapped) numerical IDs.

The text was updated successfully, but these errors were encountered:

chiarcos · 2022-05-02T16:26:13Z

Remark: I think the numerical IDs are synset IDs, the human-readable IDs are sense IDs. If this is right, the question can be rephrased as "extend the ILI concept mapping to sense IDs".
In that case, there may be more than one sense ID per synset ID. If these are concatenated with a specific separator, say |, this can still be represented in a 4 column TAB format.

chiarcos · 2022-05-02T23:16:09Z

Note: The pull request provides a linking between sense IDs and original synset IDs. This can be used in conjunction with ILI mappings, but is not directly integrated into ILI mapping files.

jmccrae · 2022-05-03T10:49:58Z

These mappings are already released as part of the existing Princeton WordNet releases (in the sense.index files). We didn't include sense mappings in this repository, because we want to avoid language-specific identifiers. I am not against accepting this change but perhaps @fcbond would like to comment as well.

goodmami · 2022-05-04T15:16:45Z

Thanks, @chiarcos for your work on this issue. My opinion is the same as @jmccrae's except that I would be against including the changes here. CILI is meant to be an interlingual resource and not tied to any one wordnet (even though the descriptions are in English and there are mappings to WordNet synset IDs so that they may be used with wordnets produced via the "expand" methodology). So linking the ILIs to words in one of the English wordnets seems misplaced.

As @jmccrae mentioned, this data is encoded in the sense.index files, and as of the OMW 1.4 release the sense keys are included in omw-en and omw-en31 lexicons which are near-direct conversions of the Princeton WordNet 3.0 and 3.1 to the WN-LMF format. You can then build such a mapping using Wn:

>>> import wn
>>> en = wn.Wordnet('omw-en')  # wn.download('omw-en') if you don't have it
>>> s = en.senses()[0]  # just get the first sense as an example
>>> s  # sense ids are not sense keys
Sense('omw-en--apos-hood-08641944-n')
>>> s.metadata()  # but for omw-en and omw-en31 they are stored in the metadata
{'identifier': "'hood%1:15:00::"}
>>> sense_key_map = {  # build the mapping
...     s.metadata()['identifier']: s
...     for s in en.senses()
...     if 'identifier' in s.metadata()  # in case some senses do not have keys
... } 
>>> sense_key_map['election%1:04:01::']
Sense('omw-en-election-00181781-n')

This provides a mapping from the sense keys to the Sense objects, but you can then get to the synsets for other kinds of mappings:

>>> sense_key_map['election%1:04:01::'].synset()  # synset objects
Synset('omw-en-00181781-n')
>>> sense_key_map['election%1:04:01::'].synset().metadata()  # NLTK-style identifiers
{'identifier': 'election.n.01'}
>>> sense_key_map['election%1:04:01::'].synset().ili  # ILI
ILI('i36368')

Building this mapping may be a bit more manual of a process than it should be. I'm not sure Wn needs a custom function to build this mapping, but it could be useful to include it as a recipe in the documentation.

Hope this helps! I'm also interested to hear @fcbond's opinion.

chiarcos · 2022-05-04T21:11:40Z

Sure, decide as you see fit. I'm also not sure whether CILI is the best place to provide that information, but the sad truth is that such a declarative mapping for older WordNets in a conventional format (it is included in the data.*, resp. *.DAT files, but that needs quite some processing) seems to be completely missing. If the pull request is accepted, I would eliminate my fork, otherwise I keep it alive and rename it to make clear that it includes additional, non-CILI information.

Also, the problem is not so much PWN 3.0 or newer resources as these are easily accessible. For the data at hand (SemCor), I need that for PWN 1.6, so that's why I created these mappings. (Or, have these been more stable than synset IDs so that I can just use [P]WN3.0 sense ids with PWN 1.6?)

PS: One part that I couldn't figure out was how to create correct %5-type sense IDs. These contain a lexical complement for which I didn't immediately find where to retrieve it from, so I produced only the left substring prior to that complement.

jmccrae · 2022-05-05T09:03:00Z

I could see the value of having mappings between sense keys and ILI IDs included in this repository, as many resources use these instead of the offset identifier.

@chiarcos yes, the %5 identifiers are very tricky to calculate :)

chiarcos linked a pull request May 2, 2022 that will close this issue

Provide complementary sense mapping #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide human-readable sense IDs in `*tab` mapping files for Princeton WN #14

provide human-readable sense IDs in `*tab` mapping files for Princeton WN #14

chiarcos commented May 2, 2022

chiarcos commented May 2, 2022 •

edited

Loading

chiarcos commented May 2, 2022

jmccrae commented May 3, 2022

goodmami commented May 4, 2022

chiarcos commented May 4, 2022

jmccrae commented May 5, 2022

provide human-readable sense IDs in *tab mapping files for Princeton WN #14

provide human-readable sense IDs in *tab mapping files for Princeton WN #14

Comments

chiarcos commented May 2, 2022

chiarcos commented May 2, 2022 • edited Loading

chiarcos commented May 2, 2022

jmccrae commented May 3, 2022

goodmami commented May 4, 2022

chiarcos commented May 4, 2022

jmccrae commented May 5, 2022

provide human-readable sense IDs in `*tab` mapping files for Princeton WN #14

provide human-readable sense IDs in `*tab` mapping files for Princeton WN #14

chiarcos commented May 2, 2022 •

edited

Loading