Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide human-readable sense IDs in *tab mapping files for Princeton WN #14

Open
chiarcos opened this issue May 2, 2022 · 6 comments · May be fixed by #15
Open

provide human-readable sense IDs in *tab mapping files for Princeton WN #14

chiarcos opened this issue May 2, 2022 · 6 comments · May be fixed by #15

Comments

@chiarcos
Copy link

chiarcos commented May 2, 2022

Traditionally, Princeton WordNet used two concurrent types of sense identification:

  • numerical, e.g., 00182630-n
  • human-readable, e.g., election%1:04:01::

In the mapping files, only the former are covered.

Request:

  • provide human-readable labels for all older-wn-mappings (and, potentially, all others)
  • if stored as an additional column to the current tab files, this should have no side-effects on existing software using the current mapping files.

Objective:

  • It would be nice to process older WordNet-annotated data with conventional RDF technology, without resorting to legacy software

Use case:

  • Trying to build an RDF-native processing workflow for the SemCor corpus,
  • SemCor is manually annotated against PWN 1.6, but provides human-readable IDs, only.
  • At the moment, the only viable way to retrieve a mapping from older human-readable to numerical IDs is to use platform-specific legacy software. (There doesn't seem to be an RDF nor a plain TAB edition of PWN 1.6.)
  • TAB files, XML files and modern WordNets can be processed with conventional RDF technology (either natively or by means of wrapper technologies such as TARQL or Fintan), but there is no mapping from (ILI-unmapped) human-readable IDs to (ILI-mapped) numerical IDs.
@chiarcos
Copy link
Author

chiarcos commented May 2, 2022

Remark: I think the numerical IDs are synset IDs, the human-readable IDs are sense IDs. If this is right, the question can be rephrased as "extend the ILI concept mapping to sense IDs".
In that case, there may be more than one sense ID per synset ID. If these are concatenated with a specific separator, say |, this can still be represented in a 4 column TAB format.

@chiarcos chiarcos linked a pull request May 2, 2022 that will close this issue
@chiarcos
Copy link
Author

chiarcos commented May 2, 2022

Note: The pull request provides a linking between sense IDs and original synset IDs. This can be used in conjunction with ILI mappings, but is not directly integrated into ILI mapping files.

@jmccrae
Copy link
Member

jmccrae commented May 3, 2022

These mappings are already released as part of the existing Princeton WordNet releases (in the sense.index files). We didn't include sense mappings in this repository, because we want to avoid language-specific identifiers. I am not against accepting this change but perhaps @fcbond would like to comment as well.

@goodmami
Copy link
Member

goodmami commented May 4, 2022

Thanks, @chiarcos for your work on this issue. My opinion is the same as @jmccrae's except that I would be against including the changes here. CILI is meant to be an interlingual resource and not tied to any one wordnet (even though the descriptions are in English and there are mappings to WordNet synset IDs so that they may be used with wordnets produced via the "expand" methodology). So linking the ILIs to words in one of the English wordnets seems misplaced.

As @jmccrae mentioned, this data is encoded in the sense.index files, and as of the OMW 1.4 release the sense keys are included in omw-en and omw-en31 lexicons which are near-direct conversions of the Princeton WordNet 3.0 and 3.1 to the WN-LMF format. You can then build such a mapping using Wn:

>>> import wn
>>> en = wn.Wordnet('omw-en')  # wn.download('omw-en') if you don't have it
>>> s = en.senses()[0]  # just get the first sense as an example
>>> s  # sense ids are not sense keys
Sense('omw-en--apos-hood-08641944-n')
>>> s.metadata()  # but for omw-en and omw-en31 they are stored in the metadata
{'identifier': "'hood%1:15:00::"}
>>> sense_key_map = {  # build the mapping
...     s.metadata()['identifier']: s
...     for s in en.senses()
...     if 'identifier' in s.metadata()  # in case some senses do not have keys
... } 
>>> sense_key_map['election%1:04:01::']
Sense('omw-en-election-00181781-n')

This provides a mapping from the sense keys to the Sense objects, but you can then get to the synsets for other kinds of mappings:

>>> sense_key_map['election%1:04:01::'].synset()  # synset objects
Synset('omw-en-00181781-n')
>>> sense_key_map['election%1:04:01::'].synset().metadata()  # NLTK-style identifiers
{'identifier': 'election.n.01'}
>>> sense_key_map['election%1:04:01::'].synset().ili  # ILI
ILI('i36368')

Building this mapping may be a bit more manual of a process than it should be. I'm not sure Wn needs a custom function to build this mapping, but it could be useful to include it as a recipe in the documentation.

Hope this helps! I'm also interested to hear @fcbond's opinion.

@chiarcos
Copy link
Author

chiarcos commented May 4, 2022

Sure, decide as you see fit. I'm also not sure whether CILI is the best place to provide that information, but the sad truth is that such a declarative mapping for older WordNets in a conventional format (it is included in the data.*, resp. *.DAT files, but that needs quite some processing) seems to be completely missing. If the pull request is accepted, I would eliminate my fork, otherwise I keep it alive and rename it to make clear that it includes additional, non-CILI information.

Also, the problem is not so much PWN 3.0 or newer resources as these are easily accessible. For the data at hand (SemCor), I need that for PWN 1.6, so that's why I created these mappings. (Or, have these been more stable than synset IDs so that I can just use [P]WN3.0 sense ids with PWN 1.6?)

PS: One part that I couldn't figure out was how to create correct %5-type sense IDs. These contain a lexical complement for which I didn't immediately find where to retrieve it from, so I produced only the left substring prior to that complement.

@jmccrae
Copy link
Member

jmccrae commented May 5, 2022

I could see the value of having mappings between sense keys and ILI IDs included in this repository, as many resources use these instead of the offset identifier.

@chiarcos yes, the %5 identifiers are very tricky to calculate :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants