Skip to content
This repository has been archived by the owner on Apr 12, 2019. It is now read-only.

Alias merging #82

Open
pfent opened this issue Apr 17, 2017 · 8 comments
Open

Alias merging #82

pfent opened this issue Apr 17, 2017 · 8 comments

Comments

@pfent
Copy link

pfent commented Apr 17, 2017

What's the status on merging different aliases for entities? (e.g. "Mozart", "Wolfgang Amadeus Mozart", "Joannes Chrysostomus Wolfgangus Theophilus Mozart")

Do we have any logic in place for this? What's the strategy here?

@kordianbruck kordianbruck added this to the 20.04 - Pre Hackathon milestone Apr 17, 2017
@sacdallago
Copy link
Member

sacdallago commented Apr 18, 2017

What about reading in a dictionary of synonyms? This should be pretty straightforward and we can curate the list later?

So, maybe: accept an optional -s defining the location of a csv/tsv file of synonyms separated by new line, something like:

consensus_name, synonym1, synonym2, synonym3,...

in real data

Wolfgang Amadeus Mozart, Mozart, Joannes Chrysostomus Wolfgangus Theophilus Mozart
Johannes Bach, Johann Bach

it technically counts as feature, but it's a nice improvement for the data so I would rebrand it as bug-fix? 💃

@pfent
Copy link
Author

pfent commented Apr 18, 2017

We probably can't curate a list for all entities, but creating such a list manually once seems like an easy start

@sacdallago
Copy link
Member

The idea would then be to have the file crowdsourced, ideally opening a repo dedicated to it :D you know, leverage the power of github :P :P

@goldbergtatyana
Copy link

This is a great idea! We, the mentors, could definitely provide a hand on creating the list manually. When can we start?

@sacdallago
Copy link
Member

@goldbergtatyana right now. https://github.com/MusicConnectionMachine/dictionaries/blob/master/artist_synonyms

you can edit directly in the browser, and then open a PR, which we can later merge :)

@gyachdav
Copy link

gyachdav commented Apr 20, 2017

We will need a more authoritative source thea manually curating the dictionary.

A list of pseudonyms, synonyms and canonical names is available from https://portal.dnb.de.

To obtain a list of synonyms for an artist we will need to:

  1. query the dnb.de portal with free text

  2. get the dnd ID for the artist

  3. use the ID to fetch the entry for the artist

  4. parse the RDF formatted output file and extract the values for the following fields: gndo:forename, gndo:surname, list of gndo:variantNameForThePerson.

  5. NOTE the entry has to match gndo:professionOrOccupation at least http://d-nb.info/gnd/4032009-1

The tasks are:

  • research the workflow that starts at free text search and ends with getting an entity record in RDF format

  • implement the workflow

  • parse the RDF format

  • extract the values for gndo:forename, gndo:surname, list of gndo:variantNameForThePerson

  • populate the CSV or a db table that we can use later on as our artists name dictionary

@sacdallago
Copy link
Member

For now, just implement a way to read in the dictionary as previously described, and @gyachdav 's suggested workflow will hopefully be implemented in some spare time at the hackathon and it will output on a file which conforms to the idea in the dictionary file (aka: preferred name, synonym1, synonym2,...)

@felixschorer
Copy link

felixschorer commented Apr 20, 2017

Aliases don't have to be merged into a single DB entry for our use case (@MusicConnectionMachine/group-2). It would suffice if all aliases had the same entityId.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants