Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataset] WSDM2012 wrapper #46

Closed
RicardoUsbeck opened this issue Nov 4, 2014 · 13 comments
Closed

[dataset] WSDM2012 wrapper #46

RicardoUsbeck opened this issue Nov 4, 2014 · 13 comments

Comments

@RicardoUsbeck
Copy link
Collaborator

Write a wrapper for the WSDM2012 dataset.
Annotate the license, experiment type and language.
Give provenance.
Update https://github.com/AKSW/gerbil/wiki/Licences-for-datasets

@RicardoUsbeck RicardoUsbeck added this to the Version 2 - new core and better logging milestone Nov 4, 2014
@TortugaAttack
Copy link
Contributor

The dataset is not available anymore at the site:
http://ilps.science.uva.nl/resources/wsdm2012-adding-semantics-to-microblog-posts/
With archive.org (https://web.archive.org/web/20120331023708/http://ilps.science.uva.nl/resources/wsdm2012-adding-semantics-to-microblog-posts/) i found out that there was the dataset http://ilps.science.uva.nl/sites/default/files/wsdm2012-adding-semantics-microblog-posts-annotations.zip
But there is no dataset.
Does anyone have the dataset and if so if it is not publicly available should we put it in?

@MichaelRoeder
Copy link
Member

@TortugaAttack
Copy link
Contributor

Yup they referenced to the link i stated above. :/

@MichaelRoeder
Copy link
Member

@RicardoUsbeck in your role of the project leader, you might want to write a mail to http://edgar.meij.pro/contact/ asking for the dataset.

@RicardoUsbeck
Copy link
Collaborator Author

@TortugaAttack
Copy link
Contributor

this dataset is just a huuuuge pain in the a#!
so do i miss something or do they link against some kb, but do not provide uris and not even the markings in the tweet, but the already linked markings.

For example: Arab countires, is annotated as Arab World etc.
This makes it very difficult to match where the annotation starts where it ends. (This is an example which i can handle but there are worse!)

Anybody an Idea to match the annotation with start and length in the actual tweet properly?

@RicardoUsbeck
Copy link
Collaborator Author

@TortugaAttack
Copy link
Contributor

okay, so i can get the wiki uri. Thats cool.
But the problem with the marking remains.
I can not get "correct" markings (start, length) out of the tweets to create NamedEntities :/

@RicardoUsbeck
Copy link
Collaborator Author

Than it is only suitable for the C2KB task, if it still exists @MichaelRoeder ?

@MichaelRoeder
Copy link
Member

Yes, looks like C2KB to me.

@TortugaAttack
Copy link
Contributor

Never done C2KB only before, do i use NamedEntity as well?
if so. the dataset will be finished in no time ;)

@MichaelRoeder
Copy link
Member

No, please use org.aksw.gerbil.transfer.nif.data.Annotation objects for that. They don't have a position. A description of the markings that are available can be found in the wiki article "Document Markings in gerbil.nif.transfer".

@TortugaAttack
Copy link
Contributor

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants