The Linked Dataset Profiling tool is an implementation of the approach proposed in [1]. It's main purpose is generating structured profiles of Linked Datasets. A profile in this case representes a graph, consisting of linked datasets, resource instances and topics. The topics are DBpedia categories, which are extracted through a Named Entity Disambiguation process by analysing textual literals from resources. The main steps which are executed from the profiling tool are:
- Dataset metadata extraction from DataHub
- Resource instance extraction
- Entity and topic extraction through NED from extracted resources using NED tools like Spotlight or TagMe!
- Profile graph construction and topic ranking
- Exporting of profiles in JSON format.
The individual steps are explained in detail in [1], here we provide a brief overview of the output from each step. In step (1) the input required from the tool is a dataset id, extracted from DataHub, i.e. lak-dataset (http://datahub.io/dataset/lak-dataset) or a group id of datasets, i.e. lodcloud (http://datahub.io/organization/lodcloud). As an output the LDP tool extracts the metadata from the datasets, such as the SPARQL endpoint, name, maintainer etc., and stores in a directory given by the user. In step (2) LDP extracts resource instances from the given datasets in (1). It has the option to sample the extracted resources, based on three sampling strategies: random, weighted and centrality (see [1]). Furthermore, the user can define what percentage of resources he/she wants to extract, i.e. 5,10,...,95% of resources. In step (3) from the extracted resources, the tool performs the NED process by analysing the textual literals of resources. Here, one can define what datatype properties are of interest for the NED process, which can be fed in into the tool during the process. In this step, LDP extracts entities as DBpedia entities, and the topics from the extracted entities through the datatype property dcterms:subject. The last step of the LDP tool, is step (4) where from the extracted datasets and their corresponding sampled resources, and the extracted entities and topics in step (3), we build the dataset topic graph as our profile. The topics are ranked for their relevance to the respective datasets by different graphical models that can be fed into the LDP tool by the user, i.e. prank, kstep, hits, for PageRank with Priors, K-Step Markov and HITS, respecitvely. Finally, after ranking the topics for their relevance, the LDP tool can export the profiles into JSON format, such that they can be further analysed or exported into RDF or other formats. For RDF we provide the tool which exposes the profiles into RDF using the VoID and VoL schema.
In order to run the LDP tool, it requires few variables to be added in its config file. We show here the possible input values for the different variables (where with "|" we show all acceptable and recognisable values by the tool), wheres for others we provide a simple textual description. See below for the sample config file. The defined variables and values should be stored in a separate file and should be given as an inline argument to the LDP tool, e.g.
java -jar ldp.jar config.ini
Only one value at at time, 0 - is for step (1), 1 for step (2), and so on
loadcase=0|1|2|3|4
An existing directory where the extracted datasets and resources will be stored
datasetpath=directory location
Path and the name of the file which will hold the computed values for the normalised topic relevance score computed as in (1)
normalised_topic_score=file location
Path and the name of the file which will hold the extracted de.l3s.bfetahu.ldp.entities and topics from DBpedia
annotationindex=file location
Sample size, which defines the ratio of extracted resources for a dataset. Be aware here that Step (3) for large sample sizes takes a long time, and as shown in [1] a sample size of 10% is representative
sample_size=1|2|...|95
Sampling strategy to extract the resources, 'centrality' performs best in terms of profiling accuracy
sampling_type=random|weighted|centrality
Path to an existing directory for the output directory location
outdir=directory location
Path to an existing directory for the output generated by the different topic ranking approaches
topic_ranking_objects=directory location
Dataset id or group id from datahub for which you want to perform the profiling
query_str=datahub_dataset_id|datahub_group_id
In case you are looking for a dataset id, then the value here should be false, and vice versa
is_dataset_group_search=true|false
Topic ranking approaches in Step (4), which determines the relevance of topics for a dataset
topic_ranking_strategies=prank|kstep|hits
Path to a file containing datatype properties of interest for the NED process. Here the datatype properties should be one per line and their object values should be textual literals
property_lookup=file location of properties of interest which need to be considered for NED analysis
Location where to store the generated dataset topic graphs
raw_graph_dir=directory location
Define which NED process to use. Spotlight doesn't require any changes, while for TagMe! one needs to get the API credenitals (contact at http://tagme.di.unipi.it/) and provide it under the tagme_api_key
ned_operation=tagme|spotlight
In case the NED process is carried out by TagMe! NED tool, then you have to request an API KEY at http://tagme.di.unipi.it/ and provide the key as the value for the variable.
tagme_api_key=TagMe! API KEY
DBpedia sparql endpoints in different languages
dbpedia_endpoint=en http://dbpedia.org/sparql,de http://de.dbpedia.org/live/sparql
This has to be set to true as it checks for the extracted de.l3s.bfetahu.ldp.entities whether their corresponding topics (categories) are extracted
load_entity_categories=true
URL for the english DBpedia used for the extraction of entity categories
dbpedia_url=http://dbpedia.org/sparql
Timeout (in seconds) when extracting resources from the data sets
timeout=10000
Define whether the de.l3s.bfetahu.ldp.entities should be included in the profiles or should be left out of the ranking process
includeEntities=false
File location where the dataset_topic_graph is stored
dataset_topic_graph=raw_graph/dataset_topic_graph.obj
Values used to initialise the KStep and PageRank models
alpha=0.1
K value for K-Step Markov
k_steps=3
Number of iterations used for the ranking of topics with K-Step Markov and PageRank with Priors
ranking_iterations=10
The code and the tool is provided under the creative commons licence (CC). When using the LDP tool please cite the paper in [1]. For additional information, refer to the website: http://data-observatory.org/lod-profiles/about.html.
######[1] Besnik Fetahu, Stefan Dietze, Bernardo Pereira Nunes, Marco Antonio Casanova, Davide Taibi, Wolfgang Nejdl: A Scalable Approach for Efficiently Generating Structured Dataset Topic Profiles. ESWC 2014: 519-534######