Streaming, transforming CSV to RDF converter
Reads CSV/TSV data as generic CSV/RDF, transforms each row using SPARQL CONSTRUCT
or DESCRIBE
, and streams the output triples.
The generic CSV/RDF format is based on the minimal mode of Generating RDF from Tabular Data on the Web.
Such transformation-based approach enables:
- building resource URIs on the fly
- fixing/remapping datatypes
- mapping different groups of values to different RDF structures
CSV2RDF differs from tarql in the way how mapping queries use graph patterns in the WHERE
clause. tarql queries operate on a table of bindings
(provided as an implicit VALUES
block) in which CSV column names become variable names. CSV2RDF generates an intermediary RDF graph for each CSV row (using column names as relative-URI properties)
that the WHERE
patterns explicitly match against.
mvn clean install
That should produce an executable JAR file target/csv2rdf-2.0.0-jar-with-dependencies.jar
in which dependency libraries will be included.
The CSV data is read from stdin
, the resulting RDF data is written to stdout
.
CSV2RDF is available as a .jar
as well as a Docker image atomgraph/csv2rdf (recommended).
Parameters:
query-file
- a text file with SPARQL 1.1CONSTRUCT
query stringbase
- the base URI for the data (also becomes theBASE
URI of the SPARQL query). Property namespace is constructed by adding#
to the base URI.
Options:
-d
,--delimiter
- value delimiter character, by default,
.--max-chars-per-column
- max characters per column value, by default 4096--input-charset
- CSV input encoding, by default UTF-8--output-charset
- RDF output encoding, by default UTF-8
Note that delimiters might have a special meaning in shell. Therefore, always enclose them in single quotes, e.g. ';'
when executing CSV2RDF from shell.
If you want to retrieve the raw CSV/RDF output, use the identity transform query CONSTRUCT WHERE { ?s ?p ?o }
.
CSV data in parking-facilities.csv
:
postDistrict,roadCode,houseNumber,name,FID,long,lat,address,postcode,parkingSpace,owner,parkingType,information
1304 København K,24,5,Adelgade 5 p_hus.0,p_hus.0,12.58228733,55.68268042,Adelgade 5,1304,92,Privat,P-Kælder,"Adelgade 5-7, Q-park."
CONSTRUCT
query in parking-facilities.rq
:
PREFIX schema: <https://schema.org/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
CONSTRUCT
{
?parking a schema:ParkingFacility ;
geo:lat ?lat ;
geo:long ?long ;
schema:name ?name ;
schema:streetAddress ?address ;
schema:postalCode ?postcode ;
schema:maximumAttendeeCapacity ?spaces ;
schema:additionalProperty ?parkingType ;
schema:comment ?information ;
schema:identifier ?id .
}
WHERE
{
?parkingRow <#FID> ?id ;
<#name> ?name ;
<#address> ?address ;
<#lat> ?lat_string ;
<#postcode> ?postcode ;
<#parkingSpace> ?spaces_string ;
<#parkingType> ?parkingType ;
<#information> ?information ;
<#long> ?long_string .
BIND(URI(CONCAT(STR(<>), ?id)) AS ?parking) # building URI from base URI and ID
BIND(xsd:integer(?spaces_string) AS ?spaces)
BIND(xsd:float(?lat_string) AS ?lat)
BIND(xsd:float(?long_string) AS ?long)
}
Java execution from shell:
cat parking-facilities.csv | java -jar csv2rdf-2.0.0-jar-with-dependencies.jar parking-facilities.rq https://localhost/ > parking-facilities.ttl
Alternatively, Docker execution from shell:
cat parking-facilities.csv | docker run --rm -i -a stdin -a stdout -a stderr -v "$(pwd)/parking-facilities.rq":/tmp/parking-facilities.rq atomgraph/csv2rdf /tmp/parking-facilities.rq https://localhost/ > parking-facilities.ttl
Note that using Docker you need to:
- bind
stdin
/stdout
/stderr
streams - mount the query file to the container, and use the filepath from within the container as
query-file
Output in parking-facilities.ttl
:
<https://localhost/p_hus.0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://schema.org/ParkingFacility> .
<https://localhost/p_hus.0> <http://www.w3.org/2003/01/geo/wgs84_pos#long> "12.58228733"^^<http://www.w3.org/2001/XMLSchema#float> .
<https://localhost/p_hus.0> <https://schema.org/identifier> "p_hus.0" .
<https://localhost/p_hus.0> <https://schema.org/additionalProperty> "P-Kælder" .
<https://localhost/p_hus.0> <https://schema.org/comment> "Adelgade 5-7, Q-park." .
<https://localhost/p_hus.0> <https://schema.org/postalCode> "1304" .
<https://localhost/p_hus.0> <http://www.w3.org/2003/01/geo/wgs84_pos#lat> "55.68268042"^^<http://www.w3.org/2001/XMLSchema#float> .
<https://localhost/p_hus.0> <https://schema.org/streetAddress> "Adelgade 5" .
<https://localhost/p_hus.0> <https://schema.org/name> "Adelgade 5 p_hus.0" .
<https://localhost/p_hus.0> <https://schema.org/maximumAttendeeCapacity> "92"^^<http://www.w3.org/2001/XMLSchema#integer> .
More mapping query examples can be found under LinkedDataHub's northwind-traders
demo app.
Largest dataset tested so far: 2.8 GB / 3709725 rows of CSV to 21.7 GB / 151348939 triples in under 27 minutes. Hardware: x64 Windows 10 PC with Intel Core i5-7200U 2.5 GHz CPU and 16 GB RAM.