-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconciling with package.jsonld
#110
Comments
Hi! What we love about JSON-LD is that it can be seen as one serialization of RDF and can therefore be converted in RDFa and therefore directly inserted into HTML documents. It opens some cool possibilities, like you are reading a New York Times article for instance and you can ldpm install it and start hacking on the data. Everything your data package manager needs to know is directly embedded into the HTML! I would encourage anyone interested in that to go read the JSON-LD spec and the RDFa Lite spec. Both are super well written. The RDFa Lite spec in particular is remarkably short. That being said, we are still experimenting a lot with that approach and 100% agree that soon enough we should work on merging all of that (and happy to contribute to the work)... Another thing to follow closely is: CSV-LD. |
Forgot to mention but for datatypes and co http://www.w3.org/TR/xmlschema-2/#built-in-datatypes is here to help (and can prevent re-inventing a spec for datatypes). |
@jbenet great to hear from you and good questions. Obviously my recommendation here would be that we converge on datapackage.json - I should also say that @sballesteros has been a major contributor to the datapackage.json spec as it stands :-) I note there are plans to introduce a few json-ld-isms (see #89) into datapackage.json but the basic aim is to keep this as simple as possible and pretty close to the commonjs package spec. Whilst I appreciate RDF's benefits (I've been a heavy RDF user in times past) I think we need to keep things super-simple if we are going to generate adoption - most data producers and users are closer to the Excel than the RDF end of the spectrum. (I note one can always choose to enhance a given datapackage.json to be json-ld like - I just think we don't want to require that). That said the differences seem pretty minor here in the basic outline so with a small bit of tweaking we could have compatibility. @sballesteros I note main differences seem, at the moment, to be:
If we could resolve these and perhaps define a natural enhancement path from a datapackage.json to become "json-ld" compliant we could have a common base - those who wanted full json-ld could 'enhance' the base datapackage.json in their desired ways but we'd keep the simplicity (and commonjs compatability) for non-RDF folks. wdyt? |
@jbenet more broadly - great to see what you are up to. Have you seen https://github.com/okfn/dpm - the data package manager? That seems to have quite a bit in common with the There's also a specific issue for the registry at frictionlessdata/dpm-js#5 - the current suggestion had been piggy-backing on github but I know we also have options in terms of CKAN and @sballesteros has worked on a couchdb based registry. |
I would said that given that using the npm registry is no longer really an option, alignment with schema.org is more interesting than commonjs compatibility but I am obviously biased ;) A counter argument to that would be To me alignment with schema.org => we can generate a package.jsonld from any webpage with RDFa markup (or microdata). You can treat JSON-LD as almost JSON (just an extra |
Hey all, Another argument in favour of a spec supporting JSON-LD and aligned with schema.org is explorability. Being able to communicate unambiguously that a given dataset/resource deals with http://en.wikipedia.org/wiki/Crime and http://en.wikipedia.org/wiki/Sentence_(law) for example, goes a longer way than keywords and a description. It makes it query-ready.
|
@rgrp thanks for checking this out! @rgrp said:
Strong +1 for simplicity and ease of use for end users. My target user is the average scientist. Friction (in terms of having to learn how to use tools, or ambiguity in the process) is deadly. I don't think that making the format json-ld compliant will add complexity beyond that ease of use. JSON-LD was designed specifically the smallest overhead, that still provides data-linking power. I found these blog posts from Manu (the primary creator), quite informative:
If I were building any package manager today, I would aim for JSON-LD as a format, seeking the easiest (and readable-ish) integration to other tools. I think JSON is already "difficult to read" to non-developers (hence people's use of YAML and TOML, which are inadequate for other reasons), the JSON-LD @context additions don't seem to make matters significantly worse. I think even learning the super simple NPM @rgrp I found your dpm after I had already built mine. We should definitely discuss converging there. I'm working with @maxogden and we're building dat + datadex to be interoperable. Also, one of the use cases I care a lot about is large datasets (100GB+) in Machine Learning and Bioinformatics. I'm not sure how much you've digged into how @sballesteros what do you think of the differences @rgrp pointed out? Namely:
IMO:
And, @sballesteros, do you see other differences? What else do you remember being explicitly different? Let's try to get convergence on these :) |
@jbenet before diving into the small differences and trying to converge somewhere, I think we should really think of why we should move away from vocabularies promoted by the W3C (like DCAT). To me, schema.org has already done a huge amount of work to try to bring as much pragmatism as possible in that space see: http://www.w3.org/wiki/WebSchemas/Datasets for instance. Why don't we join the W3C mailing lists and take action there so that new properties are added if we need them for our different data package managers? The way I see it is that unlike npm and software package manager, for open data, one of the key challenge is to make data more accessible to search engines (there are so many different decentralized data publishers out there...). Schema.org is a great step in that direction so in my opinion it is worth the little amount of clunkiness in the property names that it imposes. Just wanted to make that clear but all that being said, super happy to go in convergence mode. |
Let's separate several concerns here:
@jbenet current spec allows you to store data anywhere and tradition from CKAN continued in datapackage.json is that you have flexibility as to whether data is stored "next" to the metadata or separately (e.g. in s3, a specific DB, a dat instance etc). @sballesteros (aside) I'm not sure accessibility to search engines is the major concern in this - the concern is integration with tooling. We already get reasonable discovery from search engines (sure its far from perfect, but its no worse than for code). Key here for me is that data is more like code than it is like "content". As such, what we most want is better toolchains and processing pipelines for data. As such the test of our spec is now how it integrates with html page markup but how it supports use in data toolchains. As a basic test: can we do dependencies and automated installation (into a DB!). |
I think yes. I understand the implications of MUST semantics, and the unfortunate upgrading overhead costs it imposes. But without requiring this, applications cannot rely on a package definition being proper linked-data. They require To better understand the costs of converting exiting things, it would be useful to get a clear picture of the current usage of
I believe
Relevant mappings between DCAT and Schema.org. I'm new to DCAT, so can't comment on its vocabulary, beyond echoing "let's try not to break compatibility unless we must." @sballesteros ?
Sounds great! I care strongly about backing up everything, in case individuals stop maintaining what they published. IMO, what npm does is exactly right: back up published versions, AND link to the github repo. Data is obviously much more complicated, given licensing, storage, and bandwidth concerns. I came up with a solution-- more on this later :).
I don't particularly care much about this either. Search engines already do really well (and Links tend to be the problem, not format). IMO a JSON-LD format that uses either an existing vocabulary or one with good mappings will work well. @sballesteros what are your concerns here? |
@jbenet thanks for the responses which are very useful. On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO. This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity. |
I think these could be filled in automatically by For instance, say I have a dataset
|
@jbenet I must confess I still think this is unnecessarily burdensome addition as a requirement for all users. As I said there's no reason users or even a given group cannot add these to their datapackage.json but this adds quite a bit of "cognitive complexity" for those users who are unfamiliar with RDF and linked data. There are very few required fields at the moment in datapackage.json and anything that goes in has to be seen as showing a very strong benefit over cost (remember each time we add stuff we make it more likely people either won't use it or won't actually produce valid datapackage.json). Whilst I acknowledge that quite a lot (perhaps most) datapackage.json will be created by tools I think some people will want to edit by hand (and want to understand the files they look at). (I'm an example of a by-hand editor ;-) ...) |
Entirely agreed. Perhaps the benefits of ensuring every package is JSON-LD compliant aren't clear. Any program that understands JSON-LD would then be able to understand This video is aimed towards a very general audience, but still highlights the core principles: https://www.youtube.com/watch?v=vioCbTo3C-4 Many people have been harping on the benefits of linking data for over a decade, so I won't repeat all that here. The JSON-LD website and posts by @msporny highlight some of the more pragmatic (yay!) reasoning. Will note that it only works for the entire data web if the context is there (as the video explains). That's what enables programs that know nothing at all about this particular format to completely understand and be able to process the file. Think of it as a link to a machine-understandable RFC spec that teaches the program how to read the rest of the data. (without humans having to program that knowledge in manually).
Absolutely, me too. But imagine it's your first time looking at {
"name": "a-unique-human-readable-and-url-usable-identifier",
"datapackage_version": "1.0-beta",
"title": "A nice title",
"description": "...",
"version": "2.0",
"keywords": ["name", "My new keyword"],
"licenses": [{
"url": "http://opendatacommons.org/licenses/pddl/",
"name": "Open Data Commons Public Domain",
"version": "1.0",
"id": "odc-pddl"
}]
"sources": [{
"name": "World Bank and OECD",
"web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}],
"contributors":[ {
"name": "Joe Bloggs",
"email": "joe@bloggs.com",
"web": "http://www.bloggs.com"
}],
"maintainers": [{
# like contributors
}],
"publishers": [{
# like contributors
}],
"dependencies": {
"data-package-name": ">=1.0"
}
"resources": [
{
}
]
} Look much better than {
"@context": "http://okfn.org/datapackage-context.jsonld",
"@id": "a-unique-human-readable-and-url-usable-identifier",
"datapackage_version": "1.0-beta",
"title": "A nice title",
"description": "...",
"version": "2.0",
"keywords": ["name", "My new keyword"],
"licenses": [{
"url": "http://opendatacommons.org/licenses/pddl/",
"name": "Open Data Commons Public Domain",
"version": "1.0",
"id": "odc-pddl"
}]
"sources": [{
"name": "World Bank and OECD",
"web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}],
"contributors":[ {
"name": "Joe Bloggs",
"email": "joe@bloggs.com",
"web": "http://www.bloggs.com"
}],
"maintainers": [{
# like contributors
}],
"publishers": [{
# like contributors
}],
"dependencies": {
"data-package-name": ">=1.0"
}
"resources": [
{
}
]
} ? I would imagine thinking things like:
The latter adds:
IMO, these would involve looking up the spec and understanding how the format works. I care a lot about readability (i originally had picked yaml for datadex) But i claim readability for new users is not affected significantly here. :) |
@rgrp wrote: On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO. This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity. @id is not required for a valid JSON-LD document. Also note that you can alias "@id" to something less strange looking, like "id" or "url", for instance. The ID doesn't need to be a valid RDF class. The only thing that's truly required to transform a JSON document to a JSON-LD document is one line - @context. None of your users need to be burdened w/ RDF or Linked Data concepts unless they want to be. Just my $0.02. :) |
Weighing in briefly after being directed to this thread by @maxogden. I am not currently developing any tools, but rather looking for forward-thinking best practices around metadata for datasets I'm working with a city to help publish, so in that sense I am your end user. From a government point of view: many governmental entities at a similar level (e.g. cities, transit agencies, school districts) are dealing with similar but very differently structured data - often in contexts that care very little about schemas or data re-use, and in strange, vendor-specific formats. In order to do any sort of comparative analysis or re-use open source tools, it is very important to be able to create ad-hoc schemas. Metadata formats which make this easier by including linked data concepts are a huge step forward. Pertinent to this thread: given what @msporny said about being able to alias {
"@context": "http://okfn.org/datapackage-context.jsonld#0.1.1",
"id": "http://dathub.org/my-dataset",
"title": "my dataset",
"version": "1b76fa0893628af6c72d7fa7a6c10f8e7101c31c"
} In my example, I'm also using a hash as a datapackage version rather than a semver, since for frequently changing, properly versioned data, it is unfeasible to have a manually incremented version number. |
👍 Thank you. I will quote this in the future. :)
Yeah, absolutely. It's any URL, so you can embed a version number in the url and thus identify a different
I believe JSON-LD can extract the right
👍 hash versions ftw. What are you building? And i encourage you to allow tagging of versions. the rightest thing i've seen is to have hashes (content-addressing) identify versions, and allow human-readable tags/symlinks (yay git). |
@rgrp thoughts on all this? Can we move fwd with @sballesteros if we have |
@jden great input. @jden @jbenet re datapackage_version I actually think we should deprecate this a bit further (its not strictly required but I think, frankyl, we remove completely). People rarely add it and I'm doubtful it would be reliably maintained in which case its value to consumers rapidly falls towards zero. (I was sort of doubtful when first added but there were strong arguments in favour by others at the time). Re the general version field I note that semver allows using hashes a sort of extra e.g. 1.0.0-beta+5114f85... However, I do wonder about using version field at all if you are using full version control for the data - I imagined the version field being more like version field for software packages where its increment means something substantial (but where you can individual revisions if you want from the version control system - cf the node.js package.json where dependencies can refer either to package versions or to specific revisions for git repos). |
I agree, let's remove
Having versions in pacakage-managers/registries is really useful. Let's not remove this. The package manager websites want to show meaningful descriptions of important version changes (semver). Users can understand the difference between
By the way-- i'm not sure if you came up with something similar, but I put a tiny bit of thought into making a data-semver jbenet/data#13 which might be useful. Clearly expressing what constitutes a MAJOR.MINOR.PATCH version change in the data context will help avoid confusion for people working with data that don't understand the subtleties of code semver. @rgrp can we go fwd with |
On the @context question: let me reiterate that I'm a strong +1 on allowing this in (even encouraging it) but I am still concerned about making it a MUST. Strictly we only have one required field at the moment ( Also do be aware that @context on its own buys you rather little. If you really want to benefit from the rich schema/ontology support of RDF you are going to want to integrate a lot more stuff (e.g. into the type fields of the resources). Again i think this is great if you can do it since you get much richer info - and data package has been designed so you can do this progressive enhancement really easily (just add the @type to your resource schema) but I don't think it should be required for everyone. |
@jbenet to be clear i wasn't suggesting removing Also note I wrote my previous comment before I'd read your response. My suggested approach at present is that we add |
I believe JSON-LD can extract the right @context from a #, though not 100% sure. @msporny will know better. If not, embed it in the path: No, JSON-LD will not extract the "right" context from a #fragment :). We considered that option and felt that it adds unnecessary complexity (when a simpler alternative would solve the same problem). Just do this if you want to version the context: "@context": "http://okfn.org/datapackage-context/v1.jsonld" You are probably going to want to use a URL redirecting service so that your developers don't see breaking changes if okfn.org ever goes away. For example, use https://w3id.org/ and make this your context URL: https://w3id.org/datapackage/v1 This does three things:
I can add it to the site in less than a minute if you want (or you can submit a pull request). w3id.org is backed by multiple companies and is designed to be around for 50+ years. You can learn more about it by going here: https://w3id.org/ (edit: fixed UTF-8 BOM - no idea how that got in there) |
Somehow the w3id.org homepage link at the end of #110 (comment) is broken for me due to a utf8 bom that's crept in? Source code shows it as https://w3id.org/%EF%BB%BF. Strange. https://w3id.org/ works. |
@jbenet @rgrp Here's an example dataset I'm building: https://github.com/jden/data-bike-chattanooga-docks Some thoughts from the experience (albeit tangential to this thread):
"resources": [
{
"name": "data",
"mediatype": "text/csv",
"path": "data.csv"
},
{
"name": "data",
"mediatype": "application/json",
"path": "data.geojson"
}
]
|
|
Cool!
If you give me a couple of weeks, Transformer (repo) will help you do this really easily.
This will be the case as long as different registries do not agree on their structures. If you published this to rubygems too you'd also have a
We could open up a discussion about getting to this. Frankly, now that JSON-LD exists, there's no reason we can't have the same package.jsonld spec for every package manager out there, and use different But, don't expect this to happen for years. :) Actually... we might even be able to be
In my biased world view, I'd include only one and use transformer to generate the second representation with a makefile. (or have both, but still have the transform) Something like: data.geojson: data.csv
cat data.csv | transform csv my-data-schema geojson > data.geojson Note: this doesn't work yet. It will soon :) On indicating this, your same
I think this (ISC in For code, it's common to add a LICENSE file in packages. We could establish a convention of putting the various licenses into the same file, or perhaps have two:
does this change in light of the comments I made above, re being directly Actually, the
Not necessarily? As I understand, we can remap I want to get this finished soon, so let's settle our thought on
I think it's really important to make the move to JSON-LD. And this IMO makes it a MUST (i actually think it's more important than I will definitely require it in any registries I write for data packages. Again, tools like I'm happy to help in upgrading all existing packages (scripts to upgrade, + crawl CKAN and add If you're set on not making it a must, then I propose we keep
How's this as a first draft?
The last line is super awkward. Does it even make sense? @rgrp @msporny please correct any nonsense I might have spewed! :) |
hi @rgrp @jbenet what is the current status of this "MAY" item in next spec? I also wonder if you guys are lining up with the DCAT&PROV initiatives from W3C. DCAT and PROV are managing a similar usecase. Introduce a spec for metadata to describe datasets in RDF, which can be encoded as json-ld easily. |
@pvgenuchten not sure-- @rgrp ? |
@jbenet @pvgenuchten no-one commented on the proposal so nothing happened :-) Generally I want to see a fair number of comments on a proposal to put a change in. |
@pvgenuchten @jbenet if someone could give me some sample language or submit a PR this can go in. |
@rgrp sample language for the |
@jbenet yes plus language for the actual spec proposal |
@rgrp can you give me a precise example of what you want, say for another field, like |
@jbenet I'd be looking for relevant language to add to the spec to specify what property (or properties) to add e.g. |
@rgrp the directions are too vague. do you want a patch to http://dataprotocols.org/data-packages/ ? how much of a connection to Also, as mentioned before, It's also easy to treat all data-packages without an Note this also needs a proper JSON-LD context file representing the machine-readable version of this spec. Hmmm, i don't have enough time to take this whole thing on right now-- do you have anyone else on the team that cares about linked data to work with me on this? |
@jbenet I think it would come under the "MAY" style fields. I'm not sure I understand enough here to get the complexity. No problem if you don't have time for this right now and we can wait to see if someone else volunteers to get this in. |
I note that the link to package.jsonld in the issue description now leads to a 404 page - is @jbenet Would you be able to write out the data contained in a |
Yep still a thing. We haven't had time to give it a new home yet. |
In that case, this issue should probably be closed in favour of a new "Use the W3C Metadata Vocabulary for Tabular Data" issue. @rgrp - Is the plan to transition to the CSV-WG's JSON data package description, when it becomes a Recommendation? |
@hubgit no, no intention to transition to that spec as it isn't data package. Whilst directly inspired by Data Package and Tabular Data Package and i'm an editor I think it has currently diverged a lot. So, still useful to get JSON-LD compatibility in here and this issue should stay open. |
Ok, in that case we just need the mapping from Data Package property names to URLs, and a stable place to host a JSON-LD context file. |
@hubgit great - would you be up for having a stab? Also is there anything we need to add to datapackage.json itself? |
I'll have a look, yes. I'm not sure what the best URL for each property would be, though: maybe something like All that needs to be added to
|
Hey @hubgit, @jbenet just a quick note on best practices wrt. JSON-LD context files:
That's just the stuff I can think of off of the top of my head. I'd be happy to look at the JSON-LD context, URL mappings, and other stuff as you make more progress. |
Thanks @msporny, that's helpful. It looks like Chrome's not happy about w3id.org's HTTPS encryption? |
Thanks @hubgit, looks like the new versions of chrome mark certs that use RSA w/ SHA1 as invalid - we'll get a new cert that uses SHA256 from our CA... there may also be a problem w/ the fact that our CA doesn't publish their public audit records. Working to fix it now. |
@hubgit fixed - w3id.org now uses RSA w/ SHA256, which'll get rid of the warning in the newer versions of Chrome. |
INVALID / DUPLICATE in favour of #218. This issue has moved quite far from original discussion and is quite lengthy. I'm therefore closing in favour of a new specific issue on providing a JSON-LD context file for Data Package and Tabular Data Package. |
Hey guys!
I'm the author of datadex, and now working with @maxogden on dat. As a package manager for datasets, datadex uses a package file to describe its datasets. Choosing between
data-package.json
andpackage.jsonld
is hard:data-package.json
has been around longer, has a well defined spec, and many packages use it.package.jsonld
takes into account jsonld (which came out recently), and plugs into schema.org's schemas, for linked-data goodness. And at first glance, it seems most of what's indata-package.json
is inpackage.jsonld
.It's confusing for adopters to have two different specs. I think we should reconcile these two standards and push forward with one. Thoughts? What work would it entail?
To ease transition costs, I'm happy to take on the convergence work if others are too busy. Also, I can write a tool to convert between current
data-package.json
andpackage.jsonld
and whatever else.Cheers!
cc @rgrp, @maxogden, @sballesteros
The text was updated successfully, but these errors were encountered: