-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to specify the number of records in a dataset #1571
Comments
I know that this is not the same, but there is |
Just out curiosity, @nichtich could you provide a use case where users are depending on an exact value of the notion of size? For me there are some reasons that size is not included. With the introduction of APIs the need of size becomes very limited. For APIs size becomes temporal dependent and since most data portals assume that metadata changes slowly (ones a week is a quick pace ;-) ) the property looses it value. (if the data is only harvested once a week, then the importance of the accuracy reduces.) I see it more featuring in file downloads, but even then I am not so sure if there is need to be exact. Although the need for expressing size feels very natural, in the practice I seldom see publishers providing it because the high effort to keep track of sizes (both human and technical investment ). Therefore I am curious about the use case that would motivate publishers to provide size information. |
To my mind, number of records is a human-facing bit of info that gives a person an idea of the scope of the information prior to downloading. Number of bytes is reminiscent of those large software downloads in times past when you needed to know that the download had completed. However, for very large files it is useful to know that they ARE very large - which today means multiple gigabytes. For smaller files I doubt if byte size matters. Therefore, both measures are needed but are useful under specific circumstances. |
The number of records gives information about content. It is useful to judge and compare both different datasets of same type (same method to cound records) and change of a dataset over time. See http://nomisma.org/datasets for an example of a list of datasets with number of records each. This example happens to use <http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:hasPart [ rdf:type dcmitype:Collection ;
dcterms:type nmo:TypeSeriesItem ;
void:entities 3650
] ;
dcterms:hasPart [ rdf:type dcmitype:Collection ;
dcterms:type nmo:Monogram ;
void:entities 309
] . I am not sure whether this is best practice and applicable to other kinds of datasets, for instance number of files. By the way DataCite has a free text property that maps to <http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:extent [
rdf:type dcterms:SizeOrDuration ;
rdf:value "3650 type series items"
] ;
dcterms:extent [
rdf:type dcterms:SizeOrDuration ;
rdf:value "309 monorams"
] . or (what I would prefer) <http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:extent "3650 type series items";
dcterms:extent "309 monograms" . I also found the Ontology of units of Measure to support this: <http://numismatics.org/pco/>
rdf:type void:Dataset ;
dcterms:extent [
rdf:type om:Measure ;
om:hasNumericalValue 3650
] ;
dcterms:extent [
rdf:type om:Measure ;
om:hasNumericalValue 309
] Last but not least Wikidata uses P4876 to specify the number of records, see this list of databases with their number of records. |
@bertvannuffelen interestingly, in a related topic, while a little work was put into adding DatasetSeries in DCAT3, no-one suggested the need for a count of items in a series. |
I agree that both file size and dataset size are useful. In the world of high-performance computing, unfortunately, the times of needing to know whether a download completed haven’t yet receded into the past. A few gigabytes are not large in this realm. File movements at the terabyte to hundreds of terabytes level are common, so special tools are needed, and care must be taken to maximize throughput without causing trouble for others on the network. I often field queries from users about how to go about moving a dataset from one storage tier to another or from one site to another. So, size definitely can matter and should be expressible. Another potentially important piece of the puzzle is the number of inodes (files or directories) involved when the dataset is unpacked, since some storage can be finicky about storing or reading from many small files. The number of rows in a data table can also matter to whether it can be fit into a certain type of database or can be manipulated with certain analysis tools. Often the number of rows maps in a general way to the usefulness of a scientific dataset, though depending on the dataset, its size may be better expressed in more domain-specific terms, like degrees of the sky for astronomical data, or spatial resolution for climate data. |
Regarding series, I think the expectation is that a series will grow, so expressing a count of items in the series becomes meaningless very quickly. |
@ALL, a little as expected, there are very different, yet specific, expectations of size. I observe the following:
To get a harmonised view the size will be a complex datatype, having properties:
I see the following challenges:
From this I see sizes feature more in a specific profile of DCAT for a specific usage case. If the semantics of the semantics of size is left to the ecosystem to define in a profile, then my opinion is to not include the property in DCAT, but immediately push it to the profile. Introducing "abstract properties" that should not be used is not very helpful. @agreiner introduces an interesting notion "usefulness of a dataset". That is I think the key of the story here. Size could play a role in such an assessment, but that is very user and use case specific. I might be biased, but I think size is overrated in this assessment. Other properties will play probably a more important role (as size is not provided often cfr the challenges I listed). I think it would be good to provide evidence from existing data portals and communities where size is a critical and well maintained properties, before introducing a property. |
Thanks @bertvannuffelen for the summary. Size indeed depends a lot on context.
This goes beyong the original request. Just a cardinal number and a unit what is being counted is enough. There are several ways to express it in RDF:
There already are unit-specific properties such as
Numbers beyond number of bytes are common, just browse around in any data catalog. I just looked at the first topic I thought of (astronomy) and found two examples within a minute:
Additional examples are listed in my recent comment. |
@nichtich the two examples are interesting: example a)https://data.nasa.gov/Space-Science/Mars-orbital-image-HiRISE-labeled-data-set-version/egmv-36wq number of landmarks (very domain-specific unit) The "size information" is actually part of the description and not an independent number. Also from the description I am not sure if the dataset publisher would like to share a single number:
But I believe the publisher liked to explain the nature of the data. And by accident, the numbers fitted in the textual description. Observe that this description also ties the description of a dataset to its size. That means that the intend of this dataset is that its evolution is very static. example b)https://data.nasa.gov/Aerospace/NASA-TechPort/bq5k-hbdz - number of rows and columns (very generic unit) The portal allows to export it in CSV, RDF, XML. So the size is here not a metadata value but a service offering of the portal in case it can offer the data directly. It is calculated dynamically I assume (or on upload by the publisher). The latter are important questions as in the end publishers should be instructed to perform for all entities they share a common metadata quality. In general the following statements should be clear what they mean, without additional explanation.
Ps. I randomly clicked in data.europa.eu and I could not find any examples. Maybe bad luck, but that also indicates that the size is not often provided. That is the reason I asked for example portals where size is an important and critical feature for the functioning of that data community. In the NASA data portal the size provisioning is at hoc and probably depending on the dataset owner. I would like to see for instance, data portals that offer based on size different access patterns or payment requirements, etc. At this moment the examples are only those cases where either a) a publisher did some editorial work or b) the data is available in a data warehouse and it calculates some number. I really would like to discuss more inspiring cases than these. Because those usecases will drive publishers to provide more precise and quality metadata. But I see where you are heading, your request is to "officially" adopt The challenge is the request for harmonising the value space in some way. |
I have added the label "future work", as the DXWG group voted for the CR publication, and the process does not allow including new features at this stage. Future DCAT standardization processes can consider this issue. |
I could not find any information how to express the number of records in a dataset (also known as its size). There was a deprecated property
dcat:size
subclass ofdcterms:extent
, so my guess would be to just usedcterms:extent
(with any kind of value: number, string, blank node, URL...) or a more specific property from another vocabulary (e.g. statistics properties from void vocabulary). The general size of a dataset in terms of conceptual entities (records, concepts, resources, objects...) is fundamental information, so dcat should at least mention the topic, explain why there is no strictly defined property and refer todcterms:extent
.The text was updated successfully, but these errors were encountered: