From 4cc253528fdab39e2b4e534fda15099f61636b99 Mon Sep 17 00:00:00 2001 From: jmoore Date: Fri, 20 Nov 2020 09:24:02 +0100 Subject: [PATCH] Update introduction and references --- index.bs | 122 +++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 79 insertions(+), 43 deletions(-) diff --git a/index.bs b/index.bs index 3baf307d..d0773adf 100644 --- a/index.bs +++ b/index.bs @@ -10,7 +10,9 @@ Repository: https://github.com/joshmoore/ngff Issue Tracking: Forums https://forum.image.sc/tag/ome-ngff Logo: http://www.openmicroscopy.org/img/logos/ome-logomark.svg Local Boilerplate: header yes +Local Boilerplate: copyright yes Boilerplate: style-darkmode off +Markup Shorthands: markdown yes Editor: Josh Moore, Open Microscopy Environment (OME) https://www.openmicroscopy.org Abstract: This document contains next-generation file format (NGFF) Abstract: specifications for storing bioimaging data in the cloud. @@ -28,11 +30,11 @@ larger, preciser spatial measurements is unfortunately at odds with our ability to structure and share those measurements with others. During a global pandemic more than ever, we believe fervently that global, collaborative discovery as opposed to the post-publication, "data-on-request" mode of operation is the -path forward. Bioimages should be shareable via open and commercial cloud +path forward. Bioimaging data should be shareable via open and commercial cloud resources without the need to download entire datasets. At the moment, that is not the norm. The plethora of data formats produced by -imaging systems are ill-suited to the remote sharing. Individual scientists +imaging systems are ill-suited to remote sharing. Individual scientists typically lack the infrastructure they need to host these data themselves. When they acquire images from elsewhere, time-consuming translations and data cleaning are needed to interpret findings. Those same costs are multiplied when @@ -41,53 +43,63 @@ factor before publication is possible. Without a common effort, each lab or resource is left building the tools they need and maintaining that infrastructure often without dedicated funding. -This document assumes that there are three keys to a workable solution: +This document defines a specification for bioimaging data to make it possible +to enable the conversion of proprietary formats into a common, cloud-ready one. +Such next-generation file formats layout data so that individual portions, or +"chunks", of large data are reference-able eliminating the need to download +entire datasets. -1. Converting all data out of proprietary formats rather than trying to translate data on every access. -2. Chunking the data so that manageable areas of large data are reference-able online rather than downloading them entirely. -3. Collaborating on a small number of container formats and conventions for metadata rather than developing new versions to meet each individual requirement. -This document specifies one layout for images within Zarr files. The APIs and -scripts provided by this repository will support one or more versions of this -file, but they should all be considered internal investigations, not intended -for public re-use. - -Why "next generation"? {#ngff} ------------------------------- +Why "NGFF"? {#why-ngff} +------------------------------------------------------------------------------------------------- A short description of what is needed for an imaging format is "a hierarchy of n-dimensional (dense) arrays with metadata". This combination of features -is certainly provided by HDF5 +is certainly provided by HDF5 from the HDF Group, which a number of bioimaging formats do use. HDF5 and other larger binary structures, however, -are ill-suited for storage in the cloud where accessing individual segments, -or "chunks", of data by name rather than seeking through a large file is at -the heart of parallelization. +are ill-suited for storage in the cloud where accessing individual chunks +of data by name rather than seeking through a large file is at the heart of +parallelization. As a result, a number of formats have been developed more recently which provide the basic data structure of an HDF5 file, but do so in a more cloud-friendly way. - - +In the [PyData](https://pydata.org/) community, the Zarr [[zarr]] format was developed +for easily storing collections of [NumPy](https://numpy.org/) arrays. In the +[ImageJ](https://imagej.net/) community, N5 [[n5]] was developed to work around +the limitations of HDF5 ("N5" was originally short for "Not-HDF5"). +Both of these formats permit storing individual chunks of data either locally in +separate files or in cloud-based object stores as separate keys. + +A [current effort](https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html) +is underway to unify the two similar specifications to provide a single binary +specification. The editor's draft will soon be entering a [request for comments (RFC)](https://github.com/zarr-developers/zarr-specs/issues/101) phase with the goal of having a first version early in 2021. As that +process comes to an end, this document will be updated. + +OME-NGFF {#ome-ngff} +-------------------- + +The conventions and specifications defined in this document are designed to +enable next-generation file formats to represent the same bioimaging data +that can be represented in \[OME-TIFF](http://www.openmicroscopy.org/ome-files/) +and beyond. However, the conventions will also be usable by HDF5 and other sufficiently advanced +binary containers. Eventually, we hope, the moniker "next-generation" will no longer be +applicable, and this will simply be the most efficient, common, and useful representation +of bioimaging data, whether during acquisition or sharing in the cloud. + +Note: The following text makes use of OME-Zarr [[ome-zarr-py]], the current prototype implementation, +for all examples. On-disk (or in-cloud) layout {#on-disk} ======================================= -``` +An overview of the layout of an OME-Zarr fileset should make +understanding the following metadata sections easier. The hierarchy +is represented here as it would appear locally but could equally +be stored on a web server to be accessed via HTTP or in object storage +like S3 or GCS. +``` . # Root folder, potentially in S3, │ # with a flat list of images by image ID. │ @@ -130,8 +142,6 @@ On-disk (or in-cloud) layout {#on-disk} ├── 0 # Each multiscale level is stored as a separate Zarr array, as above, but only integer values │ ... # are supported. └── n - - ``` Metadata {#metadata} @@ -312,6 +322,17 @@ above).
 {
+  "blogNov2020": {
+    "href": "https://blog.openmicroscopy.org/file-formats/community/2020/11/04/zarr-data/",
+    "title": "Public OME-Zarr data (Nov. 2020)",
+    "authors": [
+      "OME Team"
+    ],
+    "status": "Informational",
+    "publisher": "OME",
+    "id": "blogNov2020",
+    "date": "04 November 2020"
+  },
   "imagesc26952": {
     "href": "https://forum.image.sc/t/ome-s-position-regarding-file-formats/26952",
     "title": "OME’s position regarding file formats",
@@ -323,16 +344,31 @@ above).
     "id": "imagesc26952",
     "date": "19 June 2020"
   },
-  "blogNov2020": {
-    "href": "https://blog.openmicroscopy.org/file-formats/community/2020/11/04/zarr-data/",
-    "title": "Public OME-Zarr data (Nov. 2020)",
+  "n5": {
+    "id": "n5",
+    "href": "https://github.com/saalfeldlab/n5/issues/62",
+    "title": "N5---a scalable Java API for hierarchies of chunked n-dimensional tensors and structured meta-data",
+    "status": "Informational",
     "authors": [
-      "OME Team"
+      "John A. Bogovic",
+      "Igor Pisarev",
+      "Philipp Hanslovsky",
+      "Neil Thistlethwaite",
+      "Stephan Saalfeld"
     ],
+    "date": "2020"
+  },
+  "ome-zarr-py": {
+    "id": "ome-zarr-py",
+    "href": "https://doi.org/10.5281/zenodo.4113931",
+    "title": "ome-zarr-py: Experimental implementation of next-generation file format (NGFF) specifications for storing bioimaging data in the cloud.",
     "status": "Informational",
-    "publisher": "OME",
-    "id": "blogNov2020",
-    "date": "04 November 2020"
+    "publisher": "Zenodo",
+    "authors": [
+      "OME",
+      "et al"
+    ],
+    "date": "06 October 2020"
   },
   "zarr": {
     "id": "zarr",