Open Pathology Dataset (OPaD) #3410

drbeh · 2021-11-28T22:08:44Z

drbeh
Nov 28, 2021
Collaborator

Motivation

There are many digital pathology datasets publicly available and they have been widely used in numerous research projects and challenges. However, there is no universal and easy-to-use API for these datasets to enable users to start their AI project without doing additional work for manual downloading and data preparation for the problem at hand.

So far, in MONAI, we have focused to provide users with the capabilities to build, train and test their models but the starting point, which is accessing datasets with a simple API in a normalized data format, has not been the focal point.
Although we have touched on this topic for a couple of datasets, like MedNISTDataset and DecathlonDataset, there is not any pathology dataset support apart from some generic ones, like PatchWSIDataset and MaskedInferenceWSIDataset.

MONAI Pathology have the potential to become the go-to place to start any AI in pathology challenge and this requires a reliable data hosting, flexible downloading, special data preparations, and more importantly a normalized data format for different pathology tasks. We can achieve these goals through a self-sufficient special purposed open pathology datasets (OPAD) for any publicly available and well grounded histopathology data. TorchVision datasets are good example on how we can use a simplified and almost similar API for such datasets.

Overall, OPAD is aiming to lower the barrier for users to experiment with AI models in the pathology domain, to get their hand on real histopathology datasets, and to create a common platform for any development in this area with direct access to a variety of prepared histopathology datasets.

Requirements

Priority	Description
P0	MUST HAVE
P1	SHOULD HAVE
P2	COULD HAVE

Category	Title	User story	Priority
Download	Downloadable data	As a user, I want to download a pathology dataset.	P1
"	Specific set download	As a user, I want to download only one of cross validation datasets: training, validation or test dataset.	P1
"	Partial download	As a user, I want to download an arbitrarily small portion of each cross validation datasets.	P2
"	Smallest representative datasets	As a user, I want to download the smallest portion of all cross validation datasets that I can use for testing the whole system.	P2
Normalized Output	Digestible data	As a user, I want to use the data readily to train a pathology machine learning model.	P0
Cross Validation	Cross validation sets	As a user, I want to get separate training, validation and test sets.	P0
"	Intact test set	As a user, I want to have a test set isolated from my training/validation set.	P0
"	Smaller test set	As a user, I want to set the number of samples that I want to use from the test set.	P2
"	Flexible cross validation split	As a user, I want to set the ratio of training and validation sets.	P2
"	Shuffle samples	As a user, I want to get a shuffled training set.	P1
Data Preparation	Arbitrary patch extraction	As a use, I want the patches to be extracted based on location, size and resolution level from whole slide images (very large images that usually do not fit into memory).	P0
"	Sliding window patch extraction	As a user, I want the patches to be extracted in a sliding window manner from whole slide images (covers all the images with overlaps).	P0
"	SmartCache	As a user, I want to take advantage of SmartCache in my training data.	P0
"	Offline transformation	As a user, I want to prepare my data [offline] by applying a combination of MONAI and pytorch transformation.	P2
"	Augmentation	As a user, I want to augment my data [on the fly] with a combination of MONAI and torchvision transformation.	P0
"	Random region extraction	As a user, I want to extract regions and their associated labels randomly on the fly.	P2
Open Dataset Support	Camelyon16challenge dataset	As a user, I want to be able to use the Camelyon16 challenge dataset.	P0
"	Camelyon17 challenge dataset	As a user, I want to be able to use the Camelyon17 challenge dataset.	P1
"	PANDA challenge dataset	As a user, I want to be able to use the Prostate cANcer graDe Assessment (PANDA) challenge dataset.	P1
Online Training	Stream Datasets	As a user, I want to train a model with the stream of data instead of downloading them.	P2

Questions

What are other useful open datasets for pathology?
What are common tasks for AI in digital pathology?
- image-level classification
- patch-level classification
- image segmentation
- object detection
- ...?

javier-alvarez · 2021-11-29T09:49:38Z

javier-alvarez
Nov 29, 2021

TCGA please

0 replies

JHancox · 2021-11-29T11:24:19Z

JHancox
Nov 29, 2021
Collaborator

We would need to consider how to handle the size of some of these datasets. Whilst its feasible to download some datasets within a notebook session, this is unlikely for something like Camelyon. For example, streaming for the first training epoch and then using the locally cached images thereafter.
We might want to provide pre-trained unsupervised features for each dataset too - to speed up the training process of tasks with standard network architectures.

0 replies

cooperlab · 2021-11-29T16:06:02Z

cooperlab
Nov 29, 2021

What are you considering for hosting? Even ROI-based datasets can be several GB.

I think the top priority should be standardizing formats for things like label or instance images, JSON or other for class names etc. If you provided the standards we would happily transform our data to comply and also use these standards in the future.

3 replies

drbeh Nov 30, 2021
Collaborator Author

Thanks for your feedback @cooperlab!
I totally agree with you that data standardization and normalization would be the most important task here. This is something that we hope we can leverage Pathology WG for.

Regarding the data hosting, we can get access to cloud services to host huge datasets so I am not worry about that but managing data access and use agreements would be a challenge since they are healthcare data.

cooperlab Nov 30, 2021

Those agreements have to be between users and data providers, although there is a lot that MONAI can do to make users' lives easier once they can access the data.

In the past, we looked at publishing large datasets on https://aws.amazon.com/opendata. They are probably not interested in small datasets but if you have something really significant linked to a scientific pub that would be possible.

kirbyju Jun 27, 2022
Collaborator

The Cancer Imaging Archive has published ~35 open-access pathology imaging datasets: https://www.cancerimagingarchive.net/histopathology-imaging-on-tcia/.

The Imaging Data Commons has begun taking those datasets and converting them to dual personality TIFF/DICOM and posting them in Google Cloud's Public Dataset program: https://portal.imaging.datacommons.cancer.gov/explore/filters/?access=Public&Modality=SM .

If investigators have additional data they would like to share via this pipeline you can propose new datasets to TCIA at https://www.cancerimagingarchive.net/primary-data/. Note that we also accept "Analysis Result" datasets which seek to publish image labels/annotations of the images in TCIA.

aylward · 2021-11-29T23:29:26Z

aylward
Nov 29, 2021
Maintainer

There is a parallel initiative to have MONAI become a "portal" to medical image datasets. The goal is to make it easy to access data from multiple sources and optionally with pre-defined training, validation, testing splits to promote reproducibility. As a portal, we would provide a collection of "access files". A user would browse and then download a specific access file when they wanted to access a particular set of data from the web. That access file would then be pass to special MONAI reader which would automatically handle the downloading, caching, and train/validation/test split of the data for deep learning research.

In this way, we don't have to try to host / curate the data, we aren't responsible for anonymization or other risky acts, we don't have to fund the potentially massive storage and download costs, and we can instead focus our time/resources on making existing and future data repos easily integrated into MONAI.

A very rough/incomplete/preliminary example was demonstrated in this github issue/PR that focused on getting data from NCI's "The Cancer Imaging Archive": #2212

Would such a "portal" work for pathology data?

1 reply

cooperlab Nov 30, 2021

GDC provides a similar API and client to access TCGA pathology images. MONAI could create curated manifests for different datasets / experiments that automate the download similar to what you describe for TCGA. Some of the meaningful image sets are several TB though. Lung (LUSC, LUAD) and brain (LGG, GBM) are frequent benchmarks and each has over 1000 whole-slide images. Some datasets only deal with a portion of these slide but the client can't stream just the necessary tiles (though an interesting thought).

A lot of datasets exist on challenge websites that lack an API and that may require registration for access. MONAI could still make it easier to work with these by helping the user load and partition their local copy and providing a standardized interface to work with the data.

shaneahmed · 2021-12-14T17:28:10Z

shaneahmed
Dec 14, 2021

I would suggest data streaming, with object store feature for example support for zarr format so the users do not have to download the whole data sets and they can download part of the data instead which they would like to process. One of the examples is TCGA data, sometimes you do not need to download the whole data set.
Additionally, support for colab would be useful, as colab has limited resources users need to be able to run the algorithms without downloading huge data sets.

1 reply

JHancox Dec 14, 2021
Collaborator

I agree Shan. There are a few ways we could add value too. We can provide some 'recipes', (e.g. X random images from TCGA with Y properties). We can provide smart caching and streaming (e.g. such as lazily/progressively downloading and caching sets of patches/images up to a required training set size).

ant0nsc · 2021-12-14T18:03:41Z

ant0nsc
Dec 14, 2021

Just wanted to add here that we should think about licenses. If we are offering easy-to-use download tools, for example, are we under some legal or at least moral obligation to remind people of the licenses of the dataset? It's possible that it's a no-op, but we should at least have thought about it.
If we go for "hosting" rather than just download helpers, dealing with licensing becomes a must.

2 replies

JHancox Dec 14, 2021
Collaborator

Yes, I think this point was mentioned previously, but an important one, so good to register it here.

kirbyju Jun 27, 2022
Collaborator

This is a great point. As a data provider The Cancer Imaging Archive has many open-access datasets, but each dataset still has a license. In most cases we're using https://creativecommons.org/licenses/by/4.0/ and the important thing is that people are made aware how they can properly provide attribution to the dataset authors. This is critical for creating a citation incentive for researchers to share their data, and also helps us keep track of which datasets are actually being used frequently since we don't require user registration/login for most of our datasets.

cooperlab · 2021-12-14T18:24:33Z

cooperlab
Dec 14, 2021

I think the licensing and streaming are related. Currently, a single slide is the most granular element provided by the data hosting sites. If you want finer granularity like patch/roi then I think this implies re-hosting the data. It could become messy to deal with licensing but I think there is a lot of value in going providing finer access.

1 reply

JHancox Dec 14, 2021
Collaborator

Good point. If we did rehost then we'd also have the ability to store the images in the best possible format for streaming of the foreground patches, which would be a nice value-add.

andrehuisman · 2022-05-31T07:22:08Z

andrehuisman
May 31, 2022

Are you also considering to seek collaboration with existing standardisation organisation, in some ways working on the same subjects? In some areas healthcare IT struggles with lack of standardised ways of interfacing between applications, digital pathology is one of those areas. Since most of these datasets arise from clinical data sources and DICOM for example is gaining traction now in clinical implementations it would make sense to stay close to DICOMweb (JSON) API's where possible. Also IHE (Integrating the Healthcare Enterprise) is working on describing AI use cases and translating those to existing (DICOM and HL7) standards, although in the radiology domain much of the concepts are reusable for pathology imaging of course.

0 replies

dsdanielpark · 2024-03-25T04:38:50Z

dsdanielpark
Mar 25, 2024

I've been following the Monai Project since its early days, attracted by its user-friendly design that simplifies understanding of various medical domain processes.

I'm curious about any updates related to the project, particularly progress in streaming using zarr similar to TCGA, and any advancements regarding formats that encompass medical imaging data such as DICOM, HL7, and NIFTI.

It would be exciting to have a comprehensive format for medical imaging data.

This future-oriented work is crucial as lowering barriers to access and entry in the medical domain will enable more forward-thinking development.

Can you share if there has been any progress in this area and where I can find more information?

3 replies

JHancox Mar 25, 2024
Collaborator

Hi Daniel - it's probably fair to say that progress has been somewhat slow for the last few months - for various reasons. However, at the most recent Working Group meeting (12th March 24), the issue of standardised formats for Digital Pathology was briefly discussed. Perhaps @drbeh can invite you to the next meeting and include this as an agenda item?

kirbyju Mar 25, 2024
Collaborator

Not sure if this helpful or not, but I was recently attending an NCI workshop on digitized pathology and it sounds like there are an increasing number of vendors supporting DICOM natively now and I also learned that there are numerous tools, projects and reference datasets to help support legacy conversion:

dsdanielpark Mar 26, 2024

JHancox, kirbyju

Thank you both for your thoughtful responses.

I found sections 5, 6, and 8 of the seminar particularly important and am glad to know such agendas are being discussed continuously.

Thank you also for sharing various resources. It seems that the Zenodo and Grand Challenge platforms are hosting many of the latest works, more so than the TCGA database nowadays.

The interplay of different stakeholders, the prevalence of proprietary software, and the use of diverse programming languages have hindered progress in the medical domain and accessibility for AI developers. Often, the effort to find or understand interfaces and scripts involves too much trial and error, and even successful attempts may become obsolete due to lack of updates.

Developing solutions in this context is akin to starting from scratch, collecting new tools on a blank canvas – a challenge shared by many.

The lack of sustained support or continuation for key open-source projects is a significant issue. The proliferation of versions, updates, and communication barriers poses major constraints on open-source activities.

The medical domain has seen too much redundant, time-consuming, and wasteful effort in rewrapping the same algorithms for different purposes and adjusting them to fit varying interests.

I believe prioritizing the rewrapping of these formats and contents and continuously developing the open-source community should take precedence over other tasks.

Projects like MONAI are vital because of their user-friendly, Python-written language which greatly contributes to addressing these issues. Gathering algorithms and logics used in 3D Slicer or specific areas into one place, making them easily scriptable in Python, and providing simple tutorials for integration are crucial.

Underlying all these is the sensitivity of the data we handle. Ensuring privacy and adherence to ethical codes is crucial. Utilizing public samples or generative AI-generated samples to facilitate access to this diverse formatted data and ease the initiation of research pipelines is essential.

I believe that the recent significant advancements in large language models (LLMs) offer the potential to expedite labor-intensive tasks considerably. LLMs can reinterpret algorithms dependent on specific frameworks and assist in the rapid creation of tutorials. This process could be further accelerated through collaborative efforts to correct errors, involving the original authors or developers.

This is particularly pertinent as many open-source projects face severe staffing challenges. The disparate time zones of contributors and the depth of specialized knowledge required make it challenging to contribute to the advancement of open-source initiatives effectively.

The medical domain, being exceedingly diverse and complex, necessitates a vast amount of explanation and understanding, setting it apart from general computer science. This complexity underlies the challenges faced in integrating and advancing open-source projects within this field.

Curating this knowledge with LLMs and other tools, gathering it in one place, and providing it to medical professionals, AI specialists, and developers in the medical or AI fields is a forward-looking task of immense importance for humanity's future.

I've always thought the abundance of funding in the medical and healthcare field is both an advantage and a disadvantage of the medical domain. Companies may not wish for open-source projects to flourish yet actively utilize them. Without addressing this issue, progress in the medical and healthcare domain may be slow, seemingly lagging behind other fields, especially in CV technology development.

Though it may seem minor, I once again believe that integrating data formats, simplifying access, and curating development in Python for developers to easily create pipelines and experiment is vital for the advancement of open-source and the future of humanity.

I express deep admiration and gratitude to contributors in the revolutionary work of MONAI. I believe that without the MONAI project, AI advancement in the medical field would have lagged significantly in recent years, leading to global resource wastage and losses. I am also pleased that the MONAI project continues to actively communicate and progress. However, I anticipate challenges as MONAI encounters access changes to some data, potentially causing disruptions in tutorials or updating them. Integrating a data pipeline to prevent such issues from recurring is crucial.

Once again, I wish for the continued longevity and progress of the MONAI project.
Thank you once again for your responses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Pathology Dataset (OPaD) #3410

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Open Pathology Dataset (OPaD) #3410

drbeh Nov 28, 2021 Collaborator

Motivation

Requirements

Questions

Replies: 9 comments · 11 replies

JHancox Nov 29, 2021 Collaborator

drbeh Nov 30, 2021 Collaborator Author

kirbyju Jun 27, 2022 Collaborator

aylward Nov 29, 2021 Maintainer

JHancox Dec 14, 2021 Collaborator

JHancox Dec 14, 2021 Collaborator

kirbyju Jun 27, 2022 Collaborator

JHancox Dec 14, 2021 Collaborator

JHancox Mar 25, 2024 Collaborator

kirbyju Mar 25, 2024 Collaborator

drbeh
Nov 28, 2021
Collaborator

Replies: 9 comments 11 replies

JHancox
Nov 29, 2021
Collaborator

drbeh Nov 30, 2021
Collaborator Author

kirbyju Jun 27, 2022
Collaborator

aylward
Nov 29, 2021
Maintainer

JHancox Dec 14, 2021
Collaborator

JHancox Dec 14, 2021
Collaborator

kirbyju Jun 27, 2022
Collaborator

JHancox Dec 14, 2021
Collaborator

JHancox Mar 25, 2024
Collaborator

kirbyju Mar 25, 2024
Collaborator