Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example problems and datasets for image processing #2

Open
mrocklin opened this issue Apr 28, 2018 · 28 comments
Open

Example problems and datasets for image processing #2

mrocklin opened this issue Apr 28, 2018 · 28 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Apr 28, 2018

In order to make the most of our time at the scaling scikit-image sprint it might be helpful to prepare some challenge problems and datasets that we want to focus on before we arrive. Ideally these datasets and problems have the following qualities:

  1. They represent classes of real problems faced by many researchers today
  2. They are challenging for scikit-image today but might be made more comfortable by improving scalability
  3. They are based on datasets that are publicly and easily accessible
  4. They are as simple as possible, given the constraints above

An equivalent issue for machine learning datasets is posed at #1

@mrocklin
Copy link
Member Author

mrocklin commented Apr 28, 2018

A microscopy dataset is available from Janelia Farm here: https://www.janelia.org/project-team/flyem/data-and-software-release , though I don't know what to do with it :)

@mrocklin
Copy link
Member Author

cc @jakirkham , @ebo, @freeman-lab, @jakevdp, @simone-codeluppi, @jhamman @datametrician who might know people trying to use scikit-image (or similar frameworks) in scalable contexts

@ebo
Copy link

ebo commented Apr 29, 2018

Thank you for the CC. I will take a look at the janelia dataset.

@mrocklin
Copy link
Member Author

@ebo to be clear I'm not asking you to look at the existing dataset listed above. Instead I'm suggesting that you might have some impact in this community-organized event if you happen to get a scalable workflow up and running before the end of May, especially if that workflow engages Scikit-Image-style computations.

I know that you're working with other Anaconda Inc folks, this might be a way to engage the broader community if that work goes as planned.

@simone-codeluppi
Copy link

Hi
Thanks for the CC!
I have been working on a image analysis intensive project that begun as combination of HPC+MPI and then happily transitioned to dask.distributed to handle all computation. The code base (still evolving) is in the project called pysmFISH with docs at pysmFISH-docs and overall description of the project at http://linnarssonlab.org/osmFISH/. Through collaborative work I am also involved in the spatial transcriptomics community and the starfish project.
I will be very happy to get involved in this community efforts. If needed I also have some datasets that I will be happy to provide as testing case.

@mrocklin
Copy link
Member Author

mrocklin commented Apr 29, 2018 via email

@kmader
Copy link

kmader commented Apr 30, 2018

For larger 3d datasets, we have a few on Kaggle: https://www.kaggle.com/kmader/battery-3d-images that would be good examples and some of the standard analyse notebooks:
https://www.kaggle.com/kmader/nmc-battery-3d-overview
https://www.kaggle.com/kmader/battery-watershed-overview

Which are slow, cumbersome, and sometimes not fully implemented in 3D (like regionprops)

@jni
Copy link
Contributor

jni commented Apr 30, 2018

Another big(ish) dataset in the EM ecosystem:

https://cremi.org/data/

One useful thing to do here is to lazily produce a bunch of filtered versions of the data, and concatenate these for machine learning:

(nplanes, nrows, ncolumns) -> (nplanes, nrows, ncolumns, nfilters) -> (nvoxels, nfilters) -> (da)sklearn pipeline. ;)

Of course much of this workflow has been supplanted by DL applications, but actually DL is mostly useless for live painting as done in e.g. Ilastik.

As a completely different guiding workflow for this, here's a different kind of dataset that I'm working on these days:
https://data.broadinstitute.org/bbbc/BBBC017/
The ~200K images are grouped into multiple fields (tiles) and channels that need to be accumulated (this is essentially computed as 20 means of groups of ~500 images), then in a second pass need to be corrected (each image is divided by its corresponding mean illumination image), and then the results are montaged into single images. Currently I've run this using toolz streaming, which works wonderfully but takes a long time. My early attempts to daskify this pipeline blew up the memory. (@mrocklin incidentally you might remember from SciPy 2016 that my wishlist included a toolz-like interface to dask. This pipeline is why. =)

Looking forward to the sprint!!!

@jni
Copy link
Contributor

jni commented Apr 30, 2018

@kmader "try_all_threshold is one whiny function" 😂 Please feel free to raise an issue with us, though!

Very very cool notebooks!

@simone-codeluppi
Copy link

I have a large 5Tb dataset of single molecule fluorescent images that I can split in smaller chunks. The smallest meaningful chunk to play with (~90Gb) consist of multiple FOV covering a large tissue region (220 FOV, raw images 40x2048x2048).

This will be a case study for processing high resolution fluorescence single molecule images covering a large area of thin tissue.

The images are usually low signal, low SNR and the goal is to identify single molecules that are represented by dots in the raw images. I usually run a bunch of filtering and peak selection to identify 'dots', followed by stitching, registration of multiple chunks followed by segmentation of cells using a watershed based approach. Beside the filtering everything is run on flattened images. pipeline overview
We are currently trying to implement instance segmentation with R-CNN for cell segmentation and see if a similar approach can be applied to identify the dots directly from raw images without filtering.

I have a script that can process a single chuck of the dataset in 'one go' from raw data to counts (no stitching, no alignment and no segmentation). The biggest constrain is the available RAM.

@rcjackson
Copy link

We have around 2.5 TB of 3D grids derived from a research radar in Darwin, Australia, which will be made publicly accessible within the next month (40x200x200, around 300,000 files). One thing I am hoping to do with this data is to use recurrent CNNs to try and see if the CNN can learn how the spatial distribution of composite reflectivity (which gives us an idea of how much precipitation is falling) varies with time over the historical data to try and forecast the development and direction of new storms over the next hour. Right now this is done with tracking software, which cannot account for new development of storms within the hour, so I would like to see if this can be done without tracking software and ideally just from the raw images of composite reflectivity.

@DocSavage
Copy link

DocSavage commented Apr 30, 2018

@mrocklin You might prefer to use the FlyEM data here: http://emdata.janelia.org
While our DVID (Distributed, Versioned, Image-Oriented Dataservice) system works on a HTTP API and holds a variety of data types, the simpler DICED python interface is primarily for 3d image/segmentation access.

@mrocklin
Copy link
Member Author

mrocklin commented Apr 30, 2018 via email

@westurner
Copy link

Diagnosing heart disease from DICOM MRI images
https://www.kaggle.com/c/second-annual-data-science-bowl/data

In this dataset, you are given hundreds of cardiac MRI images in DICOM format. These are 2D cine images that contain approximately 30 images across the cardiac cycle. Each slice is acquired on a separate breath hold. This is important since the registration from slice to slice is expected to be imperfect.

The competition task is to create an automated method capable of determining the left ventricle volume at two points in time: after systole, when the heart is contracted and the ventricles are at their minimum volume, and after diastole, when the heart is at its largest volume.

@westurner
Copy link

Diagnosing lung cancer from DICOM CT images
https://www.kaggle.com/c/data-science-bowl-2017/data

In this dataset, you are given over a thousand low-dose CT images from high-risk patients in DICOM format. Each image contains a series with multiple axial slices of the chest cavity. Each image has a variable number of 2D slices, which can vary based on the machine taking the scan and patient.

The DICOM files have a header that contains the necessary information about the patient id, as well as scan parameters such as the slice thickness.

The competition task is to create an automated method capable of determining whether or not the patient will be diagnosed with lung cancer within one year of the date the scan was taken. The ground truth labels were confirmed by pathology diagnosis.

@westurner
Copy link

westurner commented May 1, 2018

@kmader
Copy link

kmader commented May 1, 2018

Perhaps this is out of scope for the sprint but it is currently significantly easier to build a neural network in keras for image classification or segmentation than sklearn. I have a few examples here of how to use sklearn pipelines with some manual transformers and fit functions but it would be great if it were as easy in sklearn with decision trees as it is in keras.

Classification of Images with knearestneighbors:
http://nbviewer.jupyter.org/github/kmader/Quantitative-Big-Imaging-2018/blob/master/Lectures/05-SupervisedApproaches.ipynb#Classification

Segmentation with decision trees
http://nbviewer.jupyter.org/github/kmader/Quantitative-Big-Imaging-2018/blob/master/Lectures/05-SupervisedApproaches.ipynb#Include-Position-Information

@rmsare
Copy link

rmsare commented May 1, 2018

Thanks for this. I'm interested in using dask for scalable analysis/modeling of topographic data and satellite imagery for Earth science applications. There might be enough cross-over with other application areas in remote sensing to make this worth pursuing.

This usually involves distribut-able image processing operations for computing gradient, curvature, or other derivative quantities, or differencing for change detection between acquisitions.

Existing projects like landlab implement a lot of this functionality with numpy only.

Examples of more complex tasks are segmenting/tracking landscape features like river channels or routing flow over elevation grids.

There aren't a lot of benchmarks or challenges directly related to topographic data, but there are many public data sources. e.g.:

  1. AWS Terrain Tiles: SRTM tiles, variable resolution, global coverage (also AWS Landsat PDS)
  2. OpenTopography: high-resolution elevation data (< 2m) from airborne lidar

Elevation data has the advantage of often being served as tiled rasters which makes distributing operations that might require neighboring tiles a little easier. Same for computations that might be better performed at a certain resolution/zoom level, or sequence of zoom levels. This could make it an interesting dask use case compared to workflows that operate on individual, independent images from a large set.

Deep learning was mentioned above, and interest in DL applications for satellite imagery has spawned quite a few challenges like:

  1. SpaceNet building detection
  2. DSTL feature detection
  3. Planet Labs deforestation challenge

Maybe some of these data could be adapted for a dask-ified segmentation or feature detection task?

@rabernat
Copy link

rabernat commented May 3, 2018

@tjcrone has recently created an amazing image dataset based on the OOI CamHD video. He might have some examples to suggest here.

@jakirkham
Copy link
Contributor

Another option for imaging data is Neurofinder, which has some curated Calcium Imaging datasets with ground truth. This project was setup as part of a competition. So different algorithms are benchmarked and ranked in the leaderboard. Some of which contain references about how a particular algorithm was run.

@jakirkham
Copy link
Contributor

jakirkham commented May 8, 2018

Also just a side note (a little off topic), there has been some discussion in issue ( dask/dask#3111 ) about pulling together different pieces of existing work using Dask for image processing into a project called dask-image. Mentioning it here in case this is of interest to anyone.

Edit: Broke this out as issue ( #11 ).

@mrocklin
Copy link
Member Author

mrocklin commented May 8, 2018

Thanks for the examples all!

The Neurofinder project looks especially nice to me. It has a clear dataset that is easy to access and well explained. There is a clear problem to solve that is accessible to non-experts. And there are several implementations to compare to. Nice.

@emmanuelle
Copy link

The tutorial I wrote on tomography image segmentation http://emmanuelle.github.io/segmentation-of-3-d-tomography-images-with-python-and-scikit-image.html is a bit outdated (the link to the data is broken but I will update it) but it's a good example of a typical workflow for materials science tomography images.

Any advice on where to put an open data set?

@mrocklin
Copy link
Member Author

mrocklin commented May 19, 2018 via email

@emmanuelle
Copy link

@mrocklin thanks! I can put the image (~200 Mo) on my server, but I wanted to know whether there was something more sustainable.

Regarding user problems: as a user, what I'm mostly interested in is accelerating some functions (especially bottleneck ones) by benefiting from a multicore implementation. Can we gain a x10 speed factor on a machine by using 10 cores? (even a x5 factor would be good!). At the moment I'm doing it "by hand" like in this gist https://gist.github.com/emmanuelle/91db4a366496ecb13693c8b513235c55

@jakirkham
Copy link
Contributor

One of the participants at the ImageXD conference offered to share their data. It’s 3-D X-Ray Tomography data (+time) of fiber bundles. Interests in this data includes identification of crack formation, tracking fiber movement, image registration, etc. Some of it lives in the Google Drive linked below. Expect we can get more if there’s interest.

Ref: https://drive.google.com/drive/folders/1vLhv4iFleESxue3Ca3DYHYjbIQsShYCj?usp=sharing

@jni
Copy link
Contributor

jni commented May 19, 2018

@emmanuelle, re persistent data sharing, I recently used https://osf.io/ for my PeerJ skan paper. It’s pretty great for archival (DOI).

@westurner
Copy link

westurner commented May 19, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests