Faster inference through two-step patch generation #188

andreped · 2023-09-11T14:20:29Z

andreped
Sep 11, 2023

Patch reading speed is a massive bottleneck which affects most deployment pipelines. It becomes especially prominent when reading small image patches (e.g., 132 x 132 on 20x magnification). It does not really matter how large the network is, reading individual small patches takes A LOT of time and greatly impacts the overall deployment runtime.

A way to get around this problem, could be by reading larger patches from disk in the PatchGenerator PO, and then performing a second patch generation from the individual large patches, before running inference on each, and stitching the result into a large patch, which again will be stitched to form prediction on the full WSI.

If done correctly, and without making too large patches, the memory overhead should not be that much greater, and definitely feasible for low-end devices. The application above where I am running a model on 132 x 132 patches, one could instead read a large patch of size 528 x 528 (4x larger => 4x4=16 patches). which is similar to what we normally use for other networks anyways.

To better understand the idea, I have made a simple FPL that demonstrates how the pipeline could look like:

PipelineName "Nuclei segmentation (faster)"
PipelineDescription "Patch-wise segmentation of cell nuclei at 20x magnification using a lightweight U-Net"
PipelineInputData WSI "Whole-slide image"
PipelineOutputData Segmentation large_stitcher 0
Attribute classes "Background;Cell nuclei"

### Processing chain

ProcessObject tissueSeg TissueSegmentation
Input 0 WSI

ProcessObject large_patch PatchGenerator
Attribute patch-size 1024 1024
Attribute patch-magnification 20
Attribute patch-overlap 0.0
Input 0 WSI
Input 1 tissueSeg 0

ProcessObject patch PatchGenerator
Attribute patch-size 256 256
Input 0 large_patch 0

#ProcessObject batch ImageToBatchGenerator
#Attribute max-batch-size 16
#Input 0 patch 0

ProcessObject network SegmentationNetwork
Attribute scale-factor 0.003921568627451
Attribute model "$CURRENT_PATH$/../datahub/nuclei-segmentation-model/high_res_nuclei_unet.onnx"
Input 0 patch 0

ProcessObject stitcher PatchStitcher
Input 0 network 0

ProcessObject large_stitcher PatchStitcher
Input 0 stitcher 0

### Renderers
Renderer imgRenderer ImagePyramidRenderer
Input 0 WSI

Renderer segRenderer SegmentationRenderer
Attribute border-opacity 0.5
Input 0 large_stitcher 0

Results in the error:

>> FAST exception caught: Out of bounds cropping not allowed, but offset was below 0.

Some notes:

This FPL does not work in FastPathology or using the runPipeline CLI (nor was I expecting it to), as I believe the second PatchStitcher PO will be unable to stitch the already stitched results. It could also be the second PatchGenerator where the problem lies.
API-wise it could be a good idea to make separate LargePatchGenerator and LargePatchStitcher POs that handles this logic? Or of course, the original PatchGenerator and PatchStitcher POs could be made to handle this scenario. Perhaps that is easier?
The FPL above is for pw-semantic segmentation, but it is just as relevant for pw-classification. However, a similar issue occurs, likely with the PatchStitcher.
Also note that overlapping inference might be a bit tricky in this scenario (if you want to avoid redundant compute), so I have set patch-overlap=0.0 in the FPL, but of course nothing is stopping you from trying to get this working in an overlapping scenario as well.
Inference on the large patch could also be further accelerated by performing inference by fusing 4x4=16 patches into a batch - but if I remember correctly TensorRT was not compatible with batch-inference? Or has that been resolved a while ago maybe (it used to be the case with UFF at least)?

andreped · 2023-09-11T14:47:05Z

andreped
Sep 11, 2023
Author

Note that such a design is also highly relevant for multi-scale networks, multiple instance learning designs, graph neural networks, and vision transformers, as these networks tend to divide larger patches into smaller tiles and store these either as a grid or a bag, and then use spatial relationships as part of the classification pipeline.

Today, it is possible to generate bags by using ImageToBatchGenerator and reshaping the batch dimension in the preprocessing layer of the network to go from B,1,H,W,C to 1,B,H,W,C (effictively introducing the temporal dimension with a little trick). However, then the patches used to build up the bag are coming from the sequential patch generator, which has no guarantee that patches are close in space.

2 replies

Peregalli Sep 27, 2023

Hi @andreped! I have a question, How do you reshape the batch dimension at the layer network? Im working with python.

andreped Sep 28, 2023
Author

For what purpose? Reshaping batch dimension could be many things. I assume you are talking about the stuff I talked about at the end where I go from B,H,W,C -> B,1,H,W,C -> 1,B,H,W,C.

To understand my solution, it is nice to understand what FAST expects. Essentially, FAST's PatchGenerator PO creates patches of size H,W,C. Then to create batches you can do (example below in FPL, but you can do the same in pyFAST):

ProcessObject patch PatchGenerator
Attribute patch-size 256 256
Input 0 WSI 0
Input 0 tissue 0

ProcessObject batch ImageToBatchGenerator
Attribute max-batch-size 30
Input 0 patch 0

Now batches of shape 30,256,256,3 will be generated from the ImageToBatchGenerator.

My multiple instance learning network expects 1,30,256,256,3, where I essentially provide a bag of 30 patches. TensorFlow models expect a batch dimension, so I need a 1 as the first dimension. The output of the network is a prediction for each instance in the bag, and for a batch of size 1, the output shape will be 1,30,2 (softmax-output for two classes).

So the output of the network is very close to something on the format FAST expects, but the input and output of the network need to be changed to be compatible with pyFAST.

For that I simply added a custom layer at the start and the end of the network that simply did the reshaping for me and removed a redundant dimension in the output.

In TensorFlow, I updated my model using this script:

from tensorflow.keras.models import load_model, Model
from tensorflow.keras.layers import Input, Reshape, Lambda

model = load_model("/path/to/saved_model/", compile=False)

input_ = Input(shape=(256, 256, 3))  # FAST expects RGB image (3 dims)

x = Reshape(target_shape=(1, 256, 256, 3))(input_)  # need to add dimension, which should represent the batch size to be swapped with bag size
x = Lambda(lambda z: tf.transpose(z, perm=[1, 0, 2, 3, 4]))(x)  # switch batch and bag dimensions to work with FP
x = model(x)  # now send preprocessed sample through multiple instance learning network to produce instance predictions 
x = Lambda(lambda z: tf.squeeze(z, axis=0), name="class_attention")(x)  # (1,30,2) -> (30, 2)

updated_model = Model(inputs=input_, outputs=x)

The reason why this works is that when you feed an output of the ImageToBatchGenerator of 30 elements, the output of the NeuralNetwork PO is also expected to be 30 elements. As the output of the network is of shape 30,2, we now have softmax output for individual patches in the batch, which FAST then will stitch appropriately in the produced heatmap.

So we essentially just treat the batch dimension as our temporal/bag dimension, but need to do some reshaping for this to work.

If you are working with PyTorch, you can do a similar trick.

smistad · 2023-09-12T10:53:57Z

smistad
Sep 12, 2023
Maintainer

A good idea, but need to do some benchmarks to see if your theory holds before starting to implement something like this in the PatchGenerator.

2 replies

andreped Sep 12, 2023
Author

I can try to implement a pyFAST pipeline with some custom POs which attemps to do this and report runtimes.
When it is done, I will upload a gist to enable you and others to reproduce the experiments.

andreped Sep 13, 2023
Author

I have made an initial benchmark, but there are some pieces that remains to be solved before I can make a fair comparison. See gist if you wish to run and reproduce the current state of the benchmark.

What I basically realized that I could do, is that I do not need a Python PO for the large/small patch generators. I can just use the fast.PatchGenerator twice. However, stitching is more tricky, as it is not compatible with the current pipeline. I guess the fast.PatchStitcher only works when the fast.PatchGenerator is directly connected to a WSI?

So what I do not is that I attached a fast.DataStream to the fast.SegmentationNetwork PO, but the problem is that it will think it is finished when the second fast.PatchGenerator is finished processing the large patch. Is there another way to do this? Normally, I would attach a stitcher and it would wait till the stitcher is done on the full WSI, but since I cannot attach a stitcher, I have no way of getting this to run till it is done processing the full WSI.

For convenience, I add part of the code from the gist below:

import fast
import numpy as np
from tqdm import tqdm

# enable verbose in FAST
fast.Reporter.setGlobalReportMethod(fast.Reporter.COUT)

# Download a nuclei segmentation model from the DataHub
model = fast.DataHub().download('nuclei-segmentation-model')

importer = fast.WholeSlideImageImporter.create("./CMU-1.svs")

tissueSegmentation = fast.TissueSegmentation.create(threshold=70).connect(importer)

patchgen = fast.PatchGenerator.create(1024, 1024, magnification=5,
                                      maskThreshold=0.02, overlapPercent=0.0)\
    .connect(0, importer)\
    .connect(1, tissueSegmentation)  # @TODO: Should remove tissue selection to get experiments more comparable

smaller_patchgen = fast.PatchGenerator.create(256, 256).connect(patchgen)

network = fast.SegmentationNetwork.create(scaleFactor=1.0/255.0, modelFilename=model.paths[0] + '/high_res_nuclei_unet.onnx')\
    .connect(0, smaller_patchgen)

for pred in tqdm(fast.DataStream(network):
    pass

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster inference through two-step patch generation #188

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Faster inference through two-step patch generation #188

andreped Sep 11, 2023

Replies: 2 comments · 4 replies

andreped Sep 11, 2023 Author

Peregalli Sep 27, 2023

andreped Sep 28, 2023 Author

smistad Sep 12, 2023 Maintainer

andreped Sep 12, 2023 Author

andreped Sep 13, 2023 Author

andreped
Sep 11, 2023

Replies: 2 comments 4 replies

andreped
Sep 11, 2023
Author

andreped Sep 28, 2023
Author

smistad
Sep 12, 2023
Maintainer

andreped Sep 12, 2023
Author

andreped Sep 13, 2023
Author