Compute metric per image and handler mutex in DistributedDataParallel #1563

Nic-Ma · 2021-01-20T02:08:25Z

❓ Questions/Help/Support

I am writing an ignite handler to write the segmentation metrics of every image into 1 CSV file as the summary, for example:

metrics.csv:
/data/spleen/image_1    0.85
/data/spleen/image_2    0.87
/data/spleen/image_3    0.91
... ...

The problems are that:

I tried to add logic to metrics.update() to cache every record and write to CSV in metrics.complete(), but ignite.metrics only accepts output_transform, so I can't extract the filenames from engine.state.batch.
Then I changed to write a separate handler for this feature, but ignite metrics only saves the final average metrics into engine.state.metrics, handler is not easy to get every metric value corresponding to every image.
Another problem is the DistributedDataParallel, when I run the handler in multi-processsing, how do you usually use the multi-processing lock to save content into 1 CSV in both unix and windows OS?

Thanks.

The text was updated successfully, but these errors were encountered:

sdesrozis · 2021-01-20T07:53:17Z

@Nic-Ma Hi !

You are right, metrics works on (transformed) outputs from the process function but the batch is embedded in the engine.state. I suppose that the filenames are provided by dataset and are packed in every batch when __getitem__ is called. I could try on my side to see if I can help.
Again, if we define the dataflow from dataset to metrics, we could go further together. What do you think ?
No idea atm

vfdev-5 · 2021-01-20T09:13:03Z

Hi @Nic-Ma ,

As far as I understand your question, it is about evaluating a pretrained model on images and output the metric per image.

For a single process, IMO, this can be done like that:

from ignite.engine import Engine, Events
from ignite.metrics import Accuracy, BatchWise


num_classes = 10
batch_size = 4


data = [
    {
        "x": torch.rand(batch_size, 3, 32, 32), 
        "y": torch.randint(0, num_classes, size=(batch_size, 32, 32)), 
        "filenames": [f"file_{i + 0 * batch_size}" for i in range(batch_size)]
    },
    {
        "x": torch.rand(batch_size, 3, 32, 32), 
        "y": torch.randint(0, num_classes, size=(batch_size, 32, 32)), 
        "filenames": [f"file_{i + 1 * batch_size}" for i in range(batch_size)]
    },
]


def infer_step(engine, batch):
    x = batch["x"]
    y = batch["y"]
    return {
        "y_pred": torch.rand(batch_size, num_classes, 32, 32),
        "y": y,
    }


evaluator = Engine(infer_step)

acc_metric = Accuracy()


@evaluator.on(Events.ITERATION_COMPLETED)
def compute_metric_per_image(engine):
    print("compute_metric_per_image")
    
    output = engine.state.output
    batch = engine.state.batch
    assert len(output["y_pred"]) == len(output["y"])
    assert len(batch["filenames"]) == len(output["y"])
    
    evaluator.state.metrics = []        
    for y_pred, y, filename in zip(output["y_pred"], output["y"], batch["filenames"]):
        acc_metric.reset()
        acc_metric.update((y_pred.unsqueeze(0), y.unsqueeze(0)))
        o = {
            "filename": filename,
            "accuracy": acc_metric.compute(),
        }
        evaluator.state.metrics.append(o)


@evaluator.on(Events.ITERATION_COMPLETED)
def save_files():
    print("append data to CSV:", evaluator.state.metrics)

    
evaluator.run(data)

Another problem is the DistributedDataParallel, when I run the handler in multi-processsing, how do you usually use the multi-processing lock to save content into 1 CSV in both unix and windows OS?

Maybe, the idea is to gather all data from all participating processes to a single one and then use barrier (idist.barrier()) to write from a single process to the file.

Nic-Ma · 2021-01-20T15:00:39Z

Hi @vfdev-5 ,

Thanks very much for your detailed example!
Your program can work for the metrics-saving case, but it requires to manually call metrics.reset() and metrics.update() and set values for evaluator.state.metrics which should be done by ignite.Metric base class when we attach metrics to the engine.
Maybe we can keep the existing metrics logic, just save the metric of every image to a value in engine.state in update(), then develop a MetricsSaver handler to get the value from engine.state and write to the CSV file?
To simplify the problem of DistributedDataParallel, I am thinking maybe just every GPU process writes a separate CSV file..

Thanks.

vfdev-5 · 2021-01-20T15:06:48Z

Your program can work for the metrics-saving case, but it requires to manually call metrics.reset() and metrics.update() and set values for evaluator.state.metrics which should be done by ignite.Metric base class when we attach metrics to the engine.

Yes, that's true. As we need to compute metric per image of the batch we can not currently simply attach metric to an engine and compute batch_size of metric values on each iteration... (in some sort without averaging).
A hacky solution can be to set batch_size = 1 and make usage of BatchWise metric usage.

develop a MetricsSaver handler to get the value from engine.state and write to the CSV file?

yes, this makes sense and we can put such handler into contrib module !

To simplify the problem of DistributedDataParallel, I am thinking maybe just every GPU process writes a separate CSV file..

yes, this can be also a possible solution 👍

vfdev-5 · 2021-01-20T15:09:11Z

which should be done by ignite.Metric base class when we attach metrics to the engine.

@sdesrozis maybe we can think about another MetricUsage like InstanceWise ?

Nic-Ma · 2021-01-20T15:27:32Z

Thanks, let me do more experiments tomorrow.

sdesrozis · 2021-01-20T21:53:57Z

which should be done by ignite.Metric base class when we attach metrics to the engine.

@sdesrozis maybe we can think about another MetricUsage like InstanceWise ?

Let's think about it. From my side, I have a similar need, so I'm very interested in such a feature.

Nic-Ma · 2021-01-22T07:47:03Z

Hi @vfdev-5 and @sdesrozis ,

I think at least it's valuable to add engine as an arg for Metric.update() and Metric.compute(), so we can put some context in engine.state to share with other components, just same as regular handlers.

Thanks.

vfdev-5 · 2021-01-22T08:00:51Z

@Nic-Ma thanks for the idea. Which use-case you think of while using engine inside Metric.update/compute ? Thanks

Nic-Ma · 2021-01-22T09:31:03Z

Hi @vfdev-5 ,

I think context information is widely used in most system SW arch, like a data bus for every components to produce/consume. Your engine is a good object as context, all handlers can communicate through engine, but metrics can't access it directly.
For example, if I want to save some temp values(mean dice of every image) during metrics computation and print out in another handler later, I can save them into engine.state.metrics_report after every iteration.

Thanks.

vfdev-5 · 2021-01-22T10:02:07Z

I see, thanks for explaining !

add engine as an arg for Metric.update() and Metric.compute()

In my mind, update/compute methods are sort of independent of Engine etc.
Maybe, it would make sense to store engine once passed to attach :

def attach(self, engine: Engine, name: str, usage: Union[str, MetricUsage] = EpochWise()) -> None:
    self.engine = engine

thus, it can be accessible in all methods. We discussed that previously and seems like it was not retained... @sdesrozis what do you think ?

On the other hand, out-of-the-box metrics do not use engine (except writing to state.metrics), so a custom metric can also add the engine as an argument on construction ...

sdesrozis · 2021-01-22T21:26:23Z

I agree that having engine as a hub for handlers is very convenient. This ensures very strong customisation possibilities.

In the case of Metric (which are handlers like any others), engine is passed as argument to each call of attached methods (iteration_completed, etc.). The point is that Metric also offers an user api that is completely agnostic of engine. I think it is a good idea to keep this separation.

I even think we could go further in the separation and adopt a composite pattern rather than inheritance. But that's irrelevant here.

Keep a reference of engine when calling attach as suggested by @vfdev-5 is a smart and simple option. If it solves the issue of @Nic-Ma, I think we should do it.

vfdev-5 · 2021-01-23T15:44:46Z

@sdesrozis the problem here is that we already have existing API for detach and is_attached method that always require engine and adding Metric.engine would require an update for it too, I think.

sdesrozis · 2021-01-23T16:16:42Z

Yes there should have collateral effects to control...

Nic-Ma · 2021-01-25T02:50:52Z

Hi @vfdev-5 and @sdesrozis ,

Thanks for your discussion.
I added self.engine to the metrics in this draft MONAI PR: Project-MONAI/MONAI#1497.
Will delete it in the future when you guys added engine to the Metric base class.

Thanks.

vfdev-5 · 2021-01-25T08:33:22Z

Sounds good ! Thanks for the update @Nic-Ma

Nic-Ma added the question label Jan 20, 2021

vfdev-5 changed the title ~~Handler mutex in DistributedDataParallel~~ Compute metric per image and handler mutex in DistributedDataParallel Jan 20, 2021

vfdev-5 added the module: metrics Metrics module label Jan 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute metric per image and handler mutex in DistributedDataParallel #1563

Compute metric per image and handler mutex in DistributedDataParallel #1563

Nic-Ma commented Jan 20, 2021

sdesrozis commented Jan 20, 2021 •

edited

Loading

vfdev-5 commented Jan 20, 2021

Nic-Ma commented Jan 20, 2021

vfdev-5 commented Jan 20, 2021 •

edited

Loading

vfdev-5 commented Jan 20, 2021

Nic-Ma commented Jan 20, 2021

sdesrozis commented Jan 20, 2021 •

edited

Loading

Nic-Ma commented Jan 22, 2021

vfdev-5 commented Jan 22, 2021 •

edited

Loading

Nic-Ma commented Jan 22, 2021 •

edited

Loading

vfdev-5 commented Jan 22, 2021

sdesrozis commented Jan 22, 2021 •

edited by vfdev-5

Loading

vfdev-5 commented Jan 23, 2021

sdesrozis commented Jan 23, 2021

Nic-Ma commented Jan 25, 2021

vfdev-5 commented Jan 25, 2021

Compute metric per image and handler mutex in DistributedDataParallel #1563

Compute metric per image and handler mutex in DistributedDataParallel #1563

Comments

Nic-Ma commented Jan 20, 2021

❓ Questions/Help/Support

sdesrozis commented Jan 20, 2021 • edited Loading

vfdev-5 commented Jan 20, 2021

Nic-Ma commented Jan 20, 2021

vfdev-5 commented Jan 20, 2021 • edited Loading

vfdev-5 commented Jan 20, 2021

Nic-Ma commented Jan 20, 2021

sdesrozis commented Jan 20, 2021 • edited Loading

Nic-Ma commented Jan 22, 2021

vfdev-5 commented Jan 22, 2021 • edited Loading

Nic-Ma commented Jan 22, 2021 • edited Loading

vfdev-5 commented Jan 22, 2021

sdesrozis commented Jan 22, 2021 • edited by vfdev-5 Loading

vfdev-5 commented Jan 23, 2021

sdesrozis commented Jan 23, 2021

Nic-Ma commented Jan 25, 2021

vfdev-5 commented Jan 25, 2021

sdesrozis commented Jan 20, 2021 •

edited

Loading

vfdev-5 commented Jan 20, 2021 •

edited

Loading

sdesrozis commented Jan 20, 2021 •

edited

Loading

vfdev-5 commented Jan 22, 2021 •

edited

Loading

Nic-Ma commented Jan 22, 2021 •

edited

Loading

sdesrozis commented Jan 22, 2021 •

edited by vfdev-5

Loading