Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute metric per image and handler mutex in DistributedDataParallel #1563

Open
Nic-Ma opened this issue Jan 20, 2021 · 16 comments
Open

Compute metric per image and handler mutex in DistributedDataParallel #1563

Nic-Ma opened this issue Jan 20, 2021 · 16 comments
Labels

Comments

@Nic-Ma
Copy link
Contributor

Nic-Ma commented Jan 20, 2021

❓ Questions/Help/Support

Hi @vfdev-5 ,

I am writing an ignite handler to write the segmentation metrics of every image into 1 CSV file as the summary, for example:

metrics.csv:
/data/spleen/image_1    0.85
/data/spleen/image_2    0.87
/data/spleen/image_3    0.91
... ...

The problems are that:

  1. I tried to add logic to metrics.update() to cache every record and write to CSV in metrics.complete(), but ignite.metrics only accepts output_transform, so I can't extract the filenames from engine.state.batch.
  2. Then I changed to write a separate handler for this feature, but ignite metrics only saves the final average metrics into engine.state.metrics, handler is not easy to get every metric value corresponding to every image.
  3. Another problem is the DistributedDataParallel, when I run the handler in multi-processsing, how do you usually use the multi-processing lock to save content into 1 CSV in both unix and windows OS?

Thanks.

@sdesrozis
Copy link
Contributor

sdesrozis commented Jan 20, 2021

@Nic-Ma Hi !

  1. You are right, metrics works on (transformed) outputs from the process function but the batch is embedded in the engine.state. I suppose that the filenames are provided by dataset and are packed in every batch when __getitem__ is called. I could try on my side to see if I can help.
  2. Again, if we define the dataflow from dataset to metrics, we could go further together. What do you think ?
  3. No idea atm

@vfdev-5 vfdev-5 changed the title Handler mutex in DistributedDataParallel Compute metric per image and handler mutex in DistributedDataParallel Jan 20, 2021
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jan 20, 2021

Hi @Nic-Ma ,

As far as I understand your question, it is about evaluating a pretrained model on images and output the metric per image.

For a single process, IMO, this can be done like that:

from ignite.engine import Engine, Events
from ignite.metrics import Accuracy, BatchWise


num_classes = 10
batch_size = 4


data = [
    {
        "x": torch.rand(batch_size, 3, 32, 32), 
        "y": torch.randint(0, num_classes, size=(batch_size, 32, 32)), 
        "filenames": [f"file_{i + 0 * batch_size}" for i in range(batch_size)]
    },
    {
        "x": torch.rand(batch_size, 3, 32, 32), 
        "y": torch.randint(0, num_classes, size=(batch_size, 32, 32)), 
        "filenames": [f"file_{i + 1 * batch_size}" for i in range(batch_size)]
    },
]


def infer_step(engine, batch):
    x = batch["x"]
    y = batch["y"]
    return {
        "y_pred": torch.rand(batch_size, num_classes, 32, 32),
        "y": y,
    }


evaluator = Engine(infer_step)

acc_metric = Accuracy()


@evaluator.on(Events.ITERATION_COMPLETED)
def compute_metric_per_image(engine):
    print("compute_metric_per_image")
    
    output = engine.state.output
    batch = engine.state.batch
    assert len(output["y_pred"]) == len(output["y"])
    assert len(batch["filenames"]) == len(output["y"])
    
    evaluator.state.metrics = []        
    for y_pred, y, filename in zip(output["y_pred"], output["y"], batch["filenames"]):
        acc_metric.reset()
        acc_metric.update((y_pred.unsqueeze(0), y.unsqueeze(0)))
        o = {
            "filename": filename,
            "accuracy": acc_metric.compute(),
        }
        evaluator.state.metrics.append(o)


@evaluator.on(Events.ITERATION_COMPLETED)
def save_files():
    print("append data to CSV:", evaluator.state.metrics)

    
evaluator.run(data)

Another problem is the DistributedDataParallel, when I run the handler in multi-processsing, how do you usually use the multi-processing lock to save content into 1 CSV in both unix and windows OS?

Maybe, the idea is to gather all data from all participating processes to a single one and then use barrier (idist.barrier()) to write from a single process to the file.

@vfdev-5 vfdev-5 added the module: metrics Metrics module label Jan 20, 2021
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 20, 2021

Hi @vfdev-5 ,

Thanks very much for your detailed example!
Your program can work for the metrics-saving case, but it requires to manually call metrics.reset() and metrics.update() and set values for evaluator.state.metrics which should be done by ignite.Metric base class when we attach metrics to the engine.
Maybe we can keep the existing metrics logic, just save the metric of every image to a value in engine.state in update(), then develop a MetricsSaver handler to get the value from engine.state and write to the CSV file?
To simplify the problem of DistributedDataParallel, I am thinking maybe just every GPU process writes a separate CSV file..

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jan 20, 2021

Your program can work for the metrics-saving case, but it requires to manually call metrics.reset() and metrics.update() and set values for evaluator.state.metrics which should be done by ignite.Metric base class when we attach metrics to the engine.

Yes, that's true. As we need to compute metric per image of the batch we can not currently simply attach metric to an engine and compute batch_size of metric values on each iteration... (in some sort without averaging).
A hacky solution can be to set batch_size = 1 and make usage of BatchWise metric usage.

develop a MetricsSaver handler to get the value from engine.state and write to the CSV file?

yes, this makes sense and we can put such handler into contrib module !

To simplify the problem of DistributedDataParallel, I am thinking maybe just every GPU process writes a separate CSV file..

yes, this can be also a possible solution 👍

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jan 20, 2021

which should be done by ignite.Metric base class when we attach metrics to the engine.

@sdesrozis maybe we can think about another MetricUsage like InstanceWise ?

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 20, 2021

Thanks, let me do more experiments tomorrow.

@sdesrozis
Copy link
Contributor

sdesrozis commented Jan 20, 2021

which should be done by ignite.Metric base class when we attach metrics to the engine.

@sdesrozis maybe we can think about another MetricUsage like InstanceWise ?

Let's think about it. From my side, I have a similar need, so I'm very interested in such a feature.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 22, 2021

Hi @vfdev-5 and @sdesrozis ,

I think at least it's valuable to add engine as an arg for Metric.update() and Metric.compute(), so we can put some context in engine.state to share with other components, just same as regular handlers.

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jan 22, 2021

@Nic-Ma thanks for the idea. Which use-case you think of while using engine inside Metric.update/compute ? Thanks

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 22, 2021

Hi @vfdev-5 ,

I think context information is widely used in most system SW arch, like a data bus for every components to produce/consume. Your engine is a good object as context, all handlers can communicate through engine, but metrics can't access it directly.
For example, if I want to save some temp values(mean dice of every image) during metrics computation and print out in another handler later, I can save them into engine.state.metrics_report after every iteration.

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jan 22, 2021

I see, thanks for explaining !

add engine as an arg for Metric.update() and Metric.compute()

In my mind, update/compute methods are sort of independent of Engine etc.
Maybe, it would make sense to store engine once passed to attach :

def attach(self, engine: Engine, name: str, usage: Union[str, MetricUsage] = EpochWise()) -> None:
    self.engine = engine

thus, it can be accessible in all methods. We discussed that previously and seems like it was not retained... @sdesrozis what do you think ?

On the other hand, out-of-the-box metrics do not use engine (except writing to state.metrics), so a custom metric can also add the engine as an argument on construction ...

@sdesrozis
Copy link
Contributor

sdesrozis commented Jan 22, 2021

I agree that having engine as a hub for handlers is very convenient. This ensures very strong customisation possibilities.

In the case of Metric (which are handlers like any others), engine is passed as argument to each call of attached methods (iteration_completed, etc.). The point is that Metric also offers an user api that is completely agnostic of engine. I think it is a good idea to keep this separation.

I even think we could go further in the separation and adopt a composite pattern rather than inheritance. But that's irrelevant here.

Keep a reference of engine when calling attach as suggested by @vfdev-5 is a smart and simple option. If it solves the issue of @Nic-Ma, I think we should do it.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jan 23, 2021

@sdesrozis the problem here is that we already have existing API for detach and is_attached method that always require engine and adding Metric.engine would require an update for it too, I think.

@sdesrozis
Copy link
Contributor

Yes there should have collateral effects to control...

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 25, 2021

Hi @vfdev-5 and @sdesrozis ,

Thanks for your discussion.
I added self.engine to the metrics in this draft MONAI PR: Project-MONAI/MONAI#1497.
Will delete it in the future when you guys added engine to the Metric base class.

Thanks.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jan 25, 2021

Sounds good ! Thanks for the update @Nic-Ma

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants