Fix step number used by parallel io and OutputWriter::writeTimeStep #769

blattms · 2016-07-21T13:38:04Z

When running in parallel a well state object with the well information of the whole grid needs to constructed to gather the information from all processes. Previously, this was done with the report step exported by thetimer. This was wrong for the following reason:
The output occurs after solving the time step and the timer is already incremented. This means that we constructed the well state for gathering the data for the next report step, already. Unfortunately, at that step some wells that we have computed results for might have been shut. In that case an exception with message "global state does not contain well ..." was thrown.

This problem occured for Model number 2 and might have been due to shut wells because of banned cross flow. With this PR Model number 2 runs through again with 2 processes.

With this PR we use the last report step if this is not an initial write and not a substep. In addition we pass the report step (instead of the substep number) to OutputWriter::writeTimeStep as that is needed according to the documentation.

blattms · 2016-07-21T13:41:15Z

I am really uneasy about ca5164e. it seems weired that this has worked before given that the documentation says the parameter has to be a report step. Please double check that.

dr-robertk · 2016-07-26T16:59:02Z

@blattms: I think reportStep is wrong, but I need to double check. It may be that the documentation was not updated or the output code has changed in the meantime. I have idea about the current state of the output.

blattms · 2016-07-27T09:56:25Z

The number is used to get wells from the schedule in EclipseWriter::writeTimeStep. I assume that this results in undefined behaviour if it is not the report_step. In my debugging run in the second adaptive time step for the first report step (=0) schedule.wells(1) gets called (this is SPE9).

blattms · 2016-07-27T09:57:46Z

opm/autodiff/SimulatorFullyImplicitBlackoilOutput.cpp

@@ -344,7 +344,7 @@ namespace Opm
            if (initConfig->restartRequested() && ((initConfig->getRestartStep()) == (timer.currentStepNum()))) {


Consistently this might need to be timer.reportStepNum(), too.

blattms · 2016-09-01T15:26:11Z

Maybe @joakim-hove or @bska could shed some light on the right step number to use for output?
This PR is really critical for running model 2 in parallel.

My current impression is that at least one of the following holds when writing output

there is no sanity check for the step number used in the writer
it just always happened that the number of report steps is alway bigger than the maximum substep number used.
calling writeTimeStep multiple times with the same time step number either overwrites the previous values (unlikely) or ResInsight or other viz tools and the reader only regard the latest written values into account.

At least that is what is documented.

When running in parallel a well state object with the well information of the whole grid needs to constructed to gather the information from all processes. Previously, this was done with the report step exported by the timer. This was wrong for the following reason: The output occurs after solving the time step and the timer is already incremented. This means that we constructed the well state for gathering the data for the next report step, already. Unfortunately, at that step some wells that we have computed results for might have been shut. In that case an exception with message "global state does not contain well ..." was thrown. This problem occured for Model number 2 and might have been due to shut wells because of banned cross flow. With this commit we use the last report step if this is not an initial write and not a substep.

blattms · 2016-09-02T14:55:51Z

rebased to resolve conficts with current master

joakim-hove · 2016-09-02T15:12:57Z

Sorry - saw the comment just now. Will try to see if I can give any input here.

atgeirr · 2016-09-05T09:35:52Z

I am on this, reviewing.

blattms · 2016-09-05T10:14:03Z

Great!
Please note that I just did a parallel with this merged to the current master. Unfortunately, I am running into convergence problems for model 2 when using two processes. But this should be unrelated to the changes of this PR and is probably due to some already merged PR.

atgeirr · 2016-09-05T15:20:06Z

opm/autodiff/SimulatorFullyImplicitBlackoilOutput.cpp

@@ -370,7 +377,7 @@ namespace Opm

                */

-                eclWriter_->writeTimeStep(timer.currentStepNum(),
+                eclWriter_->writeTimeStep(timer.reportStepNum(),


I am on board with this change. I agree, it seems weird that it worked! Reading the code in opm-output indicates to me this must be correct.

joakim-hove · 2016-09-05T16:00:40Z

This might not answer anything; but anyway:

The output writer takes three time related arguments: report_step, "true time" ~ i.e. a posix_time instance and the duration of the simulation in days.
The time related arguments are used as is -i.e. the output writer does not look forward or backwards in time based on the report step, and does certainly not manipulate it.
The report step is used for two conceptually different things:
i. For restart and summary files[*] the filename contains the report step -formatted as %04d.
ii. When the output layer needs to look up dynamic properties at the time of writing; e.g. which wells are present - the report step is used as dynamic index when looking up in the various Schedule classes.
Which results are stored at which report step consists of half open interval like this:

 ]<-------]<--------]<--------}<--------]<--------]
 0        1         2         3         4         5

The simulation illustrated consist of five report steps, assuming they are all 1 day long we will get:

CASE.X0000 : Initial state
CASE.X0001 : State after 1 day of simulation
CASE.X0002 : State after 2 days of simulation
...
CASE.X0005 : Final state - after 5 days of simulation.

[*] Non unified that is - for increased understanding of these matters it might be beneficial to remove the UNIFIN and UNIFOUT keywords from the deck, then the report step <-> filename mapping will be plain obvious.

joakim-hove · 2016-09-05T16:06:31Z

@atgeirr: To accomplish this, we increment the timer before calling the output, so that the output after step n will have the number (n+1). As you have discovered this goes wrong because the output facility then tries to access the schedule of the next step.

What is the most clean way to approach this is not clear to me; but I guess one could think of step zero as the step from (-∞START] - i..e the equilibriation and then step 1 goes from <STARTE,DATES(1)].

A full solution must involve opm-output, and while I have not read the code there fully, I assume that the "report_step" passed is also used as sequence number? If so, that must be amended to increment it by one.

This might be less than perfect in opm-output at present, but my goal is that the output layer just takes the report step as an id uses it unmodified as such.

blattms · 2016-09-06T08:13:31Z

Thanks @atgeirr and @joakim-hove. That clarifies things.

So to sum this up for @joakim-hove's example:

Assume that we are simulating from day 1 0:00 until day 1 24:00 (the second interval in the sketch). For substeps we will be using report step 1 and the current time. At the end we will use report step 2 and the current time. In both cases the well information (shut/open) is defined by report step 1. To preserve this information in the parallel case I needed to introduce the wellStateStepNumber.

That means my patch is meaningful and does the right thing.

Now the point 3.ii by @joakim-hove seems to lead to vanishing information of the wells. Data of wells that are open in this and shut in the next report step might not get written (as the step passed is 2 and the state is read from the schedule by opm-output). Is that correct? I do not think that this is a problem but we should keep it in mind.

blattms · 2016-09-06T08:17:37Z

BTW I might have messed up the dataset of Model 2. That could be the problem of my runs. master with this PR merged works fine for @GitPaean

atgeirr · 2016-09-07T08:22:49Z

After discussion with @joakim-hove we have come to the conclusion that the code in opm-output is the one that should be changed. The underlying problem is, as described in @joakim-hove's third point above, that "The report step is used for two conceptually different things", both to get the well schedule and to identify the restart point via file name or sequence number.

Those two numbers should be different for all cases but the initial state. Continuing with the 5-step example, the current master would pass 0, 1, ..., 5 to the output-writer, which is correct for the sequence numbers but fails when it gets the schedule (off by one, except for the initial call). This patch does the opposite error: it passes 0, 0, 1, ..., 4 to the output-writer, which gives the right schedule but the sequence number will be wrong in the output file(s).

The proper solution is therefore to change opm-output so that it will subtract one from the sequence number before getting the schedule, unless already zero. With that change, the current master branch will give the correct result and this PR should be closed. We should keep it open until such a change has been implemented and tested to work though, since it provides a workaround (if you can live with wrong filenames/sequence numbers).

@joakim-hove has taken on the task of drafting the fix in opm-output.

blattms · 2016-09-07T09:54:23Z

This patch does the opposite error: it passes 0, 0, 1, ..., 4 to the output-writer, which gives the right schedule but the sequence number will be wrong in the output file(s).

IMHO this statement is not true. The reportstep number is not changed by my PR. I just correct the number used in ParallelDebugOutput to query the wellstate from the schedule (this is where your 0, 0, ... applies). I cannot see how changes in opm-output will fix the problem in ParallelDebugOutput. What am I missing?

blattms · 2016-09-07T10:09:39Z

We should keep it open until such a change has been implemented and tested to work though, since it provides a workaround (if you can live with wrong filenames/sequence numbers).

I do not think that this PR is a work around for the problem that @joakim-hove wants to solve. These seem unconnected:
My problem is confined to the parallel output in ParallelDebugOutput. There we need to construct the global well state (i.e. all the wells active currently in the grid). This is done by querying the schedule using the reportstep passed to the output writer which is 1 if we write after simulating the first day. Ergo we actually get the global well state of the second day but the local problems still use the results/well state of the first day.
If there is a similar problem in opm-output writing the well state of day 1 (that comes from the simulator) to the well state of day 2 in the schedule then this seems unrelated.

atgeirr · 2016-09-07T11:30:26Z

After off-line communication I have realized that this fix is needed in addition to the changes proposed for opm-output. I therefore intend to merge this unless someone disagree strongly (and soon...).

blattms reviewed Jul 27, 2016
View reviewed changes

blattms added 2 commits September 2, 2016 14:41

OutputWriter::writeTimeStep needs the report step and not the sub step.

5ecead8

At least that is what is documented.

blattms force-pushed the fix-step-used-by-parallel-io branch from 6511222 to 4a6be3d Compare September 2, 2016 14:54

atgeirr reviewed Sep 5, 2016
View reviewed changes

GitPaean mentioned this pull request Sep 7, 2016

Do not assume ordering in an unordered_map when gathering data. #773

Merged

atgeirr merged commit 4d2d004 into OPM:master Sep 7, 2016

blattms deleted the fix-step-used-by-parallel-io branch September 8, 2016 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix step number used by parallel io and OutputWriter::writeTimeStep #769

Fix step number used by parallel io and OutputWriter::writeTimeStep #769

blattms commented Jul 21, 2016

blattms commented Jul 21, 2016

dr-robertk commented Jul 26, 2016

blattms commented Jul 27, 2016

blattms Jul 27, 2016

blattms commented Sep 1, 2016

blattms commented Sep 2, 2016

joakim-hove commented Sep 2, 2016

atgeirr commented Sep 5, 2016

blattms commented Sep 5, 2016

atgeirr Sep 5, 2016

joakim-hove commented Sep 5, 2016

joakim-hove commented Sep 5, 2016

blattms commented Sep 6, 2016 •

edited

Loading

blattms commented Sep 6, 2016 •

edited

Loading

atgeirr commented Sep 7, 2016

blattms commented Sep 7, 2016 •

edited

Loading

blattms commented Sep 7, 2016

atgeirr commented Sep 7, 2016 •

edited

Loading

		@@ -344,7 +344,7 @@ namespace Opm
		if (initConfig->restartRequested() && ((initConfig->getRestartStep()) == (timer.currentStepNum()))) {

Fix step number used by parallel io and OutputWriter::writeTimeStep #769

Fix step number used by parallel io and OutputWriter::writeTimeStep #769

Conversation

blattms commented Jul 21, 2016

blattms commented Jul 21, 2016

dr-robertk commented Jul 26, 2016

blattms commented Jul 27, 2016

blattms Jul 27, 2016

Choose a reason for hiding this comment

blattms commented Sep 1, 2016

blattms commented Sep 2, 2016

joakim-hove commented Sep 2, 2016

atgeirr commented Sep 5, 2016

blattms commented Sep 5, 2016

atgeirr Sep 5, 2016

Choose a reason for hiding this comment

joakim-hove commented Sep 5, 2016

joakim-hove commented Sep 5, 2016

blattms commented Sep 6, 2016 • edited Loading

blattms commented Sep 6, 2016 • edited Loading

atgeirr commented Sep 7, 2016

blattms commented Sep 7, 2016 • edited Loading

blattms commented Sep 7, 2016

atgeirr commented Sep 7, 2016 • edited Loading

blattms commented Sep 6, 2016 •

edited

Loading

blattms commented Sep 6, 2016 •

edited

Loading

blattms commented Sep 7, 2016 •

edited

Loading

atgeirr commented Sep 7, 2016 •

edited

Loading