Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix step number used by parallel io and OutputWriter::writeTimeStep #769

Merged
merged 2 commits into from
Sep 7, 2016

Conversation

blattms
Copy link
Member

@blattms blattms commented Jul 21, 2016

When running in parallel a well state object with the well information of the whole grid needs to constructed to gather the information from all processes. Previously, this was done with the report step exported by thetimer. This was wrong for the following reason:
The output occurs after solving the time step and the timer is already incremented. This means that we constructed the well state for gathering the data for the next report step, already. Unfortunately, at that step some wells that we have computed results for might have been shut. In that case an exception with message "global state does not contain well ..." was thrown.

This problem occured for Model number 2 and might have been due to shut wells because of banned cross flow. With this PR Model number 2 runs through again with 2 processes.

With this PR we use the last report step if this is not an initial write and not a substep. In addition we pass the report step (instead of the substep number) to OutputWriter::writeTimeStep as that is needed according to the documentation.

@blattms
Copy link
Member Author

blattms commented Jul 21, 2016

I am really uneasy about ca5164e. it seems weired that this has worked before given that the documentation says the parameter has to be a report step. Please double check that.

@dr-robertk
Copy link
Member

@blattms: I think reportStep is wrong, but I need to double check. It may be that the documentation was not updated or the output code has changed in the meantime. I have idea about the current state of the output.

@blattms
Copy link
Member Author

blattms commented Jul 27, 2016

The number is used to get wells from the schedule in EclipseWriter::writeTimeStep. I assume that this results in undefined behaviour if it is not the report_step. In my debugging run in the second adaptive time step for the first report step (=0) schedule.wells(1) gets called (this is SPE9).

@@ -344,7 +344,7 @@ namespace Opm
if (initConfig->restartRequested() && ((initConfig->getRestartStep()) == (timer.currentStepNum()))) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistently this might need to be timer.reportStepNum(), too.

@blattms
Copy link
Member Author

blattms commented Sep 1, 2016

Maybe @joakim-hove or @bska could shed some light on the right step number to use for output?
This PR is really critical for running model 2 in parallel.

My current impression is that at least one of the following holds when writing output

  • there is no sanity check for the step number used in the writer
  • it just always happened that the number of report steps is alway bigger than the maximum substep number used.
  • calling writeTimeStep multiple times with the same time step number either overwrites the previous values (unlikely) or ResInsight or other viz tools and the reader only regard the latest written values into account.

When running in parallel a well state object with the well information
of the whole grid needs to constructed to gather the information from all
processes. Previously, this was done with the report step exported by the
timer. This was wrong for the following reason:
The output occurs after solving the time step and the timer is already
incremented. This means that we constructed the well state for gathering the
data for the next report step, already. Unfortunately, at that step some
wells that we have computed results for might have been shut. In that case
an exception with message "global state does not contain well ..." was thrown.

This problem occured for Model number 2 and might have been due to shut wells
because of banned cross flow.

With this commit we use the last report step if this is not an initial write
and not a substep.
@blattms blattms force-pushed the fix-step-used-by-parallel-io branch from 6511222 to 4a6be3d Compare September 2, 2016 14:54
@blattms
Copy link
Member Author

blattms commented Sep 2, 2016

rebased to resolve conficts with current master

@joakim-hove
Copy link
Member

Sorry - saw the comment just now. Will try to see if I can give any input here.

@atgeirr
Copy link
Member

atgeirr commented Sep 5, 2016

I am on this, reviewing.

@blattms
Copy link
Member Author

blattms commented Sep 5, 2016

Great!
Please note that I just did a parallel with this merged to the current master. Unfortunately, I am running into convergence problems for model 2 when using two processes. But this should be unrelated to the changes of this PR and is probably due to some already merged PR.

@@ -370,7 +377,7 @@ namespace Opm

*/

eclWriter_->writeTimeStep(timer.currentStepNum(),
eclWriter_->writeTimeStep(timer.reportStepNum(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on board with this change. I agree, it seems weird that it worked! Reading the code in opm-output indicates to me this must be correct.

@joakim-hove
Copy link
Member

This might not answer anything; but anyway:

  1. The output writer takes three time related arguments: report_step, "true time" ~ i.e. a posix_time instance and the duration of the simulation in days.
  2. The time related arguments are used as is -i.e. the output writer does not look forward or backwards in time based on the report step, and does certainly not manipulate it.
  3. The report step is used for two conceptually different things:
    i. For restart and summary files[*] the filename contains the report step -formatted as %04d.
    ii. When the output layer needs to look up dynamic properties at the time of writing; e.g. which wells are present - the report step is used as dynamic index when looking up in the various Schedule classes.
  4. Which results are stored at which report step consists of half open interval like this:
 ]<-------]<--------]<--------}<--------]<--------]
 0        1         2         3         4         5

The simulation illustrated consist of five report steps, assuming they are all 1 day long we will get:

CASE.X0000 : Initial state
CASE.X0001 : State after 1 day of simulation
CASE.X0002 : State after 2 days of simulation
...
CASE.X0005 : Final state - after 5 days of simulation.

[*] Non unified that is - for increased understanding of these matters it might be beneficial to remove the UNIFIN and UNIFOUT keywords from the deck, then the report step <-> filename mapping will be plain obvious.

@joakim-hove
Copy link
Member

@atgeirr: To accomplish this, we increment the timer before calling the output, so that the output after step n will have the number (n+1). As you have discovered this goes wrong because the output facility then tries to access the schedule of the next step.

What is the most clean way to approach this is not clear to me; but I guess one could think of step zero as the step from (-∞START] - i..e the equilibriation and then step 1 goes from <STARTE,DATES(1)].

A full solution must involve opm-output, and while I have not read the code there fully, I assume that the "report_step" passed is also used as sequence number? If so, that must be amended to increment it by one.

This might be less than perfect in opm-output at present, but my goal is that the output layer just takes the report step as an id uses it unmodified as such.

@blattms
Copy link
Member Author

blattms commented Sep 6, 2016

Thanks @atgeirr and @joakim-hove. That clarifies things.

So to sum this up for @joakim-hove's example:

Assume that we are simulating from day 1 0:00 until day 1 24:00 (the second interval in the sketch). For substeps we will be using report step 1 and the current time. At the end we will use report step 2 and the current time. In both cases the well information (shut/open) is defined by report step 1. To preserve this information in the parallel case I needed to introduce the wellStateStepNumber.

That means my patch is meaningful and does the right thing.

Now the point 3.ii by @joakim-hove seems to lead to vanishing information of the wells. Data of wells that are open in this and shut in the next report step might not get written (as the step passed is 2 and the state is read from the schedule by opm-output). Is that correct? I do not think that this is a problem but we should keep it in mind.

@blattms
Copy link
Member Author

blattms commented Sep 6, 2016

BTW I might have messed up the dataset of Model 2. That could be the problem of my runs. master with this PR merged works fine for @GitPaean

@atgeirr
Copy link
Member

atgeirr commented Sep 7, 2016

After discussion with @joakim-hove we have come to the conclusion that the code in opm-output is the one that should be changed. The underlying problem is, as described in @joakim-hove's third point above, that "The report step is used for two conceptually different things", both to get the well schedule and to identify the restart point via file name or sequence number.

Those two numbers should be different for all cases but the initial state. Continuing with the 5-step example, the current master would pass 0, 1, ..., 5 to the output-writer, which is correct for the sequence numbers but fails when it gets the schedule (off by one, except for the initial call). This patch does the opposite error: it passes 0, 0, 1, ..., 4 to the output-writer, which gives the right schedule but the sequence number will be wrong in the output file(s).

The proper solution is therefore to change opm-output so that it will subtract one from the sequence number before getting the schedule, unless already zero. With that change, the current master branch will give the correct result and this PR should be closed. We should keep it open until such a change has been implemented and tested to work though, since it provides a workaround (if you can live with wrong filenames/sequence numbers).

@joakim-hove has taken on the task of drafting the fix in opm-output.

@blattms
Copy link
Member Author

blattms commented Sep 7, 2016

This patch does the opposite error: it passes 0, 0, 1, ..., 4 to the output-writer, which gives the right schedule but the sequence number will be wrong in the output file(s).

IMHO this statement is not true. The reportstep number is not changed by my PR. I just correct the number used in ParallelDebugOutput to query the wellstate from the schedule (this is where your 0, 0, ... applies). I cannot see how changes in opm-output will fix the problem in ParallelDebugOutput. What am I missing?

@blattms
Copy link
Member Author

blattms commented Sep 7, 2016

We should keep it open until such a change has been implemented and tested to work though, since it provides a workaround (if you can live with wrong filenames/sequence numbers).

I do not think that this PR is a work around for the problem that @joakim-hove wants to solve. These seem unconnected:
My problem is confined to the parallel output in ParallelDebugOutput. There we need to construct the global well state (i.e. all the wells active currently in the grid). This is done by querying the schedule using the reportstep passed to the output writer which is 1 if we write after simulating the first day. Ergo we actually get the global well state of the second day but the local problems still use the results/well state of the first day.
If there is a similar problem in opm-output writing the well state of day 1 (that comes from the simulator) to the well state of day 2 in the schedule then this seems unrelated.

@atgeirr
Copy link
Member

atgeirr commented Sep 7, 2016

After off-line communication I have realized that this fix is needed in addition to the changes proposed for opm-output. I therefore intend to merge this unless someone disagree strongly (and soon...).

@atgeirr atgeirr merged commit 4d2d004 into OPM:master Sep 7, 2016
@blattms blattms deleted the fix-step-used-by-parallel-io branch September 8, 2016 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants