Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O module: where should I/O read calls leave the fileReader on failure? #21345

Closed
mppf opened this issue Jan 9, 2023 · 22 comments
Closed

I/O module: where should I/O read calls leave the fileReader on failure? #21345

mppf opened this issue Jan 9, 2023 · 22 comments

Comments

@mppf
Copy link
Member

mppf commented Jan 9, 2023

If there was an error reading, such as a formatting error, a I/O read call will generally throw. Supposing that the error is caught, what happens to the channel's position? Some possibilities:

  • it could go back to where the read call started
  • it could have read some data and discarded it
  • it could have read some data and returned it

Note that the answer to this question has interplay with quite a few other elements of the I/O function under consideration.

  • In order to go back to where the read call started, the fileReader has to do buffering, which can have a performance impact. In particular, any read data would have to end up in the fileReader buffer for potentially being read again in a different way. (Maybe, for something like a binary read into an array, it could do the OS read into the array and then copy it to the buffer if there was a failure? Not sure how useful that is).
  • In order to return partially read data, the read call has to work with a ref intent formal that it is updating, rather than an out intent formal or something that is being returned.
  • When returning partially read data, how can the routine communicate how much was read? It can't do so through a return value or out formal since these are never initialized if it throws. Perhaps such calls need to not throw?

See also #19496 (comment) which points out that readBinary(array) is throwing if the entire array is not read, but it also stores the data directly in the ref formal array, without indicating how many elements were read. This takes key information away if the error is caught and error recovery is attempted.

Perhaps we want use different strategies on different calls. "Go back to where the read started" seems like a lot for readBinary(array) but it seems more reasonable for read(myInt).

@mppf mppf changed the title where should I/O read calls leave the fileReader on failure? I/O module: where should I/O read calls leave the fileReader on failure? Jan 9, 2023
@bradcray
Copy link
Member

bradcray commented Jan 9, 2023

Perhaps we want use different strategies on different calls. "Go back to where the read started" seems like a lot for readBinary(array) but it seems more reasonable for read(myInt).

This seems right to me, specifically for the cases you mention here. Could the guiding philosophy be "Routines that can read a variable/unknown amount of data (and then return that data) leave it where the failure occurs and those that are more "All or nothing" will revert it? (that seems like what you may be proposing, though it's not obvious to me).

Other cases to think about would be ones like read(myInt, myBool, myReal) or read(int, bool, real) when the input contains an int and a bool, but not a real. The second seems like it would need to throw (and therefore revert the file position, ideally?) since it can't return all three things and can't tell you it hasn't. But the first could either do this same thing for consistency, or it could read 2 (of 3) arguments and return 2 to let you know it had done so, leaving the file position after the bool (I suppose that under the hood, it is probably still reverting the file position after the failure to read myReal if it had to read something to determine that it was not in fact a real?)

@bradcray
Copy link
Member

bradcray commented Jan 9, 2023

readf("My int is %i, my bool is %b, and my real is %r", myInt, myBool, myReal); is a third variation on the previous example (and one where I sometimes wished it would return how many things it was able to read during AoC 2022, though I think there are some tough questions about how to treat the string literal matches).

@mppf
Copy link
Member Author

mppf commented Jan 9, 2023

Right, another tricky case is read(myInt, myArray, myBool). If we can't find a valid bool at the end, do we have to pretend we didn't read the array?

@lydia-duncan
Copy link
Member

readf("My int is %i, my bool is %b, and my real is %r", myInt, myBool, myReal); is a third variation on the previous example (and one where I sometimes wished it would return how many things it was able to read during AoC 2022, though I think there are some tough questions about how to treat the string literal matches).

That case is interesting - my first attempt to support it would be with returning the position inside the format string that caused problems but the values that could go into each of the different format strings can vary in length and the problem could be with the string it is trying to match (e.g. it could say "My int is 15, my bool is false, but my real is 3.5"). So maybe refer to each of the arguments including the format string?

@bradcray
Copy link
Member

bradcray commented Jan 9, 2023

Right, another tricky case is read(myInt, myArray, myBool). If we can't find a valid bool at the end, do we have to pretend we didn't read the array?

For any of the read(my*) variations, I think my intuition would be to treat it the same as:

read(my*);  // arg 1
read(my*);  // arg 2
read(my*);  // arg 3

and then to return how many of them were successful; so we'd only need to rewind the cursor to the beginning of the last unsuccessful one.

For the readf() case, my intuition would also be to return an integer count, where a perfect read would return 1 per % argument + 1 per string literal between % arguments and the cursor would be left at whichever of those didn't match properly. However, I don't think this is what C scanf() does (? I think it only counts the % arguments matched or not).

[edit: so, for example, readf("%i%b%r", ...) could only return 0–3, but readf(%i %b %r", ...) could return 0–5 (1=%i, 2 = whitespace, 3 = %b, etc.) and the original example could return 0–6 (1=first string literal, 2=bool, 3=next string literal, etc.)]

@mppf
Copy link
Member Author

mppf commented Jan 9, 2023

I feel like having read(a, b, c) return 1 2 or 3 for the number of elements read only makes sense on an early EOF. What if, when trying to read b, it runs into a missing quote or other format error? Just returning "didn't read" seems to forget some of the information about the nature of the error that might be important. (Another issue is that people might forget to check the returned value).

@bradcray
Copy link
Member

bradcray commented Jan 9, 2023

What if, when trying to read b, it runs into a missing quote or other format error? Just returning "didn't read" seems to forget some of the information about the nature of the error that might be important.

That seems very different and new relative to what we've traditionally supported, isn't it? If so, it feels "maybe nice?" but not necessary to me, at least for 2.0.

Another issue is that people might forget to check the returned value

We've discussed having the compiler warn about dropping return values on the floor, which would help with this.

@mppf
Copy link
Member Author

mppf commented Jan 10, 2023

What if, when trying to read b, it runs into a missing quote or other format error? Just returning "didn't read" seems to forget some of the information about the nature of the error that might be important.

That seems very different and new relative to what we've traditionally supported, isn't it? If so, it feels "maybe nice?" but not necessary to me, at least for 2.0.

I'm very confident that you can write such patterns today. You could have your own readThis that checks for a particular format. A similar example that might be more obvious is reading a tuple but never finding an opening (. As far as I know, that throws today, whether the tuple is read first or last in a multi-variable-read call.

@bradcray
Copy link
Member

Sorry, what I was trying to say is that our read() interfaces don't seem to currently say things like "You said to read a real, but I found a letter instead", do they? I think you're saying that a motivated user could do this for their own types by throwing new errors, but given that read(a, b, c) already throws + returns a bool, I don't see anything about making it throw + returning an int to be different than that... What am I missing?

@mppf
Copy link
Member Author

mppf commented Jan 10, 2023

What am I missing?

I'm not sure, but let's move to a code example to make it more concrete.

use IO;

var f = openTempFile();

f.writer().write("1 hello");

var reader = f.reader();
writeln("channel position is ", reader.offset());

try {
  var x: int;
  var y: 2*int;
  var z: string;
  reader.read(x, y, z); // input is "1 hello", so what does this do?
} catch e {
  writeln("caught error ", e);
  writeln("channel position is ", reader.offset());
}

What happens when we actually run this?

  • the reader starts out at position 0
  • it successfully reads an integer into x and advances the channel position to 1
  • it tries to read ( to indicate the start of a tuple and fails, so throws
  • the error is caught, and the channel position is reported at 1

This issue is asking what the channel position should be in this case (and similar cases). Where the 2 patterns I am expecting we will use are:

  • throw if everything requested wasn't read & move the channel position back to the start of the read so some other way of reading can be attempted; or
  • return something to indicate how much was read & leave it to the user to figure out what to do next. If they need to be able to retry somehow, they should use mark themselves.

Now, I interpreted your comments as potentially suggesting that, in this case, read(x, y, z) should return 1 ("I read one of the things") and not throw. That would seem surprising to me in this case (but not if EOF was reached after reading one integer). Note that, if it throws, we don't get to also have the return value indicate how many things were read, because the return value doesn't exist if it throws.

@bradcray
Copy link
Member

Note that, if it throws, we don't get to also have the return value indicate how many things were read, because the return value doesn't exist if it throws.

Oh shoot, you're right that this is what I was missing (I was thinking it could both return 1 and throw. So we could potentially collapse down the responses after #21345 (comment) which I think is where I took us off-course.

@jeremiah-corrado
Copy link
Contributor

In a design discussion on this topic, we decided that reading methods on the fileReader should revert back to their initial position whenever a read fails. This is particularly important for cases where multiple values are read, and the user may want to attempt to read something else if a read fails.

With this change, the following example (from above):

use IO;

var f = openTempFile();

f.writer().write("1 hello");

var reader = f.reader();
writeln("channel position is ", reader.offset());

try {
  var x: int;
  var y: 2*int;
  var z: string;
  reader.read(x, y, z);
} catch e {
  writeln("caught error ", e);
  writeln("channel position is ", reader.offset());
}

would report a file offset of 0.

We also discussed the possibility of including the offset and the number of entries read in the error type, so that a user could query those values when handling the error. This would be a non-breaking change, so we decided that it could be designed and added post-2.0.

@mppf
Copy link
Member Author

mppf commented Jun 29, 2023

@jeremiah-corrado - in the discussion, I was saying that people could use a single-argument read if this became a performance issue. But if we have read(myArray), I would think it would need to mark/revert on failure for similar reasons. Does it already behave this way? If not, are we concerned that this is not tractable due to increased memory demands / performance overhead?

@jeremiah-corrado
Copy link
Contributor

It doesn't look like read does a mark/commit when reading a single argument (array or otherwise). I agree that the proposed change should include that case though.

I'm not sure what should be done about the memory demands or potential performance drawbacks though. A couple of ideas:

  1. make the mark/commit behavior param-configurable per fileReader. Example:
var nonRevertingReader = myFile.reader(revertOnErrors=false);

(We'd need to decide which mode stdin uses)

  1. add a lower level readArray method to the fileReader, that either doesn't buffer the whole read, or allows the user to control that behavior with a param argument. Example:
reader.readArray(myLargeArray, revertOnError=false);

@mppf
Copy link
Member Author

mppf commented Jun 29, 2023

In principle, I think we could alternatively add revertOnErrors=false to individual read calls (but for the varargs + default argument situation we might need a compiler adjustment; I am always forgetting what works here)

@mppf
Copy link
Member Author

mppf commented Jun 29, 2023

Also, IMO having a default that is "don't mark" would basically be equivalent to what we have today; where to retry you would have to opt in to marking. But you can do that today just by calling mark() and commit() / revert() yourself.

@jeremiah-corrado
Copy link
Contributor

jeremiah-corrado commented Jun 29, 2023

(but for the varargs + default argument situation we might need a compiler adjustment; I am always forgetting what works here)

Me too, @benharsh, do you remember? I believe you ran into this question with the serializer design recently.

Also, IMO having a default that is "don't mark" would basically be equivalent to what we have today; where to retry you would have to opt in to marking. But you can do that today just by calling mark() and commit() / revert() yourself.

I agree, I think the default should be "do mark" with some way to opt-out.

@benharsh
Copy link
Member

Me too, @benharsh, do you remember? I believe you ran into this question with the serializer design recently.

The problem with adding a default argument alongside varargs is that the compiler will try binding the last actual to the default argument rather than the varargs. So calls like read(x,y,z) will try passing z to your revertOnErrors formal. Even without this problem, things would still be ambiguous if the last argument was a bool.

@jeremiah-corrado
Copy link
Contributor

@mppf, given that we could add a readArray method or a param revertOnErrors: bool = false to the fileReader type in the future to address performance/memory concerns, are you still okay with moving forward with the proposed direction?

@mppf
Copy link
Member Author

mppf commented Jul 3, 2023

I'm concerned we'll lose the performance in key benchmarks. It would be OK with me to try to implement it, but I think we should discuss it again.

Also, IMO having a default that is "don't mark" would basically be equivalent to what we have today; where to retry you would have to opt in to marking. But you can do that today just by calling mark() and commit() / revert() yourself.

I agree, I think the default should be "do mark" with some way to opt-out.

Actually I think the current situation is not so bad if we document it. I wasn't trying to say that the default has to be "do the marking for me" but rather that what we have today is close enough to a default of "don't mark" that it might just be better to ask people to mark() etc themselves instead of adding a new formal to the various read functions.

Note that we can add a new formal to the non-varargs read calls to indicate if marking is desired. That might be good enough even if it can't be used on the varargs one.

It seems like a tricky choice:

  • One the one hand, reducing I/O performance in the common cases doesn't seem like a good plan (TBD how much it matters, but the memory overhead implications could be arbitrarily bad, when reading big arrays)
  • On the other hand, the ability to respond to failure is very limited if the marking is not enabled by default, and that makes the current design seem flawed.

@jeremiah-corrado
Copy link
Contributor

jeremiah-corrado commented Jul 5, 2023

After some further offline discussion, we've decided to keep the current behavior of leaving the fileReader position at the location of an error, whenever an error occurs. This applies to all overloads of fileReader.read (and fileReader.readf).

Our reasoning is that the current behavior prioritizes performance over error-recovery, which we believe to be a better default behavior. The original proposal of automatically marking at the beginning of each read call, would require all read operations to be buffered in their entirety, which would have negative performance implications. For example:

  • To read a large array, the IO runtime would need to allocate buffer space large enough to accomodate the entire array (in addition to the array allocation itself), meaning that twice the array's memory footprint would be required for the operation.
  • Additionally, the overhead associated with manipulating the fileReader's mark-stack with each read operation could have negative performance affects, particularly for programs that do many small reads in a loop.

To address the concerns in this issue about how users can recover from errors that occur part-way through multiple read operations, the recommendation is to manually mark before a call to read and then revert on an error. See the following example:

proc readItems(fr: fileReader): (int, bool 3*real) throws {
  var x: int, y: bool, z: 3*real;

  fr.mark();  // mark the fileReader position before attempting to read
  try {
    fr.read(x, y, z);
  } catch e {
    fr.revert();  // there was an error; revert to the starting position
    throw new Error("unable to read values...");
  }
  fr.commit(); // there was no error; pop from the "mark stack" 

  return (x, y, z);
}

As a post-2.0 effort, we'd also like to investigate the possibility of providing some mechanism for opting into automatic marking, rather than writing the above mark-revert-commit logic manually (see: #22676)

@lydia-duncan
Copy link
Member

I don't believe we have any additional action items here (that aren't tracked in other places), closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants