I/O module: where should I/O read calls leave the fileReader on failure? #21345

mppf · 2023-01-09T20:39:09Z

If there was an error reading, such as a formatting error, a I/O read call will generally throw. Supposing that the error is caught, what happens to the channel's position? Some possibilities:

it could go back to where the read call started
it could have read some data and discarded it
it could have read some data and returned it

Note that the answer to this question has interplay with quite a few other elements of the I/O function under consideration.

In order to go back to where the read call started, the fileReader has to do buffering, which can have a performance impact. In particular, any read data would have to end up in the fileReader buffer for potentially being read again in a different way. (Maybe, for something like a binary read into an array, it could do the OS read into the array and then copy it to the buffer if there was a failure? Not sure how useful that is).
In order to return partially read data, the read call has to work with a ref intent formal that it is updating, rather than an out intent formal or something that is being returned.
When returning partially read data, how can the routine communicate how much was read? It can't do so through a return value or out formal since these are never initialized if it throws. Perhaps such calls need to not throw?

See also #19496 (comment) which points out that readBinary(array) is throwing if the entire array is not read, but it also stores the data directly in the ref formal array, without indicating how many elements were read. This takes key information away if the error is caught and error recovery is attempted.

Perhaps we want use different strategies on different calls. "Go back to where the read started" seems like a lot for readBinary(array) but it seems more reasonable for read(myInt).

The text was updated successfully, but these errors were encountered:

bradcray · 2023-01-09T21:18:45Z

Perhaps we want use different strategies on different calls. "Go back to where the read started" seems like a lot for readBinary(array) but it seems more reasonable for read(myInt).

This seems right to me, specifically for the cases you mention here. Could the guiding philosophy be "Routines that can read a variable/unknown amount of data (and then return that data) leave it where the failure occurs and those that are more "All or nothing" will revert it? (that seems like what you may be proposing, though it's not obvious to me).

Other cases to think about would be ones like read(myInt, myBool, myReal) or read(int, bool, real) when the input contains an int and a bool, but not a real. The second seems like it would need to throw (and therefore revert the file position, ideally?) since it can't return all three things and can't tell you it hasn't. But the first could either do this same thing for consistency, or it could read 2 (of 3) arguments and return 2 to let you know it had done so, leaving the file position after the bool (I suppose that under the hood, it is probably still reverting the file position after the failure to read myReal if it had to read something to determine that it was not in fact a real?)

bradcray · 2023-01-09T21:21:52Z

readf("My int is %i, my bool is %b, and my real is %r", myInt, myBool, myReal); is a third variation on the previous example (and one where I sometimes wished it would return how many things it was able to read during AoC 2022, though I think there are some tough questions about how to treat the string literal matches).

mppf · 2023-01-09T21:26:29Z

Right, another tricky case is read(myInt, myArray, myBool). If we can't find a valid bool at the end, do we have to pretend we didn't read the array?

lydia-duncan · 2023-01-09T21:26:58Z

readf("My int is %i, my bool is %b, and my real is %r", myInt, myBool, myReal); is a third variation on the previous example (and one where I sometimes wished it would return how many things it was able to read during AoC 2022, though I think there are some tough questions about how to treat the string literal matches).

That case is interesting - my first attempt to support it would be with returning the position inside the format string that caused problems but the values that could go into each of the different format strings can vary in length and the problem could be with the string it is trying to match (e.g. it could say "My int is 15, my bool is false, but my real is 3.5"). So maybe refer to each of the arguments including the format string?

bradcray · 2023-01-09T21:33:00Z

Right, another tricky case is read(myInt, myArray, myBool). If we can't find a valid bool at the end, do we have to pretend we didn't read the array?

For any of the read(my*) variations, I think my intuition would be to treat it the same as:

read(my*);  // arg 1
read(my*);  // arg 2
read(my*);  // arg 3

and then to return how many of them were successful; so we'd only need to rewind the cursor to the beginning of the last unsuccessful one.

For the readf() case, my intuition would also be to return an integer count, where a perfect read would return 1 per % argument + 1 per string literal between % arguments and the cursor would be left at whichever of those didn't match properly. However, I don't think this is what C scanf() does (? I think it only counts the % arguments matched or not).

[edit: so, for example, readf("%i%b%r", ...) could only return 0–3, but readf(%i %b %r", ...) could return 0–5 (1=%i, 2 = whitespace, 3 = %b, etc.) and the original example could return 0–6 (1=first string literal, 2=bool, 3=next string literal, etc.)]

mppf · 2023-01-09T22:21:22Z

I feel like having read(a, b, c) return 1 2 or 3 for the number of elements read only makes sense on an early EOF. What if, when trying to read b, it runs into a missing quote or other format error? Just returning "didn't read" seems to forget some of the information about the nature of the error that might be important. (Another issue is that people might forget to check the returned value).

bradcray · 2023-01-09T22:52:10Z

What if, when trying to read b, it runs into a missing quote or other format error? Just returning "didn't read" seems to forget some of the information about the nature of the error that might be important.

That seems very different and new relative to what we've traditionally supported, isn't it? If so, it feels "maybe nice?" but not necessary to me, at least for 2.0.

Another issue is that people might forget to check the returned value

We've discussed having the compiler warn about dropping return values on the floor, which would help with this.

mppf · 2023-01-10T17:03:00Z

What if, when trying to read b, it runs into a missing quote or other format error? Just returning "didn't read" seems to forget some of the information about the nature of the error that might be important.

That seems very different and new relative to what we've traditionally supported, isn't it? If so, it feels "maybe nice?" but not necessary to me, at least for 2.0.

I'm very confident that you can write such patterns today. You could have your own readThis that checks for a particular format. A similar example that might be more obvious is reading a tuple but never finding an opening (. As far as I know, that throws today, whether the tuple is read first or last in a multi-variable-read call.

bradcray · 2023-01-10T18:01:36Z

Sorry, what I was trying to say is that our read() interfaces don't seem to currently say things like "You said to read a real, but I found a letter instead", do they? I think you're saying that a motivated user could do this for their own types by throwing new errors, but given that read(a, b, c) already throws + returns a bool, I don't see anything about making it throw + returning an int to be different than that... What am I missing?

mppf · 2023-01-10T20:28:12Z

What am I missing?

I'm not sure, but let's move to a code example to make it more concrete.

use IO;

var f = openTempFile();

f.writer().write("1 hello");

var reader = f.reader();
writeln("channel position is ", reader.offset());

try {
  var x: int;
  var y: 2*int;
  var z: string;
  reader.read(x, y, z); // input is "1 hello", so what does this do?
} catch e {
  writeln("caught error ", e);
  writeln("channel position is ", reader.offset());
}

What happens when we actually run this?

the reader starts out at position 0
it successfully reads an integer into x and advances the channel position to 1
it tries to read ( to indicate the start of a tuple and fails, so throws
the error is caught, and the channel position is reported at 1

This issue is asking what the channel position should be in this case (and similar cases). Where the 2 patterns I am expecting we will use are:

throw if everything requested wasn't read & move the channel position back to the start of the read so some other way of reading can be attempted; or
return something to indicate how much was read & leave it to the user to figure out what to do next. If they need to be able to retry somehow, they should use mark themselves.

Now, I interpreted your comments as potentially suggesting that, in this case, read(x, y, z) should return 1 ("I read one of the things") and not throw. That would seem surprising to me in this case (but not if EOF was reached after reading one integer). Note that, if it throws, we don't get to also have the return value indicate how many things were read, because the return value doesn't exist if it throws.

bradcray · 2023-01-11T00:39:34Z

Note that, if it throws, we don't get to also have the return value indicate how many things were read, because the return value doesn't exist if it throws.

Oh shoot, you're right that this is what I was missing (I was thinking it could both return 1 and throw. So we could potentially collapse down the responses after #21345 (comment) which I think is where I took us off-course.

jeremiah-corrado · 2023-06-28T21:03:00Z

In a design discussion on this topic, we decided that reading methods on the fileReader should revert back to their initial position whenever a read fails. This is particularly important for cases where multiple values are read, and the user may want to attempt to read something else if a read fails.

With this change, the following example (from above):

use IO;

var f = openTempFile();

f.writer().write("1 hello");

var reader = f.reader();
writeln("channel position is ", reader.offset());

try {
  var x: int;
  var y: 2*int;
  var z: string;
  reader.read(x, y, z);
} catch e {
  writeln("caught error ", e);
  writeln("channel position is ", reader.offset());
}

would report a file offset of 0.

We also discussed the possibility of including the offset and the number of entries read in the error type, so that a user could query those values when handling the error. This would be a non-breaking change, so we decided that it could be designed and added post-2.0.

mppf · 2023-06-29T12:04:41Z

@jeremiah-corrado - in the discussion, I was saying that people could use a single-argument read if this became a performance issue. But if we have read(myArray), I would think it would need to mark/revert on failure for similar reasons. Does it already behave this way? If not, are we concerned that this is not tractable due to increased memory demands / performance overhead?

jeremiah-corrado · 2023-06-29T15:24:45Z

It doesn't look like read does a mark/commit when reading a single argument (array or otherwise). I agree that the proposed change should include that case though.

I'm not sure what should be done about the memory demands or potential performance drawbacks though. A couple of ideas:

make the mark/commit behavior param-configurable per fileReader. Example:

var nonRevertingReader = myFile.reader(revertOnErrors=false);

(We'd need to decide which mode stdin uses)

add a lower level readArray method to the fileReader, that either doesn't buffer the whole read, or allows the user to control that behavior with a param argument. Example:

reader.readArray(myLargeArray, revertOnError=false);

mppf · 2023-06-29T15:36:04Z

In principle, I think we could alternatively add revertOnErrors=false to individual read calls (but for the varargs + default argument situation we might need a compiler adjustment; I am always forgetting what works here)

mppf · 2023-06-29T15:37:13Z

Also, IMO having a default that is "don't mark" would basically be equivalent to what we have today; where to retry you would have to opt in to marking. But you can do that today just by calling mark() and commit() / revert() yourself.

jeremiah-corrado · 2023-06-29T16:03:11Z

(but for the varargs + default argument situation we might need a compiler adjustment; I am always forgetting what works here)

Me too, @benharsh, do you remember? I believe you ran into this question with the serializer design recently.

Also, IMO having a default that is "don't mark" would basically be equivalent to what we have today; where to retry you would have to opt in to marking. But you can do that today just by calling mark() and commit() / revert() yourself.

I agree, I think the default should be "do mark" with some way to opt-out.

benharsh · 2023-06-29T16:41:44Z

Me too, @benharsh, do you remember? I believe you ran into this question with the serializer design recently.

The problem with adding a default argument alongside varargs is that the compiler will try binding the last actual to the default argument rather than the varargs. So calls like read(x,y,z) will try passing z to your revertOnErrors formal. Even without this problem, things would still be ambiguous if the last argument was a bool.

jeremiah-corrado · 2023-06-30T16:10:52Z

@mppf, given that we could add a readArray method or a param revertOnErrors: bool = false to the fileReader type in the future to address performance/memory concerns, are you still okay with moving forward with the proposed direction?

mppf · 2023-07-03T13:35:30Z

I'm concerned we'll lose the performance in key benchmarks. It would be OK with me to try to implement it, but I think we should discuss it again.

Also, IMO having a default that is "don't mark" would basically be equivalent to what we have today; where to retry you would have to opt in to marking. But you can do that today just by calling mark() and commit() / revert() yourself.

I agree, I think the default should be "do mark" with some way to opt-out.

Actually I think the current situation is not so bad if we document it. I wasn't trying to say that the default has to be "do the marking for me" but rather that what we have today is close enough to a default of "don't mark" that it might just be better to ask people to mark() etc themselves instead of adding a new formal to the various read functions.

Note that we can add a new formal to the non-varargs read calls to indicate if marking is desired. That might be good enough even if it can't be used on the varargs one.

It seems like a tricky choice:

One the one hand, reducing I/O performance in the common cases doesn't seem like a good plan (TBD how much it matters, but the memory overhead implications could be arbitrarily bad, when reading big arrays)
On the other hand, the ability to respond to failure is very limited if the marking is not enabled by default, and that makes the current design seem flawed.

jeremiah-corrado · 2023-07-05T22:22:08Z

After some further offline discussion, we've decided to keep the current behavior of leaving the fileReader position at the location of an error, whenever an error occurs. This applies to all overloads of fileReader.read (and fileReader.readf).

Our reasoning is that the current behavior prioritizes performance over error-recovery, which we believe to be a better default behavior. The original proposal of automatically marking at the beginning of each read call, would require all read operations to be buffered in their entirety, which would have negative performance implications. For example:

To read a large array, the IO runtime would need to allocate buffer space large enough to accomodate the entire array (in addition to the array allocation itself), meaning that twice the array's memory footprint would be required for the operation.
Additionally, the overhead associated with manipulating the fileReader's mark-stack with each read operation could have negative performance affects, particularly for programs that do many small reads in a loop.

To address the concerns in this issue about how users can recover from errors that occur part-way through multiple read operations, the recommendation is to manually mark before a call to read and then revert on an error. See the following example:

proc readItems(fr: fileReader): (int, bool 3*real) throws {
  var x: int, y: bool, z: 3*real;

  fr.mark();  // mark the fileReader position before attempting to read
  try {
    fr.read(x, y, z);
  } catch e {
    fr.revert();  // there was an error; revert to the starting position
    throw new Error("unable to read values...");
  }
  fr.commit(); // there was no error; pop from the "mark stack" 

  return (x, y, z);
}

As a post-2.0 effort, we'd also like to investigate the possibility of providing some mechanism for opting into automatic marking, rather than writing the above mark-revert-commit logic manually (see: #22676)

lydia-duncan · 2023-08-22T22:22:32Z

I don't believe we have any additional action items here (that aren't tracked in other places), closing

mppf added type: Design area: Libraries / Modules type: Stabilization labels Jan 9, 2023

mppf changed the title ~~where should I/O read calls leave the fileReader on failure?~~ I/O module: where should I/O read calls leave the fileReader on failure? Jan 9, 2023

vasslitvinov mentioned this issue Jan 12, 2023

Bugs+semantics of regex and channel matching #20864

Closed

lydia-duncan mentioned this issue Jun 7, 2023

I/O module: express channel mark/revert/commit via context manager #19611

Open

lydia-duncan closed this as completed Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I/O module: where should I/O read calls leave the fileReader on failure? #21345

I/O module: where should I/O read calls leave the fileReader on failure? #21345

mppf commented Jan 9, 2023 •

edited

Loading

bradcray commented Jan 9, 2023

bradcray commented Jan 9, 2023

mppf commented Jan 9, 2023

lydia-duncan commented Jan 9, 2023

bradcray commented Jan 9, 2023 •

edited

Loading

mppf commented Jan 9, 2023

bradcray commented Jan 9, 2023

mppf commented Jan 10, 2023

bradcray commented Jan 10, 2023

mppf commented Jan 10, 2023 •

edited

Loading

bradcray commented Jan 11, 2023

jeremiah-corrado commented Jun 28, 2023

mppf commented Jun 29, 2023

jeremiah-corrado commented Jun 29, 2023

mppf commented Jun 29, 2023 •

edited

Loading

mppf commented Jun 29, 2023 •

edited

Loading

jeremiah-corrado commented Jun 29, 2023 •

edited

Loading

benharsh commented Jun 29, 2023

jeremiah-corrado commented Jun 30, 2023

mppf commented Jul 3, 2023

jeremiah-corrado commented Jul 5, 2023 •

edited

Loading

lydia-duncan commented Aug 22, 2023

I/O module: where should I/O read calls leave the fileReader on failure? #21345

I/O module: where should I/O read calls leave the fileReader on failure? #21345

Comments

mppf commented Jan 9, 2023 • edited Loading

bradcray commented Jan 9, 2023

bradcray commented Jan 9, 2023

mppf commented Jan 9, 2023

lydia-duncan commented Jan 9, 2023

bradcray commented Jan 9, 2023 • edited Loading

mppf commented Jan 9, 2023

bradcray commented Jan 9, 2023

mppf commented Jan 10, 2023

bradcray commented Jan 10, 2023

mppf commented Jan 10, 2023 • edited Loading

bradcray commented Jan 11, 2023

jeremiah-corrado commented Jun 28, 2023

mppf commented Jun 29, 2023

jeremiah-corrado commented Jun 29, 2023

mppf commented Jun 29, 2023 • edited Loading

mppf commented Jun 29, 2023 • edited Loading

jeremiah-corrado commented Jun 29, 2023 • edited Loading

benharsh commented Jun 29, 2023

jeremiah-corrado commented Jun 30, 2023

mppf commented Jul 3, 2023

jeremiah-corrado commented Jul 5, 2023 • edited Loading

lydia-duncan commented Aug 22, 2023

mppf commented Jan 9, 2023 •

edited

Loading

bradcray commented Jan 9, 2023 •

edited

Loading

mppf commented Jan 10, 2023 •

edited

Loading

mppf commented Jun 29, 2023 •

edited

Loading

mppf commented Jun 29, 2023 •

edited

Loading

jeremiah-corrado commented Jun 29, 2023 •

edited

Loading

jeremiah-corrado commented Jul 5, 2023 •

edited

Loading