Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "prefill" method to IO::Buffered with read buffering #15075

Open
spuun opened this issue Oct 11, 2024 · 15 comments
Open

Add a "prefill" method to IO::Buffered with read buffering #15075

spuun opened this issue Oct 11, 2024 · 15 comments

Comments

@spuun
Copy link
Contributor

spuun commented Oct 11, 2024

Feature Request

I haven't found a way to "prefill " a IO::Buffered with read_buffering? == true. I want to fill it's buffer with at least n bytes (n <= buffer_size).

A use case is that I'd like to peek into the buffer to see some bytes to autodetect what protocol a socket client is using. A peek call will trigger a fill, but if not enough data is read I have no way of telling the IO to read more data.

I'd like to do something like

io.prefill_buffer(8) # i must have at least 8 bytes in the buffer
if io.peek[0,8] == "protocol"
  handle_protocol1_client(io)
else if io.peek[0,5] == "proto"
  handle_protocol2_client(io)
else
  handle_default_client(io)
end
@ysbaddaden
Copy link
Contributor

Maybe a #peek(n) : Bytes overload? It would make sure the buffer has at least n bytes and could return a slice to the first n bytes instead of the current buffer.

I like the generic prefill method, but looking at IO and IO::Buffered I'm not noticing other methods that could benefit from filling the buffer 🤔

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

Maybe a #peek(n) : Bytes overload? It would make sure the buffer has at least n bytes and could return a slice to the first n bytes instead of the current buffer.

Agree!

@straight-shoota
Copy link
Member

This seems like it could be useful. However I'm wondering about the behaviour outside the happy path.

For example, what happens on EOF? I suppose the result is trimmed down to the size of available data? But then we must make sure to not assume peek(n).size == n.
Also a significant difference from base #peek is that the original performs at most one read operation and returns as soon as some data has arrived. But in order to fill the peek buffer to n we may need to read multiple times. This deviates from the expectations that you would have on #peek. Not sure about the impact and relevance, but it's something to note.

This filled peek also needs a memmove on the peek buffer. I don't see any immediate downside of that except that it can have a performance effect. But that should be neglectable because peek size is usually low.

Another positive effect I can think of is that this could perhaps allow us to simplify some algorithms. IO::Delimited for example has three different implementations for delimiter detection: #read has two different ones, depending on if peek from upstream is available, and #peek. Prefilling the peek buffer would also allow to persist the remainder of a peek read while filling up the rest to read more. I haven't thought this through entirely, but it might be a chance.

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

Currently I've done https://github.com/cloudamqp/lavinmq/blob/ba4adbddaa76dfadfe8310df0b42e9d1c0d52e7c/src/stdlib/io_buffered.cr in LavinMQ. Maybe I'm missing a case or two, and I know it's only used for sockets.

@straight-shoota
Copy link
Member

straight-shoota commented Oct 11, 2024

Yeah, I think there's a bug in your code: to_read = @buffer_size - @in_buffer_rem.size should be to_read = size - @in_buffer_rem.size.

And it does not handle the case where @in_buffer_rem + to_read goes out of bounds on @in_buffer. E.g. this use case:

# This should fill the buffer and consume all but the last byte
io.read(Bytes.new(io.buffer_size - 1))
# Now we want to read one more byte than the buffer's capacity
io.peek(2)

In order to fix this, we need to always move the contents of @in_buffer_rem to the start @in_buffer. And then we fill the remainder up to at least size.

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

Yeah, I think there's a bug in your code: to_read = @buffer_size - @in_buffer_rem.size should be to_read = size - @in_buffer_rem.size.

Hm. My intention was to try to fill the entire buffer, but at least size. Isn't @buffer_size - @in_buffer_rem.size what's left in in_buffer?

And it does not handle the case where @in_buffer_rem + to_read goes out of bounds on @in_buffer. E.g. this use case:

# This should fill the buffer and consume all but the last byte
io.read(Bytes.new(io.buffer_size - 1))
# Now we want to read one more byte than the buffer's capacity
io.peek(2)

I'm only reading while @in_buffer_rem.size < size, so in this case no read would happen.

What i want #peek(n) to do is to ensure there is n bytes in the buffer/returned slice. Not always read n more bytes.

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

I have these specs that maybe shows what I want the method to do.

@straight-shoota
Copy link
Member

straight-shoota commented Oct 11, 2024

Hm, seems like misread the code, sorry. So to_read determines the remaining capacity.
But in the case @in_buffer_rem.size + to_read < size, we cannot fill the buffer to fit the requested amount. That would mean returning a slice that's smaller than slice, despite having potentially more data to read. That's unergonomic because you cannot expect peek(n) to return n bytes even in normal operation. The only case where that should be true is when the IO closes and there is not more to read.

So I still think we need to move @in_buffer_rem to the start of @in_buffer.

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

Yeah, i recreate @in_buffer_rem after data has been added to in_buffer. Maybe this must be done in some other way?

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

I think I understand what you mean now, yes.

@straight-shoota
Copy link
Member

straight-shoota commented Oct 11, 2024

I was thinking the algorithm should look something like this (untested):

    if @in_buffer_rem.size < size
      # We need to read more into the buffer

      in_buffer = in_buffer()

      # 1. Move leftover data to the front of the buffer so we have maximum capacity to fill
      @in_buffer_rem.copy_to(in_buffer, @in_buffer_rem.size)

      # 2. Fill the buffer up to at least size
      while @in_buffer_rem.size < size
        bytes_read = unbuffered_read(in_buffer + @in_buffer_rem.size).to_i
        break if bytes_read.zero?
        @in_buffer_rem = Slice.new(@in_buffer, @in_buffer_rem.size + bytes_read)
      end
    end

    @in_buffer_rem[0, size]

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

I was thinking the algorithm should look something like this (untested)

Exactly. I just did a spec to verify that.

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

Currently I have https://github.com/cloudamqp/lavinmq/pull/803/files

I think it make sense to raise something if we can't read size bytes. Also added a check to not always move data in in_buffer which I think makes sense.

@straight-shoota
Copy link
Member

straight-shoota commented Oct 11, 2024

I'm not too happy about raising. It's costly and not alway necessary.
There could be a variant that raises, but I think the default action should just return a slice with a size smaller than requested. The caller can figure out what to make of this and raise only if appropriate for the use case.

@spuun
Copy link
Contributor Author

spuun commented Oct 11, 2024

I hear ya!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants