Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: add record-oriented IO functions #41

Closed
wants to merge 1 commit into from

Conversation

Freaky
Copy link
Contributor

@Freaky Freaky commented Feb 28, 2020

This adds:

  • byte_records() iterator
  • for_byte_terminated_record_with_terminator()
  • for_byte_terminated_record()

Which are basically parametrised versions of the existing functions for reading lines.

@Freaky
Copy link
Contributor Author

Freaky commented Feb 28, 2020

I blame any mistakes on Japanese whisky.

Copy link
Owner

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left a few comments that we should address before merging. But overall this looks great.

In terms of whether we want these APIs though... Could you say more about your use case for them? I grant that they are a nice convenience, but still, always good to hear about use cases.

/// assert_eq!(records[2], "dolor".as_bytes());
/// # Ok(()) }; example().unwrap()
/// ```
fn for_byte_terminated_record<F>(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the inconsistency in naming? Seems like this should be for_byte_record given that you have byte_records above. It's also shorter. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually the last bit I added - I was leaning towards being more explicit, but I guess I was a bit sick of it by the time I got to the iterator :)

@@ -208,6 +321,20 @@ pub struct ByteLines<B> {
buf: B,
}

/// An iterator over records from an instance of
/// [`std::io::BufRead`](https://doc.rust-lang.org/std/io/trait.BufRead.html).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this add a sentence explaining what a byte record is? e.g., "A byte record is any sequence of bytes terminated by a particular byte chosen by the caller. For example, NUL separate byte strings are said to be NUL-terminated byte records."

if chunk.last() == Some(&terminator) {
for_each_record(&chunk[0..chunk.len() - 1])
} else {
for_each_record(&chunk)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep things consistent, maybe rename trim_slice to trim_line_slice and add trim_record_slice so that it can be used here. Or at least, it would be nice to call for_each_record in only one place in the source.

@Freaky
Copy link
Contributor Author

Freaky commented Feb 28, 2020

Regarding use-case, I mainly see it being used for apps wanting to support -print0-style output. I'm sure there are the odd weird data files using other delimiters too - someone has to have used these, right?

This article is what prompted me - she wrote her own custom implementation, lamenting the effort it took. Which is bit of a shame when I've had a previous iteration of this lying on my disk forgotten for the past 6 months.

@thomcc
Copy link
Contributor

thomcc commented Feb 28, 2020

I also mentioned similar use cases in #12 (comment) -- I've worked with files that delimited entries with \xff before (as well as \x00). That said, for my case it would have been better on ByteSlice rather than on io (sorry, I missed where this was in my initial comment).

@Freaky
Copy link
Contributor Author

Freaky commented Feb 28, 2020

Note there are iterators included here - they're just thin wrappers around read_until(), and each iteration involves allocating a vec and filling it from BufRead's internal buffer.

The callback functions simply lend out slices to the buffer, only copying when the buffer does not contain a full record. You can't do this with an iterator, and this can have a substantial impact on performance.

@thomcc
Copy link
Contributor

thomcc commented Feb 28, 2020

Yeah, I missed that this was in the io module, and thus the iterator approach involved a good amount of overhead, and edited my comment to remove that bit.

BurntSushi pushed a commit that referenced this pull request May 10, 2020
This mirrors the line iterating methods, but permits the caller to
specify an arbitrary terminator.

Closes #41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants