-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SequenceReader<T>: Round 2 #69
Conversation
|
||
## BufferExtensions | ||
|
||
* Can't we make this an instance method? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why force a defensive copy span allocation if its not required?
// returns false if data is more than one span
public bool TryAsSpan<T>(out ReadOnlySpan<T> span);
// copyBuffer should be >= .Length
public ReadOnlySpan<T> AsSpan<T>(Span<T> copyBuffer);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should there be a return flag to indicate if the copyBuffer
was used?
Otherwise there is ambiguity. Is there is a writable version (or copy) in the copyBuffer? Can the copyBuffer be reused prior to consuming the ReadOnlySpan
(i.e. is it part of the ReadOnlySpan or was it unused).
Does it throw an exception if the copyBuffer
is too small (assume so); but what if the copy buffer is not needed because the data is a single span, but the copy buffer is too small; does that still throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To resolve the ambiguity; especially around exceptions, maybe it should be:
// returns false if data is more than one span
public bool TryAsSpan<T>(out ReadOnlySpan<T> span);
// destination should be >= .Length or exception
public void CopyTo(Span<T> destination);
Then the pattern would be something like:
T[] scratch = null;
if (!seq.TryAsSpan(out ReadOnlySpan<T> roSpan)
{
scratch = ArrayPool<T>.Shared.Rent(seq.Length)
seq.CopyTo(scratch);
roSpan = scratch;
}
// Do something with roSpan
if (scratch != null) ArrayPool<T>.Shared.Return(scratch);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should there be a return flag to indicate if the copyBuffer was used? Otherwise there is ambiguity
It seems really rare that you'd care, though, right? Presumably you'd just assume you can't use the supplied scratch buffer again until you're done with the returned span. And in the rare case where you actually cared, you could use Span.Overlaps to determine whether the returned span is part of the scratch buffer or not.
That said, I agree in general that this pattern is confusing, and it'd be better not to force the caller to allocate until they know they have to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is if you want to modify the returned data (mask/unmask etc) then with
public ReadOnlySpan<T> AsSpan<T>(Span<T> copyBuffer);
You need one scratch buffer for the call (which may or may not be used); then a second scratch buffer to copy the returned span into; whereas it would be easier just to jump straight to .CopyTo
For just reading; then the scratch buffer is unnecessary for the fast-path; but you need a .Length
sized buffer allocated just in case and if the whole ReadOnlySequence
is being read then the scratch buffer is unlikely to be used for more than a single read (e.g. returned to pool prior to awaiting for more data).
Also question on what happens if .Length
is greater than 2Bn elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Branchless made it worse:
BEFORE
TwoStage_Array | 17.87 ns | 0.3825 ns | 0.6994 ns | 18.21 ns |
AFTER
TwoStage_Array | 18.73 ns | 0.3999 ns | 0.7894 ns | 19.18 ns |
Not terribly surprising- ripping the span out being costly is the only reason SequenceReader
is ref after all. :) Flipping to return a span made it slightly better, but still slower than TryAsSpan
and is pretty awkard imho:
TwoStage_Array | 17.03 ns | 0.3595 ns | 0.5597 ns | 17.24 ns |
ReadOnlySpan<byte> span = _arraySequence.TryAsSpan(out bool success);
if (!success)
{
why would we even need SliceOrCopy
I'm not sure what you're asking here? Having the SliceOrCopy
methods period is due to the performance improvement in the normal single segment case. Having one without a delegate is again about performance- as crossing segments is so unlikely, new T[]
is a good-enough solution for many. Given that it is expected to be common and is faster it makes sense to have an "even-faster" overload for "default" behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the results I get copying and pasting the benchmarks you shared:
Method | Mean | Error | StdDev | Median |
------------------ |---------:|----------:|----------:|---------:|
SliceOrCopy_Array | 18.25 ns | 0.3942 ns | 0.6695 ns | 18.04 ns |
TwoStage_Array | 17.58 ns | 0.1277 ns | 0.1132 ns | 17.57 ns |
I ran it multiple times, the results always come out relatively the same, with the TwoStage_Array being faster than SliceOrCopy_Array. Note that that's with:
private static ReadOnlySequence<byte> _arraySequence = new ReadOnlySequence<byte>(new byte[10_000]);
since it wasn't clear to me from what you wrote what _arraySequence was. If instead change it to be an instance, the versions end up being almost the same perf-wise for me:
Method | Mean | Error | StdDev |
------------------ |---------:|----------:|----------:|
SliceOrCopy_Array | 17.47 ns | 0.3707 ns | 0.3468 ns |
TwoStage_Array | 17.31 ns | 0.3631 ns | 0.4036 ns |
We're talking nanosecond differences here, in either direction. Maybe it's just minor differences on the machine. Maybe it's the processor involved. Maybe it's something else. But let's be very careful not to add lots of APIs because we found a microbenchmark showed an improvement on one machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you're asking here?
I mean, why have these helpers at all? They're one-liners, using public APIs that anyone can use. When I run the same test with:
[Benchmark]
public ReadOnlySpan<byte> OpenCoded()
{
ref ReadOnlySequence<byte> r = ref _arraySequence;
return r.IsSingleSegment ? r.First.Span : r.ToArray();
}
for example, I get:
Method | Mean | Error | StdDev |
------------------ |---------:|----------:|----------:|
SliceOrCopy_Array | 17.25 ns | 0.3703 ns | 0.6770 ns |
OpenCoded | 16.41 ns | 0.1735 ns | 0.1538 ns |
So, if we're going for utmost in throughput, worried about developers concerned about this level of microoptimization, and if these numbers are to be believed, then why have the method at all? Let the developer use the ReadOnlySequence APIs however they need, rather than trying to come up with the various "helper" APIs on top of them that are all effectively one-liners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the complete code:
https://gist.github.com/JeremyKuhne/2eb803b80058fa8b108c81352bc31ba0
I ran it multiple times, the results always come out relatively the same,
Ditto:
Method | Mean | Error | StdDev | Median |
--------------------------- |---------:|----------:|----------:|---------:|
SliceOrCopy_Array | 14.87 ns | 0.3224 ns | 0.7343 ns | 15.21 ns |
SliceOrCopy_AllocatorArray | 16.30 ns | 0.3458 ns | 0.4247 ns | 16.41 ns |
TwoStage_Array | 17.82 ns | 0.3756 ns | 0.5736 ns | 18.05 ns |
But let's be very careful not to add lots of APIs because we found a microbenchmark showed an improvement on one machine.
Well, sure, let's not be blind. But lets also not throw this out because some machines don't get a boost. For the record, my run is on a 6 core i8700K with a 2.1 .NET Core console app and the current BenchmarkDotNet (CTRL-F5ed of course) on Windows.
They're one-liners, using public APIs that anyone can use.
They are simple because these are meant to validate the overhead involved. The whole point of the Allocator is to allow plugging in whatever you want- which may be significantly more complicated. The ArrayPool tracker I mentioned above is one such example. I plan to move onto
The simple SliceOrCopy_Array
with no arguments as it makes writing the typical case easy. It also helps discoverability of the fact that .ToArray()
is potentially inefficient. I wouldn't toss it simply because someone can easily rewrite the code.
For the record, here is what I get plugging in your OpenCoded
:
Method | Mean | Error | StdDev | Median |
--------------------------- |---------:|----------:|----------:|---------:|
OpenCoded | 14.52 ns | 0.3108 ns | 0.5018 ns | 14.72 ns |
SliceOrCopy_Array | 14.95 ns | 0.3185 ns | 0.5904 ns | 15.21 ns |
Not as much as an improvement for me.
Having these methods does not preclude you from writing your own logic. Having them, however, allows easy piping of the Allocator approach (e.g. I can take one as input to SequenceReader
), facilitates reusability of solutions, and makes things significantly faster on some machines. All of that makes it pretty compelling to me.
If you could try to run again with the code in the gist and share your configuration perhaps we might find out what it is that makes your run slower than mine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having them, however
And is slower on some machines apparently, and clutters the surface area (which can actually make it harder for a consumer of the API to do the "right thing'"), and adds surface area that may encourage developers to allocate when they don't actually need to, and a developer who's really concerned about eeking out the best possible throughout arguably won't use them, anyway, and for everyone else a nanosecond difference doesn't matter, and they're barely shorter than just writing out the ternary manually. As for "piping the allocator", that could still be used by ToArray if it's desirable, so that doesn't seem to actually change anything.
I will try the shared code again this weekend on a few machines. But at the moment I'm not seeing the benefit here (for ToArray, yes, for SliceOrCopy, no). I would rather we teach developers about the few core APIs that are there and how to best use thm efficiently, saving additional APIs we add for when they're truly meaningful (like the general reader APIs you've been working on).
No description provided.