Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make enums containing ByteBuffer fit in three words #349

Merged
merged 1 commit into from
Apr 24, 2018

Conversation

weissi
Copy link
Member

@weissi weissi commented Apr 23, 2018

CC @tanner0101 & @helje5

only really makes sense together with #351

Motivation:

NIO's dynamic pipeline comes with one performance caveat: If your type
isn't specialised in NIOAny (only types in the NIO module can be) and
it's over 3 words (24 bytes on 64-bit platforms), you'll suffer an
allocation for every time you box it in NIOAny. Very unfortunately, both
ByteBuffer and FileRegion were exactly 3 words wide which means that any
enum containing those would be wider than 3 words. Worse, both
HTTP{Request,Response}Head contained types that were wider than 3 words.
The best solution to this problem is to shrink ByteBuffer and FileRegion
to just under 3 words and that's exactly what this PR is doing. That is
slightly tricky as ByteBuffer was already bit packed fairly well
(reader/writer indices & slice begin/end were stored as a UInt32). The
trick we employ now is to store the slice beginning in a UInt24 and the
file region reader index in a UInt56. That saves one byte for both
ByteBuffer and FileRegion with very moderate tradeoffs: The reader index
in a file now needs to be within 64 PiB (peta bytes) and a byte buffer
slice beginning must start within 16 MiB (mega bytes). Note: The
reader/writer indices as well as slice ends are not affected and can
still be within 4 GiB. Clearly no one would care about the restrictions
for FileRegions in the real world but we might hit the ByteBuffer slice
beginning limit in which case the slice would be copied out. But
given that slices are mostly used to slice off headers in network
protocols, 16 MiB should be plenty.

Norman was kind enough to measure the perf differences:

master (before this PR):

$ wrk -c 256 -d 10s -t 4 -s ~/Downloads/pipeline-many.lua -H "X-Host: SomeValue" -H "Host: swiftnio.io" -H "ThereAreEvenMoreHeaders: AndMoreValues" http://localhost:8888
Running 10s test @ http://localhost:8888
  4 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    93.02ms  158.13ms   1.97s    93.07%
    Req/Sec   112.79k    26.05k  258.75k    84.50%
  4501098 requests in 10.04s, 214.63MB read
  Socket errors: connect 0, read 111, write 0, timeout 89
Requests/sec: 448485.61
Transfer/sec:     21.39MB

after this PR:

$ wrk -c 256 -d 10s -t 4 -s ~/Downloads/pipeline-many.lua -H "X-Host: SomeValue" -H "Host: swiftnio.io" -H "ThereAreEvenMoreHeaders: AndMoreValues" http://localhost:8888
Running 10s test @ http://localhost:8888
  4 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   107.53ms  206.56ms   1.99s    90.55%
    Req/Sec   124.15k    26.56k  290.41k    89.25%
  4952904 requests in 10.03s, 236.17MB read
  Socket errors: connect 0, read 161, write 0, timeout 22
Requests/sec: 493852.65
Transfer/sec:     23.55MB

so we see a nice 10% improvement and allocations go down by 2, too:

23:21:44 info: 1000_reqs_1_conn: allocations not freed: 0
23:21:44 info: 1000_reqs_1_conn: total number of mallocs: 67832

from

19:08:57 info: 1000_reqs_1_conn: allocations not freed: 0
19:08:57 info: 1000_reqs_1_conn: total number of mallocs: 69834

Modifications:

  • shrank ByteBuffer by 1 byte, making it 23 bytes in total
  • shrank FileRegion by 1 byte, making it 23 bytes in total
  • CoW boxed parts of HTTP{Request,Response}Head to make them fit within 2 words
  • added @_inlineable to NIOAny where appropriate to not suffer boxed existentials in different places
  • added tests for the new edge cases

Result:

  • more speed
  • fewer allocations

@@ -29,6 +29,86 @@ let sysFree: @convention(c) (UnsafeMutableRawPointer?) -> Void = free
}
#endif

struct SadUInt24: ExpressibleByIntegerLiteral {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha... Sad... ;)

}

init?(_ value: UInt32) {
guard value & 0xff_00_00_00 == 0 else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that you can use _ to make it more readable

Copy link
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far... Just some random comments

@@ -19,11 +19,52 @@ let headerSeparator: StaticString = ": "

/// A representation of the request line and header fields of a HTTP request.
public struct HTTPRequestHead: Equatable {
private class _Storage {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it not really makes a difference but use final (for all internal classes) ?

}

func copy() -> _Storage {
return _Storage(method: self.method, rawURI: self.rawURI, version: self.version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you either always use .init or always _Storage just to make it consistent ?

@weissi weissi force-pushed the jw-bananas-2 branch 3 times, most recently from b35aa27 to 25fae62 Compare April 24, 2018 14:24
@weissi weissi changed the title [EXPERIMENT] [DO-NOT-MERGE] shrink ByteBuffer and reduce allocations make enums containing ByteBuffer fit in three words Apr 24, 2018

guard sliceBeginIndex <= ByteBuffer.Slice.maxSupportedLowerBound else {
// the slice's begin is past the maximum supported slice begin value (16 MiB) so the only option we have
// is copy the slice into a fresh buffer and the slice begin will then be at index 0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/and the/. The/

Copy link
Contributor

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull the changes to the HTTP types to a separate PR? It'll make reviewing the changes quite a bit easier. 😄

init(_ value: UInt32) {
precondition(value & 0xff_00_00_00 == 0, "value \(value) too large for _UInt24")
self.b12 = UInt16(truncatingIfNeeded: (value & 0xff_ff) >> 0 )
self.b3 = UInt8(truncatingIfNeeded: (value & 0xff_ff_00) >> 16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can save yourself a masking operation here. You already precondition that the top byte is zero, so with that knowledge in place you can just shift right by 16 bits and store into UInt8.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also as noted in the chat, the mask isn't really right anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! thanks

@@ -29,6 +29,97 @@ let sysFree: @convention(c) (UnsafeMutableRawPointer?) -> Void = free
}
#endif

@_versioned
struct _UInt24: ExpressibleByIntegerLiteral {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a doc comment here please?

}

@_versioned
struct _ByteBufferSlice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, can we get a doc comment here please?

@@ -486,8 +577,19 @@ public struct ByteBuffer {
}
let index = _toIndex(index)
let length = _toCapacity(length)
let sliceBeginIndex = self._slice.lowerBound + index

guard sliceBeginIndex <= ByteBuffer.Slice.maxSupportedLowerBound else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the word "begin" here is odd. Start might be better.

@@ -12,6 +12,68 @@
//
//===----------------------------------------------------------------------===//

struct _UInt56: ExpressibleByIntegerLiteral {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to consolidate our helper integers into a single source file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, will do that

precondition(value & 0xff_00_00_00_00_00_00_00 == 0)
self.init(b1234: UInt32(truncatingIfNeeded: (value & 0xff_ff_ff_ff) >> 0 ),
b56: UInt16(truncatingIfNeeded: (value & 0xff_ff_00_00_00_00) >> 32),
b7: UInt8(truncatingIfNeeded: (value & 0xff_00_00_00_00_00_00) >> 48))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your earlier precondition, this mask is unnecessary.

@weissi weissi force-pushed the jw-bananas-2 branch 2 times, most recently from d7f91d4 to 1f36f2d Compare April 24, 2018 16:08

extension _UInt56 {
init(_ value: UInt64) {
precondition(value & 0xff_00_00_00_00_00_00_00 == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should all of these precondition be assert as this is not really something the user can trigger but only ourselves if our code has a bug.


extension _UInt24 {
init(_ value: UInt32) {
precondition(value & 0xff_00_00_00 == 0, "value \(value) too large for _UInt24")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weissi should this be an assert as there is no way for the user to actual construct this ? I think we usually use assert for things that would only happen because of a bug in NIO itself


extension _UInt56 {
init(_ value: UInt64) {
precondition(value & 0xff_00_00_00_00_00_00_00 == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weissi should this be an assert as there is no way for the user to actual construct this ? I think we usually use assert for things that would only happen because of a bug in NIO itself

Also consider adding a message as well

Copy link
Contributor

@Lukasa Lukasa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nit: feel free to either address it or not depending on how you feel about it.


let offset = Int(_UInt24.max) + 1
let expectedReadablebytes = totalBufferSize - offset
let slice = buf.getSlice(at: offset, length: expectedReadablebytes)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: it might be better if we sliced the same size as the test above, to isolate the test differences to only being the slice start index, rather than the slice sizes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed and improved test (by marking some 'important' positions with different byte values)

Copy link
Member

@normanmaurer normanmaurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after @Lukasa's nit is addressed... Great work!

Motivation:

NIO's dynamic pipeline comes with one performance caveat: If your type
isn't specialised in NIOAny (only types in the NIO module can be) and
it's over 3 words (24 bytes on 64-bit platforms), you'll suffer an
allocation for every time you box it in NIOAny. Very unfortunately, both
ByteBuffer and FileRegion were exactly 3 words wide which means that any
enum containing those would be wider than 3 words. Worse, both
HTTP{Request,Response}Head contained types that were wider than 3 words.
The best solution to this problem is to shrink ByteBuffer and FileRegion
to just under 3 words and that's exactly what this PR is doing. That is
slightly tricky as ByteBuffer was already bit packed fairly well
(reader/writer indices & slice begin/end were stored as a UInt32). The
trick we employ now is to store the slice beginning in a UInt24 and the
file region reader index in a UInt56. That saves one byte for both
ByteBuffer and FileRegion with very moderate tradeoffs: The reader index
in a file now needs to be within 64 PiB (peta bytes) and a byte buffer
slice beginning must start within 16 MiB (mega bytes). Note: The
reader/writer indices as well as slice ends are _not_ affected and can
still be within 4 GiB. Clearly no one would care about the restrictions
for FileRegions in the real world but we might hit the ByteBuffer slice
beginning limit in which case the slice would be copied out. But
given that slices are mostly used to slice off headers in network
protocols, 16 MiB should be _plenty_.

Norman was kind enough to measure the perf differences:

master (before this PR):
```
$ wrk -c 256 -d 10s -t 4 -s ~/Downloads/pipeline-many.lua -H "X-Host: SomeValue" -H "Host: swiftnio.io" -H "ThereAreEvenMoreHeaders: AndMoreValues" http://localhost:8888
Running 10s test @ http://localhost:8888
  4 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    93.02ms  158.13ms   1.97s    93.07%
    Req/Sec   112.79k    26.05k  258.75k    84.50%
  4501098 requests in 10.04s, 214.63MB read
  Socket errors: connect 0, read 111, write 0, timeout 89
Requests/sec: 448485.61
Transfer/sec:     21.39MB
```

after this PR:
```
$ wrk -c 256 -d 10s -t 4 -s ~/Downloads/pipeline-many.lua -H "X-Host: SomeValue" -H "Host: swiftnio.io" -H "ThereAreEvenMoreHeaders: AndMoreValues" http://localhost:8888
Running 10s test @ http://localhost:8888
  4 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   107.53ms  206.56ms   1.99s    90.55%
    Req/Sec   124.15k    26.56k  290.41k    89.25%
  4952904 requests in 10.03s, 236.17MB read
  Socket errors: connect 0, read 161, write 0, timeout 22
Requests/sec: 493852.65
Transfer/sec:     23.55MB
```

so we see a nice 10% improvement

Modifications:

- shrank ByteBuffer by 1 byte, making it 23 bytes in total
- shrank FileRegion by 1 byte, making it 23 bytes in total
- added @_inlineable to NIOAny where appropriate to not suffer boxed existentials in different places
- added tests for the new edge cases

Result:

- more speed
- fewer allocations
@normanmaurer
Copy link
Member

@weissi please merge once CI completes

@weissi weissi merged commit db61cfe into apple:master Apr 24, 2018
@weissi weissi deleted the jw-bananas-2 branch April 24, 2018 17:51
@weissi weissi added the semver/patch No public API change. label Apr 24, 2018
@weissi weissi added this to the 1.6.0 milestone Apr 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver/patch No public API change.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants