Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DotNetty.Codecs.CorruptedFrameException: negative pre-adjustment length field #3879

Closed
leonelcm opened this issue Aug 7, 2019 · 7 comments · Fixed by #4252
Closed

DotNetty.Codecs.CorruptedFrameException: negative pre-adjustment length field #3879

leonelcm opened this issue Aug 7, 2019 · 7 comments · Fixed by #4252

Comments

@leonelcm
Copy link

leonelcm commented Aug 7, 2019

- Which Akka.Net version you are using
Akka.net: 1.3.13
DotNetty: 0.6.0

- On which platform you are using Akka.Net
Windows

- A list of steps to reproduce the issue. Or an gist or github repo which can be easily used to reproduce your case.
The following exceptions were generated last night while we were processing messages:

  • Error caught channel "[::ffff:165.33.24.232]:8910"->"[::ffff:165.34.44.22]:35690"
    {
    "Depth": 0,
    "ClassName": "DotNetty.Codecs.CorruptedFrameException",
    "Message": "negative pre-adjustment length field: -160",
    "Source": "DotNetty.Codecs",
    "StackTraceString": " at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Decode(IChannelHandlerContext context, IByteBuffer input)\r\n at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Decode(IChannelHandlerContext context, IByteBuffer input, List1 output)\r\n at DotNetty.Codecs.ByteToMessageDecoder.CallDecode(IChannelHandlerContext context, IByteBuffer input, List1 output)\r\n at DotNetty.Codecs.ByteToMessageDecoder.ChannelRead(IChannelHandlerContext context, Object message)\r\n at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": {
    "Name": "Decode",
    "AssemblyName": "DotNetty.Codecs",
    "AssemblyVersion": "0.6.0.0",
    "AssemblyCulture": "",
    "ClassName": "DotNetty.Codecs.LengthFieldBasedFrameDecoder",
    "Signature": "System.Object Decode(DotNetty.Transport.Channels.IChannelHandlerContext, DotNetty.Buffers.IByteBuffer)",
    "MemberType": 8
    },
    "HResult": -2146233088,
    "HelpURL": null
    }

  • No response from remote for inbound association. Handshake timed out after [15000] ms

  • Error caught channel "[::ffff:165.33.24.232]:8910"->"[::ffff:165.34.44.22]:53650"
    {
    "Depth": 0,
    "ClassName": "DotNetty.Codecs.TooLongFrameException",
    "Message": "Adjusted frame length exceeds 50000: 721008 - discarded",
    "Source": "DotNetty.Codecs",
    "StackTraceString": " at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Fail(Int64 frameLength)\r\n at DotNetty.Codecs.LengthFieldBasedFrameDecoder.FailIfNecessary(Boolean firstDetectionOfTooLongFrame)\r\n at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Decode(IChannelHandlerContext context, IByteBuffer input)\r\n at DotNetty.Codecs.LengthFieldBasedFrameDecoder.Decode(IChannelHandlerContext context, IByteBuffer input, List1 output)\r\n at DotNetty.Codecs.ByteToMessageDecoder.CallDecode(IChannelHandlerContext context, IByteBuffer input, List1 output)\r\n at DotNetty.Codecs.ByteToMessageDecoder.ChannelRead(IChannelHandlerContext context, Object message)\r\n at DotNetty.Transport.Channels.AbstractChannelHandlerContext.InvokeChannelRead(Object msg)",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": {
    "Name": "Fail",
    "AssemblyName": "DotNetty.Codecs",
    "AssemblyVersion": "0.6.0.0",
    "AssemblyCulture": "",
    "ClassName": "DotNetty.Codecs.LengthFieldBasedFrameDecoder",
    "Signature": "Void Fail(Int64)",
    "MemberType": 8
    },
    "HResult": -2146233088,
    "HelpURL": null
    }

Unfortunately I can't reproduce the issues as this happened in PROD and we process millions of messages a day. Any thoughts on what could be the issue and could be the impact?

Thank you,
Leonel

@Aaronontheweb
Copy link
Member

Hi @leonelcm

We've had some other reports similar to this one and they've been difficult to reproduce, the biggest reason being that we don't even know what the message type is at this stage in the decoding pipeline.

The reason why this is happening is because the 4 byte header we append to the front of each message (this is a standard socket message encoding technique) and in this instance the header appears to have some bad information in it.

Here's the line that threw:

https://github.com/Azure/DotNetty/blob/13f009426687afc7c8d60ea653f6bc59d520eb2b/src/DotNetty.Codecs/LengthFieldBasedFrameDecoder.cs#L329-L333

This could have happened because this piece of math that happened just before:

int actualLengthFieldOffset = input.ReaderIndex + this.lengthFieldOffset;
long frameLength = this.GetUnadjustedFrameLength(input, actualLengthFieldOffset, this.lengthFieldLength, this.byteOrder);

Was wrong. I think what is likely happening in this case is that the input.ReaderIndex is wrong at the start of the read - we pool those byte buffers and are supposed to release them and return them to the pool. I suspect there's a race condition of some kind where an item can be returned to the pool dirty - in the case of decoding multiple messages the reader index isn't set correctly at the end of one message and the beginning of another.

@Aaronontheweb
Copy link
Member

Going forward, I can address this in one of two ways:

  1. Run some FsCheck tests over DotNetty's decoders using a combination of dirty and clean buffers and a small, in-memory message processing pipeline that pretty closely simulates what we do with Akka.Remote. That might expose an edge case that isn't being covered correctly by this code today. I'd have to patch DotNetty and get a new release out, which might be hard considering that the Microsoft guys who run that project have gone really quiet lately. Worse case though - I can write my own LengthFrameDecoder and add it to the remoting pipeline without doing any code updates.
  2. Resume my work on getting rid of DotNetty altogether and work on moving towards a pure Akka.Streams TCP transport, which is something I started but haven't picked back up on yet here: https://github.com/Aaronontheweb/Akka.Streams.RemoteTransport - primary goal of this transport is performance and that's a totally experimental design that I haven't finished yet. But if we got that working, that would also solve this problem.

@Aaronontheweb Aaronontheweb added this to the 1.4.0 milestone Aug 7, 2019
@Aaronontheweb
Copy link
Member

Wonder if these are related #3273

@leonelcm
Copy link
Author

@Aaronontheweb, thank you very much for your feedback. Would the messages be dropped when this issue occur? What is the impact of these issues?

@Aaronontheweb
Copy link
Member

I think this is going to be difficult to locate, specifically, without a really good reproduction spec, but I ran into this myself yesterday while working on some tests with Akka.Remote.

I think the issue in this case is that DotNetty, for whatever reason, can occasionally corrupt frames using its decoding scheme (length frame.) I also think it has to do with how the input buffers are being re-used inside DotNetty as well.

We're about to embark on a number of issues for Akka.NET v1.4.0 that might render this moot - namely moving off of DotNetty and implementing Artery for Akka.NET TCP.

@Aaronontheweb
Copy link
Member

After running some detailed tests as part of #4106 I've determined that this is ultimately an issue with the LengthFrameDecoder / LengthFrameEncoder and the DotNetty reference-counting system for pooled memory. It's infrequent, but it does happen somewhat racily. I believe the issue in this case is a dirty write during the encoding process which encodes the frame-length header incorrectly.

Workaround

There is a temporary work-around for this problem which should have a fairly minimal performance impact (as determined by my numbers):

akka.remote.dot-netty.tcp.enable-pooling = false

This will disable buffer pooling altogether and thus work around the source of this issue.

Solution

The solution is to programmatically either fix our use of reference-counting (.Retain() and .Release() calls) inside our DotNetty pipeline or simply disable the use of pooling inside the remote transport entirely.

@Aaronontheweb
Copy link
Member

included this into Akka.NET v1.3.18: https://github.com/akkadotnet/akka.net/releases/tag/1.3.18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants