Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terrible SQS SendMessage performance #1602

Open
bill-poole opened this issue May 6, 2020 · 23 comments
Open

Terrible SQS SendMessage performance #1602

bill-poole opened this issue May 6, 2020 · 23 comments
Labels
bug This issue is a bug. module/sdk-custom p1 This is a high priority issue queued

Comments

@bill-poole
Copy link

Expected Behavior

An application should be able to send significantly more than 10MB/s to SQS per vCPU. Ideally, an application should be able to send well over 100MB/s per vCPU.

Current Behavior

We are finding that sending about 130MB/s of messages to SQS is consuming about 15 vCPUs. We are finding a lot of CPU time being spent by the GC because we are finding that there are over 5GB of allocations for each 100MB of messages sent.

We also find that this issue is proportional to total/aggregate payload size, not number of messages. That is, if we send much less data spread over many more messages, the CPU load is significantly less.

Possible Solution

Simplify/streamline the .NET SQS client so it is performance-optimised. Minimise allocations, reducing GC pressure.

Steps to Reproduce (for bugs)

Just create a simple application that uses the SQS client to concurrently send a large number of large messages, such that over 100MB/s of messages are being sent. It uses about 15 vCPUs on .NET Core 3.1 on Linux. The performance is even worse on Windows.

Context

Our application produces a massive amount of data that needs to be sent through SQS, and at 15 vCPUs per 100MB, we find that a lot of our compute costs are coming from the .NET SQS client.

Your Environment

  • AWSSDK.SQS version used: 3.3.102.104
  • Operating System and version: Ubuntu 18.04
  • Visual Studio version: Visual Studio 2019 (16.5)
  • Targeted .NET platform: .NET Core 3.1

.NET Core Info

  • .NET Core version used for development: .NET Core 3.1
  • .NET Core version installed in the environment where application runs: .NET Core 3.1
@bill-poole
Copy link
Author

bill-poole commented May 6, 2020

I just did a quick calculation, and with the above reported performance, it seems like there is currently a 25% cost overhead to using the SQS client provided by the AWS .NET SDK; where that cost is for the EC2 CPU-seconds needed by the SQS client to send each million messages, assuming each message is 64kB.

@philasmar philasmar added bug This issue is a bug. needs-reproduction This issue needs reproduction. needs-triage This issue or PR still needs to be triaged. module/sdk-custom labels Jul 23, 2020
@NGL321 NGL321 added A and removed needs-triage This issue or PR still needs to be triaged. labels Sep 9, 2020
@hunanniu hunanniu added the queued label Oct 7, 2020
@bill-poole
Copy link
Author

Hi, just wondering if there’s an update on this? Does AWS recognise there is a problem? If so, does AWS plan on a fix/update? If so, is the fix scheduled? If so, what’s the time horizon?

@bill-poole
Copy link
Author

Is the SQS .NET client still being maintained by AWS?

@normj
Copy link
Member

normj commented Jul 7, 2021

@billpoole-mi You mentioned messages size of 64K, is that the max size or are there some percentage that are significant larger? What I'm wondering is if your messages are going past 85,000 bytes. The .NET garbage collector will automatically put objects of that size into the large object heap (LOH) and putting a lot of objects into LOH could cause a lot of extra work for the GC.

@bill-poole
Copy link
Author

In our test, we prebuilt a pool of messages that were randomly sized up to 256 kB. We then sent those messages in a loop using SQS. So we had no allocations on our side. The allocations were all in the AWS .NET SQS client.

@bill-poole
Copy link
Author

Moreover, as stated above, we saw 5 GB of allocations for every 100 MB sent. So even if we were allocating for each new message sent, there was 50x more allocations in the SQS client than what would have been allocated by the sending code.

@normj
Copy link
Member

normj commented Jul 7, 2021

I agree the allocations are very concerning. I am still curious what the affect the LOH is having. Can you run your test harness with messages size just a little above 85,000 bytes and then a little below it. I'm curious how dramatic the difference will be. Is your test harness shareable?

@bill-poole
Copy link
Author

bill-poole commented Jul 7, 2021

I’ll have to go dig the test harness up. The last time I used it was May last year when I first raised this issue. I’ll see if I can find it.

Have you tried to reproduce the issue yourself? ie what send throughput do you see sending messages? Are you seeing more than 10 MB/s per CPU thread?

@bill-poole
Copy link
Author

It’s worth noting that although highly frequent LOH allocations are bad for performance, they’re not so bad as to cause a 10 MB/s per CPU thread send rate.

Just my 2c, but I think the majority of the problem is unnecessary buffer allocations and copying between buffers.

A sender should be able to write into a stream and at most, the whole message is buffered in memory once, and that buffer should be drawn from a memory pool to prevent LOH compaction overheads.

We can write well over 100 MB/s per CPU thread over raw HTTP. ie, if we do the same test just writing the messages as basic REST posts, we get way over 100 MB/s per CPU thread - even if the messages are all over 85 kB.

@ashishdhingra ashishdhingra removed the needs-reproduction This issue needs reproduction. label Aug 17, 2021
@github-actions
Copy link

We have noticed this issue has not received attention in 1 year. We will close this issue for now. If you think this is in error, please feel free to comment and reopen the issue.

@github-actions github-actions bot added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Aug 18, 2022
@bill-poole
Copy link
Author

Can AWS please provide an update on this? Is it planned to be fixed?

@ashovlin ashovlin removed the closing-soon This issue will automatically close in 4 days unless further comments are made. label Aug 18, 2022
@ashishdhingra ashishdhingra added p2 This is a standard priority issue and removed A labels Nov 1, 2022
@bill-poole
Copy link
Author

It's been nearly 4 years since this issue was first opened. Can AWS please provide an update?

Note that I suspect the performance problem here extends well beyond the SQS SendMessage API. For example, I have found similar performance problems with the .NET DynamoDB client library. This is why the EfficientDynamoDb project was created. The EfficientDynamoDb project boasts up to 21x better performance than the AWS-provided client library!!!

Sorry to be so direct, but the fact that the AWS .NET client library is over 20x slower than the community-built EfficientDynamoDb library and the fact that AWS has allowed this situation to persist this long is a crushing indictment of AWS's support for the .NET ecosystem. Should the .NET community perhaps follow the lead of the EfficientDynamoDb project and start an open source project to replace the AWS-provided .NET SQS client?

Does AWS have no interest in providing support for high performance applications built using .NET? Perhaps .NET developers should just use Azure instead?

@bhoradc bhoradc added p1 This is a high priority issue and removed p2 This is a standard priority issue labels Sep 23, 2024
@peterrsongg
Copy link
Contributor

@bill-poole Hello, sorry it took us so long to respond, this wasn't prioritized high enough but was recently re-prioritized. I ran some benchmarks using the version you specified (3.3.102.104) and the latest version (3.7.400.22)

One thing to note is that as of Nov 9, 2023, AWS SQS migrated away from the AWS Query protocol and to the AWS Json Protocol. Compared to the AWS Json protocol the query protocol creates a lot more extra string allocations. Either way here are the performance benchmark results:

code : https://github.com/aws/aws-sdk-net/blob/main/sdk/test/Performance/EC2PerformanceBenchmarks/SQSBenchmarks.cs
Message size (tweaked above code for message size) : 100 KB

Latest: 3.7.400.22

Method Mean Error StdDev P50 P90 P95 Gen0 Gen1 Gen2 Allocated
SQSSendMessageAsync 7.184 ms 0.1558 ms 0.4569 ms 7.159 ms 7.791 ms 7.928 ms 7.8125 7.8125 7.8125 475454 B

Version= 3.3.102.104

Method Mean Error StdDev P50 P90 P95 Gen0 Gen1 Gen2 Allocated
SQSSendMessageAsync 8.404 ms 0.2146 ms 0.6260 ms 8.322 ms 9.215 ms 9.526 ms 375.0000 375.0000 375.0000 1698929 B

As you can see the latest version allocates about 27.98% the amount of memory, which is a huge improvement (358% improvement). The latest version of the SDK targets Net8.0 so I'm assuming there are some efficiency gains from that as well, but a majority of the allocation improvements come from SQS migrating away from AWSQuery and to AWSJson. Is there a reason you cannot upgrade to the latest version or at least to the version where SQS upgraded to AWSJson?

Since the version you listed was from 4 years ago, unfortunately there isn't much we can do to improve it, since the protocol behind the service has completely changed. However, if you switch to a latest version, you should see pretty massive improvements in memory allocation.

@bill-poole
Copy link
Author

bill-poole commented Sep 26, 2024

Thanks @peterrsongg for providing an update to this issue; it's very much appreciated. I stopped using SQS in my .NET solutions because of to this issue, so I'm not in a position to test the new version. That being said, I would start using SQS again if this issue were to be resolved. I use the latest version of .NET in my solutions, so I would have no problem using whatever latest version of the .NET SDK that AWS releases.

It's really great to see AWS benchmarking the .NET SDK and I hope it is going to become part of your regular build/test pipeline so that performance is continuously improved, but also any change that hurts performance is identified and fixed before release.

Back in 2021 when I first raised this issue, we were seeing the .NET SDK able to send 10 MB/s per vCPU. Furthermore, the performance seemed independent of message size - i.e., performance depended only on total volume of data sent. 10 MB/s with 100 kB messages is 100 messages per second, which is 10 ms per message.

The benchmark results above have the version of the SDK we tested (3.3.102.104) taking 8.404 ms per message, which is about 16% faster than what we recorded in 2021, which makes sense because we had slightly slower machines in 2021 than today. So on today's hardware, the benchmark says we're getting 11.9 MB/s per vCPU on version 3.3.102.104.

However, according to the above benchmark, the latest version of the SDK (3.7.400.22) only increases that throughput to 13.9 MB/s per vCPU (based on 7.184 ms per 100 kB message), which is only a 17% improvement over the 11.9 MB/s per vCPU on version 3.3.102.104. While that improvement is very much appreciated, I still think much more improvement is needed.

The .NET System.Text.Json serializer is able to serialize JSON at a rate of 300 MB/s when using a custom-written JsonConverter that writes JSON using a Utf8JsonWriter. I've been able to serialize and send JSON over an HTTPS connection at rates well over 100 MB/s per vCPU. Therefore, I think a 10x improvement in performance is possible.

I would need to see a performance improvement of about 10x before I'd be in a position to use SQS in my .NET solutions. Otherwise, a significant portion of my EC2 costs would be spent on vCPUs executing code in the AWS SDK client. That's why I use the awesome EfficientDynamoDb library to access DynamoDB instead of the AWS .NET SDK. If it weren't for the EfficientDynamoDb library, I wouldn't be able to use DynamoDB either.

Note that the benchmark results above have the latest version of the SDK (3.7.400.22) still allocating 475 kB for every 100 kB message sent. That is still very high. With use of buffer pooling, it should be possible to reduce the heap allocations to nearly zero.

Again, I very much appreciate your response and the progress on this issue and I very much hope there will be further investment in improving the performance of the SQS client (as well as the broader .NET SDK).

Please let me know if there's anything I can do to help.

@peterrsongg
Copy link
Contributor

@bill-poole Thanks for the detailed analysis. I agree that there is much more we can do performance-wise, and i'm not sure if you're aware but we are working on a new major version of the SDK which gives us the platform to modernize. We've pulled in dependencies such as System.Text.Json, System.Memory, and System.Buffers so we can take advantage of spans, and pooling like you are suggesting, which would increase performance. We've actually had plenty of community PRs making small improvements like this, and if you wanted to help you could definitely issue some PRs targeting the v4-development branch. Here is the issue tracking our progress on V4. However, the switch the System.Text.Json is something we want to do internally since it is a large effort.

With regards to throughput, I think a proper load test is required here. Since here we are just testing 1 operation where garbage collection isn't happening, it's difficult to say what the true MB/s would be. I'd be curious to see what happens when we send a high number of messages and GC starts kicking in. This is something I can test myself.

Anyways, just to see how much better V4 is in its current state, I ran the same benchmark. The allocations are much better, for a 100KB message size we allocate just around 100KB, but the performance is still not drastically better (only 8.2% faster). But I'm optimistic that this will help in throughput b/c it will decrease the pressure on the GC. Will follow up in a later comment on some load testing numbers.

Benchmarking numbers:
Version: v4-development (branch)

Method Mean Error StdDev P50 P90 P95 Allocated
SQSSendMessageAsync 6.595 ms 0.2591 ms 0.7433 ms 6.470 ms 7.704 ms 7.945 ms 100428 B

@bill-poole
Copy link
Author

bill-poole commented Sep 29, 2024

we are working on a new major version of the SDK which gives us the platform to modernize. We've pulled in dependencies such as System.Text.Json, System.Memory, and System.Buffers so we can take advantage of spans, and pooling like you are suggesting, which would increase performance.

Yes I was aware of the new version of the SDK and am very enthusiastic about the potential for such performance improvements; however, I'm unclear on the timeframe/priority for this work. As I understand it, such improvements are not in scope for the initial V4 release. It would be great to have much greater clarity on the timeframe/priority for these performance improvements.

Note that there are additional libraries that make buffer pooling easier that are not included in the list of new dependencies you mentioned, such as the Microsoft.Toolkit.HighPerformance library, which provides the ArrayPoolBufferWriter<T> class, which allows high performance writing of data to an auto-resizing contiguous buffer, where the buffers are drawn from the shared array pool (ArrayPool<T>.Shared). Another library in this space is Microsoft.IO.RecyclableMemoryStream, which is used by the EfficientDynamoDb library.

Will it be possible to bring such libraries into the V4 distribution at a later time (i.e., after its initial GA release)? Or do such decisions need to be made now before initial release?

With regards to throughput, I think a proper load test is required here. Since here we are just testing 1 operation where garbage collection isn't happening, it's difficult to say what the true MB/s would be.

Sorry, I should have looked at the benchmark code. So, the network and SQS service latency is being included in the benchmarked time per send operation, correct? I agree a proper load test would therefore provide a much more meaningful result. It would allow an apples-to-apples comparison with the result we got in 2021 - i.e., we got our result by doing a load test from a local machine to the SQS service, sending hundreds of messages concurrently.

Note that we were in Perth, Western Australia and we were using the ap-southeast-2 (Sydney) region. However, we also tested locally using "localstack" and got the same result. i.e., the latency was absorbed by sending messages concurrently such that the CPU was always busy sending a message while waiting for a response from the SQS service.

However, now that I've looked at the benchmark code, it seems that the SQSBenchmarks.SetupForSendMessage method is creating a 10 kB UTF-16 encoded message, which is actually a 5 kB UTF-8 encoded message, which is what is actually sent over the wire to the SQS service, not a 100 kB message:

_messageBody = Utils.CreateMessage(Constants.KiloSize * 10);

The Utils.CreateStringOfSize(long sizeInBytes) method creates the returned string with a loop that invokes the StringBuilder.Append(char) method sizeInBytes / 2 times:

private static string CreateStringOfSize(long sizeInBytes)
{
    //2 bytes are needed for each characterse, since .net strings are UTF-16
    int numCharacters = (int)sizeInBytes / 2;
    StringBuilder stringBuilder = new StringBuilder();
    for (int i = 0; i < numCharacters; i++)
    {
        stringBuilder.Append('A');
    }
    return stringBuilder.ToString();
}

The above method is correct that .NET strings are UTF-16 encoded, so the above implementation correctly creates a string with the given length in bytes, but SQS messages are encoded as UTF-8, which means that a string of 10,240 bytes of 'A' characters will result in an SQS message has a payload of 5,120 bytes over the wire. So there is the potential for confusion as to whether the sizeInBytes parameter means the size of the message payload in memory or on the wire.

Note that a simpler and much more performant implementation of the Utils.CreateStringOfSize(long) method would be:

private static string CreateStringOfSize(long sizeInBytes)
{
    // 2 bytes are needed for each character, since .NET strings are UTF-16
    return string.Create(length: (int)sizeInBytes / 2, state: false, (span, state) => span.Fill('A'));
}

I recognize that this method isn't being used in any hot path anywhere, but I think its worth making these kinds of changes, if not for performance, then for simplicity. Note there is also a StringBuilder.Append(char value, int repeatCount) method, which would have been appropriate to use prior to the availability of Span<T>.Fill(T).

Also note the performance result we got in 2021 was stated in terms of UTF-8 encoded payload bytes sent, not UTF-16 encoded bytes. So, to get an apples-with-apples comparison, the message size needs to be doubled.

So it seems that the benchmark result for the V4 preview is showing 100,428 bytes being allocated for each send operation, but seems to be sending only 5 kB (recognizing that is actually 10 kB of UTF-16 encoded text) for each send operation, assuming my above assertion is correct. Am I correct? Or have I got something wrong?

Will follow up in a later comment on some load testing numbers.

I'm looking forward to seeing the results!

Have you considered adding the ability to mock out the HTTPS transport, which would allow benchmarking the client code in isolation of network latency and SQS service performance?

@peterrsongg
Copy link
Contributor

Will it be possible to bring such libraries into the V4 distribution at a later time (i.e., after its initial GA release)? Or do such decisions need to be made now before initial release?

If we wanted to bring additional libraries we would need to do that before GA, so Microsoft.Toolkit.HighPerformance is something I could bring up to the team along with Microsoft.IO.RecyclableMemoryStream.

However, now that I've looked at the benchmark code, it seems that the SQSBenchmarks.SetupForSendMessage method is creating a 10 kB UTF-16 encoded message, which is actually a 5 kB UTF-8 encoded message, which is what is actually sent over the wire to the SQS service, not a 100 kB message:

Though the code in main looks like that, I ran the code on my own branch where I updated that code to Constants.KiloSize * 100, so it was 100KB message, but that's a good callout that a 100Kb UTF-16 encoded message would actually be a 50Kb UTF-8 encoded message. The details!

Note that a simpler and much more performant implementation of the Utils.CreateStringOfSize(long) method would be:

private static string CreateStringOfSize(long sizeInBytes)
{
    // 2 bytes are needed for each character, since .NET strings are UTF-16
    return string.Create(length: (int)sizeInBytes / 2, state: false, (span, state) => span.Fill('A'));
}

I recognize that this method isn't being used in any hot path anywhere, but I think its worth making these kinds of changes, if not for performance, then for simplicity. Note there is also a StringBuilder.Append(char value, int repeatCount) method, which would have been appropriate to use prior to the availability of Span<T>.Fill(T).

Also note the performance result we got in 2021 was stated in terms of UTF-8 encoded payload bytes sent, not UTF-16 encoded bytes. So, to get an apples-with-apples comparison, the message size needs to be doubled.

So it seems that the benchmark result for the V4 preview is showing 100,428 bytes being allocated for each send operation, but seems to be sending only 5 kB (recognizing that is actually 10 kB of UTF-16 encoded text) for each send operation, assuming my above assertion is correct. Am I correct? Or have I got something wrong?

Thanks for the suggestions on the code, I'll look to update the code both to simplify it and create an overloaded method that accepts an additional parameter which doubles the size of the message sent over the wire if the service expects a UTF-8 encoded message. Appreciate you looking at the performance benchmarking code!

Have you considered adding the ability to mock out the HTTPS transport, which would allow benchmarking the client code in isolation of network latency and SQS service performance?

It's on our radar and would definitely simplify a lot of our testing code for other services as well, but we just haven't gotten around to it yet. The decision to not mock the https transport in v3 of the SDK came down to differences in netframework35 vs netframework 45 and netcoreapp31 but that decision was made before my time so I don't know exactly the details. Now that we are dropping netframework35 support, I believe we could start improving that area of testing in the sdk.

@bill-poole
Copy link
Author

If we wanted to bring additional libraries we would need to do that before GA, so Microsoft.Toolkit.HighPerformance is something I could bring up to the team along with Microsoft.IO.RecyclableMemoryStream.

I assume therefore a design and possible prototype of reading/writing pooled buffers would be needed prior to selecting which of these libraries is needed/appropriate?

It's on our radar and would definitely simplify a lot of our testing code for other services as well, but we just haven't gotten around to it yet.

If the SQS client can be configured to use a custom HttpMessageHandler, then the HttpMessageHandler.SendAsync method can be mocked (because it is declared abstract) to possibly meet this requirement.

@peterrsongg
Copy link
Contributor

I assume therefore a design and possible prototype of reading/writing pooled buffers would be needed prior to selecting which of these libraries is needed/appropriate?

Some sort of justification as to why we should include these new dependencies would be presented internally to the team. This could include a prototype and some performance improvement numbers or something like that.

It's on our radar and would definitely simplify a lot of our testing code for other services as well, but we just haven't gotten around to it yet.

If the SQS client can be configured to use a custom HttpMessageHandler, then the HttpMessageHandler.SendAsync method can be mocked (because it is declared abstract) to possibly meet this requirement.

Will keep that in mind when designing a mocked client👍

@bill-poole
Copy link
Author

Some sort of justification as to why we should include these new dependencies would be presented internally to the team. This could include a prototype and some performance improvement numbers or something like that.

I was just thinking that you'd want to be very sure you chose the correct library/libraries before being locked into them, so I'd imagine a prototype would be needed. I'd be very interested to see and provide feedback on the prototype.

@normj
Copy link
Member

normj commented Sep 30, 2024

Adding some more clarity on dependencies. Post GA I believe we can still add new dependencies but the value has to significant not just a minor performance improvement in non-hot spot areas. We would need to do some significant version bump and possibly write a blog post to make sure user's of the SDK are not surprised. We do have to support users that are acquiring the SDK outside of NuGet and dependencies get harder in those cases.

@normj
Copy link
Member

normj commented Sep 30, 2024

The HttpClientFactory property can be used to configure the SDK to use a mocked HttpClient for testing.

@peterrsongg
Copy link
Contributor

@bill-poole we're starting the work of switching to STJ marshalling. I put out the first PR here, though it's still a WIP: #3528

if you're interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. module/sdk-custom p1 This is a high priority issue queued
Projects
None yet
Development

No branches or pull requests

9 participants