Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments #52031

Closed
VincentBu opened this issue Apr 29, 2021 · 40 comments · Fixed by #52052
Assignees
Labels
arch-arm32 area-System.Net os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Milestone

Comments

@VincentBu
Copy link
Contributor

Run: runtime 20210428.85

Failed test:

net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.313.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.13-helix-arm32v7-20210414141857-1ea6b0a
 -System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(windowBits: 15)

Error message:

System.Threading.Tasks.TaskCanceledException : A task was canceled.


Stack trace
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, CancellationToken cancellationToken) in /_/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs:line 564
   at System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(Int32 windowBits) in /_/src/libraries/System.Net.WebSockets/tests/WebSocketDeflateTests.cs:line 444
--- End of stack trace from previous location ---
@VincentBu VincentBu added arch-arm32 area-System.Net os-linux Linux OS (any supported distro) labels Apr 29, 2021
@ghost
Copy link

ghost commented Apr 29, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Run: runtime 20210428.85

Failed test:

net6.0-Linux-Release-arm-CoreCLR_checked-(Alpine.313.Arm32.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:alpine-3.13-helix-arm32v7-20210414141857-1ea6b0a
 -System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(windowBits: 15)

Error message:

System.Threading.Tasks.TaskCanceledException : A task was canceled.


Stack trace
   at System.Net.WebSockets.ManagedWebSocket.SendFrameFallbackAsync(MessageOpcode opcode, Boolean endOfMessage, Boolean disableCompression, ReadOnlyMemory`1 payloadBuffer, CancellationToken cancellationToken) in /_/src/libraries/System.Net.WebSockets/src/System/Net/WebSockets/ManagedWebSocket.cs:line 564
   at System.Net.WebSockets.Tests.WebSocketDeflateTests.PayloadShouldHaveSimilarSizeWhenSplitIntoSegments(Int32 windowBits) in /_/src/libraries/System.Net.WebSockets/tests/WebSocketDeflateTests.cs:line 444
--- End of stack trace from previous location ---
Author: VincentBu
Assignees: -
Labels:

arch-arm32, area-System.Net, os-linux

Milestone: -

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 29, 2021
@zlatanov
Copy link
Contributor

zlatanov commented Apr 29, 2021

To me this issue seems unrelated to the reverted "fix ber scanf/printf". For some reason, unknown to me, on ARM architecture, this test takes more than 5 seconds to complete.

I have created this test to check that deflate context takeover works as expected, but it is a bit speculative on the compression algorithm. We can either increase the timeout (or even remove it entirely and depend on xunit cancelling the test if something hangs), or delete the test, because if in the future the underlying compression ratio changes for some reason, this test will be outdated and can start to fail.

//cc @CarnaViire

@CarnaViire
Copy link
Member

Timeouts like that tend to be unreliable in our CI, if the test is executed on a heavily loaded machine, it may take more than expected... That being said, we should first gather statistics, how common is this test failure, or it was a one-off. I'll search for the stats.

Btw, how could the algorithm change that much so the aggregation of compressed parts of the message will be much bigger/smaller than the compressed whole message?.. wouldn't it be a different/new algo then, not DEFLATE as it was described in RFC?

@zlatanov
Copy link
Contributor

Btw, how could the algorithm change that much so the aggregation of compressed parts of the message will be much bigger/smaller than the compressed whole message?.. wouldn't it be a different/new algo then, not DEFLATE as it was described in RFC?

The deflate only describes the structure/format of the data. The algorithm is implementation detail that might change (for example zlib-intel vs classic zlib), memory constraints, different performance optimizations that might trade compression ratio for speed.

@CarnaViire
Copy link
Member

CarnaViire commented Apr 29, 2021

I see that it is failing quite regularly (53 times in 2 days on main) for 14 and 15 window bits for

alpine.313.arm32
alpine.312.arm64
ubuntu.1804.arm32
ubuntu.1804.arm64

I'm ok with removing the test as we cannot depend on it from the algorithm's point of view, but... Interesting that it is not failing, but hanging, and it's only on arm. I'm interested in what will happen if we just remove our timeout, whether it will complete after some more time, or will continue hanging until killed as LongRunningTest 🤔

@zlatanov
Copy link
Contributor

I'm interested in what will happen if we just remove our timeout, whether it will complete after some more time, or will continue hanging until killed as LongRunningTest

There is only one way to find out 😎

However I have the nagging suspicion that there is some problem with Random on ARM.
There is nothing else in this test that is different from the other deflate tests. Some of them should actually be more expensive than this one.

Is there any way I could run a test on ARM, or the only way would be to create e DRAFT PR that I would delete later?

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 29, 2021
@zlatanov
Copy link
Contributor

@CarnaViire I created a PR, we wait now to see the run times for the test. If I am correct, the fix should be successful. If however you feel that we should remove the test, let me know.

@CarnaViire
Copy link
Member

Would it defeat the purpose of the test to make the message smaller, e.g. frame x5, not x10?
I also wanted to ask about NextBytes, but you've done that already 🙂

And as for running a test on ARM, I'd say CI is the best way I know of.

@zlatanov
Copy link
Contributor

Would it defeat the purpose of the test to make the message smaller, e.g. frame x5, not x10?

It would not. 5 is as good a number as 10 in this case 😊 Let's see how the CI build turns out, I think the issue here was the usage of Random.

@karelz karelz added this to the 6.0.0 milestone Apr 29, 2021
@karelz karelz removed the untriaged New issue has not been triaged by the area owner label Apr 29, 2021
@BruceForstall
Copy link
Member

@CarnaViire

And as for running a test on ARM, I'd say CI is the best way I know of.

@karelz should be able to arrange for you to get access to an ARM machine if needed.

@zlatanov
Copy link
Contributor

@CarnaViire can we run this code in one of the failing environments (alpine.313.arm32, alpine.312.arm64, ubuntu.1804.arm32, ubuntu.1804.arm64):

using System;
using System.Diagnostics;

var random = new Random(0);
var stopwatch = Stopwatch.StartNew();

for (int i = 0; i < 1_000_000; ++i)
{
    random.Next(maxValue: 10);
}

stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed);

It shouldn't take more than a few hundred milliseconds on the slowest silicon. On my i7 machine it takes 00:00:00.0074053.

//cc @danmoseley @BruceForstall @karelz

@danmoseley
Copy link
Member

@BruceForstall this was done in #52086

I agree (this is a reminder to us all 🙂) that when a test fails regularly in CI we should disable it immediately if we can't fix it immediately. We're good engineers and prefer to investigate and fix rather than disable anything, but that can happen while it's disabled. 🙂

@MattGal
Copy link
Member

MattGal commented May 3, 2021

@BruceForstall I have no idea how to get hands on ARM machines -- perhaps infra folks can help here? @ViktorHofer any idea?

I assume Viktor CC'ed me for this. We have a limited # of ARM machines that can be accessed via SSH / RDP from appropriate source IPs, you can reach out to @ilyas1974 or myself if he's unavailable to borrow a machine.

@CarnaViire
Copy link
Member

CarnaViire commented May 5, 2021

@zlatanov I was finally able to get access to and setup an ARM machine (failed to setup alpine in docker, so using ubuntu.1804.arm64) to run the code with Random you've posted here #52031 (comment). No breakthrough though... it shows around 00:00:00.0308034 on multiple runs.
I'll try to run and measure the whole test. If you have any experiments for this test you'd like to check, just send it and I'll run it for you.

@CarnaViire
Copy link
Member

For the whole test too, I see at max 00:00:00.3995667 for 15 bits, nowhere near 5 seconds...

@danmoseley
Copy link
Member

That's wacky -- we know from Kusto that it was always failing (for 14/15). Not even flaky. That includes Ubuntu.1804.arm64.

It's our own zlib, right? It's not an issue with which zlib is on the machine?

@karelz
Copy link
Member

karelz commented May 6, 2021

Looks like yet another one of the cases where chasing down test failures in CI is difficult without having access to the exact CI machine setup :( ... is there a way to get temporary access to one of the CI machines to confirm we can reproduce it there manually at least?

@CarnaViire
Copy link
Member

As I understand it, I actually did use one of the CI machines that was excluded from the rotation for me: https://github.com/dotnet/core-eng/issues/13006. But because it's excluded, it's not getting the same load it did while it was in CI, so maybe that's the reason... I also ended up creating my own docker image for Ubuntu 18.04, but I doubt it could influence it that much? Or could it?
@danmoseley what makes you say it was always failing? We wouldn't have green CI when merging the original PR if that was the case, but we did...

@zlatanov
Copy link
Contributor

zlatanov commented May 7, 2021

Thanks @CarnaViire!

I still don't think this is related to zlib. @danmoseley we don't use machine wide installation of zlib. In case of ARM, we use the classic zlib instead of zlib-intel.

I will look at the build/tests outputs to see if I can find something interesting. As @CarnaViire suggested it might be related to how busy is the machine, but from 0.01 seconds to 7 seconds seem a bit too much for me. I don't expect the xunit runtime to launch more concurrent tests than there are CPUs. Also if this was the case we should see the total runtime for all the tests combined to be in order of magnitude slower than the other environments.

@zlatanov
Copy link
Contributor

zlatanov commented May 7, 2021

After looking again in the build logs, I don't think it's load related either. The run times are too consistent - almost 4 seconds for 13 bits, 7 seconds for 14 bits and 14 seconds for 15 bits.

If it was load related, I expect we would have deviations, but we don't.

@danmoseley
Copy link
Member

@danmoseley what makes you say it was always failing?

I spoke imprecisely, I meant when it failed, it was always 14 or 15, try this query:

Execute: Web | Desktop | Web (Lens) | Desktop (SAW)

https://engsrvprod.kusto.windows.net/engineeringdata

TestResults 
| join kind=inner WorkItems on WorkItemId 
| join kind=inner Jobs on JobId
| where Finished >= datetime(2021-3-1 0:00:00)
and Type == "System.Net.WebSockets.Tests.WebSocketDeflateTests"
and Branch == "refs/heads/main"
| summarize count() by Method, Result, QueueAlias, Arguments, Message

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 7, 2021
@danmoseley
Copy link
Member

After looking again in the build logs, I don't think it's load related either. The run times are too consistent - almost 4 seconds for 13 bits, 7 seconds for 14 bits and 14 seconds for 15 bits.

Which logs are these? I didn't think we retained data for individual tests unless they failed, and I thought you couldn't repro locally.

Something else I see from querying the failures: it never fails on 13, and it fails on 15 about twice as much as 14. When it fails for 14, it always fails for 15. (This is consistent with the "machine mysteriously slow" theory)

One other idea -- roll back #52086 and add a Stopwatch around each part of the test, including the Random loop, to measure how long each part takes. Put a try/catch for TaskCanceledException around the whole test, and if it's hit, then add those sub-times into the message. That will tell us where the time was spent. Ideally we'd do this locally on @CarnaViire 's ARM machine and just run the test over and over that way ?

@danmoseley
Copy link
Member

This probably can be done with a simple console app, no xunit involved.

@CarnaViire
Copy link
Member

The simple console app doesn't help - it consistently shows the small numbers I've posted before, like 0.03 for random and 0.3 for the rest... now I'm trying to build and run the actual full test suit, possibly xunit parallel test run influences it somehow.

@CarnaViire
Copy link
Member

So, even when running all WebSockets tests with parallel execution, I am getting 0.34-0.35s for 15 window bits for this test.

I guess we have to try executing it in actual CI pipeline... which I am not sure how to do because CI seems to skip test run on ARM e.g. here #52417 (it says it's because isFullMatrix==false apparently) and I don't know how to trigger it 😢

cc @safern @ViktorHofer

@zlatanov
Copy link
Contributor

Which logs are these? I didn't think we retained data for individual tests unless they failed, and I thought you couldn't repro locally.

The build logs seem to be retained for 60 days. If you go to the tests and filter them you can find the websocket tests for each of the platform. The run times are too consistent. You can try and find older builds (PRs that mentioned this issue) and you will see similar run times.

And because I originally used 5 seconds as cancellation timeout, having the test throw TaskCanceledException after 14 seconds is dubious.

image

@safern
Copy link
Member

safern commented May 10, 2021

and I don't know how to trigger it

You can run it manually from your branch. Just click on "Run Pipeline" here: https://dev.azure.com/dnceng/public/_build?definitionId=686

Then, on the source branch choose your branch... you can push a branch to the dotnet fork of dotnet/runtime and then it will show up as available to run the pipeline from; or if you have a PR you can use refs/pull/<PRid>.

@CarnaViire
Copy link
Member

I've found the difference between what I was running on my machine and what was failing in CI. I was running on Release runtime, but the test actually ran overtime on Checked runtime.
Here on Release runtime, the test takes the same 0.3s I was seeing
image
And here on Checked runtime, it takes more than 5s
image
The pipeline with these results is here.
I will build Checked runtime on my machine to confirm I have the repro and to see what exactly takes the time.

@danmoseley
Copy link
Member

I was racking my brain to think what we were missing, and you figured it out. 👏

@CarnaViire
Copy link
Member

I've run the measurements multiple times on Checked runtime and it was indeed Random taking all the time there: random part takes 4.8s and deflate part takes 0.3s. When I apply fix from #52052, random part reduces to 0.02s 😲.
@danmoseley should we pass a word about it to people owning Random?
I will reopen @zlatanov's PR as the fix is working, and I will double-check that in CI 😊

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 11, 2021
@danmoseley
Copy link
Member

@jkotas does it surprise you that Random is orders of magnitude slower on a checked runtime on ARM?

@jkotas
Copy link
Member

jkotas commented May 11, 2021

I do not see an obvious reason why it should be that much slower compared to release build.

CarnaViire pushed a commit that referenced this issue May 11, 2021
Reducing the number of times Random.Next is called to improve runtime performance of test on ARM. 
Fixes #52031
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label May 11, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Jun 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-System.Net os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants