Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Azurite table service #19

Open
anna-mazhar opened this issue Nov 9, 2022 · 0 comments
Open

Bug in Azurite table service #19

anna-mazhar opened this issue Nov 9, 2022 · 0 comments
Labels

Comments

@anna-mazhar
Copy link
Collaborator

Running Orleans tests on emulator and cloud service

After running Orleans tests with Azurite as well as Azure cloud service, we found some discrepancies in the test outcomes. We yet have to find the root cause of one of the discrepancies. We believe this is due to a potential bug in Azurite.


Priority: Medium

Environment

-- Azurite: 3.19.0
-- Orlean: 3.6.5
-- Node: 16.18.0
-- Windows VM specs: 16 GB RAM; Microsoft Windows 10 pro
-- Used rainmaker proxy to select tests that invoke Azure REST API.

Discrepancy details

We found that 5 tests fail on the emulator but pass with the cloud service.

The root cause of failure of the following 4 tests (stress tests) on the emulator is the same:

These tests involve making upsert calls to the azure table.
The total upsert calls range from 1000-2000 for these 4 tests.
It was observed that many of these calls fail and respond with a 404 error while some of them succeed with a 204.

The following is an example exception raised for a single failed call:

---- Azure.RequestFailedException : Service request failed.
Status: 404 (Not Found)

Headers:
Connection: keep-alive
Content-Length: 0

What is more interesting is that the successful calls to failed calls ratio is never the same.
At one time, ~700 calls were successful, and at other times, ~400 calls.
What’s more surprising is that these tests never failed on Ze’s machine when run individually but they consistently fail on my VM. This is very interesting!

Hypothesis: Due to limited memory in the VM, the table while maintained in the VM has a limited size which is possibly leading to limited successful upserts to the table.
The memory usage was monitored and it was found to be far from reaching the memory limit. It takes only 10-15MBs. Also, note that the total number of successful upsert calls is not the same across all runs so it's hard to argue that this failure has a correlation with table size limitation in memory.

Hypothesis: The rainmaker proxy is messing things up.
After removing the proxy, the tests were consistently passing!! We then thought that the proxy is probably introducing a delay which makes the requests fail.

Order of the upsert calls
I noticed that the requests received by the azurite table are messed up when using the proxy. These calls are asynchronous so we obviously do not expect the calls to be received in the same order as they are made. But I did notice that the order is very messed up with a proxy. This led me to make the following hypothesis.

Hypothesis: The upsert calls are async, and the order of calls gets further messed up with the proxy delay. If we have a not so messed up order, the tests may pass.
I added a delay of 500ms after each call so that the order of the calls is maintained i.e. they are received in the same order as they are called. Surprisingly, the tests passed consistently in 3 runs.

Hypothesis: The proxy is not a culprit and the delay should have been handled. Azurite table should be able to handle these out-of-order requests.
We ran the same tests in the same environment with rainmaker proxy and instead of using Azurite, we used Azure Storage Emulator (legacy version of Azurite). We found that these tests pass consecutively in this env!

Conclusion: Azurite and its legacy version are not consistent. The legacy version is working correctly for these stress tests. Azurite definitely has a bug.

What's next?
Find the root cause for these 4 Orleans tests which fail on the emulator in the VM.

  • Why do they pass with a delay between consecutive calls?
  • Why do they pass on the legacy version of Azurite but fail with Azurite?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant