Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible race condition in creating Mercurial project and uploading .zip file #1025

Closed
rmunn opened this issue Aug 13, 2024 · 7 comments · Fixed by #1026
Closed

Reproducible race condition in creating Mercurial project and uploading .zip file #1025

rmunn opened this issue Aug 13, 2024 · 7 comments · Fixed by #1026
Assignees
Labels
bug Something isn't working

Comments

@rmunn
Copy link
Contributor

rmunn commented Aug 13, 2024

Describe the bug
While working on sillsdev/LfMerge#342, I found a consistent reproduction for the race condition that #765 was addressing. I believe that the fix for #765 did manage to fix the NFS issue for most cases, but sillsdev/LfMerge#342 has a consistently reproducible test failure that shows that the NFS issue may not be completely gone.

To Reproduce
Steps to reproduce the behavior:

  1. Clone https://github.com/sillsdev/LfMerge/ and check out commit 238049a25025aceb74e78138927477fe4a2d5e08.
  2. Make sure LexBox is running on localhost.
  3. Run dotnet test -l "console;verbosity=normal" --filter=Foo_
  4. TusException : The remote server returned an error: (500) Internal Server Error.
  5. Check the LexBox logs and you'll see that ProjectService.FinishReset is calling UpdateProjectMeta which in turn calls HgService.GetLastCommitTimeFromHg, and Mercurial is returning a 404 for /hg/s/{projectCode}/log?style=json-lex&rev=tip.

Expected behavior
FinishReset would get a correct last commit time and would not return a 404 for a project that exists and has just had a .zip file uploaded to it.

@rmunn rmunn added the bug Something isn't working label Aug 13, 2024
@rmunn rmunn self-assigned this Aug 13, 2024
@rmunn
Copy link
Contributor Author

rmunn commented Aug 13, 2024

Further test results: if, instead of getting the last commit date from hg/s/{projectCode}/log?style=json-lex&rev=tip, we instead get it from the command runner running hg log -r tip -T {date|hgdate} then the race condition does not happen. (Consistently, just as it consistently does happen if the last commit is obtained via HTTP).

@rmunn
Copy link
Contributor Author

rmunn commented Aug 14, 2024

It's definitely the refreshinterval causing this. When I set refreshinterval to 0 ("always refresh") in hgweb.hgrc and make no other changes, the repro stops reproing. Set refreshinterval back to 5 and it fails. (In fact, even a refresh interval of 1 second fails, because of how the tests are arranged: one test looks up hg/s/{projectCode}/log, which causes Mercurial to refresh its repo list. Then the next test created a project followed by looking up hg/f/{otherCode}/log, but Mercurial doesn't refresh the repo list so it doesn't find the project).

The reason why using the command runner to get the last commit date fixes the issue is because the command runner is using the hg command line, and the refresh interval is specific to hgweb.

@rmunn
Copy link
Contributor Author

rmunn commented Aug 14, 2024

Some logs showing failure with a refresh interval of 1 second:

[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:05 +0000] "GET /hg/s/sena-3/file/tip?style=json HTTP/1.1" 200 929
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:07 +0000] "GET /command/sena-3/invalidatedircache HTTP/1.1" 200 -
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:07 +0000] "GET /command/sena-3/tip HTTP/1.1" 200 40
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:07 +0000] "GET /command/sena-3/invalidatedircache HTTP/1.1" 200 -
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:07 +0000] "GET /command/sena-3/tip HTTP/1.1" 200 40
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:07 +0000] "GET /command/sena-3/lexentrycount HTTP/1.1" 200 5
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:07 +0000] "GET /hg/s/sena-3/log?style=json-lex&rev=tip HTTP/1.1" 200 440
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /command/fw-2442044c-fdd3-4f20-b234-54c284673851/invalidatedircache HTTP/1.1" 200 -
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /command/fw-2442044c-fdd3-4f20-b234-54c284673851/tip HTTP/1.1" 200 40
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /command/fw-2442044c-fdd3-4f20-b234-54c284673851/invalidatedircache HTTP/1.1" 200 -
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /command/fw-2442044c-fdd3-4f20-b234-54c284673851/tip HTTP/1.1" 200 40
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /command/fw-2442044c-fdd3-4f20-b234-54c284673851/invalidatedircache HTTP/1.1" 200 -
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /command/fw-2442044c-fdd3-4f20-b234-54c284673851/tip HTTP/1.1" 200 40
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /command/fw-2442044c-fdd3-4f20-b234-54c284673851/lexentrycount HTTP/1.1" 200 5
[hgweb] 10.1.12.113 - - [14/Aug/2024:01:03:08 +0000] "GET /hg/f/fw-2442044c-fdd3-4f20-b234-54c284673851/log?style=json-lex&rev=tip HTTP/1.1" 404 22

Note how the command/fw-(guid) lines all succeeded with a 200 status, but the /hg/f/fw-(guid)/log failed with a 404 because hgweb couldn't see the fw-(guid) project, even though the /command/fw-(guid) lines immediately beforehand prove that it was definitely there on the filesystem.

@hahn-kev
Copy link
Collaborator

I would be in favor of switching how we get the latest date to use the command runner. We need to be carful and make sure it returns the same date that our current method gets (maybe do both and compare, throw an error if they are off by more than 1 second, then test it against a bunch of repos.

Also, that log just shows that the command returned a 200, does it actually have data? it won't necessarily error if there's no data in the folder.

@rmunn
Copy link
Contributor Author

rmunn commented Aug 14, 2024

The lexentrycount returned 5 bytes, 4 bytes for "1462" and one byte for the newline. It would have returned "0\n" (2 bytes) if the repo had been empty. (Edit: Confirmed by printing the lexentrycount result to the log).

I'll open a PR changing GetLastCommitDate to use the command runner.

@rmunn
Copy link
Contributor Author

rmunn commented Aug 14, 2024

Okay, #1026 opened with a fix and a unit test. The dates in the test were taken from an actual repo and have positive, negative, and zero timezone offsets. I've also verified that the code doesn't fall over when run on a repo with no commits in it.

@rmunn
Copy link
Contributor Author

rmunn commented Aug 19, 2024

Bumping size estimate up from Small to Medium because this issue has expanded to include figuring out why the GHA integration tests have failures that aren't showing up when I run them locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants