Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote targets: tarballs and source repos #5351

Merged
merged 35 commits into from
Jul 14, 2018
Merged

Remote targets: tarballs and source repos #5351

merged 35 commits into from
Jul 14, 2018

Conversation

dcoutts
Copy link
Contributor

@dcoutts dcoutts commented May 29, 2018

This is for review, I'll be adding more commits as I tidy things up from the wip/remote-targets branch.

So far it's just got the VCS and Get change, a preview of which was in PR #5333. The version here should address the review comments from that PR.

  • VCS abstraction and use in Get module
  • Unit tests for the above
  • Support for local tarballs
  • Support for remote tarballs
  • Support for remote source repos
  • Unit and/or integration tests for the above

And the standard checklist:

  • Patches conform to the coding conventions.
  • Any changes that could be relevant to users have been recorded in the changelog.
  • The documentation has been updated, if necessary.

The VCS and Get modules have quite good tests. The remaining integration has only been tested manually so far...

@dcoutts dcoutts requested review from 23Skidoo, hvr and typedrat May 29, 2018 13:32
@typedrat
Copy link
Collaborator

This obsoletes #5333, so I closed that PR. Obv, can be reopened if anyone has an issue with that.

@hvr
Copy link
Member

hvr commented May 29, 2018

The win32 failures were btw of the following variety:

  UnitTests.Distribution.Client.VCS
    check VCS test framework
      git:                                                                   FAIL (0.40s)
        *** Failed! Exception: 'removeDirectoryRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:DeleteFile "C:\\Users\\appveyor\\AppData\\Local\\Temp\\vcstest-2564\\src\\.git\\objects\\8d\\4d25610950af005ce3a17d2596da20d669085b": permission denied (Access is denied.)' (after 1 test):
        BranchingRepoRecipe [Left (TaggedCommits "tag_E" [Commit [FileUpdate "file/E" "pBDG}FKDY_"]])]
Use --quickcheck-replay=982686 to reproduce.

@dcoutts
Copy link
Contributor Author

dcoutts commented May 30, 2018

First guess about the Windows CI failure: after each test, once the git program has terminated, we delete the repo; however Windows is lazy with releasing file locks so the files git had open are still locked for a few miliseconds after it terminates.

So a solution might be to do a threadDelay mtimeChange at the end of the test before cleaning up the test dir. For example, stick it into testSetup at the end after running theTest.

@Mistuke
Copy link
Collaborator

Mistuke commented May 30, 2018

It's only lazy if the process exits without releasing the locks. In that the interrupt to release the lock IRP_MN_UNLOCK_ALL has to be sent by the I/O manager. But the cleanup interrupt to release all locks are only sent when all handles to the lock are closed (IRP_MJ_CLEANUP). http://download.microsoft.com/download/4/3/8/43889780-8d45-4b2e-9d3a-c696a890309f/File%20System%20Behavior%20Overview.pdf

I think it it's likely that there is just an existing open handle to the file/folder preventing the cleanup from happening. The I/O manager should know to process interrupts in order so the delete should never be sent before the unlock if they are both queued.

It seems that Appveyor may have windows defender still enabled https://help.appveyor.com/discussions/questions/2898-spuriously-failing-test-due-to-seemingly-stray-open-handles they say it's seemingly disabled in the Visual Studio 2017 image type. Which image is used for cabal?

A common workaround is to have a retry loop (what postgres etc do, and what we do in the ghc testsuite) or to use DELETE_ON_CLOSE on the handle, so that when the last handle is eventually closed the system removes the file. This prevents any polling on our side.

@dcoutts
Copy link
Contributor Author

dcoutts commented May 30, 2018

Windows is so bonkers:

https://superuser.com/questions/918476/in-windows-do-file-locks-stay-after-a-process-was-terminated-via-taskkill
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365202.aspx

I mean, seriously, the only thing a parent process can sanely wait on to make sure it doesn't trample on shared resources is the child process terminating, and then Win32 goes and removes that option.

@Mistuke
Copy link
Collaborator

Mistuke commented May 30, 2018 via email

@hvr
Copy link
Member

hvr commented May 30, 2018

@Mistuke

Which image is used for cabal?

the VS 2015 image-type was used; I've switched it to VS 2017 and triggered a rebuild of this PR; let's see if this changes anything.

@dcoutts
Copy link
Contributor Author

dcoutts commented May 30, 2018

If it doesn't we can turn up the verbosity in that test and track down at what point the failure occurs. I'm just guessing currently (though I think it's a pretty reasonable guess).

@hvr
Copy link
Member

hvr commented Jun 1, 2018

@dcoutts I'm afraid we need to go for that verbosity increase...

@hvr
Copy link
Member

hvr commented Jun 1, 2018

@dcoutts with increased verbosity, we now get (the git config error is orthogonal):

Running 1 test suites...
Test suite unit-tests: RUNNING...
File modification time resolution calibration completed, maximum delay
observed: 125.003 ms. Will be using delay of 250.006 for test runs.
"C:\Program Files\Git\mingw64\bin\git.exe" "--version"
Warning: cannot determine version of C:\Program Files\Git\mingw64\bin\git.exe
:
"git version 2.16.2.windows.1\n"
"C:\Program Files\Git\mingw64\bin\git.exe" "init"
Initialized empty Git repository in C:/Users/appveyor/AppData/Local/Temp/vcstest-1344/src/.git/
"C:\Program Files\Git\mingw64\bin\git.exe" "add" "file/C"
"C:\Program Files\Git\mingw64\bin\git.exe" "commit" "--all" "--message=a patch" "--author=Author <author@example.com>"
*** Please tell me who you are.
Run
  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"
to set your account's default identity.
Omit --global to set the identity only in this repository.
fatal: unable to auto-detect email address (got 'appveyor@APPVYR-WIN.(none)')
Unit Tests
"C:\Program Files\Git\mingw64\bin\git.exe" "--version"
  UnitTests.Distribution.Client.VCS
    check VCS test framework
      git:                                                                   FAIL (0.18s)
        *** Failed! Exception: 'removeDirectoryRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removePathRecursive:DeleteFile "C:\\Users\\appveyor\\AppData\\Local\\Temp\\vcstest-1344\\src\\.git\\objects\\90\\781888239b7a5c4e9011438395ed68f2425f4f": permission denied (Access is denied.)' (after 1 test):
        BranchingRepoRecipe [Left (TaggedCommits "tag_A" [Commit [FileUpdate "file/C" "^Yf{'@JVRr"]])]
        Use --quickcheck-replay=696078 to reproduce.

@typedrat
Copy link
Collaborator

typedrat commented Jun 1, 2018 via email

@23Skidoo
Copy link
Member

23Skidoo commented Jun 1, 2018

I have Windows installed, can run tests for you guys.

@hvr
Copy link
Member

hvr commented Jun 1, 2018

@dcoutts I incremented the sleep-time, and it didn't change anything; in fact I don't think the threadDelay is in the code-path we care about, as I noticed FAIL (0.18s) which implies that the test-case took less than a 1s to execute and fail.

@dcoutts
Copy link
Contributor Author

dcoutts commented Jun 1, 2018

Ah, ok, so git fails (due to the author thing) and we propagate an exception, the bracket handler executes the cleanup and so we end up with the same problem, that we didn't wait between the end of the (failing) test and removing the dir contents.

So we should 1. adjust the git flags to provide the user name and email or whatever git is complaining about and 2. move the thread delay inside the remove dir cleanup handler so that it delays in both the success and failing cases (otherwise any future failures will also be masked by this confusing and annoying failure). Or switch from simple delay to the try-multiple-times dir cleanup.

@hvr
Copy link
Member

hvr commented Jun 5, 2018

@Mistuke, so even the generous 10s delay seems to leave this issue unaffected; any ideas/suggestions what we can/should do? I still don't fully understand what's going on here.

@dcoutts
Copy link
Contributor Author

dcoutts commented Jun 6, 2018

This is seriously messed up.

Warning: Windows file locking hack: hit the retry limit 29 while trying to
remove C:\Users\appveyor\AppData\Local\Temp\vcstest-2981

This means it tried 30 times, waiting a second between each go. This is clearly more than just waiting for file locks from a dead process to be released by the kernel.

Two plausible scenarios I can think of:

  • we're running these tests in parallel and we're somehow messing up the locking so two tests run on the same dir, so we have another process we're running that's holding locks
  • there really is a 3rd party process or AV tool that is holding open these files

@dcoutts
Copy link
Contributor Author

dcoutts commented Jun 6, 2018

So apparently the theory is that on windows git's emulated exec impl causes the git process to terminate while a git child process is still running, executing the command.

But this doesn't seem to fit the facts.

  1. We don't generally have a problem waiting for git commands to terminate. These tests run lots and lots of git commands in sequence and then check the state of the filesystem afterwards. If the git process was terminating before the work was actually complete we would see test failures much earlier.
  2. even if the child process took a moment to complete its work an terminate at the end of the sequence of calls, are we really saying it'd take 30 seconds?

@typedrat
Copy link
Collaborator

typedrat commented Jun 6, 2018 via email

@dcoutts
Copy link
Contributor Author

dcoutts commented Jun 6, 2018

Also, git uses a lock file to detect concurrent execution on the same repo and complain, That situation is not being reported by git in this test. So again this suggests that git really is not running concurrently and has really terminated, but something else is holding files open.

@Mistuke
Copy link
Collaborator

Mistuke commented Jun 8, 2018

It occurred to me that this may not be a file locking issue at all. It may simply be, as the error states a permissions one.

I forgot that another thing we account for in the GHC testsuite is that these psuedo posix tools of msys2 sometimes set the read-only attribute inexplicably to files. On Windows you can't delete files with the read-only attribute set. The delete would fail with the exact same error as a file lock. https://msdn.microsoft.com/en-us/library/windows/desktop/aa363915(v=vs.85).aspx

In the GHC testsuite we work around it by if the error is a access denied we check the read-only attribute on the file and clear it before trying again.

Googling around a bit some people doing git-automation have run into the same issue loot/loot@01d60ce and solved it same way. e.g. lib2git inexplicably sets the read-only bit preventing cleanups. This would explain why 30 seconds later the file is still un-deletable.

this can be verified by setting https://www.appveyor.com/docs/how-to/rdp-to-build-worker/, logging into the machine and trying to delete the folder manually (from the commandline, not using explorer, explorer clears the flag first so it works.) or checking the file permissions or just implement a call to setAccessPermissions before retrying the delete. This is the only scenario left that makes any sense to me...

@Mistuke
Copy link
Collaborator

Mistuke commented Jun 8, 2018

Ok, looking into it some more, this makes a lot more sense now. git's internal object database is immutable, because once a object is created, it can never be changed https://git.wiki.kernel.org/index.php/Git#Implementation e.g. amending a commit creates a new hash.

so replaying the actions you'll find

> attrib .\.git\objects\4b\*
A    R               R:\tmp\.git\objects\4b\825dc642cb6eb9a060e54bf8d69288fbee4904

and a subsequent rm will fail

> rm -r .\.git\
rm : Cannot remove item R:\tmp\.git\objects\4b\825dc642cb6eb9a060e54bf8d69288fbee4904: You do not have sufficient access rights to perform this operation.
At line:1 char:1
+ rm -r .\.git\
+ ~~~~~~~~~~~~~
    + CategoryInfo          : PermissionDenied: (825dc642cb6eb9a060e54bf8d69288fbee4904:FileInfo) [Remove-Item], IOException
    + FullyQualifiedErrorId : RemoveFileSystemItemUnAuthorizedAccess,Microsoft.PowerShell.Commands.RemoveItemCommand
rm : Directory R:\tmp\.git\objects\4b cannot be removed because it is not empty.
At line:1 char:1
+ rm -r .\.git\
+ ~~~~~~~~~~~~~
    + CategoryInfo          : WriteError: (4b:DirectoryInfo) [Remove-Item], IOException
    + FullyQualifiedErrorId : DirectoryNotEmpty,Microsoft.PowerShell.Commands.RemoveItemCommand

So indeed, it's not a file locking issue (though for robustness in the presence of AVs we should still leave that code that retries in) it is simply a permissions one and a difference with the unix APIs in that Win32 won't allow the removal of read-only files.

@hvr
Copy link
Member

hvr commented Jun 8, 2018

@Mistuke @dcoutts so all we need is to use removePathForcibly (and add a .Compat version to cabal-install, since it's only available with rather recent directorys?)

@Mistuke
Copy link
Collaborator

Mistuke commented Jun 8, 2018

@hvr yeah, that looks like it'll do the job!

@Mistuke
Copy link
Collaborator

Mistuke commented Jun 8, 2018

Seems to be a different issue now. PackageTests\CustomDep\cabal.test.hs seems to be failing.

@hvr
Copy link
Member

hvr commented Jun 8, 2018

@Mistuke I've seen that cabal.test.hs failure already on master some PRs, such as #5364, so it's orthogonal to this PR

@23Skidoo do you know what's up with that cabal.test.hs failure on appveyor?

@quasicomputational
Copy link
Contributor

It's definitely something about the new tests that's making it hang.

@quasicomputational
Copy link
Contributor

Looks very much like UnitTests.Distribution.Client.VCS is the source of the hang (or one of the sources).

@quasicomputational
Copy link
Contributor

I might have jumped the gun earlier on pointing the finger at the VCS tests and assuming it had hung, and killed the test run too quickly.

@quasicomputational
Copy link
Contributor

OK, I think we've got a handle on the problem here: the VCS tests are catastrophically slow on AppVeyor.

  UnitTests.Distribution.Client.VCS
    check VCS test framework
      git:                                                                   Updated tag 'tag_G' (was b800725)
Updated tag 'tag_Q' (was 12b9f4c)
Updated tag 'tag_I' (was c4bd76a)
Updated tag 'tag_W' (was 383db4d)
OK (700.25s)
        +++ OK, passed 100 tests.

I initially misdiagnosed this as a hang because new-test doesn't stream its output (and doesn't even have an option to: #4766; +1 for the suggestion to make it the default/only behaviour); switching to new-run has at least allowed us some visibility into the problem.

I'm calling it quits for tonight. I hope that this is enough for you to go on in chasing the real problem down, @typedrat.

@Mistuke
Copy link
Collaborator

Mistuke commented Jul 12, 2018

OK, I think we've got a handle on the problem here: the VCS tests are catastrophically slow on AppVeyor.

In the appveyor file you have:

 +  - set PATH=%APPDATA%\cabal\bin;C:\msys64\usr\bin;C:\Program Files\Git\mingw64\bin;%PATH%

notice the order, you're putting msys before the git. My guess is if you do which git you'll find it's using msys2's git (which is an emulated posix version of git and will be much slower). instead of a native git.

I assume the git in C:\Program Files\Git is the git-for-windows one which should be significantly faster. so flip the order of these paths and it should be a lot quicker.

If not try actually installing git-for-windows using chocolatey

 choco install git

But I think that's what they pre-installed.

@Mistuke
Copy link
Collaborator

Mistuke commented Jul 12, 2018

Btw if it's still slow after this enable git tracing by exporting GIT_TRACE=1 using setx and that should give a better idea of why along with the timestamps for each command which I assume you can get with appveyor?

@Mistuke
Copy link
Collaborator

Mistuke commented Jul 12, 2018

Now it's a different error 😂 I think the final directory is too long because of the fully qualified package name. Add a store-dir see e.g. https://github.com/haskell/network/blob/master/appveyor.yml#L6

@hvr
Copy link
Member

hvr commented Jul 12, 2018

@Mistuke @typedrat @quasicomputational

Now we're at...

  UnitTests.Distribution.Client.Get
    forkPackages
      no repos:                                                              OK
      no repos of requested kind:                                            OK
      no repo type specified:                                                OK
      unsupported repo type:                                                 OK
      no repo location specified:                                            OK
      correct repo kind selection:                                           OK
      repo destination exists:                                               OK
      git fetch failure:                                                     fatal: 'C:\Users\appveyor\AppData\Local\Temp\repos-2844\src' does not appear to be a git repository
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
OK (0.09s)
    forkPackages, network tests
      git clone:                                                             OK (5.55s)
1 out of 292 tests failed (2002.48s)

@typedrat
Copy link
Collaborator

Still have one failing test.

@quasicomputational
Copy link
Contributor

I wish I knew why it was failing sometimes on the manpage thing. That's either a bug in new-run or a bug in Setup.hs and either one is our problem.

@typedrat
Copy link
Collaborator

I don't know if that is caused by this patch or not.

appveyor.yml Outdated
- appveyor-retry cabal new-build exe:cabal exe:cabal-tests --only-dependencies
- cabal %CABOPTS% new-build exe:cabal
- cabal %CABOPTS% new-run cabal-tests -- -j3 --with-cabal=dist-newstyle\build\x86_64-windows\ghc-8.0.2\cabal-install-2.3.0.0\build\cabal\cabal.exe
- appveyor-retry cabal new-build cabal-install:tests --only-dependencies
Copy link
Collaborator

@Mistuke Mistuke Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires %CABOPTS% otherwise it won't try to use the cache for components we just built. the other retry commands too I think.

build_script:
- cd Cabal
- ghc --make -threaded -i -i. Setup.hs -Wall -Werror -XRank2Types -XFlexibleContexts
environment:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're missing APPVEYOR_SAVE_CACHE_ON_ERROR: True, the issue is, right now it's not using the cache at all due to the build not completing, so it's in a bit of a circular dependency. we need the cache in order to finish on time, but it needs to finish to create the cache. You can see that it still marks all dependencies as needs download and build currently.

I think we can always save the cache so that option should be safe. See https://www.appveyor.com/docs/build-cache/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be fixed, but it wouldn't matter. The owner of the Appveyor account (@23Skidoo, I believe) needs to enable setting the cache from PR builds for it to have any effect before this is merged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, indeed. probably should be enabled for a bit so we get the initial cache.

Copy link
Collaborator

@Mistuke Mistuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not currently using the cache, which is why we're still running out of time. With the cache the build should only take about 10-15mins,

@typedrat
Copy link
Collaborator

Tests work. 🙏

@alexbiehl
Copy link
Member

🚢

@Mistuke
Copy link
Collaborator

Mistuke commented Jul 14, 2018

Good job! Only 8 mins to spare though on the build time 😂😂

@typedrat
Copy link
Collaborator

Should get better as soon as this gets merged. What's that waiting on, now?

@alexbiehl
Copy link
Member

Do it!

VCS tests are *very* slow; it took a substantial amount of work (this is a squash of 35 commits by two authors with debugging help from @Mistuke as well) to get the test suite and build process fast enough that they can run along with the rest of it in under an hour (the Appveyor limit).

Co-authored-by: quasicomputational <quasicomputational@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants