Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAUNCHER ERROR: Cannot calculate relative path from absolute path under different drives. #5125

Closed
philwo opened this issue May 1, 2018 · 31 comments
Assignees
Labels
breakage category: misc > misc P1 I'll work on this now. (Assignee required) platform: windows

Comments

@philwo
Copy link
Member

philwo commented May 1, 2018

I just upgraded our CI machines to Bazel 0.13.0 and now Bazel is failing on most tests:

https://buildkite.com/bazel/bazel-bazel/builds/1828#ee4e74ed-79f0-470b-abb7-283f5eaed5dd

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
-----------------------------------------------------------------------------
LAUNCHER ERROR: Cannot calculate relative path from absolute path under different drives.
path = d:\build\buildkite-worker-windows-java8-9wk1-1\bazel\bazel-bazel\third_party\guava\guava-24.1-jre.jar
base = c:\users\buildkite\_bazel_buildkite\kkh8ke_d\execroot\io_bazel\bazel-out\x64_windows-fastbuild\bin\src\test\java\com\google\devtools\build\lib
LAUNCHER ERROR: CreateClasspathJar failed

Bazel 0.12.0 still had:

output_base: D:/temp/_bazel_buildkite/eou8oqvz

Bazel 0.13.0 has:

output_base: C:/users/buildkite/_bazel_buildkite/kkh8ke_d 

@laszlocsomor Does this ring a bell?

@philwo philwo added P1 I'll work on this now. (Assignee required) platform: windows category: misc > misc labels May 1, 2018
philwo added a commit to bazelbuild/continuous-integration that referenced this issue May 1, 2018
@philwo
Copy link
Member Author

philwo commented May 1, 2018

I've changed our Windows images to not use a local SSD, thus now everything is on drive C. However, there's one test that still fails:

Error: Current working directory has a path longer than allowed for a
Win32 working directory.
Can't start native Windows application from here.

external/bazel_tools/tools/test/test-setup.sh: line 232: C:/users/buildkite/_bazel_buildkite/ihg0nqgr/execroot/io_bazel/_tmp/2713b7a92d8298dcf32c3b44de75c8b1/root/abwzq2tc/execroot/__main__/bazel-out/x64_windows_msvc-fastbuild/bin/examples/java-native/src/test/java/com/example/myproject/custom.exe: File name too long

https://buildkite.com/bazel/bazel-bazel/builds/1828#947a6adc-df29-45be-b9a1-2ab832430636

https://buildkite.com/organizations/bazel/pipelines/bazel-bazel/builds/1828/jobs/947a6adc-df29-45be-b9a1-2ab832430636/artifacts/03b9b283-33ee-45eb-95c6-8302625001c4

philwo added a commit to bazelbuild/continuous-integration that referenced this issue May 1, 2018
Hopefully this makes the path to the output_base short enough to
temporarily workaround bazelbuild/bazel#5125.
@philwo
Copy link
Member Author

philwo commented May 1, 2018

The workaround of renaming the Buildkite user to „b“ has worked, Bazel’s CI job is green again.

@laszlocsomor
Copy link
Contributor

Thanks for reporting and finding a workaround!

The error comes from the java_binary launcher (//src/tools/launcher) where it creates the classpath jar. The classpath jar is a jar file with a manifest that contains a classpath, and we use it to work around command line length limitations. The downside is that the classpath in the manifest must use relative paths, relative to the main jar, meaning all jar files must be on the same drive.
A workaround could be that the launcher creates junctions upon startup and references jar files through those.

@laszlocsomor
Copy link
Contributor

@philwo , may I close this bug?

@philwo
Copy link
Member Author

philwo commented May 2, 2018

Hi, thanks! No, IMHO this is a remaining P1 breakage that may even warrant a patch release - the Windows CI runs on hacky workarounds at the moment and stuff might break any moment if someone checks in a test that has a longer name than the one that broke. I would like to revert both workarounds (disabling local SSDs and the shortened user name) in a few days.

@laszlocsomor
Copy link
Contributor

Thanks, that means #5135 is P1 too.

@laszlocsomor
Copy link
Contributor

Bugfix for #5135 is merged as f96f037.

@philwo
Copy link
Member Author

philwo commented May 15, 2018

@buchgr Please rollback bazelbuild/continuous-integration@fb39222 and bazelbuild/continuous-integration@44188d9 as soon as possible (i.e. when a Bazel release is out that has this bug fix). This will restore normal functionality of the Windows VMs. :)

This was a bad enough breakage (unfortunately not detected pre-release) that it would have been warranted to rollback the 0.13.0 release, but we don't have any process for that, so I didn't and instead cooked up this semi-working workaround to at least get our presubmits working again.

@buchgr
Copy link
Contributor

buchgr commented May 15, 2018

@philwo will do. thanks for the pointers!

@meteorcloudy
Copy link
Member

I believe f96f037 only fixed LAUNCHER ERROR: Cannot calculate relative path from absolute path under different drives.

The long path error is due to on Windows we no longer respect TMP, TEMP and TEMP envs for creating users output (f083e76). Be aware that the same approach in #4149 won't fix it, because it happened with a java binary in test.

To fix it, I think we can enable --output_user_root=%TMP% on Windows when rolling back bazelbuild/continuous-integration@fb39222 and bazelbuild/continuous-integration@44188d9, this makes sure our CI is using a short output root.

@laszlocsomor
Copy link
Contributor

--output_user_root=%TMP% would trigger #5038. Let's use --output_user_root=d:/ instead, calling subst d: c:\d_drive to create the drive.

@meteorcloudy
Copy link
Member

@laszlocsomor Sounds good, but D drive might already exist on our CI as a SSD drive? @buchgr can confirm that.

@laszlocsomor
Copy link
Contributor

Sure, any other drive will do.

@philwo
Copy link
Member Author

philwo commented May 15, 2018

Some thoughts:

  • CI runs Bazel as an unprivileged user (not user if you have to be admin to use "subst"?).
  • Using subst sounds a bit dangerous to me, because we don't (can't?) clean up these mappings after a CI job finishes from our monitor script.
  • The only places you can write to on CI are the user's home directory, D:\build and D:\temp. Everything else is not allowed for unprivileged users on the machines, in order to prevent the system from accidentally keeping state and being modified by the CI jobs.

@laszlocsomor
Copy link
Contributor

  • CI runs Bazel as an unprivileged user (not user if you have to be admin to use "subst"?).

You can use subst as a normal user, no Admin privileges needed.

  • Using subst sounds a bit dangerous to me, because we don't (can't?) clean up these mappings after a CI job finishes from our monitor script.

Do you have any particular concerns about that? The mapping goes away after machine reboot.

  • The only places you can write to on CI are the user's home directory, D:\build and D:\temp. Everything else is not allowed for unprivileged users on the machines, in order to prevent the system from accidentally keeping state and being modified by the CI jobs.

We can use one of those directories then! I recommended subst as one way to have short paths for the output root. We can certainly use slightly longer paths too.

@philwo
Copy link
Member Author

philwo commented May 15, 2018

Do you have any particular concerns about that? The mapping goes away after machine reboot.

We don't reboot the machines usually, so my worry was that the first time we run a job, Bazel successfully runs "subst" to map some folder to a drive letter. However, it's not guaranteed that the "subst" is removed after the CI job is finished, so the next time the same job runs, the "subst" might fail, if a mapping from an earlier job run on the machine still exists, doesn't it?

Or did you mean to do the "subst" in the machine startup script? That would be no problem then :)

@laszlocsomor
Copy link
Contributor

Either of your ideas work. We can subst at startup or at Bazel job start. If subsequent jobs try to subst to an existing directory, the command fails, but we can handle that.

But we don't even have to use subst at all. Really I just brought it up as one of many possible solutions to have a short value for --output_user_root. We can just as well use d:\temp without the worry for cleaning up the virtual drive that subst would create.

@philwo
Copy link
Member Author

philwo commented May 15, 2018

Sounds good - in that case I'd recommend going with D:\temp on CI.

Is the --output_user_root passed down to the Bazel inside integration tests? Or will that try to use the new path inside the user's home directory again?

@laszlocsomor
Copy link
Contributor

Sounds good - in that case I'd recommend going with D:\temp on CI.

SGTM.

Is the --output_user_root passed down to the Bazel inside integration tests? Or will that try to use the new path inside the user's home directory again?

No, inner Bazels do not inherit the outer Bazel's --output_user_root. They either use the default value (c:\users\USERNAME) or explicitly set --output_user_root.

@meteorcloudy
Copy link
Member

meteorcloudy commented May 15, 2018

@philwo The output user root in test is set here:

startup --output_user_root=${bazel_root}

which is:
# OS X has a limit in the pipe length, so force the root to a shorter one
bazel_root="${TEST_TMPDIR}/root"

@laszlocsomor laszlocsomor removed their assignment May 16, 2018
@laszlocsomor
Copy link
Contributor

Assigning to @buchgr because:

@philwo

@buchgr Please rollback bazelbuild/continuous-integration@fb39222 and bazelbuild/continuous-integration@44188d9 as soon as possible (i.e. when a Bazel release is out that has this bug fix). This will restore normal functionality of the Windows VMs. :)

@buchgr

@philwo will do. thanks for the pointers!

@laszlocsomor
Copy link
Contributor

laszlocsomor commented May 16, 2018

@dslomov, @meteorcloudy , @buchgr and myself just discussed and summarized the current situation:

What's unclear to me and ask @buchgr to help answering, is:

  1. where does the "d:" come from in e.g. rules_sass on Windows in https://buildkite.com/bazel/bazel-with-downstream-projects-bazel/builds/235#cd44ec8a-130f-4beb-8a6c-42ea38bee84f ?
  2. how and where could we run subst d: c:\something to create this virtual drive?

@buchgr
Copy link
Contributor

buchgr commented May 16, 2018

What's unclear to me and ask @buchgr to help answering, is:

  1. where does the "d:" come from in e.g. rules_sass on Windows in https://buildkite.com/bazel/bazel-with-downstream-projects-bazel/builds/235#cd44ec8a-130f-4beb-8a6c-42ea38bee84f ?
  2. how and where could we run subst d: c:\something to create this virtual drive?

I found that this was a bug in Bazel's CI script. Seems to have been a leftover from when we used to have a d: drive. I have fixed it bazelbuild/continuous-integration@036052e and rules_nodejs works again on Windows.

@laszlocsomor
Copy link
Contributor

Cool, thank you @buchgr.
@dslomov : AFAIK you are the Bazel sheriff this week. Is there a tracking bug for the rules_sass failures I mentioned earlier, and if so, can we close it now that @buchgr 's fix is in?

@philwo
Copy link
Member Author

philwo commented May 19, 2018

Here's how it should look like and looked like until 0.13.0 was released and I had to disable local SSDs on our Windows VMs due to the new behavior ("Bazel now puts output user root underneath the user's home directory") and this bug ("Java launcher doesn't work when files are on two different drives"):

  • C:\ is the slow, persistent SSD used only to store the Windows OS, some Buildkite Agent files and the user's home directories (which should be unused).
  • D:\ is the fast, local SSD which should be used as the output_user_root for all Bazel invocations and temporary files. Most notably, there's D:\build which is where the Buildkite Agent checks out Git repos and invokes Bazel in and D:\temp which is %TEMP% / %TMP% and should also be used by Bazel for output bases.

This all worked fine until 0.13.0 was released and Bazel no longer respected %TEMP% for the output_user_root.

Notably:

  • "Bazel used to hit long path limitations on CI": I don't think this ever happened on our new CI.
  • The fact that Remote Desktop follows junctions and deleted wrong stuff is unfortunate, but I don't think it warrants changing Bazel's behavior to no longer use %TEMP% as the output base, especially considering that this broke our CI.

Here's what we should do:

Here's what we shouldn't do:

  • Switch stuff to C:\build (which is the slow persistent SSD).
  • Call things like "renaming the buildkite user to b" or disabling local SSDs a "fix" for these problems. They're horrible workarounds that I invented under pressure because our CI was broken after 0.13.0 got released and all presubmits failed. I had to make up something to get this unblocked, but this should all have been reverted by now. If there are users out there who have a similar setup like we had on CI, e.g. a separate fast scratch drive for their output base, then 0.13.0 broke them, too and they might not be able to rename their users to single letters or move their stuff to their primary drive C:\, which might just be too small.
  • Use subst for anything.

@laszlocsomor
Copy link
Contributor

@philwo, thanks for the details.

As for using %TEMP% for output user root: can we modify the CI scripts to always pass --output_user_root=%TEMP% to Bazel?

@meteorcloudy
Copy link
Member

@laszlocsomor I believe it's fine to do that. We were always using %TEMP% as the output root, it didn't cause any file deleted unexpectedly problem.
@buchgr Can you re-enable the SSD (D:/) as Phillip suggested?

@laszlocsomor
Copy link
Contributor

@laszlocsomor I believe it's fine to do that. We were always using %TEMP% as the output root, it didn't cause any file deleted unexpectedly problem.

It will if we RDP to the CI machines and manually run Bazel on them, because then we'll create a junction pointing to the JDK and trigger bazelbuild/continuous-integration#252 (comment).

As long as we do not RDP into CI machines and run builds on them, we'll be fine.

@laszlocsomor
Copy link
Contributor

@buchgr : do you have any updates on this bug? Which ongoing release should it be blocking?

@buchgr buchgr assigned philwo and unassigned buchgr Jul 9, 2018
@laszlocsomor
Copy link
Contributor

@meteorcloudy : can we close this bug?

@meteorcloudy
Copy link
Member

Yes, all issues mentioned here are fixed.

buchgr pushed a commit to bazelbuild/continuous-integration that referenced this issue Sep 5, 2018
buchgr pushed a commit to bazelbuild/continuous-integration that referenced this issue Sep 5, 2018
Hopefully this makes the path to the output_base short enough to
temporarily workaround bazelbuild/bazel#5125.
joeleba pushed a commit to joeleba/continuous-integration that referenced this issue Jun 17, 2019
joeleba pushed a commit to joeleba/continuous-integration that referenced this issue Jun 17, 2019
Hopefully this makes the path to the output_base short enough to
temporarily workaround bazelbuild/bazel#5125.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breakage category: misc > misc P1 I'll work on this now. (Assignee required) platform: windows
Projects
None yet
Development

No branches or pull requests

6 participants