Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add build logic for zig cc #19535

Closed
howardjohn opened this issue Jan 13, 2022 · 61 comments
Closed

Add build logic for zig cc #19535

howardjohn opened this issue Jan 13, 2022 · 61 comments
Labels
area/build enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently

Comments

@howardjohn
Copy link
Contributor

howardjohn commented Jan 13, 2022

Title: Add build logic for zig cc

Description:
https://andrewkelley.me/post/zig-cc-powerful-drop-in-replacement-gcc-clang.html can act as a drop in replacement for clang. This offers a few benefits over Envoy's current build:

  • Simplified cross-compiling - this would be pretty useful for us in particular, as our CI only has access to amd64 machines. Basic tests can be done under emulation but
  • Compiling against different GLIBC versions. We try to support running our builds of Envoy on old OS's, but because of GLIBC dependency that means we also need to build on older platforms. This can be challenging - for example Ubuntu 16 is now EOL which can make it harder to run builds on this version, but we still want to support users on this version. Keeping dependencies like llvm up to date becomes harder as well the older the OS is.
  • Faster ? Absolutely no proof here, but its plausible.

There is some third-party bazel integration: https://git.sr.ht/~motiejus/bazel-zig-cc

Overall as a project, this seems like this may be problematic - maintaining multiple build systems can cause a lot of burden and I assume we would not fully migrate everything to zig. However, I thought I would float the idea out.

I did experiment with this a few months back and was able to get a binary compiled. However, it crashed immediately for unclear reasons. I am not an expert on C++, zig, or bazel though so likely someone else could get much further

@howardjohn howardjohn added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Jan 13, 2022
@alyssawilk alyssawilk added area/build and removed triage Issue requires triage labels Jan 14, 2022
@alyssawilk
Copy link
Contributor

I'm inclined to think the costs are higher than the benefits but cc @lizan @htuch @phlax for opinions

@lizan
Copy link
Member

lizan commented Jan 14, 2022

If zig cc can be a complete drop-in replacement, I would experiment by replacing clang binary with a shell script invoke zig cc. That doesn't involve bazel integration.

For the benefits, the biggest concern is the C++ standard library (libc++ and libstdc++), it seems zig cc provides libc for the cross compiling and support older glibcs, take C++ standard library and C++ features there could cost high. If anyone have a solution I would be happy to look into though.

@htuch
Copy link
Member

htuch commented Jan 17, 2022

Yeah, I'm wondering if we can use Bazel's automatic toolchain discoevry with the right env vars to make this work. Then it's just a question of whether we want a CI job, which would be a lower bar.

In general, I think we would need some owners and contribution to help make this a reality if there is non-trivial work.

@andrewrk
Copy link

For the benefits, the biggest concern is the C++ standard library (libc++ and libstdc++), it seems zig cc provides libc for the cross compiling and support older glibcs, take C++ standard library and C++ features there could cost high. If anyone have a solution I would be happy to look into though.

If I understand correctly you are concerned with how to provide libc++. In case it is not clear, zig provides libc++ in addition to libc. It will be compiled lazily from source according to the target and command line flags and then cached for reuse.

If you decide to go this route in envoy and run into trouble, don't hesitate to reach out! I'd be happy to help.

@howardjohn
Copy link
Contributor Author

Played around with this a bit.

Command:

bazel build --config=libc++ --action_env=CC=$HOME/bin/zig-cc --action_env=CXX=$HOME/bin/zig-cxx //source/exe:envoy-static

Got up to this point:

$ cat ~/bin/zig-cc
#!/bin/bash
HOME=/tmp/home /usr/local/google/home/howardjohn/bin/zig/zig cc -lc++ -v "$@"
$ cat ~/bin/zig-cxx
#!/bin/bash
HOME=/tmp/home /usr/local/google/home/howardjohn/bin/zig/zig c++ -v "$@"

Fails with:

ERROR: /usr/local/google/home/howardjohn/.cache/bazel/_bazel_howardjohn/78cef1fda54a78a84bcafef4dbf40b75/external/envoy_api/envoy/annotations/BUILD:5:18 Middleman _middlemen/@envoy_Uapi_S_Senvoy_Sannotations_Cpkg-BazelCppSemantics_build_arch_k8-fastbuild failed: undeclared inclusion(s) in rule '@com_google_protobuf//:protobuf_lite':
this rule is missing dependency declarations for the following files included by 'src/google/protobuf/io/strtod.cc':
  '/usr/local/google/home/howardjohn/bin/zig/lib/libcxx/include/cstdio'
  '/usr/local/google/home/howardjohn/bin/zig/lib/libcxx/include/__config'
  '/usr/local/google/home/howardjohn/bin/zig/lib/libc/include/generic-glibc/features.h'
  '/usr/local/google/home/howardjohn/bin/zig/lib/libc/include/generic-glibc/features-time64.h'
  '/usr/local/google/home/howardjohn/bin/zig/lib/libc/include/x86_64-linux-gnu/bits/wordsize.h'
  '/usr/local/google/home/howardjohn/bin/zig/lib/libc/include/x86_64-linux-gnu/bits/timesize.h'

If I remove -lc++ from zig-cc, I get a different issue:

#include "..." search starts here:
 external/com_google_absl
 bazel-out/k8-fastbuild/bin/external/com_google_absl
#include <...> search starts here:
 /usr/local/google/home/howardjohn/bin/zig/lib/include
 /usr/local/google/home/howardjohn/bin/zig/lib/libc/include/x86_64-linux-gnu
 /usr/local/google/home/howardjohn/bin/zig/lib/libc/include/generic-glibc
 /usr/local/google/home/howardjohn/bin/zig/lib/libc/include/x86-linux-any
 /usr/local/google/home/howardjohn/bin/zig/lib/libc/include/any-linux-any
 /usr/local/include
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.
In file included from external/com_google_absl/absl/debugging/internal/address_is_readable.cc:18:
In file included from external/com_google_absl/absl/debugging/internal/address_is_readable.h:18:
external/com_google_absl/absl/base/config.h:56:10: fatal error: 'cstddef' file not found
#include <cstddef>
         ^~~~~~~~~
1 error generated.

I assume the first issue is bazel not recognizing zig's libc+ somehow and the latter is not using libc++ at all...?

@howardjohn
Copy link
Contributor Author

I also played around with https://sr.ht/~motiejus/bazel-zig-cc/ in https://github.com/envoyproxy/envoy/compare/main...howardjohn:zig2?expand=1, but struggled to actually get the toolchain used. It was registered but zig never actually executed.

@motiejus
Copy link

motiejus commented Jan 25, 2022

I also played around with https://sr.ht/~motiejus/bazel-zig-cc/ in https://github.com/envoyproxy/envoy/compare/main...howardjohn:zig2?expand=1, but struggled to actually get the toolchain used. It was registered but zig never actually executed.

You seem to be missing

diff --git a/.bazelrc b/.bazelrc
index 89110fc7cd..d2f6940c3a 100644
--- a/.bazelrc
+++ b/.bazelrc
@@ -15,6 +15,7 @@ run --color=yes
 build --color=yes
 build --workspace_status_command="bash bazel/get_workspace_status"
 build --incompatible_strict_action_env
+build --incompatible_enable_cc_toolchain_resolution
 build --host_force_python=PY3
 build --host_javabase=@bazel_tools//tools/jdk:remote_jdk11
 build --javabase=@bazel_tools//tools/jdk:remote_jdk11

I added that, and it seemed to be doing something (not an expert in envoy, not sure if the target is the right way to go at all):

zigcc

I've added a note about it to the top-level README. Thanks for trying.

@howardjohn
Copy link
Contributor Author

Nice, thanks @motiejus . That got me to a complete build.

Initial I had a full build and had issues on v8 failing. I noticed while zig was used for most things, cc1plus was used for v8 for some reason, which I assume is what led to the failure. I disabled all the wasm extensions, and was able to complete a build.

The binary is failing with "illegal hardware instruction" though. Still looking into why. Perhaps some dependencies are still using other toolchains and they get mixed up. There were also warnings for unsupported linker arg: -whole-archive and similar which could be important

@motiejus
Copy link

motiejus commented Jan 25, 2022

The binary is failing with "illegal hardware instruction" though. Still looking into why. Perhaps some dependencies are still using other toolchains and they get mixed up. There were also warnings for unsupported linker arg: -whole-archive and similar which could be important

Illegal instruction is most likely UBSAN: zig cc enables it by default: ziglang/zig#4830 (comment)

You may confirm the hypothesis by adding -fno-sanitize=undefined to CFLAGS. If it starts working, it is UB. If the hypothesis is confirmed, it is reasonable to fix the undefined behavior and add -fsanitize=undefined to the CFLAGS. Then you will test for it on all C/C++ compilers, regardless of which one you decide to use.

@andrewrk
Copy link

andrewrk commented Jan 25, 2022

Here's an issue that would ease discovery of the fact that "illegal hardware instruction" is in fact UBSAN tripping: ziglang/zig#5163

May I see the full list of linker warnings?

@howardjohn
Copy link
Contributor Author

May I see the full list of linker warnings?

INFO: From Linking source/exe/envoy-static:
warning: unsupported linker arg: -whole-archive
warning: unsupported linker arg: -no-whole-archive
... repeat above 1000 times ...
warning: unsupported linker arg: --hash-style
warning: unsupported linker arg: gnu
warning: unsupported linker arg: @bazel-out/k8-fastbuild/bin/bazel/gnu_build_id.ldscript
warning: unsupported linker arg: -S

-fno-sanitize=undefined

Awesome, we have a running binary now! Currently failing on

RELEASE_ASSERT(mprotect(altstack_ + guard_size_ + altstack_size_, guard_size_, PROT_NONE) == 0,
but we can run it

@andrewrk
Copy link

Thanks, I'll look into these linker warnings.

@lizan
Copy link
Member

lizan commented Jan 25, 2022

You may confirm the hypothesis by adding -fno-sanitize=undefined to CFLAGS. If it starts working, it is UB. If the hypothesis is confirmed, it is reasonable to fix the undefined behavior and add -fsanitize=undefined to the CFLAGS. Then you will test for it on all C/C++ compilers, regardless of which one you decide to use.

We do run all tests with Clang UBSAN so I think it is unlikely unless zig enables something extra.

@moderation
Copy link
Contributor

Not sure if the same but others finding UBSAN error's when using zig cc - https://twitter.com/FlohOfWoe/status/1483778821912514560

@motiejus
Copy link

Thanks, I'll look into these linker warnings.

Thank you @andrewrk for ziglang/zig@50905d8. Once the CI passes, I will cut a new bazel-zig-cc with this new version.

@andrewrk
Copy link

No problem. I was planning on solving those other two as well - will make a note here when those land.

@andrewrk
Copy link

--hash-style is implemented in ziglang/zig@40c9ce2.

May I see the contents of bazel-out/k8-fastbuild/bin/bazel/gnu_build_id.ldscript?

@howardjohn
Copy link
Contributor Author

howardjohn commented Jan 26, 2022

$ cat bazel-out/k8-fastbuild/bin/bazel/gnu_build_id.ldscript
--build-id=0xba7583d2fd45d21d22b7c523314c1908f0d17888

The usage of ^ can be disabled in the envoy build with:

diff --git a/source/exe/BUILD b/source/exe/BUILD
index 0eb4078c16..905d1f9afc 100644
--- a/source/exe/BUILD
+++ b/source/exe/BUILD
@@ -26,7 +26,7 @@ envoy_cc_binary(
         "//bazel:windows_opt_build": ["generate_pdb_file"],
         "//conditions:default": [],
     }),
-    stamped = True,
+    stamped = False,
     deps = [":envoy_main_entry_lib"],
 )

already though

@andrewrk
Copy link

Thanks. The issue to track for that one is ziglang/zig#3047, however it's not a high priority at the moment.

Do I understand correctly that this use case is unblocked?

@howardjohn
Copy link
Contributor Author

I haven't had a chance to try with ziglang/zig@40c9ce2 yet. I did, however, disable the --hash-style flag from being set in the Envoy build config. This makes the build succeed without any errors.

However, running it immediately fails on

RELEASE_ASSERT(mprotect(altstack_ + guard_size_ + altstack_size_, guard_size_, PROT_NONE) == 0,
failing assertions.

If I remove that assertion,

RELEASE_ASSERT(sigaltstack(&previous_altstack_, nullptr) == 0, "");
fails.

If I remove that one as well, I can run envoy --help, but even an empty config fails with Didn't find a registered implementation for name: 'envoy.network.dns_resolver.cares'. I am assuming that this is the first extension that is attempted to be loaded and not specific to cares. At this point things are looking more in Envoy territory than zig`


I also had to disable tcp_stats stats extension or it fails with:

undeclared inclusion(s) in rule '//source/extensions/transport_sockets/tcp_stats:tcp_stats_lib':
this rule is missing dependency declarations for the following files included by 'source/extensions/transport_sockets/tcp_stats/tcp_stats.cc':
  '/usr/include/linux/tcp.h'*

I am guessing this is somehow implicitly included in the clang bazel logic

@andrewrk
Copy link

Neither --hash-style nor --build-id are related to those mprotect/sigaltstack calls. I'm not sure why you would get different behavior with zig cc vs clang. I would advise against removing those asserts. They are ensuring that your stack has a guard page which is a security hardening technique. It would be useful to know the errno value that mprotect is returning when it fails.

@howardjohn
Copy link
Contributor Author

I would advise against removing those asserts.

Certainly - was mostly removing them temporarily to see if it would unblock the rest of the process.

It would be useful to know the errno value that mprotect is returning when it fails.

$ strace bazel-bin/source/exe/envoy-static --config-yaml "admin: {address: {socket_address: {address: '127.0.0.1', port_value: 9901}}}" |& rg mprotect -C 2
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 1868664, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fd97d75a000
mprotect(0x7fd97d780000, 1654784, PROT_NONE) = 0
mmap(0x7fd97d780000, 1343488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7fd97d780000
mmap(0x7fd97d8c8000, 307200, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16e000) = 0x7fd97d8c8000
--
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=40032, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 44032, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fd97d748000
mprotect(0x7fd97d74b000, 24576, PROT_NONE) = 0
mmap(0x7fd97d74b000, 16384, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7fd97d74b000
mmap(0x7fd97d74f000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7fd97d74f000
--
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd97d742000
arch_prctl(ARCH_SET_FS, 0x7fd97d743a40) = 0
mprotect(0x7fd97d914000, 12288, PROT_READ) = 0
mprotect(0x7fd97d93e000, 4096, PROT_READ) = 0
mprotect(0x7fd97d751000, 4096, PROT_READ) = 0
mprotect(0x7fd97d758000, 4096, PROT_READ) = 0
mprotect(0x7fd97da85000, 4096, PROT_READ) = 0
mprotect(0x56208512b000, 806912, PROT_READ) = 0
mprotect(0x7fd97dad5000, 8192, PROT_READ) = 0
munmap(0x7fd97da87000, 115343)          = 0
set_tid_address(0x7fd97d743d10)         = 1070849
--
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd97da94000
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd97d732000
mprotect(0x5716ffe00000, 2097152, PROT_READ|PROT_WRITE) = 0
gettid()                                = 1070849
openat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/tsc_freq_khz", O_RDONLY) = -1 ENOENT (No such file or directory)
--
madvise(0x562085400000, 6291456, MADV_NOHUGEPAGE) = 0
mmap(0x9a9c4000000, 2105344, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x9a9c4000000
mprotect(0x5716ffc00000, 2097152, PROT_READ|PROT_WRITE) = 0
gettid()                                = 1070849
gettid()                                = 1070849
--
close(3)                                = 0
rseq(0x7fd97d7427e0, 0x20, 0, 0x53053053) = 0
mprotect(0x5716ffa00000, 2097152, PROT_READ|PROT_WRITE) = 0
gettid()                                = 1070849
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd97d722000
--
gettid()                                = 1070849
mmap(NULL, 8191, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fd97da92000
mprotect(0x7fd97da92000, 4096, PROT_NONE) = 0
mprotect(0x7fd97da92fff, 4096, PROT_NONE) = -1 EINVAL (Invalid argument)
gettid()                                = 1070849
gettid()                                = 1070849
--
read(3, "TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\5\0\0\0\5\0\0\0\0"..., 4096) = 1802
close(3)                                = 0
write(2, "[2022-01-26 15:48:50.935][107084"..., 194[2022-01-26 15:48:50.935][1070849][critical][assert] [source/common/signal/signal_action.cc:109] assert failure: mprotect(altstack_ + guard_size_ + altstack_size_, guard_size_, PROT_NONE) == 0.
) = 194

@andrewrk
Copy link

       EINVAL addr is not a valid pointer, or not a  multiple  of  the  system
              page size.

this address is not properly aligned: 0x7fd97da92fff

from the mmap call with 8191 I can deduced that mapSizeWithGuards() is using -1 for the value of altstack_size_.

Not sure why that would happen but that's clearly the wrong value for altstack_size_ to have.

@motiejus
Copy link

motiejus commented Jan 27, 2022

I am guessing this is somehow implicitly included in the clang bazel logic

When Bazel is configured to use a C/C++ toolchain, it defaults to a "more hermetic" build (for the lack of better word, or lack of my understanding how it actually works). I.e. it will refuse to include anything that's not explicitly allowlisted, and /usr/include/linux/* isn't. However, zig bundles the linux headers and should be able to find them. What happens if you #include <linux/tcp.h>?

As a consequence of a hermetic build, you will be able to cross-compile envoy on anything that bazel and zig run. I.e. compile linux_arm64 envoy binaries on osx or bsd or similar.

@motiejus
Copy link

--hash-style is implemented in ziglang/zig@40c9ce2.

bazel-zig-cc 0.4.2 is now released, which includes this change.

@howardjohn
Copy link
Contributor Author

from the mmap call with 8191 I can deduced that mapSizeWithGuards() is using -1 for the value of altstack_size_.

Thanks for the pointer @andrewrk ! It turns out

altstack_size_(std::max(guard_size_ * 4, static_cast<size_t>(MINSIGSTKSZ))) {

evaluates to -1 as MINSIGSTKSZ is -1. This can be reproduced with the following program:

$ cat main.cpp

#include <csignal>
#include <iostream>

int main() {
      std::cout << "Hello World!" << MINSIGSTKSZ;
          return 0;
}
$ zig-cxx -target x86_64-linux-musl *.cpp && ./a.out
Hello World!2048
$ zig-cxx -target x86_64-linux-gnu.2.33 *.cpp && ./a.out
Hello World!-1
$ clang++ *.cpp && ./a.out
Hello World!2048

For now, I have just replaced it with guard_size_ * 4 which gets things back to the Didn't find a registered implementation for name from previously. Still trying to figure out what that is about. Somehow we have some extensions, but are missing a ton:

statically linked extensions:
envoy.request_id: envoy.request_id.uuid
envoy.filters.network: envoy.filters.network.http_connection_manager, envoy.http_connection_manager
envoy.transport_sockets.downstream: envoy.transport_sockets.quic
envoy.bootstrap: envoy.extensions.network.socket_interface.default_socket_interface
envoy.resolvers: envoy.ip
envoy.matching.action: skip
envoy.transport_sockets.upstream: envoy.transport_sockets.quic
envoy.upstream_options: envoy.extensions.upstreams.http.v3.HttpProtocolOptions, envoy.upstreams.http.http_protocol_options

@PiotrSikora
Copy link
Contributor

There are a few remaining issues when cross-compiling, mostly around Bazel rules.

  1. BoringSSL fails to build, because it tries to compile assembly for the host and not target architecture. This is because it uses Bazel's old --cpu based platform detection (same as Envoy, so perhaps that's also something to look at).
    Fixed in https://boringssl-review.googlesource.com/c/boringssl/+/51265.

  2. LuaJIT fails to build, because host tools are linked with -target x.
    Fixed in lua: clear LDFLAGS for host tools. #19818.

  3. LuaJIT fails to build, because the strip tool doesn't know how to strip cross-compiled binaries.
    Fixed in lua: don't strip the binary. #19820.

  4. V8 (Wasm) - builds fine (once patched with -fno-sanitize=undefined), but crashes at runtime. The V8 build process is quite complex, because it uses mksnapshot tool (built for the host) that starts V8 as part of the build process to create and serialize V8 isolate, which is then embedded in the final library. I suspect that the serialized V8 isolate is for the host architecture, which is why it crashes. V8 supports building for different target CPUs, but I believe it's for the V8 simulator (see: https://v8.dev/docs/debug-arm) and not cross-compling, so this one might be tricky.

  5. WAMR (Wasm) - as an alternative to V8, I tired to build WAMR, but it also builds for the host and not target architecture. It uses CMake build system, and they correcty use ${CMAKE_SYSTEM_PROCESSOR} to detect the target CPU, but it looks that rules_foreign_cc doesn't detect/propagate the information about the target architecture. We could add a select() to force target architecture in cmake()'s envs as a workaround, but I worry that the lack of target platform awareness in rules_foreign_cc might be affecting other dependencies, so it would be best to fix it there.

@motiejus
Copy link

motiejus commented Feb 4, 2022

LuaJIT fails to build, because the strip tool doesn't know how to strip cross-compiled binaries.

Perhaps this can be provided as a linker flag instead? -s/-S

but it looks that rules_foreign_cc doesn't detect/propagate the information about the target architecture.

From a quick grep I can see we (my company) are cross-compiling at least libjpeg-turbo, libpng, libwebp and zlib with rules_foreign_cc and cmake. Either it works as expected, or I am not aware of any issues (because both build and runtime hosts are using amd64). I just spot-checked a couple by building for arm64, and they indeed produce arm64 archives (.a files).

@PiotrSikora
Copy link
Contributor

Perhaps this can be provided as a linker flag instead? -s/-S

The issue is not "how to strip the binary", but "how to prevent LuaJIT's Makefile from running strip". See the linked fix.

From a quick grep I can see we (my company) are cross-compiling at least libjpeg-turbo, libpng, libwebp and zlib with rules_foreign_cc and cmake. Either it works as expected, or I am not aware of any issues (because both build and runtime hosts are using amd64). I just spot-checked a couple by building for arm64, and they indeed produce arm64 archives (.a files).

There can be a many reason why they work, e.g.:

  1. They don't rely on assmebly.
  2. They don't rely on CMake to provide information about the target CPU, but they detect it some other way.

For most projects, it doesn't matter, since if you're using the same sources for all platforms, then -target x is going to force the correct architecture anyway, even if the build system doesn't know the correct target.

Having said that, I know that some of those libraries (definitely zlib) use CPU-specific files for optimizations, so it would be worth verifying that the correct files are picked up when cross-compiling.

@motiejus
Copy link

motiejus commented Feb 4, 2022

Having said that, I know that some of those libraries (definitely zlib) use CPU-specific files for optimizations, so it would be worth verifying that the correct files are picked up when cross-compiling.

Do you mean I should if it works at all? Can you elaborate?

I'd need to setup a main.c and link it to zlib.a and write a test program that is (de-)compressing something. More than 5 minutes though. :)

@PiotrSikora
Copy link
Contributor

Do you mean I should if it works at all? Can you elaborate?

I'd need to setup a main.c and link it to zlib.a and write a test program that is (de-)compressing something. More than 5 minutes though. :)

No, run bazel build -s ... and see if the expected files are compiled. I'm sure it works if it builds, but you might be missing some optimizations if you include wrong file that's then empty because of a #ifdef matching different arch, etc.

@PiotrSikora
Copy link
Contributor

  1. V8 (Wasm) - builds fine (once patched with -fno-sanitize=undefined), but crashes at runtime. The V8 build process is quite complex, because it uses mksnapshot tool (built for the host) that starts V8 as part of the build process to create and serialize V8 isolate, which is then embedded in the final library. I suspect that the serialized V8 isolate is for the host architecture, which is why it crashes. V8 supports building for different target CPUs, but I believe it's for the V8 simulator (see: https://v8.dev/docs/debug-arm) and not cross-compling, so this one might be tricky.

OK, I have V8 working and executing Proxy-Wasm plugins, but it might take a while to fix its Bazel rules.

@motiejus
Copy link

motiejus commented Feb 4, 2022

Do you mean I should if it works at all? Can you elaborate?
I'd need to setup a main.c and link it to zlib.a and write a test program that is (de-)compressing something. More than 5 minutes though. :)

No, run bazel build -s ... and see if the expected files are compiled. I'm sure it works if it builds, but you might be missing some optimizations if you include wrong file that's then empty because of a #ifdef matching different arch, etc.

i can't get it to print more verbose logs than this:

$ bazel build -s --platforms @io_bazel_rules_go//go/toolchain:linux_arm64_cgo --extra_toolchains @zig_sdk//:linux_arm64_gnu.2.28_toolchain @zlib-local//:zlib
INFO: Invocation ID: b56fbbc9-0095-443f-9101-cedb54c6aad2
INFO: Analyzed target @zlib-local//:zlib (42 packages loaded, 7718 targets configured).
INFO: Found 1 target...
SUBCOMMAND: # @rules_foreign_cc//toolchains:make_tool [action 'BootstrapGNUMake external/rules_foreign_cc/toolchains/make', configuration: 604ca03f5d00bfeab7fa9eb11e74a6e3cb4466ac99348d81217b065e9fb6bda9, execution platform: @local_config_platform//:host]
(cd /home/motiejus/.cache/bazel/_bazel_motiejus/80f026c00534678eecd7f80fa20fddc4/execroot/__main__ && \
  exec env - \
    PATH=/bin:/usr/bin:/usr/local/bin \
  /bin/bash -c bazel-out/k8-opt-exec-2B5CBBC6/bin/external/rules_foreign_cc/toolchains/make_tool_foreign_cc/wrapper_build_script.sh)
SUBCOMMAND: # @zlib-local//:zlib [action 'Foreign Cc - CMake: Building zlib', configuration: 5c17d8e6dbfc95ee1f21ebaf6c90f23d28f759370fd5613274ca9e7dd0051f5d, execution platform: @local_config_platform//:host]
(cd /home/motiejus/.cache/bazel/_bazel_motiejus/80f026c00534678eecd7f80fa20fddc4/execroot/__main__ && \
  exec env - \
    PATH=/bin:/usr/bin:/usr/local/bin \
  /bin/bash -c bazel-out/k8-fastbuild/bin/external/zlib-local/zlib_foreign_cc/wrapper_build_script.sh)
Target @zlib-local//:zlib up-to-date:
  bazel-bin/external/zlib-local/zlib/include
  bazel-bin/external/zlib-local/zlib/lib/libz.a
  bazel-bin/external/zlib-local/copy_zlib/zlib
INFO: Elapsed time: 5.851s, Critical Path: 0.15s
INFO: 7 processes: 2 remote cache hit, 5 internal.
INFO: Build completed successfully, 7 total actions

@PiotrSikora
Copy link
Contributor

@motiejus CMake logs should be in bazel-out/k8-fastbuild/bin/external/zlib-local/zlib_foreign_cc/CMake.log, but you need to add build_args = ["-v"] to cmake() to see the exact commands instead of the CMake's "pretty" view.

However, it looks that vanilla zlib doesn't automatically detect platform anyway, and you need to manually select the build option to use the assembly optimizations (see: madler/zlib's CMakeLists.txt#L9).

@PiotrSikora
Copy link
Contributor

PiotrSikora commented Feb 4, 2022

...but yeah, I verified that rules_foreign_cc doesn't propagate any cross-compiling information:

$ bazel build --incompatible_enable_cc_toolchain_resolution --platforms @zig_sdk//:linux_aarch64_platform --extra_toolchains @zig_sdk//:aarch64-linux-gnu.2.33_toolchain ...
[...]
-- DEBUG: CMAKE_CROSSCOMPILING='FALSE'
-- DEBUG: CMAKE_C_COMPILER_TARGET=''
-- DEBUG: CMAKE_SYSTEM_PROCESSOR='x86_64'

Filed on issue with upstream: bazel-contrib/rules_foreign_cc#869.

PiotrSikora added a commit to PiotrSikora/envoy that referenced this issue Feb 9, 2022
Fixes envoyproxy#19535.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
@PiotrSikora
Copy link
Contributor

I opened #19885, which adds --config=zig-cc and --config=zig-cc-linux-aarch64 that "mostly work".

The binaries work fine, but there are a few test failures that might be worth investigating (and some more when cross-compiling to Linux/aarch64).

The biggest holdbacks are the conflicts between bazel-zig-cc and:

  • emsdk - zig c++ takes priority and tries to build Proxy-Wasm C++ plugins (instead of emcc),
  • rules_rust - crashes Bazel when calculating target configuration for building Proxy-Wasm Rust plugins (it might be either because bazel-zig-cc is missing CcToolchanInfo, or because of the transition to a different --platforms in wasm_rust_binary).

which prevent building Proxy-Wasm plugins in tests.

@motiejus
Copy link

Thank you @howardjohn, that buys a lot of coffee. A nice surprise.

joshperry pushed a commit to joshperry/envoy that referenced this issue Feb 13, 2022
See
envoyproxy#19535 (comment)

This allows using othe compiler toolchains that are not literally in
/usr/include but are still present. I don't think this should impact
standard clang builds at all.

Signed-off-by: John Howard <howardjohn@google.com>
Signed-off-by: Josh Perry <josh.perry@mx.com>
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 12, 2022
@robejfarr
Copy link

Hi all, looks like there's been some great steps on the zig front on this issue and we've been having a look at zig to see if it can build Envoy for CentOS 7. Caveat we're still learning about C++ build chains and the tooling around it so we're beginners in this space.

We've installed zig on the latest Envoy ubuntu build container 18.04 and installed @PiotrSikora pull request #19885.

The only configuration we've changed is to remove the aarch toolchain build:zig-cc --extra_toolchains @zig_sdk//:aarch64-linux-gnu.2.28_toolchain and have used a few different options for the X86 toolchain and we're just looking at compilation for now.

We've also disabled the tcp_stats extension to remove the header compatibility issues on CentOS 7 (https://github.com/envoyproxy/envoy-build-tools/blob/main/build_container/CENTOS7_BUILD_STATUS.md). We are aware that there is a third party ELRepo kernel version that could work https://centos.pkgs.org/7/elrepo-kernel-x86_64/kernel-ml-headers-5.16.15-1.el7.elrepo.x86_64.rpm.html, but most CentOS 7 installs we have do not use it.

Running bazel build --config=zig-cc -c opt //source/exe:envoy-static and cross compiling

@zig_sdk//:x86_64-linux-gnu.2.28_toolchain compiles fine
@zig_sdk//:x86_64-linux-gnu.2.18_toolchain compiles fine
@zig_sdk//:x86_64-linux-gnu.2.17_toolchain does not compile and throws the error below.

1 warning generated.
ERROR: /envoy/source/exe/BUILD:23:16: Linking source/exe/envoy-static failed: (Exit 1): c++ failed: error executing command /root/.cache/bazel/_bazel_root/37e3aec351bcd85a6ea8b58e3592ef6e/external/zig_sdk/tools/c++ -o bazel-out/k8-opt/bin/source/exe/envoy-static ... (remaining 3991 arguments skipped)

Use --sandbox_debug to see verbose messages from the sandbox
warning: unsupported linker arg: @bazel-out/k8-opt/bin/bazel/gnu_build_id.ldscript
ld.lld: error: undefined symbol: __cxa_thread_atexit_impl
>>> referenced by cxa_thread_atexit.cpp
>>>               /tmp/bazel-zig-cc/o/a3f0d4f3c69cdfff9abe6bb2e87b4c32/cxa_thread_atexit.o:(__cxa_thread_atexit) in archive /tmp/bazel-zig-cc/o/4692f8c4f45428a48864c01ec35e7820/libc++abi.a
Target //source/exe:envoy-static failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2643.189s, Critical Path: 117.61s
INFO: 10716 processes: 2124 internal, 1 local, 8590 processwrapper-sandbox, 1 worker.
FAILED: Build did NOT complete successfully
root@917c7a6f20bf:/envoy# 

Is this a problem linking against a later version of glibc (https://groups.google.com/g/envoy-users/c/XK9M-RhWrxg/m/XEGMU6hVBwAJ) or something else as cxa_thread_atexit_impl.c is not in glibc 2.17 https://elixir.bootlin.com/glibc/glibc-2.17/source/stdlib/cxa_thread_atexit_impl.c and only glibc 2.18 onwards?

Is there an easy way to explain what the problem is and anything we can configure on a zig or bazel front to make it compile for @zig_sdk//:x86_64-linux-gnu.2.17_toolchain @andrewrk / @motiejus? We're happy to investigate this further with some steers.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Mar 18, 2022
@motiejus
Copy link

motiejus commented Mar 18, 2022

Is there an easy way to explain what the problem is and anything we can configure on a zig or bazel front to make it compile for @zig_sdk//:x86_64-linux-gnu.2.17_toolchain @andrewrk / @motiejus? We're happy to investigate this further with some steers.

The symbol doesn't seem to exist in glibc 2.17:

motiejus ~/code/glibc $ git checkout glibc-2.18
HEAD is now at eefa3be8e4 Update version.h and include/features.h for 2.18 release.
motiejus ~/code/glibc $ git grep __cxa_thread_atexit | wc -l
40
motiejus ~/code/glibc $ git checkout glibc-2.17
Updating files: 100% (8839/8839), done.
Previous HEAD position was eefa3be8e4 Update version.h and include/features.h for 2.18 release.
HEAD is now at c758a68615 Update version.h and include/features.h for 2.17 release.
motiejus ~/code/glibc $ git grep __cxa_thread_atexit | wc -l
0
motiejus ~/code/glibc $ 

May I ask why you are linking against glibc that is almost 10 years old?

Edit: looks like this is what CentOS uses. Unfortunately, I don't have a better answer than "upgrade your OS" or "build statically instead".

@motiejus
Copy link

motiejus commented Mar 24, 2022

Also, keep in mind that the glibc headers in zig are always latest (of glibc), but the symbol definitions are pointing to the right glibc version. So if you are compiling for gnu.2.17, even thought the symbol does not exist in that version, the function prototype will be included in the headers anyway:

~zig/lib/libc/glibc/include $ git grep __cxa_thread_atexit
stdlib.h:extern int __cxa_thread_atexit_impl (void (*func) (void *), void *arg,
~zig/lib/libc/glibc/include $ 

However, linking phase will fail, like it does now.

The fact that the function prototype is included, but linking fails, is a known discrepancy (similar issue in ziglang/zig#9485). If envoy can compile on glibc 2.17 at all, maybe it detects that the symbol does not exist, and uses a different one instead. So inconsistent headers may or may not be a problem and a solution. :)

cc @marler8997 who has done some work on tackling this.

@robejfarr
Copy link

Thanks @motiejus for continuing to look into this.

Yes glibc 2.17 is the CentOS 7 default and why we're trying to compile using it until the maintenance window expires (probably a bit before).

Atm we can get Envoy to build using glibc 2.17 just on CentOS 7 disabling the tcp_stats extension (see comment above). We use the Envoy CentOS 7 build container script located here (https://github.com/envoyproxy/envoy-build-tools/tree/main/build_container/Dockerfile-centos) and per your note above we presume that's the reason why it works currently.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 24, 2022
@github-actions
Copy link

github-actions bot commented May 1, 2022

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as completed May 1, 2022
@motiejus
Copy link

Thanks @motiejus for continuing to look into this.

Atm we can get Envoy to build using glibc 2.17 just on CentOS 7 disabling the tcp_stats extension (see comment above). We use the Envoy CentOS 7 build container script located here (https://github.com/envoyproxy/envoy-build-tools/tree/main/build_container/Dockerfile-centos) and per your note above we presume that's the reason why it works currently.

Stumbled upon this by accident. Quick update: @andrewrk started working on universal headers for zig, which, when implemented, will this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants