-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDBF-793 - Retire MSAN clang-15 builder, upgrade to clang-20 #562
base: dev
Are you sure you want to change the base?
MDBF-793 - Retire MSAN clang-15 builder, upgrade to clang-20 #562
Conversation
@grooverdan |
89ab27f
to
852a451
Compare
0c93f34
to
92566d9
Compare
d7eb898
to
4be5d7c
Compare
fe47a5e
to
54b43e1
Compare
3775559
to
bf026a0
Compare
ci_build_images/msan.Dockerfile
Outdated
ARG CLANG_VERSION=15 | ||
ARG CLANG_VERSION=19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 20, right? Or do we intentionally want to retain some support for clang-19? Later in the script we’re comparing CLANG_VERSION
to even smaller version numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's is bit in the .github/workflows that overrides this to 20. Yes, could be changed to 20 here, I was just making sure there wasn't an aversion to using it.
I was aiming on preserving compat with pre-19, but I think that's dropped in an attempt to get 19/20 working and bits could be cleaned out. There's a small amount of compat with the bleeding edge 21.
ci_build_images/msan.Dockerfile
Outdated
|
||
RUN apt-get source gnutls28 \ | ||
RUN . /etc/os-release \ | ||
&& export CFLAGS="-fno-omit-frame-pointer -O2 -g -fsanitize=memory" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d add -march=native
to compensate for some overhead. If we wanted deterministic results between Docker images deployed in different environments, that might be a bad idea. I’m not sure if we really want that; developers can log into a Buildbot machine if really needed for debugging some issue. Besides, we already have some CPU detection code for things like CRC calculation that wouldn’t be guarded by -march
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer local reproducibility, we may also find this limitation hurts us between various builders down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we've got a build environment of github actions where the Dockerfile builds an image, and a run environment on where the buildbot worker is deployed.
Note these CFLAGS/CXXFLAGS/LDFLAGS at this point are just for creating the instrumented libraries, they aren't used to compile MariaDB. In an effort to avoid debugging instrumented libraries in different environments, I think I'd omit the -arch=native here and hope all libraries provide a reliable ABI.
The built container image should be able to run anywhere where any CFLAGS/CXXFLAGS can be set at runtime. This is encouraged (and TODO follow up documentation) and why rr was put into the container.
The actual flags used in the MariaDB msan build are default as listed in the master.cfg change. Generally trying to keep flags used in buildbot as the ones needed for the environment in which is run, then developers can change these in the server code like MDEV-34996 if needed.
@RazvanLiviuVarzaru, given the amount of bash scripting in the Dockerfile, I am wondering if (for maintainability) we should not instead create that script, copy it into the container an use it for the build. |
I am on board with this, following our approach to building bintars. Also, all the build steps require comments as to why certain flags are used! @grooverdan Those should be added. |
73cd2d3
to
6703791
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I correctly understand, this is a docker file that can be built alone, right?
It took me a while to figure it out that you are actually using COPY --from=rr
to transfer the binary to the MSAN container image.
Maybe it's useful that in the header of this Dockerfile to say:
- that this is behaving like a stage in a multi-stage Dockerfile after the contents of the base dockerfile (debian.Dockerfile) are added on top of it . (order is important)
- can be built alone if needed
- in the multi-stage setup it's acting like a .fragment. Dockerfile and sharing the RR binary is achieved using
COPY --from=stage
syntax - in the same multi-stage setup, to highlight the importance of
FROM $BASE_IMAGE
, I guess that it's important that Building RR and Building MSAN should happen on the same OS/VERSION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it can be built alone.
Multi-stage - doesn't explictly add on top. It just provides the arctifacts for COPY --from=stage
.
The requirements for the later stage are a bit less. Hopefully covered all these in updated header of the file.
util.BuilderConfig( | ||
name="amd64-debug-msan-clang", | ||
workernames=workers["x64-bbw-docker-debug-msan-clang"], | ||
tags=["Debian", "clang", "msan", "debug"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the builder name doesn't have the version of clang in the name and if the tags do not give the user a clue, then the user's last resort is to study the logs.
It's useful to have an easy to spot place for telling the clang-version even if that means we will update the configuration once in a while.
In Grid view, builder name is the easiest indicator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
amd64-msan-clang-20
amd64-msan-clang-20-debug
What are the benefits?
- debug word at the end as in other builders, also draws attention
- builder history wise we keep apples and pears in separate baskets e.g. clang-20 full run history in
amd64-msan-clang-20
, future clang-21 history inamd64-msan-clang-21
- transitioning . Future
amd64-msan-clang-21
can run in parallel withamd64-msan-clang-20
until is stable and we can discontinue the latter smoothly (just a protected branches switch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention build type/compiler/extra tests wasn't fully clear.
Ok, I accept these reasons.
reminder both Marko and I are almost of the opinion that the msan-debug might be providing more timeouts for a minimal addition of coverage (on an non-released work). And the issue noted on zulip means it currently doesn't even bootstrap on debug, so look at dropping this if too much trouble, and the debug obviousn't won't be protected branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given updates to clang-21 if it still occurs within the support cycle of debian-12 I've updated the image tag to include -20
.
canStartBuild=canStartBuild, | ||
properties={"c_compiler": "clang", "cxx_compiler": "clang++", "build_type": "RelWithDebInfo"}, | ||
locks=getLocks, | ||
factory=f_msan_build, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take into account that now f_msan_build
factory in master-docker-nonstandard-2
is not equal anymore with f_msan_build
factory in master-docker-nonstandard
(where clang-16 builder is).
We should have extra care at transitioning and mention the plan. Things to note:
- a protected branches builder should have backup workers. It's OK that now these 2 builders only have
apexis
, they are not protected.MSAN Clang-16
for example hashz-bbw1 / 4 / 5
as worker nodes. - at transitioning,
f_msan_build
factory that clang-16 uses is no longer valid (right ?) and we can movef_msan_build
from your patch incommon_factories.py
and of course, adjust the worker pool of these 2 builders when they become protected.
Maybe @cvicentiu has better ideas here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, however a transition plan is probably better in in the MDBF task.
image: ${{ matrix.image }} | ||
platforms: ${{ matrix.platforms }} | ||
tag: ${{ matrix.tag }} | ||
branch: ${{ matrix.branch }} | ||
clang_version: ${{ matrix.clang_version }} | ||
nogalera: ${{ matrix.nogalera }} | ||
files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think files:
is needed, adds extra complexity.
In bintars, for example, we just do COPY ci_build_images/scripts/* /scripts/
without any files:
specification in the workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And as @cvicentiu mentioned, for future us, as many detailed comments as you can for the library instrumentation script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is build with uses: bbw_build_container_template.yml
, so the additional file complexity is needed to fit in with what the template expects.
I did include as much information as I know in the library instrumentation script.
# Marker to make it possible to build a dev msan builder | ||
# from the nightly clang versions as they are in a differently | ||
# name repo | ||
ENV CLANG_DEV_VERSION=21 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for local testing?
So when we move to clang 21 we just set CLANG_VERSION=21
and CLANG_DEV_VERSION
to what is currently in the nightly builds repository?
If yes, it's worth adding a comment on how do we handle upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local testing or if ever every want to use a dev version for BB testing.
CLANG_VERSION
is the version you want. CLANG_DEV_VERSION
https://apt.llvm.org/ under "development" branch (which is in a state of flux as clang-20 is at the end of its rc cycle.
file comment updated.
Part of MDBF-993. As per Vicențiu Ciorbaru request, last-N-failed is only reporting its status (not in branch protection) and Vicențiu Ciorbaru reported often sporadic failures for it. debian-11-msan It will be replaced by a clang-20 builder in MariaDB#562 so no need to update GITHUB_STATUS_BUILDERS for it in this Pull Request.
pkg-config \ | ||
python3-pexpect \ | ||
unzip \ | ||
zlib1g-dev \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list of packages if exactly the same as the one installed could be in a variable, this would ease maintenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is RR dependencies, its not exactly the same.
Part of MDBF-993. As per Vicențiu Ciorbaru request, last-N-failed is only reporting its status (not in branch protection) and Vicențiu Ciorbaru reported often sporadic failures for it. debian-11-msan It will be replaced by a clang-20 builder in #562 so no need to update GITHUB_STATUS_BUILDERS for it in this Pull Request.
Here we are updating to Debian 12 and clang-20 There where major changes form the clang-16 version meaning creating a compatible script to build both was proving difficult (or messy when it was paritally acheived). As such the github workflow on build-debian.msan-based removes the clang-16 script to ensure this is stable while a new Debian-12 /Clang-20 image/worker are created. Consistent with other changes in ci_images, msan is a fragment an not a self contained container so it has been renamed. The build of clang to instrument sanitizer=memory has been adjusted so it can optionally build with a development version of clang, defined in the build script as CLANG_DEV_VERSION. libunwind is part of the clang now. It needed to be built with msan, however the libraries it generates currently contain a bug with pthread_conditions and multiple threads causing event a basic mariadb server start to corrupt its stack (or similar) during the exit of server. A build without this library failed in some way (that I can't exactly remember now). The instrumentation of libraries has been expanded, but moved to its own script in the next commit. Update alternatives have make the new clang in the container image the default without for its base name the need to suffix the compiler with the version.
maintaining large instrumentation in a Dockerfile creating scripting difficulties dealing with Dockerfile env parsing and a posix shell. moving the instrumentation to the instrumentation script resolved these difficulties and conforms to some best practices guides for containers. here we also expanded the instrumented builds notably to include: * openssl (the parsec plugin of MariaDB uses this and is more consistent with the builds that we do in releases) The start of expanding the instrumented builds to get the instrumented builds for a MSAN galera where started though its completion has been deferred.
d1a53ba
to
1436198
Compare
…tion (factory) An additional primary purpose for this is to upgrade to the run in default RelWithDebInfo CMAKE_BUILD_TYPE. This was requested by Marko to make this a more realistic test of the codebase delivered to users. Coding constructs to prevent memory sanitation problems can be inadvertently be left as Debug only fixes when there are real impacts that need to be addressed. The end goal is for the RelWithDebInfo to be the protected branch builder. During this work, server work was introduces to the codebase under MDEV-34996 and MDEV-36156 to eliminate the proliferation of things required in the build compile process to create a MSAN build. The concrete examples here are C{,XX}FLAG inclusion have been removed. Of the remaining flags. * WITH_EMBEDDED_SERVER=OFF - mainly a steadup * WITH_INNODB_{BZIP2,LZ2,LZMA,LZO,SNAPPY}=OFF - these needed instrumented compression libraries that has not been done. * WITH_ZLIB=bundled - likewise, but is a good candidate to reduce repeated compulation. * LIB_AIO and LIBURING have been disabled, these aren't instrumented and its currently uncertian if the MSAN upstream has the required shims to correctly track the MSAN initialization that these libraries provide. Also LIBURING is blocked in containers via seccomp rules (see MDEV-36234). Relaxing this might in buildbot might expose a way for a pull request to escalate to something more serious like a container escape with arbitary code which we'd like to avoid. * WITH_NUMA/SYSTEMD=no - not instrumented libraries - wasn't significant server code around these code paths. * HAVE_CXX_NEW=1 is needed otherwise there will be msan errors within the C++ library. * CMAKE_{EXE,MODULE}_LINKER_FLAGS - compiles and links generated artifacts with a rpath to find the MSAN libraries first. If we didn't do this we'd have to run the MTR tests with LD_LIBRARY_PATH=$MSAN_LIBDIR and this would result on many things in the system from perl and tools that aren't compiled against MSAN, and as such running these things in MTR would fail on the script. because of this unit tests are now enabled in * CMAKE_CXX_FLAGS=-fsanitize=memory - the needed for this is removed by server commit in MDEV-36156 which hasn't been applied to 10.5. * WITH_DBUG_TRACE=OFF adds effort of MSAN for no real gain. Only applicable to Debug builds and should be disabled in server (under MSAN at least). The disabled PLUGIN list has been reduced. CONNECT has been observed as passing successfully. SPIDER has at least one outstanding MSAN MDEV-35009, and RockDB also has MSAN errors around basic index usage. OQGRAPH needs instrumented boost (as would columnstore but this fails on compile check) but also fails to compile on the late clang version due to using deprecated C++ syntax. ARCHIVE has been enabled as it passes and has a few tests. Mroonga is disables as there aren't any MTR tests to exercise any functionality so there was no point building it. make has been replaced with cmake --build like the intro to cmake docs says to do and package has been removed from the default target. This just consumed more build time for something that wasn't saved or usable if it was saved. On MTR: * MSAN_OPTIONS is defined in the container by default * LD_LIBRARY_PATH removed as before * the list of failing tests isn't failing any more, and we shouldn't be maintaining these lists in BB. If there's failure, the server code can correct or mask these. There is a debug builder this there.
Part of the point of simplifying MSAN is to make the Buildbot worker images into something that is usable by developers to repoduce failures in buildbot (without buildbot doing the work) and to test in different environments seen as important. As generating the MSAN is specialized task in itself this is an important goal in itself. Understanding MSAN bugs needs to do debug examination of when things are initalizied, changed etc that is very hard without RR. As such rr has been added to the container image to support this debugging by developers.
1436198
to
e981a07
Compare
Replace MSAN CLANG-15 builder with CLANG 19 on Debian 12.