Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDBF-793 - Retire MSAN clang-15 builder, upgrade to clang-20 #562

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

RazvanLiviuVarzaru
Copy link
Collaborator

@RazvanLiviuVarzaru RazvanLiviuVarzaru commented Sep 12, 2024

Replace MSAN CLANG-15 builder with CLANG 19 on Debian 12.

@RazvanLiviuVarzaru RazvanLiviuVarzaru marked this pull request as ready for review September 12, 2024 12:46
@RazvanLiviuVarzaru
Copy link
Collaborator Author

@grooverdan
I think this closes #482.

@grooverdan grooverdan force-pushed the feature/msan-clang-19 branch from 89ab27f to 852a451 Compare September 25, 2024 01:37
@grooverdan grooverdan force-pushed the feature/msan-clang-19 branch from 0c93f34 to 92566d9 Compare November 15, 2024 07:22
@grooverdan grooverdan force-pushed the feature/msan-clang-19 branch from fe47a5e to 54b43e1 Compare November 28, 2024 04:32
@grooverdan grooverdan changed the title MDBF-793 - Retire MSAN clang-15 builder, upgrade to clang-19 MDBF-793 - Retire MSAN clang-15 builder, upgrade to clang-20 Feb 25, 2025
@grooverdan grooverdan force-pushed the feature/msan-clang-19 branch from 3775559 to bf026a0 Compare February 25, 2025 07:28
ARG CLANG_VERSION=15
ARG CLANG_VERSION=19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 20, right? Or do we intentionally want to retain some support for clang-19? Later in the script we’re comparing CLANG_VERSION to even smaller version numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's is bit in the .github/workflows that overrides this to 20. Yes, could be changed to 20 here, I was just making sure there wasn't an aversion to using it.

I was aiming on preserving compat with pre-19, but I think that's dropped in an attempt to get 19/20 working and bits could be cleaned out. There's a small amount of compat with the bleeding edge 21.


RUN apt-get source gnutls28 \
RUN . /etc/os-release \
&& export CFLAGS="-fno-omit-frame-pointer -O2 -g -fsanitize=memory" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d add -march=native to compensate for some overhead. If we wanted deterministic results between Docker images deployed in different environments, that might be a bad idea. I’m not sure if we really want that; developers can log into a Buildbot machine if really needed for debugging some issue. Besides, we already have some CPU detection code for things like CRC calculation that wouldn’t be guarded by -march.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer local reproducibility, we may also find this limitation hurts us between various builders down the road.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we've got a build environment of github actions where the Dockerfile builds an image, and a run environment on where the buildbot worker is deployed.

Note these CFLAGS/CXXFLAGS/LDFLAGS at this point are just for creating the instrumented libraries, they aren't used to compile MariaDB. In an effort to avoid debugging instrumented libraries in different environments, I think I'd omit the -arch=native here and hope all libraries provide a reliable ABI.

The built container image should be able to run anywhere where any CFLAGS/CXXFLAGS can be set at runtime. This is encouraged (and TODO follow up documentation) and why rr was put into the container.

The actual flags used in the MariaDB msan build are default as listed in the master.cfg change. Generally trying to keep flags used in buildbot as the ones needed for the environment in which is run, then developers can change these in the server code like MDEV-34996 if needed.

@fauust
Copy link
Collaborator

fauust commented Feb 25, 2025

@RazvanLiviuVarzaru, given the amount of bash scripting in the Dockerfile, I am wondering if (for maintainability) we should not instead create that script, copy it into the container an use it for the build.

@cvicentiu
Copy link
Member

@RazvanLiviuVarzaru, given the amount of bash scripting in the Dockerfile, I am wondering if (for maintainability) we should not instead create that script, copy it into the container an use it for the build.

I am on board with this, following our approach to building bintars.

Also, all the build steps require comments as to why certain flags are used! @grooverdan Those should be added.

@grooverdan grooverdan force-pushed the feature/msan-clang-19 branch 2 times, most recently from 73cd2d3 to 6703791 Compare February 26, 2025 05:01
Copy link
Collaborator Author

@RazvanLiviuVarzaru RazvanLiviuVarzaru Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I correctly understand, this is a docker file that can be built alone, right?
It took me a while to figure it out that you are actually using COPY --from=rr to transfer the binary to the MSAN container image.

Maybe it's useful that in the header of this Dockerfile to say:

  • that this is behaving like a stage in a multi-stage Dockerfile after the contents of the base dockerfile (debian.Dockerfile) are added on top of it . (order is important)
  • can be built alone if needed
  • in the multi-stage setup it's acting like a .fragment. Dockerfile and sharing the RR binary is achieved using COPY --from=stage syntax
  • in the same multi-stage setup, to highlight the importance of FROM $BASE_IMAGE, I guess that it's important that Building RR and Building MSAN should happen on the same OS/VERSION

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it can be built alone.

Multi-stage - doesn't explictly add on top. It just provides the arctifacts for COPY --from=stage.

The requirements for the later stage are a bit less. Hopefully covered all these in updated header of the file.

util.BuilderConfig(
name="amd64-debug-msan-clang",
workernames=workers["x64-bbw-docker-debug-msan-clang"],
tags=["Debian", "clang", "msan", "debug"],
Copy link
Collaborator Author

@RazvanLiviuVarzaru RazvanLiviuVarzaru Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the builder name doesn't have the version of clang in the name and if the tags do not give the user a clue, then the user's last resort is to study the logs.

It's useful to have an easy to spot place for telling the clang-version even if that means we will update the configuration once in a while.

In Grid view, builder name is the easiest indicator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion:

  • amd64-msan-clang-20
  • amd64-msan-clang-20-debug

What are the benefits?

  • debug word at the end as in other builders, also draws attention
  • builder history wise we keep apples and pears in separate baskets e.g. clang-20 full run history in amd64-msan-clang-20, future clang-21 history in amd64-msan-clang-21
  • transitioning . Future amd64-msan-clang-21 can run in parallel with amd64-msan-clang-20 until is stable and we can discontinue the latter smoothly (just a protected branches switch)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention build type/compiler/extra tests wasn't fully clear.

Ok, I accept these reasons.

reminder both Marko and I are almost of the opinion that the msan-debug might be providing more timeouts for a minimal addition of coverage (on an non-released work). And the issue noted on zulip means it currently doesn't even bootstrap on debug, so look at dropping this if too much trouble, and the debug obviousn't won't be protected branches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given updates to clang-21 if it still occurs within the support cycle of debian-12 I've updated the image tag to include -20.

canStartBuild=canStartBuild,
properties={"c_compiler": "clang", "cxx_compiler": "clang++", "build_type": "RelWithDebInfo"},
locks=getLocks,
factory=f_msan_build,
Copy link
Collaborator Author

@RazvanLiviuVarzaru RazvanLiviuVarzaru Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take into account that now f_msan_build factory in master-docker-nonstandard-2 is not equal anymore with f_msan_build factory in master-docker-nonstandard (where clang-16 builder is).

We should have extra care at transitioning and mention the plan. Things to note:

  • a protected branches builder should have backup workers. It's OK that now these 2 builders only have apexis, they are not protected. MSAN Clang-16 for example has hz-bbw1 / 4 / 5 as worker nodes.
  • at transitioning, f_msan_build factory that clang-16 uses is no longer valid (right ?) and we can move f_msan_build from your patch in common_factories.py and of course, adjust the worker pool of these 2 builders when they become protected.

Maybe @cvicentiu has better ideas here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, however a transition plan is probably better in in the MDBF task.

image: ${{ matrix.image }}
platforms: ${{ matrix.platforms }}
tag: ${{ matrix.tag }}
branch: ${{ matrix.branch }}
clang_version: ${{ matrix.clang_version }}
nogalera: ${{ matrix.nogalera }}
files:
Copy link
Collaborator Author

@RazvanLiviuVarzaru RazvanLiviuVarzaru Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think files: is needed, adds extra complexity.
In bintars, for example, we just do COPY ci_build_images/scripts/* /scripts/ without any files: specification in the workflow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as @cvicentiu mentioned, for future us, as many detailed comments as you can for the library instrumentation script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is build with uses: bbw_build_container_template.yml , so the additional file complexity is needed to fit in with what the template expects.

I did include as much information as I know in the library instrumentation script.

# Marker to make it possible to build a dev msan builder
# from the nightly clang versions as they are in a differently
# name repo
ENV CLANG_DEV_VERSION=21
Copy link
Collaborator Author

@RazvanLiviuVarzaru RazvanLiviuVarzaru Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for local testing?
So when we move to clang 21 we just set CLANG_VERSION=21 and CLANG_DEV_VERSION to what is currently in the nightly builds repository?

If yes, it's worth adding a comment on how do we handle upgrades.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local testing or if ever every want to use a dev version for BB testing.

CLANG_VERSION is the version you want. CLANG_DEV_VERSION https://apt.llvm.org/ under "development" branch (which is in a state of flux as clang-20 is at the end of its rc cycle.

file comment updated.

RazvanLiviuVarzaru added a commit to RazvanLiviuVarzaru/buildbot-r that referenced this pull request Mar 3, 2025
Part of MDBF-993.
As per Vicențiu Ciorbaru request,

last-N-failed is only reporting its status (not in branch protection)
and Vicențiu Ciorbaru reported often sporadic failures for it.

debian-11-msan It will be replaced by a clang-20 builder in MariaDB#562
so no need to update GITHUB_STATUS_BUILDERS for it in this Pull Request.
pkg-config \
python3-pexpect \
unzip \
zlib1g-dev \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of packages if exactly the same as the one installed could be in a variable, this would ease maintenance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is RR dependencies, its not exactly the same.

RazvanLiviuVarzaru added a commit that referenced this pull request Mar 10, 2025
Part of MDBF-993.
As per Vicențiu Ciorbaru request,

last-N-failed is only reporting its status (not in branch protection)
and Vicențiu Ciorbaru reported often sporadic failures for it.

debian-11-msan It will be replaced by a clang-20 builder in #562
so no need to update GITHUB_STATUS_BUILDERS for it in this Pull Request.
Here we are updating to Debian 12 and clang-20

There where major changes form the clang-16 version meaning
creating a compatible script to build both was proving difficult
(or messy when it was paritally acheived).

As such the github workflow on build-debian.msan-based removes
the clang-16 script to ensure this is stable while a new Debian-12
/Clang-20 image/worker are created.

Consistent with other changes in ci_images, msan is a fragment
an not a self contained container so it has been renamed.

The build of clang to instrument sanitizer=memory has been
adjusted so it can optionally build with a development
version of clang, defined in the build script as CLANG_DEV_VERSION.

libunwind is part of the clang now. It needed to be built
with msan, however the libraries it generates currently contain
a bug with pthread_conditions and multiple threads causing
event a basic mariadb server start to corrupt its stack (or
similar) during the exit of server. A build without this
library failed in some way (that I can't exactly remember now).

The instrumentation of libraries has been expanded, but moved to
its own script in the next commit.

Update alternatives have make the new clang in the container image
the default without for its base name the need to suffix the compiler
with the version.
maintaining large instrumentation in a Dockerfile creating scripting
difficulties dealing with Dockerfile env parsing and a posix shell.

moving the instrumentation to the instrumentation script resolved
these difficulties and conforms to some best practices guides for
containers.

here we also expanded the instrumented builds notably to include:
* openssl (the parsec plugin of MariaDB uses this and is more consistent
with the builds that we do in releases)

The start of expanding the instrumented builds to get the instrumented
builds for a MSAN galera where started though its completion has been
deferred.
@grooverdan grooverdan force-pushed the feature/msan-clang-19 branch from d1a53ba to 1436198 Compare March 10, 2025 23:41
…tion (factory)

An additional primary purpose for this is to upgrade to the
run in default RelWithDebInfo CMAKE_BUILD_TYPE.

This was requested by Marko to make this a more realistic test
of the codebase delivered to users. Coding constructs to prevent
memory sanitation problems can be inadvertently be left as Debug only
fixes when there are real impacts that need to be addressed.

The end goal is for the RelWithDebInfo to be the protected branch builder.

During this work, server work was introduces to the codebase under MDEV-34996
and MDEV-36156 to eliminate the proliferation of things required in the build
compile process to create a MSAN build. The concrete examples here are C{,XX}FLAG
inclusion have been removed.

Of the remaining flags.
* WITH_EMBEDDED_SERVER=OFF - mainly a steadup
* WITH_INNODB_{BZIP2,LZ2,LZMA,LZO,SNAPPY}=OFF - these needed instrumented
  compression libraries that has not been done.
* WITH_ZLIB=bundled - likewise, but is a good candidate to reduce repeated compulation.
* LIB_AIO and LIBURING have been disabled, these aren't instrumented and its
  currently uncertian if the MSAN upstream has the required shims to correctly
  track the MSAN initialization that these libraries provide. Also LIBURING is
  blocked in containers via seccomp rules (see MDEV-36234). Relaxing this might
  in buildbot might expose a way for a pull request to escalate to something
  more serious like a container escape with arbitary code which we'd like to avoid.
* WITH_NUMA/SYSTEMD=no - not instrumented libraries - wasn't significant server
  code around these code paths.
* HAVE_CXX_NEW=1 is needed otherwise there will be msan errors within
  the C++ library.
* CMAKE_{EXE,MODULE}_LINKER_FLAGS - compiles and links generated artifacts
  with a rpath to find the MSAN libraries first. If we didn't do this we'd
  have to run the MTR tests with LD_LIBRARY_PATH=$MSAN_LIBDIR and this would
  result on many things in the system from perl and tools that aren't compiled
  against MSAN, and as such running these things in MTR would fail on the script.
  because of this unit tests are now enabled in
* CMAKE_CXX_FLAGS=-fsanitize=memory - the needed for this is removed by server
  commit in MDEV-36156 which hasn't been applied to 10.5.
* WITH_DBUG_TRACE=OFF adds effort of MSAN for no real gain. Only
  applicable to Debug builds and should be disabled in server (under
  MSAN at least).

The disabled PLUGIN list has been reduced. CONNECT has been observed as passing
successfully. SPIDER has at least one outstanding MSAN MDEV-35009, and
RockDB also has MSAN errors around basic index usage. OQGRAPH needs instrumented
boost (as would columnstore but this fails on compile check) but also
fails to compile on the late clang version due to using deprecated C++
syntax. ARCHIVE has been enabled as it passes and has a few tests.
Mroonga is disables as there aren't any MTR tests to exercise any
functionality so there was no point building it.

make has been replaced with cmake --build like the intro to cmake docs says to do
and package has been removed from the default target. This just consumed more build
time for something that wasn't saved or usable if it was saved.

On MTR:
* MSAN_OPTIONS is defined in the container by default
* LD_LIBRARY_PATH removed as before
* the list of failing tests isn't failing any more, and we shouldn't be maintaining these
  lists in BB. If there's failure, the server code can correct or mask these.

There is a debug builder this there.
Part of the point of simplifying MSAN is to make the
Buildbot worker images into something that is usable
by developers to repoduce failures in buildbot (without
buildbot doing the work) and to test in different
environments seen as important.

As generating the MSAN is specialized task in itself
this is an important goal in itself.

Understanding MSAN bugs needs to do debug examination
of when things are initalizied, changed etc that is
very hard without RR. As such rr has been added to the
container image to support this debugging by developers.
@grooverdan grooverdan force-pushed the feature/msan-clang-19 branch from 1436198 to e981a07 Compare March 11, 2025 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants