Skip to content

[llvm][aarch64] Fix Arm64EC name mangling algorithm #115567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 13, 2024

Conversation

dpaoliello
Copy link
Contributor

@dpaoliello dpaoliello commented Nov 8, 2024

Arm64EC uses a special name mangling mode that adds $$h between the symbol name and its type. In MSVC's name mangling @ is used to separate the name and type BUT it is also used for other purposes, such as the separator between paths in a fully qualified name.

The original algorithm was quite fragile and made assumptions that didn't hold true for all MSVC mangled symbols, so instead of trying to improve this algorithm we are now using the demangler to indicate where the insertion point should be (i.e., to parse the fully-qualified name and return the current string offset).

Also fixed isArm64ECMangledFunctionName to search for @$$h since the $$h must always be after a @.

Fixes #115231

@llvmbot
Copy link
Member

llvmbot commented Nov 8, 2024

@llvm/pr-subscribers-platform-windows

@llvm/pr-subscribers-llvm-ir

Author: Daniel Paoliello (dpaoliello)

Changes

Arm64EC uses a special name mangling mode that adds $$h between the symbol name and its type. In MSVC's name mangling @ is used to separate the name and type BUT it is also used for other purposes, such as the separator between paths in a fully qualified name.

This change fixes the algorithm to:

  • Ignore the last 4 characters of the mangled name - this may contain a @@ if the symbol is a function returning a type with a qualified name.
  • Search backwards for @@ - this will find the @ at the end of the symbol's name if it is qualified (which includes those in the global namespace) + the @ used as the separator between the name and the type.
  • If @@ is not found, search for just @ - this finds the @ used as the separator between the name and the type for global symbols (such as operator new()) BUT assumes that their return types are not types with a qualified name.

Also fixed isArm64ECMangledFunctionName to search for @$$h since the $$h must always be after a @.

Fixes #115231


Full diff: https://github.com/llvm/llvm-project/pull/115567.diff

3 Files Affected:

  • (modified) llvm/include/llvm/IR/Mangler.h (+1-1)
  • (modified) llvm/lib/IR/Mangler.cpp (+15-6)
  • (modified) llvm/unittests/IR/ManglerTest.cpp (+62)
diff --git a/llvm/include/llvm/IR/Mangler.h b/llvm/include/llvm/IR/Mangler.h
index 3c3f0c6dce80fa..6c8ebf5f072f28 100644
--- a/llvm/include/llvm/IR/Mangler.h
+++ b/llvm/include/llvm/IR/Mangler.h
@@ -64,7 +64,7 @@ std::optional<std::string> getArm64ECDemangledFunctionName(StringRef Name);
 /// Check if an ARM64EC function name is mangled.
 bool inline isArm64ECMangledFunctionName(StringRef Name) {
   return Name[0] == '#' ||
-         (Name[0] == '?' && Name.find("$$h") != StringRef::npos);
+         (Name[0] == '?' && Name.find("@$$h") != StringRef::npos);
 }
 
 } // End llvm namespace
diff --git a/llvm/lib/IR/Mangler.cpp b/llvm/lib/IR/Mangler.cpp
index 15a4debf191a5b..12be14156f6656 100644
--- a/llvm/lib/IR/Mangler.cpp
+++ b/llvm/lib/IR/Mangler.cpp
@@ -302,14 +302,23 @@ std::optional<std::string> llvm::getArm64ECMangledFunctionName(StringRef Name) {
   // Insert the ARM64EC "$$h" tag after the mangled function name.
   if (Name.contains("$$h"))
     return std::nullopt;
-  size_t InsertIdx = Name.find("@@");
-  size_t ThreeAtSignsIdx = Name.find("@@@");
-  if (InsertIdx != std::string::npos && InsertIdx != ThreeAtSignsIdx) {
+
+  // The last 4 characters of the symbol type may contain a `@@` if the symbol
+  // is returning a qualified type. We don't want to insert `$$h` at that point.
+  auto TrimmedName = Name.drop_back(4);
+
+  // The last `@@` is the separation between the qualified name of the symbol
+  // and its type, which is where we want to insert `$$h`.
+  auto InsertIdx = TrimmedName.rfind("@@");
+  if (InsertIdx != StringRef::npos) {
     InsertIdx += 2;
   } else {
-    InsertIdx = Name.find("@");
-    if (InsertIdx != std::string::npos)
-      InsertIdx++;
+    // If there is no `@@`, then this is a global symbol (e.g., `operator new`)
+    // so look for a `@` instead (since we assume that it will not return a
+    // qualified type).
+    InsertIdx = TrimmedName.find_last_of('@');
+    assert(InsertIdx != StringRef::npos && "Invalid mangled name");
+    InsertIdx += 1;
   }
 
   return std::optional<std::string>(
diff --git a/llvm/unittests/IR/ManglerTest.cpp b/llvm/unittests/IR/ManglerTest.cpp
index 5ac784b7e89ac6..017d8303551244 100644
--- a/llvm/unittests/IR/ManglerTest.cpp
+++ b/llvm/unittests/IR/ManglerTest.cpp
@@ -172,4 +172,66 @@ TEST(ManglerTest, GOFF) {
             "L#foo");
 }
 
+TEST(ManglerTest, Arm64EC) {
+  constexpr std::string_view Arm64ECNames[] = {
+      // Basic C name.
+      "#Foo",
+
+      // Basic C++ name.
+      "?foo@@$$hYAHXZ",
+
+      // Regression test: https://github.com/llvm/llvm-project/issues/115231
+      "?GetValue@?$Wrapper@UA@@@@$$hQEBAHXZ",
+
+      // Symbols from:
+      // ```
+      // namespace A::B::C::D {
+      // struct Base {
+      //   virtual int f() { return 0; }
+      // };
+      // }
+      // struct Derived : public A::B::C::D::Base {
+      //   virtual int f() override { return 1; }
+      // };
+      // A::B::C::D::Base* MakeObj() { return new Derived(); }
+      // ```
+      // void * __cdecl operator new(unsigned __int64)
+      "??2@$$hYAPEAX_K@Z",
+      // public: virtual int __cdecl A::B::C::D::Base::f(void)
+      "?f@Base@D@C@B@A@@$$hUEAAHXZ",
+      // public: __cdecl A::B::C::D::Base::Base(void)
+      "??0Base@D@C@B@A@@$$hQEAA@XZ",
+      // public: virtual int __cdecl Derived::f(void)
+      "?f@Derived@@$$hUEAAHXZ",
+      // public: __cdecl Derived::Derived(void)
+      "??0Derived@@$$hQEAA@XZ",
+      // struct A::B::C::D::Base * __cdecl MakeObj(void)
+      "?MakeObj@@$$hYAPEAUBase@D@C@B@A@@XZ",
+  };
+
+  for (const auto &Arm64ECName : Arm64ECNames) {
+    // Check that this is a mangled name.
+    EXPECT_TRUE(isArm64ECMangledFunctionName(Arm64ECName))
+        << "Test case: " << Arm64ECName;
+    // Refuse to mangle it again.
+    EXPECT_FALSE(getArm64ECMangledFunctionName(Arm64ECName).has_value())
+        << "Test case: " << Arm64ECName;
+
+    // Demangle.
+    auto Arm64Name = getArm64ECDemangledFunctionName(Arm64ECName);
+    EXPECT_TRUE(Arm64Name.has_value()) << "Test case: " << Arm64ECName;
+    // Check that it is not mangled.
+    EXPECT_FALSE(isArm64ECMangledFunctionName(Arm64Name.value()))
+        << "Test case: " << Arm64ECName;
+    // Refuse to demangle it again.
+    EXPECT_FALSE(getArm64ECDemangledFunctionName(Arm64Name.value()).has_value())
+        << "Test case: " << Arm64ECName;
+
+    // Round-trip.
+    auto RoundTripArm64ECName =
+        getArm64ECMangledFunctionName(Arm64Name.value());
+    EXPECT_EQ(RoundTripArm64ECName, Arm64ECName);
+  }
+}
+
 } // end anonymous namespace

@dpaoliello dpaoliello changed the title Fix Arm64EC name mangling algorithm [llvm][aarch64] Fix Arm64EC name mangling algorithm Nov 8, 2024
// so look for a `@` instead (since we assume that it will not return a
// qualified type).
InsertIdx = TrimmedName.find_last_of('@');
assert(InsertIdx != StringRef::npos && "Invalid mangled name");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this assertion is likely fine for IR callers, in LLD we may encounter arbitrary user-provided input. Instead, we could simply return nullopt here. This would break the assumption that exactly one of the mangled or demangled functions always returns a value, so some adjustments will be needed in the callers. I’ve drafted the LLD change here: cjacek@ccb9503.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed all insertions

auto TrimmedName = Name.drop_back(4);

// The last `@@` is the separation between the qualified name of the symbol
// and its type, which is where we want to insert `$$h`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident this actually works. Consider, for example:

template<typename T> struct WW {   struct Z{}; };
template <typename X> struct Wrapper {
  int GetValue(typename WW<X>::Z) const;
};
struct A {
};
template<typename X> int Wrapper<X>::GetValue(typename WW<X>::Z) const { return 3; }
template class Wrapper<A>;

If we can't figure out a simple way to figure out the correct insertion point, it might end up being simpler to just compute the name in clang, using the real name mangler, and attach it using metadata, or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSVC generates the mangled name in the frontend, and that's certainly the most robust choice.

Unfortunately, LLD and import library generation have both taken a dependency on this function so we'd have to detangle that first.

The other option is to use a name demangler to parse the mangled name instead of this adhoc from-the-back parsing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

I'd guess parsing forward from the front is easier than parsing backwards, because that's the way mangling naturally works. And I'm not sure there are any good shortcuts here. Not sure how much code a "mini-demangler" that knows enough to parse the name of the function without actually printing the demangled name would be, vs. just pulling in the real demangler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out using the real demangler was fairly easy. Also added your new test case.

Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dpaoliello dpaoliello merged commit fa0cf3d into llvm:main Nov 13, 2024
8 checks passed
@dpaoliello dpaoliello deleted the ecname branch November 13, 2024 23:35
@efriedma-quic
Copy link
Collaborator

Do you think it makes sense to cherry-pick this to 19.x?

@dpaoliello
Copy link
Contributor Author

Do you think it makes sense to cherry-pick this to 19.x?

Yeah, that's probably a good idea otherwise we'll be generating different names than MSVC

cjacek added a commit to cjacek/llvm-project that referenced this pull request Nov 14, 2024
This is a follow-up to llvm#115567. Emit an error for invalid function names, similar to MSVC's `lib.exe` behavior.

Returning an error from `writeImportLibrary` exposed bugs in error handling by its callers, which have been addressed in this patch.
cjacek added a commit to cjacek/llvm-project that referenced this pull request Nov 14, 2024
dpaoliello added a commit to dpaoliello/llvm-project that referenced this pull request Nov 14, 2024
Arm64EC uses a special name mangling mode that adds `$$h` between the
symbol name and its type. In MSVC's name mangling `@` is used to
separate the name and type BUT it is also used for other purposes, such
as the separator between paths in a fully qualified name.

The original algorithm was quite fragile and made assumptions that
didn't hold true for all MSVC mangled symbols, so instead of trying to
improve this algorithm we are now using the demangler to indicate where
the insertion point should be (i.e., to parse the fully-qualified name
and return the current string offset).

Also fixed `isArm64ECMangledFunctionName` to search for `@$$h` since the
`$$h` must always be after a `@`.

Fixes llvm#115231
tru pushed a commit to dpaoliello/llvm-project that referenced this pull request Nov 15, 2024
Arm64EC uses a special name mangling mode that adds `$$h` between the
symbol name and its type. In MSVC's name mangling `@` is used to
separate the name and type BUT it is also used for other purposes, such
as the separator between paths in a fully qualified name.

The original algorithm was quite fragile and made assumptions that
didn't hold true for all MSVC mangled symbols, so instead of trying to
improve this algorithm we are now using the demangler to indicate where
the insertion point should be (i.e., to parse the fully-qualified name
and return the current string offset).

Also fixed `isArm64ECMangledFunctionName` to search for `@$$h` since the
`$$h` must always be after a `@`.

Fixes llvm#115231
cjacek added a commit that referenced this pull request Nov 15, 2024
…es (#116250)

This is a follow-up to #115567. Emit an error for invalid function
names, similar to MSVC's `lib.exe` behavior.

Returning an error from `writeImportLibrary` exposed bugs in error
handling by its callers, which have been addressed in this patch.
cjacek added a commit to cjacek/llvm-project that referenced this pull request Nov 15, 2024
Since these symbols cannot be mangled or demangled, there is no symbol to check for conflicts
in checkLazyECPair, nor is there an alias to create in addUndefined. Attempting to create an
import library with such symbols results in an error; the patch includes a test to ensure the
error is handled correctly.

This is a follow-up to llvm#115567.
cjacek added a commit that referenced this pull request Nov 15, 2024
Since these symbols cannot be mangled or demangled, there is no symbol
to check for conflicts in `checkLazyECPair`, nor is there an alias to
create in `addUndefined`. Attempting to create an import library with
such symbols results in an error; the patch includes a test to ensure
the error is handled correctly.

This is a follow-up to #115567.
nikic pushed a commit to rust-lang/llvm-project that referenced this pull request Nov 20, 2024
Arm64EC uses a special name mangling mode that adds `$$h` between the
symbol name and its type. In MSVC's name mangling `@` is used to
separate the name and type BUT it is also used for other purposes, such
as the separator between paths in a fully qualified name.

The original algorithm was quite fragile and made assumptions that
didn't hold true for all MSVC mangled symbols, so instead of trying to
improve this algorithm we are now using the demangler to indicate where
the insertion point should be (i.e., to parse the fully-qualified name
and return the current string offset).

Also fixed `isArm64ECMangledFunctionName` to search for `@$$h` since the
`$$h` must always be after a `@`.

Fixes llvm#115231
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arm64ec incorrect mangling instantiating C++ template
4 participants