Skip to content

[KernelInfo] Implement new LLVM IR pass for GPU code analysis #102944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Jan 29, 2025

Conversation

jdenny-ornl
Copy link
Collaborator

@jdenny-ornl jdenny-ornl commented Aug 12, 2024

This patch implements an LLVM IR pass, named kernel-info, that reports various statistics for codes compiled for GPUs. The ultimate goal of these statistics to help identify bad code patterns and ways to mitigate them. The pass operates at the LLVM IR level so that it can, in theory, support any LLVM-based compiler for programming languages supporting GPUs. It has been tested so far with LLVM IR generated by Clang for OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass runs at the end of LTO, and options like -Rpass=kernel-info enable its remarks. Example opt and clang command lines appear in llvm/docs/KernelInfo.rst. Remarks include summary statistics (e.g., total size of static allocas) and individual occurrences (e.g., source location of each alloca). Examples of its output appear in tests in llvm/test/Analysis/KernelInfo.

This patch implements an LLVM IR pass, named kernel-info, that reports
various statistics for codes compiled for GPUs.  The ultimate goal of
these statistics to help identify bad code patterns and ways to
mitigate them.  The pass operates at the LLVM IR level so that it can,
in theory, support any LLVM-based compiler for programming languages
supporting GPUs.  It has been tested so far with LLVM IR generated by
Clang for OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass is disabled.  For convenience,
`-kernel-info-end-lto` inserts it at the end of LTO, and options like
`-Rpass=kernel-info` enable its remarks.  Example opt and clang
command lines appear in comments in
`llvm/include/llvm/Analysis/KernelInfo.h`.  Remarks include summary
statistics (e.g., total size of static allocas) and individual
occurrences (e.g., source location of each alloca).  Examples of its
output appear in tests in `llvm/test/Analysis/KernelInfo`.
@llvmbot
Copy link
Member

llvmbot commented Aug 12, 2024

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-analysis

Author: Joel E. Denny (jdenny-ornl)

Changes

This patch implements an LLVM IR pass, named kernel-info, that reports various statistics for codes compiled for GPUs. The ultimate goal of these statistics to help identify bad code patterns and ways to mitigate them. The pass operates at the LLVM IR level so that it can, in theory, support any LLVM-based compiler for programming languages supporting GPUs. It has been tested so far with LLVM IR generated by Clang for OpenMP offload codes targeting NVIDIA GPUs and AMD GPUs.

By default, the pass is disabled. For convenience, -kernel-info-end-lto inserts it at the end of LTO, and options like -Rpass=kernel-info enable its remarks. Example opt and clang command lines appear in comments in
llvm/include/llvm/Analysis/KernelInfo.h. Remarks include summary statistics (e.g., total size of static allocas) and individual occurrences (e.g., source location of each alloca). Examples of its output appear in tests in llvm/test/Analysis/KernelInfo.


Patch is 129.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/102944.diff

20 Files Affected:

  • (added) llvm/include/llvm/Analysis/KernelInfo.h (+148)
  • (modified) llvm/include/llvm/Target/TargetMachine.h (+3)
  • (modified) llvm/lib/Analysis/CMakeLists.txt (+1)
  • (added) llvm/lib/Analysis/KernelInfo.cpp (+350)
  • (modified) llvm/lib/Passes/PassBuilder.cpp (+1)
  • (modified) llvm/lib/Passes/PassRegistry.def (+2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+10)
  • (modified) llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp (+10)
  • (modified) llvm/lib/Target/TargetMachine.cpp (+5)
  • (added) llvm/test/Analysis/KernelInfo/addrspace0.ll (+152)
  • (added) llvm/test/Analysis/KernelInfo/allocas.ll (+78)
  • (added) llvm/test/Analysis/KernelInfo/calls.ll (+112)
  • (added) llvm/test/Analysis/KernelInfo/kernel-info-after-lto/amdgpu.ll (+47)
  • (added) llvm/test/Analysis/KernelInfo/kernel-info-after-lto/nvptx.ll (+47)
  • (added) llvm/test/Analysis/KernelInfo/launch-bounds/amdgpu.ll (+40)
  • (added) llvm/test/Analysis/KernelInfo/launch-bounds/nvptx.ll (+36)
  • (added) llvm/test/Analysis/KernelInfo/linkage.ll (+51)
  • (added) llvm/test/Analysis/KernelInfo/openmp/README.md (+40)
  • (added) llvm/test/Analysis/KernelInfo/openmp/amdgpu.ll (+217)
  • (added) llvm/test/Analysis/KernelInfo/openmp/nvptx.ll (+811)
diff --git a/llvm/include/llvm/Analysis/KernelInfo.h b/llvm/include/llvm/Analysis/KernelInfo.h
new file mode 100644
index 00000000000000..5495bb2fd4d925
--- /dev/null
+++ b/llvm/include/llvm/Analysis/KernelInfo.h
@@ -0,0 +1,148 @@
+//=- KernelInfo.h - Kernel Analysis -------------------------------*- C++ -*-=//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines the KernelInfo, KernelInfoAnalysis, and KernelInfoPrinter
+// classes used to extract function properties from a GPU kernel.
+//
+// To analyze a C program as it appears to an LLVM GPU backend at the end of
+// LTO:
+//
+//   $ clang -O2 -g -fopenmp --offload-arch=native test.c -foffload-lto \
+//       -Rpass=kernel-info -mllvm -kernel-info-end-lto
+//
+// To analyze specified LLVM IR, perhaps previously generated by something like
+// 'clang -save-temps -g -fopenmp --offload-arch=native test.c':
+//
+//   $ opt -disable-output test-openmp-nvptx64-nvidia-cuda-sm_70.bc \
+//       -pass-remarks=kernel-info -passes=kernel-info
+//
+// kernel-info can also be inserted into a specified LLVM pass pipeline using
+// -kernel-info-end-lto, or it can be positioned explicitly in that pipeline:
+//
+//   $ clang -O2 -g -fopenmp --offload-arch=native test.c -foffload-lto \
+//       -Rpass=kernel-info -mllvm -kernel-info-end-lto \
+//       -Xoffload-linker --lto-newpm-passes='lto<O2>'
+//
+//   $ clang -O2 -g -fopenmp --offload-arch=native test.c -foffload-lto \
+//       -Rpass=kernel-info \
+//       -Xoffload-linker --lto-newpm-passes='lto<O2>,module(kernel-info)'
+//
+//   $ opt -disable-output test-openmp-nvptx64-nvidia-cuda-sm_70.bc \
+//       -pass-remarks=kernel-info -kernel-info-end-lto -passes='lto<O2>'
+//
+//   $ opt -disable-output test-openmp-nvptx64-nvidia-cuda-sm_70.bc \
+//       -pass-remarks=kernel-info -passes='lto<O2>,module(kernel-info)'
+// ===---------------------------------------------------------------------===//
+
+#ifndef LLVM_ANALYSIS_KERNELINFO_H
+#define LLVM_ANALYSIS_KERNELINFO_H
+
+#include "llvm/Analysis/OptimizationRemarkEmitter.h"
+
+namespace llvm {
+class DominatorTree;
+class Function;
+
+/// Data structure holding function info for kernels.
+class KernelInfo {
+  void updateForBB(const BasicBlock &BB, int64_t Direction,
+                   OptimizationRemarkEmitter &ORE);
+
+public:
+  static KernelInfo getKernelInfo(Function &F, FunctionAnalysisManager &FAM);
+
+  bool operator==(const KernelInfo &FPI) const {
+    return std::memcmp(this, &FPI, sizeof(KernelInfo)) == 0;
+  }
+
+  bool operator!=(const KernelInfo &FPI) const { return !(*this == FPI); }
+
+  /// If false, nothing was recorded here because the supplied function didn't
+  /// appear in a module compiled for a GPU.
+  bool IsValid = false;
+
+  /// Whether the function has external linkage and is not a kernel function.
+  bool ExternalNotKernel = false;
+
+  /// OpenMP Launch bounds.
+  ///@{
+  std::optional<int64_t> OmpTargetNumTeams;
+  std::optional<int64_t> OmpTargetThreadLimit;
+  ///@}
+
+  /// AMDGPU launch bounds.
+  ///@{
+  std::optional<int64_t> AmdgpuMaxNumWorkgroupsX;
+  std::optional<int64_t> AmdgpuMaxNumWorkgroupsY;
+  std::optional<int64_t> AmdgpuMaxNumWorkgroupsZ;
+  std::optional<int64_t> AmdgpuFlatWorkGroupSizeMin;
+  std::optional<int64_t> AmdgpuFlatWorkGroupSizeMax;
+  std::optional<int64_t> AmdgpuWavesPerEuMin;
+  std::optional<int64_t> AmdgpuWavesPerEuMax;
+  ///@}
+
+  /// NVPTX launch bounds.
+  ///@{
+  std::optional<int64_t> Maxclusterrank;
+  std::optional<int64_t> Maxntidx;
+  ///@}
+
+  /// The number of alloca instructions inside the function, the number of those
+  /// with allocation sizes that cannot be determined at compile time, and the
+  /// sum of the sizes that can be.
+  ///
+  /// With the current implementation for at least some GPU archs,
+  /// AllocasDyn > 0 might not be possible, but we report AllocasDyn anyway in
+  /// case the implementation changes.
+  int64_t Allocas = 0;
+  int64_t AllocasDyn = 0;
+  int64_t AllocasStaticSizeSum = 0;
+
+  /// Number of direct/indirect calls (anything derived from CallBase).
+  int64_t DirectCalls = 0;
+  int64_t IndirectCalls = 0;
+
+  /// Number of direct calls made from this function to other functions
+  /// defined in this module.
+  int64_t DirectCallsToDefinedFunctions = 0;
+
+  /// Number of calls of type InvokeInst.
+  int64_t Invokes = 0;
+
+  /// Number of addrspace(0) memory accesses (via load, store, etc.).
+  int64_t AddrspaceZeroAccesses = 0;
+};
+
+/// Analysis class for KernelInfo.
+class KernelInfoAnalysis : public AnalysisInfoMixin<KernelInfoAnalysis> {
+public:
+  static AnalysisKey Key;
+
+  using Result = const KernelInfo;
+
+  KernelInfo run(Function &F, FunctionAnalysisManager &FAM) {
+    return KernelInfo::getKernelInfo(F, FAM);
+  }
+};
+
+/// Printer pass for KernelInfoAnalysis.
+///
+/// It just calls KernelInfoAnalysis, which prints remarks if they are enabled.
+class KernelInfoPrinter : public PassInfoMixin<KernelInfoPrinter> {
+public:
+  explicit KernelInfoPrinter() {}
+
+  PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM) {
+    AM.getResult<KernelInfoAnalysis>(F);
+    return PreservedAnalyses::all();
+  }
+
+  static bool isRequired() { return true; }
+};
+} // namespace llvm
+#endif // LLVM_ANALYSIS_KERNELINFO_H
diff --git a/llvm/include/llvm/Target/TargetMachine.h b/llvm/include/llvm/Target/TargetMachine.h
index c3e9d41315f617..5c338a8fcd0cfb 100644
--- a/llvm/include/llvm/Target/TargetMachine.h
+++ b/llvm/include/llvm/Target/TargetMachine.h
@@ -18,6 +18,7 @@
 #include "llvm/IR/PassManager.h"
 #include "llvm/Support/Allocator.h"
 #include "llvm/Support/CodeGen.h"
+#include "llvm/Support/CommandLine.h"
 #include "llvm/Support/Error.h"
 #include "llvm/Support/PGOOptions.h"
 #include "llvm/Target/CGPassBuilderOption.h"
@@ -27,6 +28,8 @@
 #include <string>
 #include <utility>
 
+extern llvm::cl::opt<bool> KernelInfoEndLTO;
+
 namespace llvm {
 
 class AAManager;
diff --git a/llvm/lib/Analysis/CMakeLists.txt b/llvm/lib/Analysis/CMakeLists.txt
index 2cb3547ec40473..02e76af8d903de 100644
--- a/llvm/lib/Analysis/CMakeLists.txt
+++ b/llvm/lib/Analysis/CMakeLists.txt
@@ -78,6 +78,7 @@ add_llvm_component_library(LLVMAnalysis
   InstructionPrecedenceTracking.cpp
   InstructionSimplify.cpp
   InteractiveModelRunner.cpp
+  KernelInfo.cpp
   LazyBranchProbabilityInfo.cpp
   LazyBlockFrequencyInfo.cpp
   LazyCallGraph.cpp
diff --git a/llvm/lib/Analysis/KernelInfo.cpp b/llvm/lib/Analysis/KernelInfo.cpp
new file mode 100644
index 00000000000000..9df3b5b32afcb4
--- /dev/null
+++ b/llvm/lib/Analysis/KernelInfo.cpp
@@ -0,0 +1,350 @@
+//===- KernelInfo.cpp - Kernel Analysis -----------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines the KernelInfo, KernelInfoAnalysis, and KernelInfoPrinter
+// classes used to extract function properties from a kernel.
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/KernelInfo.h"
+#include "llvm/ADT/StringExtras.h"
+#include "llvm/Analysis/OptimizationRemarkEmitter.h"
+#include "llvm/IR/DebugInfo.h"
+#include "llvm/IR/Dominators.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Metadata.h"
+#include "llvm/IR/Module.h"
+#include "llvm/IR/PassManager.h"
+#include "llvm/Passes/PassBuilder.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "kernel-info"
+
+static bool isKernelFunction(Function &F) {
+  // TODO: Is this general enough?  Consider languages beyond OpenMP.
+  return F.hasFnAttribute("kernel");
+}
+
+static void identifyFunction(OptimizationRemark &R, const Function &F) {
+  if (auto *SubProgram = F.getSubprogram()) {
+    if (SubProgram->isArtificial())
+      R << "artificial ";
+  }
+  R << "function '" << F.getName() << "'";
+}
+
+static void remarkAlloca(OptimizationRemarkEmitter &ORE, const Function &Caller,
+                         const AllocaInst &Alloca,
+                         TypeSize::ScalarTy StaticSize) {
+  ORE.emit([&] {
+    StringRef Name;
+    DebugLoc Loc;
+    bool Artificial = false;
+    auto DVRs = findDVRDeclares(&const_cast<AllocaInst &>(Alloca));
+    if (!DVRs.empty()) {
+      const DbgVariableRecord &DVR = **DVRs.begin();
+      Name = DVR.getVariable()->getName();
+      Loc = DVR.getDebugLoc();
+      Artificial = DVR.Variable->isArtificial();
+    }
+    OptimizationRemark R(DEBUG_TYPE, "Alloca", DiagnosticLocation(Loc),
+                         Alloca.getParent());
+    R << "in ";
+    identifyFunction(R, Caller);
+    R << ", ";
+    if (Artificial)
+      R << "artificial ";
+    if (Name.empty()) {
+      R << "unnamed alloca ";
+      if (DVRs.empty())
+        R << "(missing debug metadata) ";
+    } else {
+      R << "alloca '" << Name << "' ";
+    }
+    R << "with ";
+    if (StaticSize)
+      R << "static size of " << itostr(StaticSize) << " bytes";
+    else
+      R << "dynamic size";
+    return R;
+  });
+}
+
+static void remarkCall(OptimizationRemarkEmitter &ORE, const Function &Caller,
+                       const CallBase &Call, StringRef CallKind,
+                       StringRef RemarkKind) {
+  ORE.emit([&] {
+    OptimizationRemark R(DEBUG_TYPE, RemarkKind, &Call);
+    R << "in ";
+    identifyFunction(R, Caller);
+    R << ", " << CallKind;
+    if (const Function *Callee =
+            dyn_cast_or_null<Function>(Call.getCalledOperand())) {
+      R << ", callee is";
+      StringRef Name = Callee->getName();
+      if (auto *SubProgram = Callee->getSubprogram()) {
+        if (SubProgram->isArtificial())
+          R << " artificial";
+      }
+      if (!Name.empty())
+        R << " '" << Name << "'";
+      else
+        R << " with unknown name";
+    }
+    return R;
+  });
+}
+
+static void remarkAddrspaceZeroAccess(OptimizationRemarkEmitter &ORE,
+                                      const Function &Caller,
+                                      const Instruction &Inst) {
+  ORE.emit([&] {
+    OptimizationRemark R(DEBUG_TYPE, "AddrspaceZeroAccess", &Inst);
+    R << "in ";
+    identifyFunction(R, Caller);
+    if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(&Inst)) {
+      R << ", '" << II->getCalledFunction()->getName() << "' call";
+    } else {
+      R << ", '" << Inst.getOpcodeName() << "' instruction";
+    }
+    if (Inst.hasName())
+      R << " ('%" << Inst.getName() << "')";
+    R << " accesses memory in addrspace(0)";
+    return R;
+  });
+}
+
+void KernelInfo::updateForBB(const BasicBlock &BB, int64_t Direction,
+                             OptimizationRemarkEmitter &ORE) {
+  assert(Direction == 1 || Direction == -1);
+  const Function &F = *BB.getParent();
+  const Module &M = *F.getParent();
+  const DataLayout &DL = M.getDataLayout();
+  for (const Instruction &I : BB.instructionsWithoutDebug()) {
+    if (const AllocaInst *Alloca = dyn_cast<AllocaInst>(&I)) {
+      Allocas += Direction;
+      TypeSize::ScalarTy StaticSize = 0;
+      if (std::optional<TypeSize> Size = Alloca->getAllocationSize(DL)) {
+        StaticSize = Size->getFixedValue();
+        assert(StaticSize <= std::numeric_limits<int64_t>::max());
+        AllocasStaticSizeSum += Direction * StaticSize;
+      } else {
+        AllocasDyn += Direction;
+      }
+      remarkAlloca(ORE, F, *Alloca, StaticSize);
+    } else if (const CallBase *Call = dyn_cast<CallBase>(&I)) {
+      std::string CallKind;
+      std::string RemarkKind;
+      if (Call->isIndirectCall()) {
+        IndirectCalls += Direction;
+        CallKind += "indirect";
+        RemarkKind += "Indirect";
+      } else {
+        DirectCalls += Direction;
+        CallKind += "direct";
+        RemarkKind += "Direct";
+      }
+      if (isa<InvokeInst>(Call)) {
+        Invokes += Direction;
+        CallKind += " invoke";
+        RemarkKind += "Invoke";
+      } else {
+        CallKind += " call";
+        RemarkKind += "Call";
+      }
+      if (!Call->isIndirectCall()) {
+        if (const Function *Callee = Call->getCalledFunction()) {
+          if (Callee && !Callee->isIntrinsic() && !Callee->isDeclaration()) {
+            DirectCallsToDefinedFunctions += Direction;
+            CallKind += " to defined function";
+            RemarkKind += "ToDefinedFunction";
+          }
+        }
+      }
+      remarkCall(ORE, F, *Call, CallKind, RemarkKind);
+      if (const AnyMemIntrinsic *MI = dyn_cast<AnyMemIntrinsic>(Call)) {
+        if (MI->getDestAddressSpace() == 0) {
+          AddrspaceZeroAccesses += Direction;
+          remarkAddrspaceZeroAccess(ORE, F, I);
+        } else if (const AnyMemTransferInst *MT =
+                       dyn_cast<AnyMemTransferInst>(MI)) {
+          if (MT->getSourceAddressSpace() == 0) {
+            AddrspaceZeroAccesses += Direction;
+            remarkAddrspaceZeroAccess(ORE, F, I);
+          }
+        }
+      }
+    } else if (const LoadInst *Load = dyn_cast<LoadInst>(&I)) {
+      if (Load->getPointerAddressSpace() == 0) {
+        AddrspaceZeroAccesses += Direction;
+        remarkAddrspaceZeroAccess(ORE, F, I);
+      }
+    } else if (const StoreInst *Store = dyn_cast<StoreInst>(&I)) {
+      if (Store->getPointerAddressSpace() == 0) {
+        AddrspaceZeroAccesses += Direction;
+        remarkAddrspaceZeroAccess(ORE, F, I);
+      }
+    } else if (const AtomicRMWInst *At = dyn_cast<AtomicRMWInst>(&I)) {
+      if (At->getPointerAddressSpace() == 0) {
+        AddrspaceZeroAccesses += Direction;
+        remarkAddrspaceZeroAccess(ORE, F, I);
+      }
+    } else if (const AtomicCmpXchgInst *At = dyn_cast<AtomicCmpXchgInst>(&I)) {
+      if (At->getPointerAddressSpace() == 0) {
+        AddrspaceZeroAccesses += Direction;
+        remarkAddrspaceZeroAccess(ORE, F, I);
+      }
+    }
+  }
+}
+
+static void remarkProperty(OptimizationRemarkEmitter &ORE, const Function &F,
+                           StringRef Name, int64_t Value) {
+  ORE.emit([&] {
+    OptimizationRemark R(DEBUG_TYPE, Name, &F);
+    R << "in ";
+    identifyFunction(R, F);
+    R << ", " << Name << " = " << itostr(Value);
+    return R;
+  });
+}
+
+static void remarkProperty(OptimizationRemarkEmitter &ORE, const Function &F,
+                           StringRef Name, std::optional<int64_t> Value) {
+  if (!Value)
+    return;
+  remarkProperty(ORE, F, Name, Value.value());
+}
+
+static std::vector<std::optional<int64_t>>
+parseFnAttrAsIntegerFields(Function &F, StringRef Name, unsigned NumFields) {
+  std::vector<std::optional<int64_t>> Result(NumFields);
+  Attribute A = F.getFnAttribute(Name);
+  if (!A.isStringAttribute())
+    return Result;
+  StringRef Rest = A.getValueAsString();
+  for (unsigned I = 0; I < NumFields; ++I) {
+    StringRef Field;
+    std::tie(Field, Rest) = Rest.split(',');
+    if (Field.empty())
+      break;
+    int64_t Val;
+    if (Field.getAsInteger(0, Val)) {
+      F.getContext().emitError("cannot parse integer in attribute '" + Name +
+                               "': " + Field);
+      break;
+    }
+    Result[I] = Val;
+  }
+  if (!Rest.empty())
+    F.getContext().emitError("too many fields in attribute " + Name);
+  return Result;
+}
+
+static std::optional<int64_t> parseFnAttrAsInteger(Function &F,
+                                                   StringRef Name) {
+  return parseFnAttrAsIntegerFields(F, Name, 1)[0];
+}
+
+// TODO: This nearly duplicates the same function in OMPIRBuilder.cpp.  Can we
+// share?
+static MDNode *getNVPTXMDNode(Function &F, StringRef Name) {
+  Module &M = *F.getParent();
+  NamedMDNode *MD = M.getNamedMetadata("nvvm.annotations");
+  if (!MD)
+    return nullptr;
+  for (auto *Op : MD->operands()) {
+    if (Op->getNumOperands() != 3)
+      continue;
+    auto *KernelOp = dyn_cast<ConstantAsMetadata>(Op->getOperand(0));
+    if (!KernelOp || KernelOp->getValue() != &F)
+      continue;
+    auto *Prop = dyn_cast<MDString>(Op->getOperand(1));
+    if (!Prop || Prop->getString() != Name)
+      continue;
+    return Op;
+  }
+  return nullptr;
+}
+
+static std::optional<int64_t> parseNVPTXMDNodeAsInteger(Function &F,
+                                                        StringRef Name) {
+  std::optional<int64_t> Result;
+  if (MDNode *ExistingOp = getNVPTXMDNode(F, Name)) {
+    auto *Op = cast<ConstantAsMetadata>(ExistingOp->getOperand(2));
+    Result = cast<ConstantInt>(Op->getValue())->getZExtValue();
+  }
+  return Result;
+}
+
+KernelInfo KernelInfo::getKernelInfo(Function &F,
+                                     FunctionAnalysisManager &FAM) {
+  KernelInfo KI;
+  // Only analyze modules for GPUs.
+  // TODO: This would be more maintainable if there were an isGPU.
+  const std::string &TT = F.getParent()->getTargetTriple();
+  llvm::Triple T(TT);
+  if (!T.isAMDGPU() && !T.isNVPTX())
+    return KI;
+  KI.IsValid = true;
+
+  // Record function properties.
+  KI.ExternalNotKernel = F.hasExternalLinkage() && !isKernelFunction(F);
+  KI.OmpTargetNumTeams = parseFnAttrAsInteger(F, "omp_target_num_teams");
+  KI.OmpTargetThreadLimit = parseFnAttrAsInteger(F, "omp_target_thread_limit");
+  auto AmdgpuMaxNumWorkgroups =
+      parseFnAttrAsIntegerFields(F, "amdgpu-max-num-workgroups", 3);
+  KI.AmdgpuMaxNumWorkgroupsX = AmdgpuMaxNumWorkgroups[0];
+  KI.AmdgpuMaxNumWorkgroupsY = AmdgpuMaxNumWorkgroups[1];
+  KI.AmdgpuMaxNumWorkgroupsZ = AmdgpuMaxNumWorkgroups[2];
+  auto AmdgpuFlatWorkGroupSize =
+      parseFnAttrAsIntegerFields(F, "amdgpu-flat-work-group-size", 2);
+  KI.AmdgpuFlatWorkGroupSizeMin = AmdgpuFlatWorkGroupSize[0];
+  KI.AmdgpuFlatWorkGroupSizeMax = AmdgpuFlatWorkGroupSize[1];
+  auto AmdgpuWavesPerEu =
+      parseFnAttrAsIntegerFields(F, "amdgpu-waves-per-eu", 2);
+  KI.AmdgpuWavesPerEuMin = AmdgpuWavesPerEu[0];
+  KI.AmdgpuWavesPerEuMax = AmdgpuWavesPerEu[1];
+  KI.Maxclusterrank = parseNVPTXMDNodeAsInteger(F, "maxclusterrank");
+  KI.Maxntidx = parseNVPTXMDNodeAsInteger(F, "maxntidx");
+
+  const DominatorTree &DT = FAM.getResult<DominatorTreeAnalysis>(F);
+  auto &ORE = FAM.getResult<OptimizationRemarkEmitterAnalysis>(F);
+  for (const auto &BB : F)
+    if (DT.isReachableFromEntry(&BB))
+      KI.updateForBB(BB, +1, ORE);
+
+#define REMARK_PROPERTY(PROP_NAME)                                             \
+  remarkProperty(ORE, F, #PROP_NAME, KI.PROP_NAME)
+  REMARK_PROPERTY(ExternalNotKernel);
+  REMARK_PROPERTY(OmpTargetNumTeams);
+  REMARK_PROPERTY(OmpTargetThreadLimit);
+  REMARK_PROPERTY(AmdgpuMaxNumWorkgroupsX);
+  REMARK_PROPERTY(AmdgpuMaxNumWorkgroupsY);
+  REMARK_PROPERTY(AmdgpuMaxNumWorkgroupsZ);
+  REMARK_PROPERTY(AmdgpuFlatWorkGroupSizeMin);
+  REMARK_PROPERTY(AmdgpuFlatWorkGroupSizeMax);
+  REMARK_PROPERTY(AmdgpuWavesPerEuMin);
+  REMARK_PROPERTY(AmdgpuWavesPerEuMax);
+  REMARK_PROPERTY(Maxclusterrank);
+  REMARK_PROPERTY(Maxntidx);
+  REMARK_PROPERTY(Allocas);
+  REMARK_PROPERTY(AllocasStaticSizeSum);
+  REMARK_PROPERTY(AllocasDyn);
+  REMARK_PROPERTY(DirectCalls);
+  REMARK_PROPERTY(IndirectCalls);
+  REMARK_PROPERTY(DirectCallsToDefinedFunctions);
+  REMARK_PROPERTY(Invokes);
+  REMARK_PROPERTY(AddrspaceZeroAccesses);
+#undef REMARK_PROPERTY
+
+  return KI;
+}
+
+AnalysisKey KernelInfoAnalysis::Key;
diff --git a/llvm/lib/Passes/PassBuilder.cpp b/llvm/lib/Passes/PassBuilder.cpp
index 46f43f3de4705c..61677f02783cc9 100644
--- a/llvm/lib/Passes/PassBuilder.cpp
+++ b/llvm/lib/Passes/PassBuilder.cpp
@@ -44,6 +44,7 @@
 #include "llvm/Analysis/InlineAdvisor.h"
 #include "llvm/Analysis/InlineSizeEstimatorAnalysis.h"
 #include "llvm/Analysis/InstCount.h"
+#include "llvm/Analysis/KernelInfo.h"
 #include "llvm/Analysis/LazyCallGraph.h"
 #include "llvm/Analysis/LazyValueInfo.h"
 #include "llvm/Analysis/Lint.h"
diff --git a/llvm/lib/Passes/PassRegistry.def b/llvm/lib/Passes/PassRegistry.def
index 0cec9fbd7cd05e..dcfa732f410b38 100644
--- a/llvm/lib/Passes/PassRegistry.def
+++ b/llvm/lib/Passes/PassRegistry.def
@@ -278,6 +278,7 @@ FUNCTION_ANALYSIS(
     MachineFunctionAnalysis(static_ca...
[truncated]

@shiltian shiltian requested a review from arsenm August 12, 2024 17:59
@tschuett
Copy link

Can you put an rst file to https://github.com/llvm/llvm-project/tree/main/llvm/docs instead of hiding everything in a header?

@shiltian
Copy link
Contributor

My general question is, the kernel info looks like highly "target" dependent. I'm not sure if it is a good idea to have them all combined together in a target "independent" manner.

@jdoerfert
Copy link
Member

My general question is, the kernel info looks like highly "target" dependent. I'm not sure if it is a good idea to have them all combined together in a target "independent" manner.

What's the alternative? 2-3 passes that do basically the same thing (e.g., infos about alloca, call, etc.)?

@shiltian
Copy link
Contributor

An alternative is to have a base class with common info and then have target dependent part in sub classes.

Comment on lines 310 to 311
auto AmdgpuWavesPerEu =
parseFnAttrAsIntegerFields(F, "amdgpu-waves-per-eu", 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already report information on this in the pass remarks in the AsmPrinter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AMDGPU remarks emit 0 for values of non-kernel functions, IIRC. Having a single location for all relevant remarks is also helpful. Only post-instruction selection & reg alloc stuff should be printed per-target (=for AMDGPU) late.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just a bug though

@jdoerfert
Copy link
Member

An alternative is to have a base class with common info and then have target dependent part in sub classes.

I don't see how that buys us much and I feel there is a cost to it (more files, more places, more boilerplate). What do we gain by splitting it up? In the existing code we can/should just print target dependent features (like waves-per-eu) only for AMDGPU and call it a day, no?

We have to be more careful about targets in the test suite now because
`getFlatAddressSpace` returns garbage for unsupported targets.

Should we change the remarks to say flat addrspace instead of
addrspace(0)?
@jdenny-ornl
Copy link
Collaborator Author

Can you put an rst file to https://github.com/llvm/llvm-project/tree/main/llvm/docs instead of hiding everything in a header?

It's in llvm/docs/KernelInfo.rst. Any suggestions on what other docs should link to it?

-kernel-info-end-lto doesn't insert kernel-info for cpu modules.  If
the user explicitly specifies the pass for a cpu module, then it will
run now.

define void @h() !dbg !3 {
entry:
; CHECK: remark: test.c:0:0: in artificial function 'h', artificial alloca 'dyn_ptr' with static size of 8 bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's artificial about it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test makes sure that functions marked as artificial in metadata are reported that way. It's been a while, but my recollection is I started with IR generated by clang, ran it through llvm-reduce, and then further reduced it by hand.

What are you objecting to? Do you want the test function to look more like an artificial function in some way? Do you want artificial functions not to be called out by kernel-info?

; DEFINE: %{fcheck-on} = FileCheck -match-full-lines %S/Inputs/test.ll
; DEFINE: %{fcheck-off} = FileCheck -allow-empty -check-prefixes=NONE \
; DEFINE: %S/Inputs/test.ll

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test the bounds attributes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what you're asking. This test checks when kernel-info is enabled by looking for its output, and it arbitrarily chooses omp_target_num_teams as the output to look for. Do you want more bounds checking? Do you want something else intsead?

Copy link

github-actions bot commented Dec 27, 2024

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 7b1becd940cb93f8b63c9872e1af7431dea353d1 1f1ca6cfc17def5595adb71ea00282154d803fc7 llvm/include/llvm/Analysis/KernelInfo.h llvm/lib/Analysis/KernelInfo.cpp llvm/test/Analysis/KernelInfo/allocas.ll llvm/test/Analysis/KernelInfo/calls.ll llvm/test/Analysis/KernelInfo/enable-kernel-info/Inputs/test.ll llvm/test/Analysis/KernelInfo/flat-addrspace/Inputs/test.ll llvm/test/Analysis/KernelInfo/launch-bounds/amdgpu.ll llvm/test/Analysis/KernelInfo/launch-bounds/nvptx.ll llvm/test/Analysis/KernelInfo/linkage.ll llvm/test/Analysis/KernelInfo/openmp/amdgpu.ll llvm/test/Analysis/KernelInfo/openmp/nvptx.ll llvm/include/llvm/Analysis/TargetTransformInfo.h llvm/include/llvm/Analysis/TargetTransformInfoImpl.h llvm/include/llvm/IR/Function.h llvm/include/llvm/Target/TargetMachine.h llvm/lib/Analysis/TargetTransformInfo.cpp llvm/lib/Passes/PassBuilder.cpp llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.cpp llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h llvm/lib/Target/TargetMachine.cpp llvm/lib/Transforms/IPO/OpenMPOpt.cpp

The following files introduce new uses of undef:

  • llvm/test/Analysis/KernelInfo/openmp/nvptx.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

See llvm/test/Analysis/KernelInfo/openmp/README.md.
@jdenny-ornl
Copy link
Collaborator Author

jdenny-ornl commented Dec 27, 2024

The following files introduce new uses of undef:

* llvm/test/Analysis/KernelInfo/openmp/nvptx.ll

When clang stops generating those uses, we can update the tests.

Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor issues and then we can merge this.

@@ -322,6 +322,32 @@ void Module::eraseNamedMetadata(NamedMDNode *NMD) {
eraseNamedMDNode(NMD);
}

SetVector<Function *> Module::getDeviceKernels() {
// TODO: Create a more cross-platform way of determining device kernels.
NamedMDNode *MD = getNamedMetadata("nvvm.annotations");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on deprecating this method of specifying a kernel in favor of the ptx_kernel calling convention (see #120806). I think it would be good to check the calling convention of all functions in the module and add them to the set if they have ptx_kernel or amdgpu_kernel (maybe others too?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know. Now that your PR #122320 has landed, I've updated this PR to use it.

I see that llvm::omp::getDeviceKernels checks for the kernel calling convention but also still kernel in nvvm.annotations. Should KernelInfo check for the latter as well? My update only checks for the former.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#119261 will auto-upgrade nvvm.annotations allowing you to only check the CC. Currently you'll still need to check both but hopefully soon only CC will be needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is that needed? Just for previously generated LLVM IR? Clang at today's main seems to generate the calling conventions already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea to maintain backwards compatibility with older IR, or out-of-tree front-ends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. But for KernelInfo, I'm not sure if it's worth it. We could probably live with that small shortcoming until PR #119261 lands. @jdoerfert?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backwards compatibility with old GPU IR is not much of a thing. This should be fine.

This reverts commit bb9d5c2.

This will facilitate merging main due to 07ed818 (PR llvm#122320), which
changes llvm::omp::getDeviceKernels.  Will rewrite and reapply after
merging main.
Also, regenerate OpenMP tests from current clang so they see the new
kernel calling conventions.
Copy link
Member

@jdoerfert jdoerfert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, I think.

@jdenny-ornl jdenny-ornl merged commit 18f8106 into llvm:main Jan 29, 2025
8 of 9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 29, 2025

LLVM Buildbot has detected a new failure on builder flang-aarch64-dylib running on linaro-flang-aarch64-dylib while building llvm at step 5 "build-unified-tree".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/50/builds/9588

Here is the relevant piece of the build log for the reference
Step 5 (build-unified-tree) failure: build (failure)
...
1495.210 [995/1/5830] Linking CXX static library lib/libMLIRSparseTensorPipelines.a
1496.198 [994/1/5831] Linking CXX shared library lib/libmlir_runner_utils.so.21.0git
1496.218 [993/1/5832] Creating library symlink lib/libmlir_runner_utils.so
1496.363 [992/1/5833] Building CXX object tools/mlir/lib/CAPI/Dialect/CMakeFiles/obj.MLIRCAPIAMDGPU.dir/AMDGPU.cpp.o
1496.510 [991/1/5834] Linking CXX static library lib/libMLIRCAPIDebug.a
1496.620 [990/1/5835] Building CXX object tools/mlir/lib/CAPI/Dialect/CMakeFiles/obj.MLIRCAPIMath.dir/Math.cpp.o
1496.748 [989/1/5836] Building CXX object tools/mlir/lib/CAPI/Dialect/CMakeFiles/obj.MLIRCAPIEmitC.dir/EmitC.cpp.o
1496.895 [988/1/5837] Building CXX object tools/mlir/test/lib/IR/CMakeFiles/MLIRTestIR.dir/TestMatchers.cpp.o
1497.445 [987/1/5838] Building CXX object tools/mlir/test/lib/IR/CMakeFiles/MLIRTestIR.dir/TestLazyLoading.cpp.o
1520.288 [986/1/5839] Building CXX object tools/mlir/test/lib/IR/CMakeFiles/MLIRTestIR.dir/TestBytecodeRoundtrip.cpp.o
FAILED: tools/mlir/test/lib/IR/CMakeFiles/MLIRTestIR.dir/TestBytecodeRoundtrip.cpp.o 
/usr/local/bin/c++ -DGTEST_HAS_RTTI=0 -DMLIR_INCLUDE_TESTS -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/tools/mlir/test/lib/IR -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/mlir/test/lib/IR -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/tools/mlir/include -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/mlir/include -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/include -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/llvm/include -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/mlir/test/lib/IR/../Dialect/Test -I/home/tcwg-buildbot/worker/flang-aarch64-dylib/build/tools/mlir/test/lib/IR/../Dialect/Test -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -Wundef -Werror=mismatched-tags -O3 -DNDEBUG -std=c++17  -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -MD -MT tools/mlir/test/lib/IR/CMakeFiles/MLIRTestIR.dir/TestBytecodeRoundtrip.cpp.o -MF tools/mlir/test/lib/IR/CMakeFiles/MLIRTestIR.dir/TestBytecodeRoundtrip.cpp.o.d -o tools/mlir/test/lib/IR/CMakeFiles/MLIRTestIR.dir/TestBytecodeRoundtrip.cpp.o -c /home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/mlir/test/lib/IR/TestBytecodeRoundtrip.cpp
In file included from /home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/mlir/test/lib/IR/TestBytecodeRoundtrip.cpp:10:
/home/tcwg-buildbot/worker/flang-aarch64-dylib/llvm-project/mlir/test/lib/IR/../Dialect/Test/TestOps.h:148:10: fatal error: 'TestOps.h.inc' file not found
  148 | #include "TestOps.h.inc"
      |          ^~~~~~~~~~~~~~~
1 error generated.
ninja: build stopped: subcommand failed.

@slackito
Copy link
Collaborator

I was trying to fix the bazel build after this change, and it seems it introduces a couple of circular dependencies. Analysis/KernelInfo.cpp includes Passes/PassBuilder.h and Target/TargetMachine.h, but both Passes and Target already depend on Analysis.

I think the circular dependency can be broken by removing the PassBuilder.h include, which seems to be unused, and replacing the TargetMachine include with a forward decl. Something like this:

diff --git a/llvm/lib/Analysis/KernelInfo.cpp b/llvm/lib/Analysis/KernelInfo.cpp
index be9b1136321f..5175614645bf 100644
--- a/llvm/lib/Analysis/KernelInfo.cpp
+++ b/llvm/lib/Analysis/KernelInfo.cpp
@@ -15,17 +15,20 @@
 #include "llvm/ADT/SmallString.h"
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/Analysis/OptimizationRemarkEmitter.h"
+#include "llvm/Analysis/TargetTransformInfo.h"
 #include "llvm/IR/DebugInfo.h"
 #include "llvm/IR/Dominators.h"
 #include "llvm/IR/Instructions.h"
 #include "llvm/IR/Metadata.h"
 #include "llvm/IR/Module.h"
 #include "llvm/IR/PassManager.h"
-#include "llvm/Passes/PassBuilder.h"
-#include "llvm/Target/TargetMachine.h"
 
 using namespace llvm;
 
+namespace llvm {
+class TargetMachine;
+}
+
 #define DEBUG_TYPE "kernel-info"
 
 namespace {

@slackito
Copy link
Collaborator

Never mind, it was already fixed by 953354c and 57f1731.

@jdenny-ornl
Copy link
Collaborator Author

Never mind, it was already fixed by 953354c and 57f1731.

Thanks for reporting that back here. Sorry for the trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants