- 
                Notifications
    You must be signed in to change notification settings 
- Fork 15k
[AMDGPU] Graph-based Module Splitting Rewrite #104763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Major rewrite of the AMDGPUSplitModule pass. Highlights: - Removal of the "SML" logging system in favor of just using CL options, LLVM_DEBUG, etc. - Graph-based module representation with DOTGraph printing support - No more defaulting to "P0" for external calls. - Graph-search algorithm that can explore multiple branches/assignments for a cluster of functions, up to a maximum depth.
| @llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesMajor rewrite of the AMDGPUSplitModule pass in order to better support it long-term. Highlights: 
 All of this gives us a lot of room to experiment with new heuristics or even entirely different splitting strategies if we need to. For instance, the graph representation has room for abstract nodes, e.g. if we need to represent some global variables or external constraints. We could also introduce more edge types to model other type of relations between nodes, etc. Patch is 106.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/104763.diff 17 Files Affected: 
 diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
index df084cf41c4783..6d755a7c61d4f6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSplitModule.cpp
@@ -7,33 +7,36 @@
 //===----------------------------------------------------------------------===//
 //
 /// \file Implements a module splitting algorithm designed to support the
-/// FullLTO --lto-partitions option for parallel codegen. This is completely
-/// different from the common SplitModule pass, as this system is designed with
-/// AMDGPU in mind.
+/// FullLTO --lto-partitions option for parallel codegen.
 ///
-/// The basic idea of this module splitting implementation is the same as
-/// SplitModule: load-balance the module's functions across a set of N
-/// partitions to allow parallel codegen. However, it does it very
-/// differently than the target-agnostic variant:
-///   - The module has "split roots", which are kernels in the vast
-//      majority of cases.
-///   - Each root has a set of dependencies, and when a root and its
-///     dependencies is considered "big", we try to put it in a partition where
-///     most dependencies are already imported, to avoid duplicating large
-///     amounts of code.
-///   - There's special care for indirect calls in order to ensure
-///     AMDGPUResourceUsageAnalysis can work correctly.
+/// The role of this module splitting pass is the same as
+/// lib/Transforms/Utils/SplitModule.cpp: load-balance the module's functions
+/// across a set of N partitions to allow for parallel codegen.
 ///
-/// This file also includes a more elaborate logging system to enable
-/// users to easily generate logs that (if desired) do not include any value
-/// names, in order to not leak information about the source file.
-/// Such logs are very helpful to understand and fix potential issues with
-/// module splitting.
+/// The similarities mostly end here, as this pass achieves load-balancing in a
+/// more elaborate fashion which is targeted towards AMDGPU modules. It can take
+/// advantage of the structure of AMDGPU modules (which are mostly
+/// self-contained) to allow for more efficient splitting without affecting
+/// codegen negatively, or causing innaccurate resource usage analysis.
+///
+/// High-level pass overview:
+///   - SplitGraph & associated classes
+///      - Graph representation of the module and of the dependencies that
+///      matter for splitting.
+///   - RecursiveSearchSplitting
+///     - Core splitting algorithm.
+///   - SplitProposal
+///     - Represents a suggested solution for splitting the input module. These
+///     solutions can be scored to determine the best one when multiple
+///     solutions are available.
+///   - Driver/pass "run" function glues everything together.
 
 #include "AMDGPUSplitModule.h"
 #include "AMDGPUTargetMachine.h"
 #include "Utils/AMDGPUBaseInfo.h"
 #include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/EquivalenceClasses.h"
+#include "llvm/ADT/GraphTraits.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringRef.h"
@@ -44,44 +47,56 @@
 #include "llvm/IR/Module.h"
 #include "llvm/IR/User.h"
 #include "llvm/IR/Value.h"
+#include "llvm/Support/Allocator.h"
 #include "llvm/Support/Casting.h"
+#include "llvm/Support/DOTGraphTraits.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Support/FileSystem.h"
+#include "llvm/Support/GraphWriter.h"
 #include "llvm/Support/Path.h"
-#include "llvm/Support/Process.h"
-#include "llvm/Support/SHA256.h"
-#include "llvm/Support/Threading.h"
+#include "llvm/Support/Timer.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/Utils/Cloning.h"
 #include <algorithm>
 #include <cassert>
+#include <cmath>
 #include <iterator>
 #include <memory>
 #include <utility>
 #include <vector>
 
-using namespace llvm;
+#ifndef NDEBUG
+#include "llvm/Support/LockFileManager.h"
+#endif
 
 #define DEBUG_TYPE "amdgpu-split-module"
 
+namespace llvm {
 namespace {
 
+static cl::opt<unsigned> MaxDepth(
+    "amdgpu-module-splitting-max-depth",
+    cl::desc(
+        "maximum search depth. 0 forces a greedy approach. "
+        "warning: the algorithm is up to O(2^N), where N is the max depth."),
+    cl::init(8));
+
 static cl::opt<float> LargeFnFactor(
-    "amdgpu-module-splitting-large-function-threshold", cl::init(2.0f),
-    cl::Hidden,
+    "amdgpu-module-splitting-large-threshold", cl::init(2.0f), cl::Hidden,
     cl::desc(
-        "consider a function as large and needing special treatment when the "
-        "cost of importing it into a partition"
-        "exceeds the average cost of a partition by this factor; e;g. 2.0 "
-        "means if the function and its dependencies is 2 times bigger than "
-        "an average partition; 0 disables large functions handling entirely"));
+        "when max depth is reached and we can no longer branch out, this "
+        "value determines if a function is worth merging into an already "
+        "existing partition to reduce code duplication. This is a factor "
+        "of the ideal partition size, e.g. 2.0 means we consider the "
+        "function for merging if its cost (including its callees) is 2x the "
+        "size of an ideal partition."));
 
 static cl::opt<float> LargeFnOverlapForMerge(
-    "amdgpu-module-splitting-large-function-merge-overlap", cl::init(0.8f),
-    cl::Hidden,
-    cl::desc(
-        "defines how much overlap between two large function's dependencies "
-        "is needed to put them in the same partition"));
+    "amdgpu-module-splitting-merge-threshold", cl::init(0.7f), cl::Hidden,
+    cl::desc("when a function is considered for merging into a partition that "
+             "already contains some of its callees, do the merge if at least "
+             "n% of the code it can reach is already present inside the "
+             "partition; e.g. 0.7 means only merge >70%"));
 
 static cl::opt<bool> NoExternalizeGlobals(
     "amdgpu-module-splitting-no-externalize-globals", cl::Hidden,
@@ -89,142 +104,95 @@ static cl::opt<bool> NoExternalizeGlobals(
              "may cause globals to be duplicated which increases binary size"));
 
 static cl::opt<std::string>
-    LogDirOpt("amdgpu-module-splitting-log-dir", cl::Hidden,
-              cl::desc("output directory for AMDGPU module splitting logs"));
+    ModuleDotCfgOutput("amdgpu-module-splitting-print-module-dotcfg",
+                       cl::Hidden,
+                       cl::desc("output file to write out the dotgraph "
+                                "representation of the input module"));
+
+static cl::opt<std::string> PartitionSummariesOutput(
+    "amdgpu-module-splitting-print-partition-summaries", cl::Hidden,
+    cl::desc("output file to write out a summary of "
+             "the partitions created for each module"));
+
+#ifndef NDEBUG
+static cl::opt<bool> TimeBuild("amdgpu-module-splitting-time-trace", cl::Hidden,
+                               cl::desc("enable and print timers"));
+
+static cl::opt<bool>
+    UseLockFile("amdgpu-module-splitting-serial-execution", cl::Hidden,
+                cl::desc("use a lock file so only one process in the system "
+                         "can run this pass at once. useful to avoid mangled "
+                         "debug output in multithreaded environments."));
 
 static cl::opt<bool>
-    LogPrivate("amdgpu-module-splitting-log-private", cl::Hidden,
-               cl::desc("hash value names before printing them in the AMDGPU "
-                        "module splitting logs"));
+    DebugProposalSearch("amdgpu-module-splitting-debug-proposal-search",
+                        cl::Hidden,
+                        cl::desc("print all proposals received and whether "
+                                 "they were rejected or accepted"));
+
+struct SplitModuleTimer : NamedRegionTimer {
+  SplitModuleTimer(StringRef Name, StringRef Desc)
+      : NamedRegionTimer(Name, Desc, DEBUG_TYPE, "AMDGPU Module Splitting",
+                         TimeBuild) {}
+};
+#else
+struct SplitModuleTimer {
+  SplitModuleTimer(StringRef Name, StringRef Desc) {}
+};
+#endif
+
+//===----------------------------------------------------------------------===//
+// Utils
+//===----------------------------------------------------------------------===//
 
 using CostType = InstructionCost::CostType;
-using PartitionID = unsigned;
+using FunctionsCostMap = DenseMap<const Function *, CostType>;
 using GetTTIFn = function_ref<const TargetTransformInfo &(Function &)>;
+static constexpr unsigned InvalidPID = -1;
+
+/// \param Num numerator
+/// \param Dem denominator
+/// \param FmtString printf-like format string
+/// \returns a printable object to print (Num/Dem) using FmtString.
+static auto formatRatioOf(CostType Num, CostType Dem,
+                          const char *FmtString = "%0.2f") {
+  return format(FmtString, (double(Num) / Dem) * 100);
+}
 
-static bool isEntryPoint(const Function *F) {
+static bool isKernel(const Function *F) {
   return AMDGPU::isEntryFunctionCC(F->getCallingConv());
 }
 
-static std::string getName(const Value &V) {
-  static bool HideNames;
-
-  static llvm::once_flag HideNameInitFlag;
-  llvm::call_once(HideNameInitFlag, [&]() {
-    if (LogPrivate.getNumOccurrences())
-      HideNames = LogPrivate;
-    else {
-      const auto EV = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_PRIVATE");
-      HideNames = (EV.value_or("0") != "0");
-    }
-  });
-
-  if (!HideNames)
-    return V.getName().str();
-  return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())),
-               /*LowerCase=*/true);
+static bool isNonCopyable(const Function &F) {
+  return isKernel(&F) || F.hasExternalLinkage() || !F.isDefinitionExact();
 }
 
-/// Main logging helper.
-///
-/// Logging can be configured by the following environment variable.
-///   AMD_SPLIT_MODULE_LOG_DIR=<filepath>
-///     If set, uses <filepath> as the directory to write logfiles to
-///     each time module splitting is used.
-///   AMD_SPLIT_MODULE_LOG_PRIVATE
-///     If set to anything other than zero, all names are hidden.
-///
-/// Both environment variables have corresponding CL options which
-/// takes priority over them.
-///
-/// Any output printed to the log files is also printed to dbgs() when -debug is
-/// used and LLVM_DEBUG is defined.
-///
-/// This approach has a small disadvantage over LLVM_DEBUG though: logging logic
-/// cannot be removed from the code (by building without debug). This probably
-/// has a small performance cost because if some computation/formatting is
-/// needed for logging purpose, it may be done everytime only to be ignored
-/// by the logger.
-///
-/// As this pass only runs once and is not doing anything computationally
-/// expensive, this is likely a reasonable trade-off.
-///
-/// If some computation should really be avoided when unused, users of the class
-/// can check whether any logging will occur by using the bool operator.
-///
-/// \code
-///   if (SML) {
-///     // Executes only if logging to a file or if -debug is available and
-///     used.
-///   }
-/// \endcode
-class SplitModuleLogger {
-public:
-  SplitModuleLogger(const Module &M) {
-    std::string LogDir = LogDirOpt;
-    if (LogDir.empty())
-      LogDir = sys::Process::GetEnv("AMD_SPLIT_MODULE_LOG_DIR").value_or("");
-
-    // No log dir specified means we don't need to log to a file.
-    // We may still log to dbgs(), though.
-    if (LogDir.empty())
-      return;
-
-    // If a log directory is specified, create a new file with a unique name in
-    // that directory.
-    int Fd;
-    SmallString<0> PathTemplate;
-    SmallString<0> RealPath;
-    sys::path::append(PathTemplate, LogDir, "Module-%%-%%-%%-%%-%%-%%-%%.txt");
-    if (auto Err =
-            sys::fs::createUniqueFile(PathTemplate.str(), Fd, RealPath)) {
-      report_fatal_error("Failed to create log file at '" + Twine(LogDir) +
-                             "': " + Err.message(),
-                         /*CrashDiag=*/false);
-    }
-
-    FileOS = std::make_unique<raw_fd_ostream>(Fd, /*shouldClose=*/true);
-  }
-
-  bool hasLogFile() const { return FileOS != nullptr; }
-
-  raw_ostream &logfile() {
-    assert(FileOS && "no logfile!");
-    return *FileOS;
-  }
-
-  /// \returns true if this SML will log anything either to a file or dbgs().
-  /// Can be used to avoid expensive computations that are ignored when logging
-  /// is disabled.
-  operator bool() const {
-    return hasLogFile() || (DebugFlag && isCurrentDebugType(DEBUG_TYPE));
+/// If \p GV has local linkage, make it external + hidden.
+static void externalize(GlobalValue &GV) {
+  if (GV.hasLocalLinkage()) {
+    GV.setLinkage(GlobalValue::ExternalLinkage);
+    GV.setVisibility(GlobalValue::HiddenVisibility);
   }
 
-private:
-  std::unique_ptr<raw_fd_ostream> FileOS;
-};
-
-template <typename Ty>
-static SplitModuleLogger &operator<<(SplitModuleLogger &SML, const Ty &Val) {
-  static_assert(
-      !std::is_same_v<Ty, Value>,
-      "do not print values to logs directly, use handleName instead!");
-  LLVM_DEBUG(dbgs() << Val);
-  if (SML.hasLogFile())
-    SML.logfile() << Val;
-  return SML;
+  // Unnamed entities must be named consistently between modules. setName will
+  // give a distinct name to each such entity.
+  if (!GV.hasName())
+    GV.setName("__llvmsplit_unnamed");
 }
 
-/// Calculate the cost of each function in \p M
-/// \param SML Log Helper
+/// Cost analysis function. Calculates the cost of each function in \p M
+///
 /// \param GetTTI Abstract getter for TargetTransformInfo.
 /// \param M Module to analyze.
 /// \param CostMap[out] Resulting Function -> Cost map.
 /// \return The module's total cost.
-static CostType
-calculateFunctionCosts(SplitModuleLogger &SML, GetTTIFn GetTTI, Module &M,
-                       DenseMap<const Function *, CostType> &CostMap) {
+static CostType calculateFunctionCosts(GetTTIFn GetTTI, Module &M,
+                                       FunctionsCostMap &CostMap) {
+  SplitModuleTimer SMT("calculateFunctionCosts", "cost analysis");
+
+  LLVM_DEBUG(dbgs() << "[cost analysis] calculating function costs\n");
   CostType ModuleCost = 0;
-  CostType KernelCost = 0;
+  [[maybe_unused]] CostType KernelCost = 0;
 
   for (auto &Fn : M) {
     if (Fn.isDeclaration())
@@ -251,23 +219,30 @@ calculateFunctionCosts(SplitModuleLogger &SML, GetTTIFn GetTTI, Module &M,
     assert((ModuleCost + FnCost) >= ModuleCost && "Overflow!");
     ModuleCost += FnCost;
 
-    if (isEntryPoint(&Fn))
+    if (isKernel(&Fn))
       KernelCost += FnCost;
   }
 
-  CostType FnCost = (ModuleCost - KernelCost);
-  CostType ModuleCostOr1 = ModuleCost ? ModuleCost : 1;
-  SML << "=> Total Module Cost: " << ModuleCost << '\n'
-      << "  => KernelCost: " << KernelCost << " ("
-      << format("%0.2f", (float(KernelCost) / ModuleCostOr1) * 100) << "%)\n"
-      << "  => FnsCost: " << FnCost << " ("
-      << format("%0.2f", (float(FnCost) / ModuleCostOr1) * 100) << "%)\n";
+  if (CostMap.empty())
+    return 0;
+
+  assert(ModuleCost);
+  LLVM_DEBUG({
+    const CostType FnCost = ModuleCost - KernelCost;
+    dbgs() << " - total module cost is " << ModuleCost << ". kernels cost "
+           << "" << KernelCost << " ("
+           << format("%0.2f", (float(KernelCost) / ModuleCost) * 100)
+           << "% of the module), functions cost " << FnCost << " ("
+           << format("%0.2f", (float(FnCost) / ModuleCost) * 100)
+           << "% of the module)\n";
+  });
 
   return ModuleCost;
 }
 
+/// \return true if \p F can be indirectly called
 static bool canBeIndirectlyCalled(const Function &F) {
-  if (F.isDeclaration() || isEntryPoint(&F))
+  if (F.isDeclaration() || isKernel(&F))
     return false;
   return !F.hasLocalLinkage() ||
          F.hasAddressTaken(/*PutOffender=*/nullptr,
@@ -278,351 +253,1042 @@ static bool canBeIndirectlyCalled(const Function &F) {
                            /*IgnoreCastedDirectCall=*/true);
 }
 
-/// When a function or any of its callees performs an indirect call, this
-/// takes over \ref addAllDependencies and adds all potentially callable
-/// functions to \p Fns so they can be counted as dependencies of the function.
+//===----------------------------------------------------------------------===//
+// Graph-based Module Representation
+//===----------------------------------------------------------------------===//
+
+/// AMDGPUSplitModule's view of the source Module, as a graph of all components
+/// that can be split into different modules.
 ///
-/// This is needed due to how AMDGPUResourceUsageAnalysis operates: in the
-/// presence of an indirect call, the function's resource usage is the same as
-/// the most expensive function in the module.
-/// \param M    The module.
-/// \param Fns[out] Resulting list of functions.
-static void addAllIndirectCallDependencies(const Module &M,
-                                           DenseSet<const Function *> &Fns) {
-  for (const auto &Fn : M) {
-    if (canBeIndirectlyCalled(Fn))
-      Fns.insert(&Fn);
+/// The most trivial instance of this graph is just the CallGraph of the module,
+/// but it is not guaranteed that the graph is strictly equal to the CFG. It
+/// currently always is but it's designed in a way that would eventually allow
+/// us to create abstract nodes, or nodes for different entities such as global
+/// variables or any other meaningful constraint we must consider.
+///
+/// The graph is only mutable by this class, and is generally not modified
+/// after \ref SplitGraph::buildGraph runs. No consumers of the graph can
+/// mutate it.
+class SplitGraph {
+public:
+  class Node;
+
+  enum class EdgeKind : uint8_t {
+    /// The nodes are related through a direct call. This is a "strong" edge as
+    /// it means the Src will directly reference the Dst.
+    DirectCall,
+    /// The nodes are related through an indirect call.
+    /// This is a "weaker" edge and is only considered when traversing the graph
+    /// starting from a kernel. We need this edge for resource usage analysis.
+    ///
+    /// The reason why we have this edge in the first place is due to how
+    /// AMDGPUResourceUsageAnalysis works. In the presence of an indirect call,
+    /// the resource usage of the kernel containing the indirect call is the
+    /// max resource usage of all functions that can be indirectly called.
+    IndirectCall,
+  };
+
+  /// An edge between two nodes. Edges are directional, and tagged with a
+  /// "kind".
+  struct Edge {
+    Edge(Node *Src, Node *Dst, EdgeKind Kind)
+        : Src(Src), Dst(Dst), Kind(Kind) {}
+
+    Node *Src; ///< Source
+    Node *Dst; ///< Destination
+    EdgeKind Kind;
+  };
+
+  using EdgesVec = SmallVector<const Edge *, 0>;
+  using edges_iterator = EdgesVec::const_iterator;
+  using nodes_iterator = const Node *const *;
+
+  SplitGraph(const Module &M, const FunctionsCostMap &CostMap,
+             CostType ModuleCost)
+      : M(M), CostMap(CostMap), ModuleCost(ModuleCost) {}
+
+  void buildGraph(CallGraph &CG);
+
+#ifndef NDEBUG
+  void verifyGraph() const;
+#endif
+
+  bool empty() const { return Nodes.empty(); }
+  const iterator_range<nodes_iterator> nodes() const {
+    return {Nodes.begin(), Nodes.end()};
   }
-}
+  const Node &getNode(unsigned ID) const { return *Nodes[ID]; }
 
-/// Adds the functions that \p Fn may call to \p Fns, then recurses into each
-/// callee until all reachable functions have been gathered.
+  unsigned getNumNodes() const { return Nodes.size(); }
+  BitVector createNodesBitVector() const { return BitVector(Nodes.size()); }
+
+  const Module &getModule() const { return M; }
+
+  CostType getModuleCost() const { return ModuleCost; }
+  CostType getCost(const Function &F) const { return CostMap.at(&F); }
+
+  /// \returns the aggregated cost of all nodes in \p BV (bits set to 1 = node
+  /// IDs).
+  CostType calculateCost(const BitVector &BV) const;
+
+private:
+  /// Retrieves the node for \p GV in \p Cache, or creates a new node for it and
+  /// updates \p Cache.
+  Node &getNode(DenseMap<const GlobalValue *, Node *> &Cache,
+                const GlobalValue &GV);
+
+  // Create a new edge between two nodes and add it to both nodes.
+  const Edge &createEdge(Node &Src, Node &Dst, EdgeKind EK);
+
+  const Module &M;
+  const FunctionsCostMap &CostMap;
+  CostType ModuleCost;
+...
[truncated]
 | 
| // arbitrary max value as a safeguard. Anything above 10 will already be | ||
| // slow, this is just a max value to prevent extreme resource exhaustion or | ||
| // unbounded run time. | ||
| if (MaxDepth > 16) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as a safeguard because the maximum complexity is 2^N where N is the max depth, so things can get out of hand quickly. Not sure if it's really necessary though. I'm wondering if we should just print a warning when the value is increased above the default, and let the user deal with the consequences?
cc @gandhi56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could set MaxDepth = min(MaxDepth, 16); and add a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean just replace the error with a warning and continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
| // Unnamed entities must be named consistently between modules. setName will | ||
| // give a distinct name to each such entity. | ||
| if (!GV.hasName()) | ||
| GV.setName("__llvmsplit_unnamed"); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use period in the name, since they don't conflict with C identifiers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same name the default SplitModule pass uses, it hasn't changed
| errs() << "[amdgpu-split-module] unable to acquire lockfile, debug " | ||
| "output may be mangled by other processes\n"; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't spam errors to console, keep it under LLVM_DEBUG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error will only print once per process, as the loop will break after
However I agree it makes sense to make this a debug output because it's only relevant if debug is enabled anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about the number of times, it's the compiler as a library should never spam to console ever
| "the partitions created for each module")); | ||
|  | ||
| #ifndef NDEBUG | ||
| static cl::opt<bool> TimeBuild("amdgpu-module-splitting-time-trace", cl::Hidden, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also just use TimePassesIsEnabled instead of using a new option. Also other timers are just always built?
| return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())), | ||
| /*LowerCase=*/true); | ||
| static bool isNonCopyable(const Function &F) { | ||
| return AMDGPU::isEntryFunctionCC(F.getCallingConv()) || | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC check last? Is it really necessary? In practice those would always be external anyway?
| return V.getName().str(); | ||
| return toHex(SHA256::hash(arrayRefFromStringRef(V.getName())), | ||
| /*LowerCase=*/true); | ||
| static bool isNonCopyable(const Function &F) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document what non-copyable is?
| else | ||
| IsEntryFnCC = false; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have the default false set in the member initializer
| HasNonDuplicatableDependecy |= | ||
| (Dep->hasExternalLinkage() || !Dep->isDefinitionExact()); | ||
| #ifndef NDEBUG | ||
| void SplitGraph::verifyGraph() const { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify functions are more useable when they return a bool for valid or not, the assert can then be on the use of the verify function. That way it's less disruptive to use in a debugger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it also be enabled in release builds or is it fine if it's guarded by NDEBUG?
| if (F.isDeclaration() || isEntryPoint(&F)) | ||
| if (F.isDeclaration() || AMDGPU::isEntryFunctionCC(F.getCallingConv())) | ||
| return false; | ||
| return !F.hasLocalLinkage() || | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is an AMDGPU module and that the module should be standalone/closed-world, we should have visibility of all the callers. The only externally reachable functions are the kernel functions so we can probably assume non-kernel functions without address taken can't be indirectly called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the normal case this check should do it, no? If we're really closed world, we'd internalize everything, and we check above for kernels. I'm not sure i understand your comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the closed world, do you agree that we don't expect call to an extern device function in this module from another module? If that's the case, then is there a reason we want to report device functions without local linkage as potentially indirectly called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, then is there a reason we want to report device functions without local linkage as potentially indirectly called?
Because we cannot see all possible callers of externally visible functions. The closed world assumption should be restricted to the few places that need it, like the internalization. There is no need to add more special casing of the whole world assumption here
Test didn't have a FileCheck line and is obsolete after #104763
| LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/174/builds/4136 Here is the relevant piece of the build log for the reference | 
| LLVM Buildbot has detected a new failure on builder  Full details are available at: https://lab.llvm.org/buildbot/#/builders/153/builds/7305 Here is the relevant piece of the build log for the reference | 
| Hi @Pierre-vh there are few more failed tests on the expensive checks builder (ubuntu):  | 
This reverts commit c9b6e01.
* Revert "Fix MSVC "not all control paths return a value" warning. NFC." Dep to revert c9b6e01 * Revert "[AMDGPU] Graph-based Module Splitting Rewrite (llvm#104763)" This reverts commit c9b6e01.
Relands llvm#104763 with - Fixes for EXPENSIVE_CHECKS test failure (due to sorting operator failing if the input is shuffled first) - Fix for broken proposal selection - c3cb273 included Original commit description below --- Major rewrite of the AMDGPUSplitModule pass in order to better support it long-term. Highlights: - Removal of the "SML" logging system in favor of just using CL options and LLVM_DEBUG, like any other pass in LLVM. - The SML system started from good intentions, but it was too flawed and messy to be of any real use. It was also a real pain to use and made the code more annoying to maintain. - Graph-based module representation with DOTGraph printing support - The graph represents the module accurately, with bidirectional, typed edges between nodes (a node usually represents one function). - Nodes are assigned IDs starting from 0, which allows us to represent a set of nodes as a BitVector. This makes comparing 2 sets of nodes to find common dependencies a trivial task. Merging two clusters of nodes together is also really trivial. - No more defaulting to "P0" for external calls - Roots that can reach non-copyable dependencies (such as external calls) are now grouped together in a single "cluster" that can go into any partition. - No more defaulting to "P0" for indirect calls - New representation for module splitting proposals that can be graded and compared. - Graph-search algorithm that can explore multiple branches/assignments for a cluster of functions, up to a maximum depth. - With the default max depth of 8, we can create up to 256 propositions to try and find the best one. - We can still fall back to a greedy approach upon reaching max depth. That greedy approach uses almost identical heuristics to the previous version of the pass. All of this gives us a lot of room to experiment with new heuristics or even entirely different splitting strategies if we need to. For instance, the graph representation has room for abstract nodes, e.g. if we need to represent some global variables or external constraints. We could also introduce more edge types to model other type of relations between nodes, etc. I also designed the graph representation & the splitting strategies to be as fast as possible, and it seems to have paid off. Some quick tests showed that we spend pretty much all of our time in the CloneModule function, with the actual splitting logic being >1% of the runtime.
…7076) Relands #104763 with - Fixes for EXPENSIVE_CHECKS test failure (due to sorting operator failing if the input is shuffled first) - Fix for broken proposal selection - c3cb273 included Original commit description below --- Major rewrite of the AMDGPUSplitModule pass in order to better support it long-term. Highlights: - Removal of the "SML" logging system in favor of just using CL options and LLVM_DEBUG, like any other pass in LLVM. - The SML system started from good intentions, but it was too flawed and messy to be of any real use. It was also a real pain to use and made the code more annoying to maintain. - Graph-based module representation with DOTGraph printing support - The graph represents the module accurately, with bidirectional, typed edges between nodes (a node usually represents one function). - Nodes are assigned IDs starting from 0, which allows us to represent a set of nodes as a BitVector. This makes comparing 2 sets of nodes to find common dependencies a trivial task. Merging two clusters of nodes together is also really trivial. - No more defaulting to "P0" for external calls - Roots that can reach non-copyable dependencies (such as external calls) are now grouped together in a single "cluster" that can go into any partition. - No more defaulting to "P0" for indirect calls - New representation for module splitting proposals that can be graded and compared. - Graph-search algorithm that can explore multiple branches/assignments for a cluster of functions, up to a maximum depth. - With the default max depth of 8, we can create up to 256 propositions to try and find the best one. - We can still fall back to a greedy approach upon reaching max depth. That greedy approach uses almost identical heuristics to the previous version of the pass. All of this gives us a lot of room to experiment with new heuristics or even entirely different splitting strategies if we need to. For instance, the graph representation has room for abstract nodes, e.g. if we need to represent some global variables or external constraints. We could also introduce more edge types to model other type of relations between nodes, etc. I also designed the graph representation & the splitting strategies to be as fast as possible, and it seems to have paid off. Some quick tests showed that we spend pretty much all of our time in the CloneModule function, with the actual splitting logic being >1% of the runtime.
…lvm#107076) Relands llvm#104763 with - Fixes for EXPENSIVE_CHECKS test failure (due to sorting operator failing if the input is shuffled first) - Fix for broken proposal selection - c3cb273 included Original commit description below Major rewrite of the AMDGPUSplitModule pass in order to better support it long-term. Highlights: - Removal of the "SML" logging system in favor of just using CL options and LLVM_DEBUG, like any other pass in LLVM. - The SML system started from good intentions, but it was too flawed and messy to be of any real use. It was also a real pain to use and made the code more annoying to maintain. - Graph-based module representation with DOTGraph printing support - The graph represents the module accurately, with bidirectional, typed edges between nodes (a node usually represents one function). - Nodes are assigned IDs starting from 0, which allows us to represent a set of nodes as a BitVector. This makes comparing 2 sets of nodes to find common dependencies a trivial task. Merging two clusters of nodes together is also really trivial. - No more defaulting to "P0" for external calls - Roots that can reach non-copyable dependencies (such as external calls) are now grouped together in a single "cluster" that can go into any partition. - No more defaulting to "P0" for indirect calls - New representation for module splitting proposals that can be graded and compared. - Graph-search algorithm that can explore multiple branches/assignments for a cluster of functions, up to a maximum depth. - With the default max depth of 8, we can create up to 256 propositions to try and find the best one. - We can still fall back to a greedy approach upon reaching max depth. That greedy approach uses almost identical heuristics to the previous version of the pass. All of this gives us a lot of room to experiment with new heuristics or even entirely different splitting strategies if we need to. For instance, the graph representation has room for abstract nodes, e.g. if we need to represent some global variables or external constraints. We could also introduce more edge types to model other type of relations between nodes, etc. I also designed the graph representation & the splitting strategies to be as fast as possible, and it seems to have paid off. Some quick tests showed that we spend pretty much all of our time in the CloneModule function, with the actual splitting logic being >1% of the runtime. Change-Id: Ib48a8b0c7cc16edb5d593f740d81b658849876c9
Major rewrite of the AMDGPUSplitModule pass in order to better support it long-term.
Highlights:
All of this gives us a lot of room to experiment with new heuristics or even entirely different splitting strategies if we need to. For instance, the graph representation has room for abstract nodes, e.g. if we need to represent some global variables or external constraints. We could also introduce more edge types to model other type of relations between nodes, etc.
I also designed the graph representation & the splitting strategies to be as fast as possible, and it seems to have paid off. Some quick tests showed that we spend pretty much all of our time in the CloneModule function, with the actual splitting logic being >1% of the runtime.