[PGO] Sampled instrumentation in PGO to speed up instrumentation binary #69535

xur-llvm · 2023-10-18T23:01:31Z

In comparison to non-instrumented binaries, PGO instrumentation binaries
can be significantly slower. For highly threaded programs, this slowdown can
reach 10x due to data races or false sharing within counters.

This patch incorporates sampling into the PGO instrumentation process to
enhance the speed of instrumentation binaries. The fundamental concept
is similar to the one proposed in https://reviews.llvm.org/D63949.

Three sampling modes are introduced:

Simple Sampling: When '-sampled-instr-bust-duration' is set to 1.
Fast Burst Sampling: When not using simple sampling, and
'-sampled-instr-period' is set to 65535. This is the default mode of sampling.
Full Burst Sampling: When neither simple nor fast burst sampling is used.

Utilizing this sampled instrumentation significantly improves the binary's
execution speed. Measurements show up to 5x speedup with default
settings. Fast burst sampling now results in only around 20% to 30%
slowdown (compared to 8 to 10x slowdown without sampling).

Out tests show that profile quality remains good with sampling,
with edge counts typically showing more than 90% overlap.
For applications whose behavior changes due to binary speed,
sampling instrumentation can enhance performance.
Observations have shown some apps experiencing up to
a ~2% improvement in PGO.

A potential drawback of this patch is the increased binary size
and compilation time. The Sampling method in this patch does
not improve single threaded program instrumentation binary
speed.

llvmbot · 2023-10-18T23:02:34Z

@llvm/pr-subscribers-pgo

@llvm/pr-subscribers-llvm-transforms

Author: None (xur-llvm)

Changes

PGO instrumentation binary can be very slow comparing to the non-instrumented binary. It's not uncommon to see 10x slowdown for highly threaded programs, due to data race of false sharing to the counters.

This patch uses sampling in PGO instrumentation to speed up instrumentation binary. The basic idea is the same as one: here: https://reviews.llvm.org/D63949

This patch makes some improvements so that we only use one condition. We now fix the WholeDuring at 65536 and use the wraparound of unsigned short.

With this sampled instrumentation, the binary runs much faster. We measure 5x speedup using the default duration. We now only see about 20% to 30% slow down (comparing to 8 to 10x slowdown without sampling).

The profile quality is pretty good with sampling: the edge counts usually report >90% overlap.

For the apps that program behaviors change due to binary speed, sampling instrumentation can improve the performance. We have observed some apps getting up ~2% improvement in PGO.

One potential issue of this patch is the increased binary size and compilation time.

Patch is 26.63 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/69535.diff

11 Files Affected:

(modified) llvm/include/llvm/ProfileData/InstrProfData.inc (+1)
(modified) llvm/include/llvm/Transforms/Instrumentation.h (+6)
(modified) llvm/include/llvm/Transforms/Instrumentation/InstrProfiling.h (+6)
(modified) llvm/include/llvm/Transforms/Instrumentation/PGOInstrumentation.h (+4-2)
(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+9-1)
(modified) llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp (+116-15)
(modified) llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp (+2)
(added) llvm/test/Transforms/PGOProfile/Inputs/cspgo_bar_sample.ll (+82)
(added) llvm/test/Transforms/PGOProfile/counter_promo_sampling.ll (+78)
(added) llvm/test/Transforms/PGOProfile/cspgo_sample.ll (+112)
(added) llvm/test/Transforms/PGOProfile/instrprof_sample.ll (+47)

diff --git a/llvm/include/llvm/ProfileData/InstrProfData.inc b/llvm/include/llvm/ProfileData/InstrProfData.inc
index 13be2753e514efe..6294505ac396856 100644
--- a/llvm/include/llvm/ProfileData/InstrProfData.inc
+++ b/llvm/include/llvm/ProfileData/InstrProfData.inc
@@ -676,6 +676,7 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
 #define INSTR_PROF_PROFILE_RUNTIME_VAR __llvm_profile_runtime
 #define INSTR_PROF_PROFILE_COUNTER_BIAS_VAR __llvm_profile_counter_bias
 #define INSTR_PROF_PROFILE_SET_TIMESTAMP __llvm_profile_set_timestamp
+#define INSTR_PROF_PROFILE_SAMPLING_VAR __llvm_profile_sampling
 
 /* The variable that holds the name of the profile data
  * specified via command line. */
diff --git a/llvm/include/llvm/Transforms/Instrumentation.h b/llvm/include/llvm/Transforms/Instrumentation.h
index 392983a19844451..76d4e1de75154ff 100644
--- a/llvm/include/llvm/Transforms/Instrumentation.h
+++ b/llvm/include/llvm/Transforms/Instrumentation.h
@@ -116,12 +116,18 @@ struct InstrProfOptions {
   // Use BFI to guide register promotion
   bool UseBFIInPromotion = false;
 
+  // Use sampling to reduce the profile instrumentation runtime overhead.
+  bool Sampling = false;
+
   // Name of the profile file to use as output
   std::string InstrProfileOutput;
 
   InstrProfOptions() = default;
 };
 
+// Create the variable for profile sampling.
+void createProfileSamplingVar(Module &M);
+
 // Options for sanitizer coverage instrumentation.
 struct SanitizerCoverageOptions {
   enum Type {
diff --git a/llvm/include/llvm/Transforms/Instrumentation/InstrProfiling.h b/llvm/include/llvm/Transforms/Instrumentation/InstrProfiling.h
index cb0c055dcb74ae8..d0581ff72a15864 100644
--- a/llvm/include/llvm/Transforms/Instrumentation/InstrProfiling.h
+++ b/llvm/include/llvm/Transforms/Instrumentation/InstrProfiling.h
@@ -86,6 +86,9 @@ class InstrProfiling : public PassInfoMixin<InstrProfiling> {
   /// Returns true if profile counter update register promotion is enabled.
   bool isCounterPromotionEnabled() const;
 
+  /// Return true if profile sampling is enabled.
+  bool isSamplingEnabled() const;
+
   /// Count the number of instrumented value sites for the function.
   void computeNumValueSiteCounts(InstrProfValueProfileInst *Ins);
 
@@ -109,6 +112,9 @@ class InstrProfiling : public PassInfoMixin<InstrProfiling> {
   /// acts on.
   Value *getCounterAddress(InstrProfInstBase *I);
 
+  /// Lower the incremental instructions under profile sampling predicates.
+  void doSampling(Instruction *I);
+
   /// Get the region counters for an increment, creating them if necessary.
   ///
   /// If the counter array doesn't yet exist, the profile data variables
diff --git a/llvm/include/llvm/Transforms/Instrumentation/PGOInstrumentation.h b/llvm/include/llvm/Transforms/Instrumentation/PGOInstrumentation.h
index 5b1977b7de9a2ae..7199f27dbc991a8 100644
--- a/llvm/include/llvm/Transforms/Instrumentation/PGOInstrumentation.h
+++ b/llvm/include/llvm/Transforms/Instrumentation/PGOInstrumentation.h
@@ -43,12 +43,14 @@ class FileSystem;
 class PGOInstrumentationGenCreateVar
     : public PassInfoMixin<PGOInstrumentationGenCreateVar> {
 public:
-  PGOInstrumentationGenCreateVar(std::string CSInstrName = "")
-      : CSInstrName(CSInstrName) {}
+  PGOInstrumentationGenCreateVar(std::string CSInstrName = "",
+                                 bool Sampling = false)
+      : CSInstrName(CSInstrName), ProfileSampling(Sampling) {}
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &MAM);
 
 private:
   std::string CSInstrName;
+  bool ProfileSampling;
 };
 
 /// The instrumentation (profile-instr-gen) pass for IR based PGO.
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 600f8d43caaf216..5595f92e24aa861 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -273,6 +273,9 @@ static cl::opt<AttributorRunOption> AttributorRun(
                clEnumValN(AttributorRunOption::NONE, "none",
                           "disable attributor runs")));
 
+static cl::opt<bool> EnableSampledInstr(
+    "enable-sampled-instr", cl::init(false), cl::Hidden,
+    cl::desc("Enable profile instrumentation sampling (default = off)"));
 static cl::opt<bool> UseLoopVersioningLICM(
     "enable-loop-versioning-licm", cl::init(false), cl::Hidden,
     cl::desc("Enable the experimental Loop Versioning LICM pass"));
@@ -805,6 +808,10 @@ void PassBuilder::addPGOInstrPasses(ModulePassManager &MPM,
   // Do counter promotion at Level greater than O0.
   Options.DoCounterPromotion = true;
   Options.UseBFIInPromotion = IsCS;
+  if (EnableSampledInstr) {
+    Options.Sampling = true;
+    Options.DoCounterPromotion = false;
+  }
   Options.Atomic = AtomicCounterUpdate;
   MPM.addPass(InstrProfiling(Options, IsCS));
 }
@@ -1117,7 +1124,8 @@ PassBuilder::buildModuleSimplificationPipeline(OptimizationLevel Level,
   }
   if (PGOOpt && Phase != ThinOrFullLTOPhase::ThinLTOPostLink &&
       PGOOpt->CSAction == PGOOptions::CSIRInstr)
-    MPM.addPass(PGOInstrumentationGenCreateVar(PGOOpt->CSProfileGenFile));
+    MPM.addPass(PGOInstrumentationGenCreateVar(PGOOpt->CSProfileGenFile,
+                                               EnableSampledInstr));
 
   if (PGOOpt && Phase != ThinOrFullLTOPhase::ThinLTOPostLink &&
       !PGOOpt->MemoryProfile.empty())
diff --git a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
index 57fcfd53836911b..89e8e152fcee7e4 100644
--- a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
+++ b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
@@ -36,6 +36,7 @@
 #include "llvm/IR/Instruction.h"
 #include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/MDBuilder.h"
 #include "llvm/IR/Module.h"
 #include "llvm/IR/Type.h"
 #include "llvm/InitializePasses.h"
@@ -48,6 +49,7 @@
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/TargetParser/Triple.h"
 #include "llvm/Transforms/Instrumentation/PGOInstrumentation.h"
+#include "llvm/Transforms/Utils/BasicBlockUtils.h"
 #include "llvm/Transforms/Utils/ModuleUtils.h"
 #include "llvm/Transforms/Utils/SSAUpdater.h"
 #include <algorithm>
@@ -148,6 +150,16 @@ cl::opt<bool> SkipRetExitBlock(
     "skip-ret-exit-block", cl::init(true),
     cl::desc("Suppress counter promotion if exit blocks contain ret."));
 
+static cl::opt<bool>
+    SampledInstrument("sampled-instr", cl::ZeroOrMore, cl::init(false),
+                      cl::desc("Do PGO instrumentation sampling"));
+
+static cl::opt<unsigned> SampledInstrumentDuration(
+    "sampled-instr-duration",
+    cl::desc("Set the sample rate for profile instrumentation, with a value "
+             "range 0 to 65535. We will record this number of samples for "
+             "every 65536 count updates"),
+    cl::init(200));
 ///
 /// A helper class to promote one counter RMW operation in the loop
 /// into register update.
@@ -412,30 +424,91 @@ PreservedAnalyses InstrProfiling::run(Module &M, ModuleAnalysisManager &AM) {
   return PreservedAnalyses::none();
 }
 
+// Perform instrumentation sampling.
+// We transform:
+//   Increment_Instruction;
+// to:
+//   if (__llvm_profile_sampling__ <= SampleDuration) {
+//     Increment_Instruction;
+//   }
+//   __llvm_profile_sampling__ += 1;
+//
+// "__llvm_profile_sampling__" is a thread-local global shared by all PGO
+// instrumentation variables (value-instrumentation and edge instrumentation).
+// It has a unsigned short type and will wrapper around when overflow.
+//
+// Note that, the code snippet after the transformation can still be
+// counter promoted. But I don't see a reason for that because the
+// counter updated should be sparse. That's the reason we disable
+// counter promotion by default when sampling is enabled.
+// This can be overwritten by the internal option.
+//
+void InstrProfiling::doSampling(Instruction *I) {
+  if (!isSamplingEnabled())
+    return;
+  int SampleDuration = SampledInstrumentDuration.getValue();
+  unsigned WrapToZeroValue = USHRT_MAX + 1;
+  assert(SampleDuration < USHRT_MAX);
+  auto *Int16Ty = Type::getInt16Ty(M->getContext());
+  auto *CountVar =
+      M->getGlobalVariable(INSTR_PROF_QUOTE(INSTR_PROF_PROFILE_SAMPLING_VAR));
+  assert(CountVar && "CountVar not set properly");
+  IRBuilder<> CondBuilder(I);
+  auto *LoadCountVar = CondBuilder.CreateLoad(Int16Ty, CountVar);
+  auto *DurationCond = CondBuilder.CreateICmpULE(
+      LoadCountVar, CondBuilder.getInt16(SampleDuration));
+  MDBuilder MDB(I->getContext());
+  MDNode *BranchWeight =
+      MDB.createBranchWeights(SampleDuration, WrapToZeroValue - SampleDuration);
+  Instruction *ThenTerm = SplitBlockAndInsertIfThen(
+      DurationCond, I, /* Unreacheable */ false, BranchWeight);
+  IRBuilder<> IncBuilder(I);
+  auto *NewVal = IncBuilder.CreateAdd(LoadCountVar, IncBuilder.getInt16(1));
+  IncBuilder.CreateStore(NewVal, CountVar);
+  I->moveBefore(ThenTerm);
+}
+
 bool InstrProfiling::lowerIntrinsics(Function *F) {
   bool MadeChange = false;
   PromotionCandidates.clear();
+  SmallVector<InstrProfInstBase *, 8> InstrProfInsts;
+
   for (BasicBlock &BB : *F) {
     for (Instruction &Instr : llvm::make_early_inc_range(BB)) {
-      if (auto *IPIS = dyn_cast<InstrProfIncrementInstStep>(&Instr)) {
-        lowerIncrement(IPIS);
-        MadeChange = true;
-      } else if (auto *IPI = dyn_cast<InstrProfIncrementInst>(&Instr)) {
-        lowerIncrement(IPI);
-        MadeChange = true;
-      } else if (auto *IPC = dyn_cast<InstrProfTimestampInst>(&Instr)) {
-        lowerTimestamp(IPC);
-        MadeChange = true;
-      } else if (auto *IPC = dyn_cast<InstrProfCoverInst>(&Instr)) {
-        lowerCover(IPC);
-        MadeChange = true;
-      } else if (auto *IPVP = dyn_cast<InstrProfValueProfileInst>(&Instr)) {
-        lowerValueProfileInst(IPVP);
-        MadeChange = true;
+      if (auto *IP = dyn_cast<InstrProfInstBase>(&Instr)) {
+        InstrProfInsts.push_back(IP);
       }
     }
   }
 
+  for (auto *IP : InstrProfInsts) {
+    if (auto *IPIS = dyn_cast<InstrProfIncrementInstStep>(IP)) {
+      doSampling(IP);
+      lowerIncrement(IPIS);
+      MadeChange = true;
+    } else if (auto *IPI = dyn_cast<InstrProfIncrementInst>(IP)) {
+      doSampling(IP);
+      lowerIncrement(IPI);
+      MadeChange = true;
+    } else if (auto *IPC = dyn_cast<InstrProfTimestampInst>(IP)) {
+      doSampling(IP);
+      lowerTimestamp(IPC);
+      MadeChange = true;
+    } else if (auto *IPC = dyn_cast<InstrProfCoverInst>(IP)) {
+      doSampling(IP);
+      lowerCover(IPC);
+      MadeChange = true;
+    } else if (auto *IPVP = dyn_cast<InstrProfValueProfileInst>(IP)) {
+      doSampling(IP);
+      lowerValueProfileInst(IPVP);
+      MadeChange = true;
+    } else {
+      LLVM_DEBUG(dbgs() << "Invalid InstroProf intrinsic: " << *IP << "\n");
+      // ?? Seeing "call void @llvm.memcpy.p0.p0.i64..." here ??
+      // llvm_unreachable("Invalid InstroProf intrinsic");
+    }
+  }
+
   if (!MadeChange)
     return false;
 
@@ -455,6 +528,12 @@ bool InstrProfiling::isRuntimeCounterRelocationEnabled() const {
   return TT.isOSFuchsia();
 }
 
+bool InstrProfiling::isSamplingEnabled() const {
+  if (SampledInstrument.getNumOccurrences() > 0)
+    return SampledInstrument;
+  return Options.Sampling;
+}
+
 bool InstrProfiling::isCounterPromotionEnabled() const {
   if (DoCounterPromotion.getNumOccurrences() > 0)
     return DoCounterPromotion;
@@ -535,6 +614,9 @@ bool InstrProfiling::run(
   if (NeedsRuntimeHook)
     MadeChange = emitRuntimeHook();
 
+  if (!IsCS && isSamplingEnabled())
+    createProfileSamplingVar(M);
+
   bool ContainsProfiling = containsProfilingIntrinsics(M);
   GlobalVariable *CoverageNamesVar =
       M.getNamedGlobal(getCoverageUnusedNamesVarName());
@@ -1372,3 +1454,22 @@ void InstrProfiling::emitInitialization() {
 
   appendToGlobalCtors(*M, F, 0);
 }
+
+namespace llvm {
+// Create the variable for profile sampling.
+void createProfileSamplingVar(Module &M) {
+  const StringRef VarName(INSTR_PROF_QUOTE(INSTR_PROF_PROFILE_SAMPLING_VAR));
+  Type *IntTy16 = Type::getInt16Ty(M.getContext());
+  auto SamplingVar = new GlobalVariable(
+      M, IntTy16, false, GlobalValue::WeakAnyLinkage,
+      Constant::getIntegerValue(IntTy16, APInt(16, 0)), VarName);
+  SamplingVar->setVisibility(GlobalValue::DefaultVisibility);
+  SamplingVar->setThreadLocal(true);
+  Triple TT(M.getTargetTriple());
+  if (TT.supportsCOMDAT()) {
+    SamplingVar->setLinkage(GlobalValue::ExternalLinkage);
+    SamplingVar->setComdat(M.getOrInsertComdat(VarName));
+  }
+  appendToCompilerUsed(M, SamplingVar);
+}
+} // namespace llvm
diff --git a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
index 7ad1c9bc54f3780..0ea6398fbdedc1f 100644
--- a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
+++ b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
@@ -1820,6 +1820,8 @@ PGOInstrumentationGenCreateVar::run(Module &M, ModuleAnalysisManager &MAM) {
   // The variable in a comdat may be discarded by LTO. Ensure the declaration
   // will be retained.
   appendToCompilerUsed(M, createIRLevelProfileFlagVar(M, /*IsCS=*/true));
+  if (ProfileSampling)
+    createProfileSamplingVar(M);
   PreservedAnalyses PA;
   PA.preserve<FunctionAnalysisManagerModuleProxy>();
   PA.preserveSet<AllAnalysesOn<Function>>();
diff --git a/llvm/test/Transforms/PGOProfile/Inputs/cspgo_bar_sample.ll b/llvm/test/Transforms/PGOProfile/Inputs/cspgo_bar_sample.ll
new file mode 100644
index 000000000000000..1c8be82715f2531
--- /dev/null
+++ b/llvm/test/Transforms/PGOProfile/Inputs/cspgo_bar_sample.ll
@@ -0,0 +1,82 @@
+target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
+target triple = "x86_64-unknown-linux-gnu"
+
+$__llvm_profile_filename = comdat any
+$__llvm_profile_raw_version = comdat any
+$__llvm_profile_sampling = comdat any
+
+@odd = common dso_local local_unnamed_addr global i32 0, align 4
+@even = common dso_local local_unnamed_addr global i32 0, align 4
+@__llvm_profile_filename = local_unnamed_addr constant [25 x i8] c"pass2/default_%m.profraw\00", comdat
+@__llvm_profile_raw_version = local_unnamed_addr constant i64 216172782113783812, comdat
+@__llvm_profile_sampling = thread_local global i16 0, comdat
+@llvm.used = appending global [1 x i8*] [i8* bitcast (i64* @__llvm_profile_sampling to i8*)], section "llvm.metadata"
+
+define dso_local void @bar(i32 %n) !prof !30 {
+entry:
+  %call = tail call fastcc i32 @cond(i32 %n)
+  %tobool = icmp eq i32 %call, 0
+  br i1 %tobool, label %if.else, label %if.then, !prof !31
+
+if.then:
+  %0 = load i32, i32* @odd, align 4, !tbaa !32
+  %inc = add i32 %0, 1
+  store i32 %inc, i32* @odd, align 4, !tbaa !32
+  br label %if.end
+
+if.else:
+  %1 = load i32, i32* @even, align 4, !tbaa !32
+  %inc1 = add i32 %1, 1
+  store i32 %inc1, i32* @even, align 4, !tbaa !32
+  br label %if.end
+
+if.end:
+  ret void
+}
+
+define internal fastcc i32 @cond(i32 %i) #1 !prof !30 !PGOFuncName !36 {
+entry:
+  %rem = srem i32 %i, 2
+  ret i32 %rem
+}
+
+attributes #1 = { inlinehint noinline }
+
+!llvm.module.flags = !{!0, !1, !2}
+
+!0 = !{i32 1, !"wchar_size", i32 4}
+!1 = !{i32 1, !"EnableSplitLTOUnit", i32 0}
+!2 = !{i32 1, !"ProfileSummary", !3}
+!3 = !{!4, !5, !6, !7, !8, !9, !10, !11}
+!4 = !{!"ProfileFormat", !"InstrProf"}
+!5 = !{!"TotalCount", i64 500002}
+!6 = !{!"MaxCount", i64 200000}
+!7 = !{!"MaxInternalCount", i64 100000}
+!8 = !{!"MaxFunctionCount", i64 200000}
+!9 = !{!"NumCounts", i64 6}
+!10 = !{!"NumFunctions", i64 4}
+!11 = !{!"DetailedSummary", !12}
+!12 = !{!13, !14, !15, !16, !17, !18, !19, !20, !21, !22, !23, !24, !25, !26, !27, !28}
+!13 = !{i32 10000, i64 200000, i32 1}
+!14 = !{i32 100000, i64 200000, i32 1}
+!15 = !{i32 200000, i64 200000, i32 1}
+!16 = !{i32 300000, i64 200000, i32 1}
+!17 = !{i32 400000, i64 200000, i32 1}
+!18 = !{i32 500000, i64 100000, i32 4}
+!19 = !{i32 600000, i64 100000, i32 4}
+!20 = !{i32 700000, i64 100000, i32 4}
+!21 = !{i32 800000, i64 100000, i32 4}
+!22 = !{i32 900000, i64 100000, i32 4}
+!23 = !{i32 950000, i64 100000, i32 4}
+!24 = !{i32 990000, i64 100000, i32 4}
+!25 = !{i32 999000, i64 100000, i32 4}
+!26 = !{i32 999900, i64 100000, i32 4}
+!27 = !{i32 999990, i64 100000, i32 4}
+!28 = !{i32 999999, i64 1, i32 6}
+!30 = !{!"function_entry_count", i64 200000}
+!31 = !{!"branch_weights", i32 100000, i32 100000}
+!32 = !{!33, !33, i64 0}
+!33 = !{!"int", !34, i64 0}
+!34 = !{!"omnipotent char", !35, i64 0}
+!35 = !{!"Simple C/C++ TBAA"}
+!36 = !{!"cspgo_bar.c:cond"}
diff --git a/llvm/test/Transforms/PGOProfile/counter_promo_sampling.ll b/llvm/test/Transforms/PGOProfile/counter_promo_sampling.ll
new file mode 100644
index 000000000000000..6f13196a724994e
--- /dev/null
+++ b/llvm/test/Transforms/PGOProfile/counter_promo_sampling.ll
@@ -0,0 +1,78 @@
+; RUN: opt < %s --passes=pgo-instr-gen,instrprof -do-counter-promotion=true -sampled-instr=true -skip-ret-exit-block=0 -S | FileCheck --check-prefixes=SAMPLING,PROMO %s
+
+; SAMPLING: $__llvm_profile_sampling = comdat any
+; SAMPLING: @__llvm_profile_sampling = thread_local global i16 0, comdat
+
+define void @foo(i32 %n, i32 %N) {
+; SAMPLING-LABEL: @foo
+; SAMPLING:  %[[VV0:[0-9]+]] = load i16, ptr @__llvm_profile_sampling, align 2
+; SAMPLING:  %[[VV1:[0-9]+]] = icmp ule i16 %[[VV0]], 200
+; SAMPLING:  br i1 %[[VV1]], label {{.*}}, label {{.*}}, !prof !0
+; SAMPLING: {{.*}} = load {{.*}} @__profc_foo{{.*}} 3)
+; SAMPLING-NEXT: add
+; SAMPLING-NEXT: store {{.*}}@__profc_foo{{.*}}3)
+bb:
+  %tmp = add nsw i32 %n, 1
+  %tmp1 = add nsw i32 %n, -1
+  br label %bb2
+
+bb2:
+; PROMO: phi {{.*}}
+; PROMO-NEXT: phi {{.*}}
+; PROMO-NEXT: phi {{.*}}
+; PROMO-NEXT: phi {{.*}}
+  %i.0 = phi i32 [ 0, %bb ], [ %tmp10, %bb9 ]
+  %tmp3 = icmp slt i32 %i.0, %tmp
+  br i1 %tmp3, label %bb4, label %bb5
+
+bb4:
+  tail call void @bar(i32 1)
+  br label %bb9
+
+bb5:
+  %tmp6 = icmp slt i32 %i.0, %tmp1
+  br i1 %tmp6, label %bb7, label %bb8
+
+bb7:
+  tail call void @bar(i32 2)
+  br label %bb9
+
+bb8:
+  tail call void @bar(i32 3)
+  br label %bb9
+
+bb9:
+; SAMPLING:       phi {{.*}}
+; SAMPLING-NEXT:  %[[V1:[0-9]+]] = add i16 {{.*}}, 1
+; SAMPLING-NEXT:  store i16 %[[V1]], ptr @__llvm_profile_sampling, align 2
+; SAMPLING:       phi {{.*}}
+; SAMPLING-NEXT:  %[[V2:[0-9]+]] = add i16 {{.*}}, 1
+; SAMPLING-NEXT:  store i16 %[[V2]], ptr @__llvm_profile_sampling, align 2
+; SAMPLING:       phi {{.*}}
+; SAMPLING-NEXT:  %[[V3:[0-9]+]] = add i16 {{.*}}, 1
+; SAMPLING-NEXT:  store i16 %[[V3]], ptr @__llvm_profile_sampling, align 2
+; PROMO: %[[LIVEOUT3:[a-z0-9]+]] = phi {{.*}}
+; PROMO-NEXT: %[[LIVEOUT2:[a-z0-9]+]] = phi {{.*}}
+; PROMO-NEXT: %[[LIVEOUT1:[a-z0-9]+]] = phi {{.*}}
+  %tmp10 = add nsw i32 %i.0, 1
+  %tmp11 = icmp slt i32 %tmp10, %N
+  br i1 %tmp11, label %bb2, label %bb12
+
+bb12:
+  ret void
+; PROMO: %[[CHECK1:[a-z0-9.]+]] = load {{.*}} @__profc_foo{{.*}}
+; PROMO-NEXT: add {{.*}} %[[CHECK1]], %[[LIVEOUT1]]
+; PROMO-NEXT: store {{.*}}@__profc_foo{{.*}}
+; PROMO-NEXT: %[[CHECK2:[a-z0-9.]+]] = load {{.*}} @__profc_foo{{.*}} 1)
+; PROMO-NEXT: add {{.*}} %[[CHECK2]], %[[LIVEOUT2]]
+; PROMO-NEXT: store {{.*}}@__profc_foo{{.*}}1)
+; PROMO-NEXT: %[[CHECK3:[a-z0-9.]+]] = load {{.*}} @__profc_foo{{.*}} 2)
+; PROMO-NEXT: add {{.*}} %[[CHECK3]], %[[LIVEOUT3]]
+; PROMO-NEXT: store {{.*}}@__profc_foo{{.*}}2)
+; PROMO-NOT: @__profc_foo{{.*}})
+
+}
+
+declare void @bar(i32)
+
+; SAMPLING: !0 = !{!"branch_weights", i32 200, i32 65336}
diff --git a/llvm/test/Transforms/PGOProfile/cspgo_sample.ll b/llvm/test/Transforms/PGOProfile/cspgo_sample.ll
new file mode 100644
index 000000000000000..6683cae4e64c10d
--- /dev/null
+++ b/llvm/test/Transforms/PGOProfile/cspgo_sample.ll
@@ -0,0 +1,112 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 2
+; REQUIRES: x86-registered-target
+
+; RUN: opt -module-summary %s -o %t1.bc
+; RUN: opt -module-summary %S/Inputs/cspgo_bar_sample.ll -o %t2.bc
+; RUN: llvm-lto2 run -lto-cspgo-profile-file=alloc -enable-sampled-instr -lto-cspgo-gen -save-temps -o %t %t1.bc %t2.bc \
+; RUN:   -r=%t1.bc,foo,pl \
+; RUN:   -r=%t1.bc,bar,l \
+; RUN:   -r=%t1.bc,main,plx \
+; RUN:   -r=%t1.bc,__llvm_profile_filename,plx \
+; RUN:   -r=%t1.bc,__llvm_profile_raw_version,p...
[truncated]

snehasish · 2023-10-18T23:16:30Z

llvm/lib/Passes/PassBuilderPipelines.cpp

@@ -805,6 +808,10 @@ void PassBuilder::addPGOInstrPasses(ModulePassManager &MPM,
  // Do counter promotion at Level greater than O0.
  Options.DoCounterPromotion = true;
  Options.UseBFIInPromotion = IsCS;
+  if (EnableSampledInstr) {
+    Options.Sampling = true;
+    Options.DoCounterPromotion = false;


Can you add a comment on why counter promotion is turned off?

The reason is mentioned in InstrProfling.C:400.

snehasish · 2023-10-18T23:22:25Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+//   if (__llvm_profile_sampling__ <= SampleDuration) {
+//     Increment_Instruction;
+//   }
+//   __llvm_profile_sampling__ += 1;


This seems more like firstN of 64K rather than random sampling. I have a couple of concerns.

This seems like it could easily introduce bias.

We introduce yet another knob for tuning sampled-instr-duration. If it was randomized then we won't have to worry about this knob. A fast random implementation may not be too much increase in overhead compared to this approach.

Agree with what @snehasish said. Why not record once for every N hits? firstN of 64K is more prone to synchronization issues (with the program).

xur@'s sampling scheme is bursty style which seems less likely to introduce bias compared with every N hits scheme (which can lead to shadow.

For this style of sampling, there might be some optimization that can be done to coalesce sample count update and check.

Since __llvm_profile_sampling__ is shared by all PGO counters, I'm wondering if biasing could happen to code always shadowed in the range [__llvm_profile_sampling__, 65536]. E.g, assume two loops in the program, they count up to 65536 iterations in all, and the second loop will unlikely be counted?

It can be problem for static count, but might be ok with runtime count. Data changes at runtime introduce some randomness -- for instance loop trip count. It is unlikely all loops in a program have fixed trip count through out the training run.

llvm_profile_sampling is not shared by all PGO counters. It's s thread-local variable. All the counters in one thread shared the value.

This is burst sampling. We used to have to parameters (i.e. changing 64K to another used specified value). Using 64K value does increase the chances of bias. But I do think the chances are low in real programs.

Of course, I can write a test case to show the method will result in a bias profile. This is sampling after all. There will be always corner cases that gets biased result.

xur@'s sampling scheme is bursty style which seems less likely to introduce bias compared with every N hits scheme

Why? In general, sampling needs to be evenly distributed to avoid bias. HW sampling doesn't do burst either. Do you have data and example to illustrate the benefit of bursty sampling?

The only reason I can see to choose bursty sample is to open up the opportunity for coalescing increments, but that is not done here.

snehasish · 2023-10-18T23:23:52Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+      // ?? Seeing "call void @llvm.memcpy.p0.p0.i64..." here ??
+      // llvm_unreachable("Invalid InstroProf intrinsic");


I think the llvm_unreachable should be uncommented?

I think there is bug: it seems I get llvm.memcpy instinsic here. That's the reason I comment out and sink the doSampling(IP) to the loop. I had this patch for quite a while. I'll try to see if the bug still there.

WenleiHe

One potential issue of this patch is the increased binary size and compilation time.

Typically how much bigger is the .text size?

WenleiHe · 2023-10-18T23:36:26Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+                      cl::desc("Do PGO instrumentation sampling"));
+
+static cl::opt<unsigned> SampledInstrumentDuration(
+    "sampled-instr-duration",


nit: this is more like a sample rate rather than duration?

or sampling period.

I don't this is the sampling rate. This is burst sampling. In this burst period, the sample rate is 1. this parameter specifies the during for the burst period.

If you end up with bursty sampling, maybe sampled-instr-burst-duration to be clear. There also needs to be a explanation in comment somewhere to call out burst mode is chosen and explain why.

It can be useful to have a separate sampled-instr-period, which defaults to 64K, so that part is also tunable when needed. With that flag, people can set sampled-instr-burst-duration to 1, and tune sampled-instr-period instead to achieve non-burst sampling. We may give it a try.

I like the idea of two flags and allowing users to choose the setting enabling to both bursty or conventional sampling. One suggestion would be to make sampled-instr-period a prime number (standard practice in hardware sampling).

The current sampling period is implicit to take advantage of the wrapping behavior to reduce runtime overhead (and code size) -- we don't want to lose that. However I think the implementation can check if the bursty duration is not one, assert that sampling period is not set. Otherwise, generate the non bursty style sampling.

WenleiHe · 2023-10-18T23:50:38Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+//   if (__llvm_profile_sampling__ <= SampleDuration) {
+//     Increment_Instruction;
+//   }
+//   __llvm_profile_sampling__ += 1;


Agree with what @snehasish said. Why not record once for every N hits? firstN of 64K is more prone to synchronization issues (with the program).

WenleiHe · 2023-10-18T23:53:53Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+  Instruction *ThenTerm = SplitBlockAndInsertIfThen(
+      DurationCond, I, /* Unreacheable */ false, BranchWeight);
+  IRBuilder<> IncBuilder(I);
+  auto *NewVal = IncBuilder.CreateAdd(LoadCountVar, IncBuilder.getInt16(1));


So the wrap around is implicitly dependent on HW implementation of integer add?

Not sure I get the question: this depends on the wrap-around of unsigned add. I thought this is a c/c++ standard.

You're right, signed add overflow is UB, but overflow of unsigned add should always wrap around per standard.

WenleiHe · 2023-10-18T23:55:16Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

      }
    }
  }

+  for (auto *IP : InstrProfInsts) {
+    if (auto *IPIS = dyn_cast<InstrProfIncrementInstStep>(IP)) {
+      doSampling(IP);


nit: hoist doSampling(IP) out of the if-else construct?

This is to work-around the bug in line 507.

david-xl · 2023-10-19T04:11:57Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+//   __llvm_profile_sampling__ += 1;
+//
+// "__llvm_profile_sampling__" is a thread-local global shared by all PGO
+// instrumentation variables (value-instrumentation and edge instrumentation).


instrumentation variables --> counters

Name is shared by all counters. I will change the comments.

david-xl · 2023-10-19T04:13:32Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+// It has a unsigned short type and will wrapper around when overflow.
+//
+// Note that, the code snippet after the transformation can still be
+// counter promoted. But I don't see a reason for that because the


I don't see a reason --> there is no reason ..

david-xl · 2023-10-19T04:14:02Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+// Note that, the code snippet after the transformation can still be
+// counter promoted. But I don't see a reason for that because the
+// counter updated should be sparse. That's the reason we disable
+// counter promotion by default when sampling is enabled.


It is still unclear why counter promotion should be disabled with this comment.

OK. I can expand a bit more there: The downside of counter promotion is that we can get incomplete profile if we dump the counter in the middle of the loop. The benefit is improve the instrumentation speed. With this patch, the benefit is very small and won't overweight the potential downside.

As I said in the comment, these two techniques can work together without any issue. I actually added a test case for them to work together.

david-xl · 2023-10-19T04:15:56Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+                      cl::desc("Do PGO instrumentation sampling"));
+
+static cl::opt<unsigned> SampledInstrumentDuration(
+    "sampled-instr-duration",


or sampling period.

david-xl · 2023-10-19T04:17:47Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

@@ -148,6 +150,16 @@ cl::opt<bool> SkipRetExitBlock(
    "skip-ret-exit-block", cl::init(true),
    cl::desc("Suppress counter promotion if exit blocks contain ret."));

+static cl::opt<bool>
+    SampledInstrument("sampled-instr", cl::ZeroOrMore, cl::init(false),


naming nit: SampledInstr to be consistent with command name. Similar change in other places.

OK. Will change.

david-xl · 2023-10-19T04:43:17Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+//   if (__llvm_profile_sampling__ <= SampleDuration) {
+//     Increment_Instruction;
+//   }
+//   __llvm_profile_sampling__ += 1;


xur@'s sampling scheme is bursty style which seems less likely to introduce bias compared with every N hits scheme (which can lead to shadow.

For this style of sampling, there might be some optimization that can be done to coalesce sample count update and check.

snehasish · 2023-10-24T17:19:40Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+  MDNode *BranchWeight =
+      MDB.createBranchWeights(SampleDuration, WrapToZeroValue - SampleDuration);
+  Instruction *ThenTerm = SplitBlockAndInsertIfThen(
+      DurationCond, I, /* Unreacheable */ false, BranchWeight);


nit: typo Unreachable

WenleiHe · 2024-01-20T02:51:09Z

Any follow ups on this patch? Once it's in we'd like to give it a try as well.

xur-llvm · 2024-01-24T18:54:47Z

Hi, Wenlei, I just updated the patch to sync with LLVM head. You can try with option "--mllvm -enable-sampled-instr=true". I resumed work on this from the last few weeks and I'm testing it with some internal benchmarks. One thing that we noticed is that this patch can increase the text size quite a bit and sometimes hits the 2GB relocation limit for large programs. I am thinking of doing a function level sampling so that the text size increase would be minimal. Please let me know if you have any problems with the patch and keep us updated about the performance.

…

-Rong

On Fri, Jan 19, 2024 at 6:51 PM wenlei ***@***.***> wrote: Any follow ups on this patch? Once it's in we'd like to give it a try as well. — Reply to this email directly, view it on GitHub <#69535 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOI42XXW4DZVXN42HYPHGUTYPMWKZAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBRGY3DSNZYGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

WenleiHe · 2024-01-24T18:59:06Z

Hi, Wenlei, I just updated the patch to sync with LLVM head. You can try with option "--mllvm -enable-sampled-instr=true". I resumed work on this from the last few weeks and I'm testing it with some internal benchmarks. One thing that we noticed is that this patch can increase the text size quite a bit and sometimes hits the 2GB relocation limit for large programs. I am thinking of doing a function level sampling so that the text size increase would be minimal. Please let me know if you have any problems with the patch and keep us updated about the performance.
…
-Rong
On Fri, Jan 19, 2024 at 6:51 PM wenlei @.> wrote: Any follow ups on this patch? Once it's in we'd like to give it a try as well. — Reply to this email directly, view it on GitHub <#69535 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOI42XXW4DZVXN42HYPHGUTYPMWKZAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBRGY3DSNZYGE . You are receiving this because you authored the thread.Message ID: @.>

Thanks. Will do!

WenleiHe · 2024-01-27T19:24:14Z

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp

@@ -770,7 +770,7 @@ BasicBlock *FuncPGOInstrumentation<Edge, BBInfo>::getInstrBB(Edge *E) {
  auto canInstrument = [](BasicBlock *BB) -> BasicBlock * {
    // There are basic blocks (such as catchswitch) cannot be instrumented.
    // If the returned first insertion point is the end of BB, skip this BB.
-    if (BB->getFirstInsertionPt() == BB->end())
+    if (BB->getFirstNonPHIOrDbgOrAlloca() == BB->end())


what is the reason to instrument after alloca?

WenleiHe · 2024-01-27T19:26:59Z

split-indirectbr-critical-edges.ll need to be updated

WenleiHe · 2024-02-06T04:23:41Z

Hi, Wenlei, I just updated the patch to sync with LLVM head. You can try with option "--mllvm -enable-sampled-instr=true". I resumed work on this from the last few weeks and I'm testing it with some internal benchmarks. One thing that we noticed is that this patch can increase the text size quite a bit and sometimes hits the 2GB relocation limit for large programs. I am thinking of doing a function level sampling so that the text size increase would be minimal. Please let me know if you have any problems with the patch and keep us updated about the performance.
…
-Rong
On Fri, Jan 19, 2024 at 6:51 PM wenlei @.> wrote: Any follow ups on this patch? Once it's in we'd like to give it a try as well. — Reply to this email directly, view it on GitHub <#69535 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOI42XXW4DZVXN42HYPHGUTYPMWKZAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBRGY3DSNZYGE . You are receiving this because you authored the thread.Message ID: @.>

We're seeing ~0.4% rps improvements on HHVM (PHP JIT) with this change on top of regular IRPGO. The smaller improvement is perhaps expected as JIT doesn't have a lot of parallelism/contention.

Any plan for landing this after addressing remaining code review comments? We'd happily take it down after it's landed here -- thanks for the work.

WenleiHe · 2024-05-22T02:59:17Z

@xur-llvm is this abandoned?

xur-llvm · 2024-05-22T05:23:15Z

On Tue, May 21, 2024 at 7:59 PM wenlei ***@***.***> wrote: @xur-llvm <https://github.com/xur-llvm> is this abandoned?

No. This is still active. I was busy with other projects in the past few months. But now I think I have time for this. I will address the review comments and land the change. One issue I found for this patch is that it increases the code size non-trivially, and it is probamaticaly for certain applications. We have some ideas to reduce the size. This would be in a follow-up patch.

…

— Reply to this email directly, view it on GitHub <#69535 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOI42XRP2IA7AI6N4KEV4N3ZDQC2BAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTG43TSOJZGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

WenleiHe · 2024-07-11T17:54:50Z

On Tue, May 21, 2024 at 7:59 PM wenlei @.> wrote: @xur-llvm https://github.com/xur-llvm is this abandoned?
No. This is still active. I was busy with other projects in the past few months. But now I think I have time for this. I will address the review comments and land the change. One issue I found for this patch is that it increases the code size non-trivially, and it is probamaticaly for certain applications. We have some ideas to reduce the size. This would be in a follow-up patch.
…
— Reply to this email directly, view it on GitHub <#69535 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOI42XRP2IA7AI6N4KEV4N3ZDQC2BAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTG43TSOJZGY . You are receiving this because you were mentioned.Message ID: @.>

Any update on this @xur-llvm ?

xur-llvm · 2024-07-11T19:48:21Z

reopen this pull-request.

xur-llvm · 2024-07-11T19:48:36Z

Reopen.

PGO instrumentation binary can be very slow comparing to the non-instrumented binary. It's not uncommon to see 10x slowdown for highly threaded programs, due to data race of false sharing to the counters. This patch uses sampling in PGO instrumentation to speed up instrumentation binary. The basic idea is the same as one: here: https://reviews.llvm.org/D63949 This patch makes some improvements so that we only use one condition. We now fix the WholeDuring at 65536 and use the wraparound of unsigned short. With this sampled instrumentation, the binary runs much faster. We measure 5x speedup using the default duration. We now only see about 20% to 30% slow down (comparing to 8 to 10x slowdown without sampling). The profile quality is pretty good with sampling: the edge counts usually report >90% overlap. For the apps that program behaviors change due to binary speed, sampling instrumentation can improve the performance. We have observed some apps getting up ~2% improvement in PGO. One potential issue of this patch is the increased binary size and compilation time.

xur-llvm · 2024-07-15T00:03:16Z

The last commit was to rebase to the more recent sources.

Integrated the reviews from Wenlei, David, Snehasish and Hongtao. The patch now has 3 modes for sampling: (1) full burst sampling (2) fast burst sampling (3) simple sampling Also update the tests.

xur-llvm · 2024-07-15T00:14:40Z

Commit 2 integrated the suggestions from reviewers.
It has full-burst, fast-burst, and simple sampling now.

xur-llvm · 2024-07-15T16:08:33Z

I updated the patch yesterday. The new version addressed the review comments. Could you take a look?

…

On Thu, Jul 11, 2024 at 10:55 AM wenlei ***@***.***> wrote: On Tue, May 21, 2024 at 7:59 PM wenlei *@*. *> wrote: @xur-llvm <https://github.com/xur-llvm> https://github.com/xur-llvm <https://github.com/xur-llvm> is this abandoned? No. This is still active. I was busy with other projects in the past few months. But now I think I have time for this. I will address the review comments and land the change. One issue I found for this patch is that it increases the code size non-trivially, and it is probamaticaly for certain applications. We have some ideas to reduce the size. This would be in a follow-up patch. … <#m_3115587418209969750_> — Reply to this email directly, view it on GitHub <#69535 (comment) <#69535 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOI42XRP2IA7AI6N4KEV4N3ZDQC2BAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTG43TSOJZGY <https://github.com/notifications/unsubscribe-auth/AOI42XRP2IA7AI6N4KEV4N3ZDQC2BAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTG43TSOJZGY> . You are receiving this because you were mentioned.Message ID: @.*> Any update on this @xur-llvm <https://github.com/xur-llvm> ? — Reply to this email directly, view it on GitHub <#69535 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOI42XR7STCDHS37KOFXG3DZL3BINAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGU2DENJTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Fix the test for windows build as the llvm_nm has different size. Also rename the tests to correct a typo.

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

WenleiHe · 2024-07-18T06:05:32Z

llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp

+  assert(SampledBurstDuration < SampledPeriod);
+  bool UseShort = (SampledPeriod <= USHRT_MAX);
+  bool IsSimpleSampling = (SampledBurstDuration == 1);
+  bool IsFastSampling = (!IsSimpleSampling && SampledPeriod == 65535);


The condition for SampledBurstDuration is generated for non-simple sampling, regardless whether it's fast sampling. My question was why do we need to include !IsSimpleSampling in IsFastSampling..

llvm/test/Transforms/PGOProfile/instrprof_simple_sampling.ll

Add some comments and a test per Wenlei's suggestion.

xur-llvm · 2024-07-18T17:07:31Z

Update the patch based on Wenlei's suggestion.

WenleiHe

lgtm, thanks.

ormris · 2024-07-22T17:38:28Z

I'm seeing some buildbot failures after this commit: https://lab.llvm.org/buildbot/#/builders/66/builds/1956

@xur-llvm Could you take a look?

dyung · 2024-07-22T17:54:00Z

Another buildbot showing a failure here: https://lab.llvm.org/buildbot/#/builders/174/builds/2103

xur-llvm · 2024-07-22T17:56:04Z

Looking at the failures now.

…

-Rong

On Mon, Jul 22, 2024 at 10:54 AM dyung ***@***.***> wrote: Another buildbot showing a failure here: https://lab.llvm.org/buildbot/#/builders/174/builds/2103 — Reply to this email directly, view it on GitHub <#69535 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AOI42XQGU4WVLIDG4B4QCETZNVBNHAVCNFSM6AAAAAA6GII2ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBTGUYDKNBUHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ormris · 2024-07-22T19:18:53Z

If some investigation is required, we would appreciate a revert.

xur-llvm · 2024-07-22T20:12:20Z

I think this was due to compiler-rt/include/profile/InstrProfData.inc not synced to llvm/include/llvm/ProfileData/InstrProfData.inc.

xur-llvm · 2024-07-22T20:39:04Z

Create a PR to fix the issue: #99930

Sync InstrProfData.inc from llvm to compiler-rt. The difference was introduced from #69535.

…ry (llvm#69535) In comparison to non-instrumented binaries, PGO instrumentation binaries can be significantly slower. For highly threaded programs, this slowdown can reach 10x due to data races or false sharing within counters. This patch incorporates sampling into the PGO instrumentation process to enhance the speed of instrumentation binaries. The fundamental concept is similar to the one proposed in https://reviews.llvm.org/D63949. Three sampling modes are introduced: 1. Simple Sampling: When '-sampled-instr-bust-duration' is set to 1. 2. Fast Burst Sampling: When not using simple sampling, and '-sampled-instr-period' is set to 65535. This is the default mode of sampling. 3. Full Burst Sampling: When neither simple nor fast burst sampling is used. Utilizing this sampled instrumentation significantly improves the binary's execution speed. Measurements show up to 5x speedup with default settings. Fast burst sampling now results in only around 20% to 30% slowdown (compared to 8 to 10x slowdown without sampling). Out tests show that profile quality remains good with sampling, with edge counts typically showing more than 90% overlap. For applications whose behavior changes due to binary speed, sampling instrumentation can enhance performance. Observations have shown some apps experiencing up to a ~2% improvement in PGO. A potential drawback of this patch is the increased binary size and compilation time. The Sampling method in this patch does not improve single threaded program instrumentation binary speed.

Sync InstrProfData.inc from llvm to compiler-rt. The difference was introduced from llvm#69535.

…ry (#69535) Summary: In comparison to non-instrumented binaries, PGO instrumentation binaries can be significantly slower. For highly threaded programs, this slowdown can reach 10x due to data races or false sharing within counters. This patch incorporates sampling into the PGO instrumentation process to enhance the speed of instrumentation binaries. The fundamental concept is similar to the one proposed in https://reviews.llvm.org/D63949. Three sampling modes are introduced: 1. Simple Sampling: When '-sampled-instr-bust-duration' is set to 1. 2. Fast Burst Sampling: When not using simple sampling, and '-sampled-instr-period' is set to 65535. This is the default mode of sampling. 3. Full Burst Sampling: When neither simple nor fast burst sampling is used. Utilizing this sampled instrumentation significantly improves the binary's execution speed. Measurements show up to 5x speedup with default settings. Fast burst sampling now results in only around 20% to 30% slowdown (compared to 8 to 10x slowdown without sampling). Out tests show that profile quality remains good with sampling, with edge counts typically showing more than 90% overlap. For applications whose behavior changes due to binary speed, sampling instrumentation can enhance performance. Observations have shown some apps experiencing up to a ~2% improvement in PGO. A potential drawback of this patch is the increased binary size and compilation time. The Sampling method in this patch does not improve single threaded program instrumentation binary speed. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60251137

Summary: Sync InstrProfData.inc from llvm to compiler-rt. The difference was introduced from #69535. Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: https://phabricator.intern.facebook.com/D60250796

… function Per the discussion in llvm#102542, it is safe to insert BBs under `lowerIntrinsics()` since llvm#69535 has made tolerant of modifying BBs. So, I can get rid of using the inlined function `rmw_or`.

…10792) Per the discussion in #102542, it is safe to insert BBs under `lowerIntrinsics()` since #69535 has made tolerant of modifying BBs. So, I can get rid of using the inlined function `rmw_or`, introduced in #96040.

…vm#110792) Per the discussion in llvm#102542, it is safe to insert BBs under `lowerIntrinsics()` since llvm#69535 has made tolerant of modifying BBs. So, I can get rid of using the inlined function `rmw_or`, introduced in llvm#96040.

With sampled instrumentation (llvm#69535), profile counts can appear corrupt. In particular a function can have a 0 block count for its entry, while later blocks are non zero. This is only likely to happen for colder functions, so it is reasonable to take any action that does not crash.

With sampled instrumentation (llvm#69535), profile counts can appear corrupt. In particular a function can have a 0 block counts for all its blocks, while having some non-zero counters for select instrumentation. This is only possible for colder functions, and a reasonable modification to ensure the entry is non-zero (required by `fixFuncEntryCounts`) is to set the counter to one. This is only likely to happen for colder functions, so it is reasonable to take any action that does not crash.

With sampled instrumentation (#69535), profile counts may appear corrupt and `fixFuncEntryCount` may assert. In particular a function can have a 0 block count for its entry, while later blocks are non zero. This is only likely to happen for colder functions, so it is reasonable to take any action that does not crash. Here we simply bail from fixing the entry count.

llvmbot added PGO Profile Guided Optimizations llvm:transforms labels Oct 18, 2023

xur-llvm requested review from WenleiHe, david-xl and htyu October 18, 2023 23:02

snehasish reviewed Oct 18, 2023

View reviewed changes

WenleiHe reviewed Oct 18, 2023

View reviewed changes

david-xl reviewed Oct 19, 2023

View reviewed changes

snehasish reviewed Oct 24, 2023

View reviewed changes

WenleiHe reviewed Jan 27, 2024

View reviewed changes

xur-llvm closed this May 20, 2024

xur-llvm force-pushed the main branch from 34184d6 to 7ecdf62 Compare May 20, 2024 22:09

xur-llvm reopened this Jul 15, 2024

[PGO] Sampled instrumentation in PGO to speed up instrumentation binary

64cbd8f

Integrated the reviews from Wenlei, David, Snehasish and Hongtao. The patch now has 3 modes for sampling: (1) full burst sampling (2) fast burst sampling (3) simple sampling Also update the tests.

[PGO] Fix the test for windows build

c207bfb

Fix the test for windows build as the llvm_nm has different size. Also rename the tests to correct a typo.

WenleiHe reviewed Jul 18, 2024

View reviewed changes

[PGO] sampled instrumention

b8204b0

Add some comments and a test per Wenlei's suggestion.

xur-llvm requested a review from WenleiHe July 18, 2024 17:07

WenleiHe approved these changes Jul 19, 2024

View reviewed changes

Merge branch 'main' into main

27d0a6c

xur-llvm merged commit b1ca2a9 into llvm:main Jul 22, 2024
7 checks passed

xur-llvm mentioned this pull request Jul 22, 2024

[PGO] Sync InstrProfData.inc from llvm to compiler-rt #99930

Merged

xur-llvm added a commit that referenced this pull request Jul 22, 2024

[PGO] Sync InstrProfData.inc from llvm to compiler-rt (#99930)

25897ba

Sync InstrProfData.inc from llvm to compiler-rt. The difference was introduced from #69535.

sgundapa pushed a commit to sgundapa/upstream_effort that referenced this pull request Jul 23, 2024

[PGO] Sync InstrProfData.inc from llvm to compiler-rt (llvm#99930)

76c2066

Sync InstrProfData.inc from llvm to compiler-rt. The difference was introduced from llvm#69535.

alanzhao1 mentioned this pull request Aug 9, 2024

[InstrProf] Support conditional counter updates #102542

Merged

chapuni mentioned this pull request Oct 2, 2024

[MC/DC] Rework tvbitmap.update to get rid of the inlined function #110792

Merged

mofarrell mentioned this pull request Oct 11, 2024

[PGO] Ensure non-zero entry-count after populateCounters #112029

Merged

		// ?? Seeing "call void @llvm.memcpy.p0.p0.i64..." here ??
		// llvm_unreachable("Invalid InstroProf intrinsic");

[PGO] Sampled instrumentation in PGO to speed up instrumentation binary #69535

[PGO] Sampled instrumentation in PGO to speed up instrumentation binary #69535

Conversation

xur-llvm commented Oct 18, 2023 • edited Loading

llvmbot commented Oct 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WenleiHe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WenleiHe commented Jan 20, 2024

xur-llvm commented Jan 24, 2024 via email

WenleiHe commented Jan 24, 2024

Choose a reason for hiding this comment

WenleiHe commented Jan 27, 2024

WenleiHe commented Feb 6, 2024

WenleiHe commented May 22, 2024

xur-llvm commented May 22, 2024 via email

WenleiHe commented Jul 11, 2024

xur-llvm commented Jul 11, 2024

xur-llvm commented Jul 11, 2024

xur-llvm commented Jul 15, 2024

xur-llvm commented Jul 15, 2024

xur-llvm commented Jul 15, 2024 via email

Choose a reason for hiding this comment

xur-llvm commented Jul 18, 2024

WenleiHe left a comment

Choose a reason for hiding this comment

ormris commented Jul 22, 2024

dyung commented Jul 22, 2024

xur-llvm commented Jul 22, 2024 via email

ormris commented Jul 22, 2024

xur-llvm commented Jul 22, 2024

xur-llvm commented Jul 22, 2024

xur-llvm commented Oct 18, 2023 •

edited

Loading

llvmbot commented Oct 18, 2023 •

edited

Loading