[FMV][GlobalOpt] Statically resolve calls to versioned functions. #87939

labrinea · 2024-04-07T20:34:02Z

To deduce whether the optimization is legal we need to compare the target
features between caller and callee versions. The criteria for bypassing
the resolver are the following:

If the callee's feature set is a subset of the caller's feature set,
then the callee is a candidate for direct call.
Among such candidates the one of highest priority is the best match
and it shall be picked, unless there is a version of the callee with
higher priority than the best match which cannot be picked from a
higher priority caller (directly or through the resolver).
For every higher priority callee version than the best match, there
is a higher priority caller version whose feature set availability
is implied by the callee's feature set.

llvmbot · 2024-04-07T20:35:13Z

@llvm/pr-subscribers-tablegen
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-llvm-analysis

Author: Alexandros Lamprineas (labrinea)

Changes

To deduce whether the optimization is legal we need to compare the target
features between caller and callee versions. The criteria for bypassing
the resolver are the following:

If the callee's feature set is a subset of the caller's feature set,
then the callee is a candidate for direct call.
Among such candidates the one of highest priority is the best match
and it shall be picked, unless there is a version of the callee with
higher priority than the best match which cannot be picked because
there is no corresponding caller for whom it would have been the best
match.

Implementation details:

First we collect all the callee versions in feature priority order. We do
the same for all the callsites. Then we try to constant fold the resolver
for every callsite starting from higher priority callers. This guarantees
that as soon as we find a callee whose priority is lower than the expected
best match then there is no point in continuing further.

The constant folding works for single basic block resolvers as well as
for resolvers consisting of multiple basic blocks. The set of instructions
we attempt to fold are a handful give or take (return, binop, compare,
select, branch, phi) and we only follow single user use-def chains. For
callsites residing in the same caller we cache the folded result to avoid
redundant computation.

Patch is 24.13 KiB, truncated to 20.00 KiB below, full version: /~https://github.com/llvm/llvm-project/pull/87939.diff

8 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+23)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+6)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+10)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+12)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+6)
(modified) llvm/lib/TargetParser/AArch64TargetParser.cpp (+7-2)
(modified) llvm/lib/Transforms/IPO/GlobalOpt.cpp (+225-1)
(added) llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll (+211)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index fa9392b86c15b9..530935fd63d326 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1762,6 +1762,16 @@ class TargetTransformInfo {
   /// false, but it shouldn't matter what it returns anyway.
   bool hasArmWideBranch(bool Thumb) const;
 
+  /// Returns true if the target supports Function MultiVersioning.
+  bool hasFMV() const;
+
+  /// Returns the MultiVersion priority of a given function.
+  uint64_t getFMVPriority(Function &F) const;
+
+  /// Returns the symbol which contains the cpu feature mask used by
+  /// the Function MultiVersioning resolver.
+  GlobalVariable *getCPUFeatures(Module &M) const;
+
   /// \return The maximum number of function arguments the target supports.
   unsigned getMaxNumArgs() const;
 
@@ -2152,6 +2162,9 @@ class TargetTransformInfo::Concept {
   virtual VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
+  virtual bool hasFMV() const = 0;
+  virtual uint64_t getFMVPriority(Function &F) const = 0;
+  virtual GlobalVariable *getCPUFeatures(Module &M) const = 0;
   virtual unsigned getMaxNumArgs() const = 0;
 };
 
@@ -2904,6 +2917,16 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.hasArmWideBranch(Thumb);
   }
 
+  bool hasFMV() const override { return Impl.hasFMV(); }
+
+  uint64_t getFMVPriority(Function &F) const override {
+    return Impl.getFMVPriority(F);
+  }
+
+  GlobalVariable *getCPUFeatures(Module &M) const override {
+    return Impl.getCPUFeatures(M);
+  }
+
   unsigned getMaxNumArgs() const override {
     return Impl.getMaxNumArgs();
   }
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 63c2ef8912b29c..746c09f0d50370 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -941,6 +941,12 @@ class TargetTransformInfoImplBase {
 
   bool hasArmWideBranch(bool) const { return false; }
 
+  bool hasFMV() const { return false; }
+
+  uint64_t getFMVPriority(Function &F) const { return 0; }
+
+  GlobalVariable *getCPUFeatures(Module &M) const { return nullptr; }
+
   unsigned getMaxNumArgs() const { return UINT_MAX; }
 
 protected:
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 5f933b4587843c..39da6cc4445759 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1296,6 +1296,16 @@ bool TargetTransformInfo::hasArmWideBranch(bool Thumb) const {
   return TTIImpl->hasArmWideBranch(Thumb);
 }
 
+bool TargetTransformInfo::hasFMV() const { return TTIImpl->hasFMV(); }
+
+uint64_t TargetTransformInfo::getFMVPriority(Function &F) const {
+  return TTIImpl->getFMVPriority(F);
+}
+
+GlobalVariable *TargetTransformInfo::getCPUFeatures(Module &M) const {
+  return TTIImpl->getCPUFeatures(M);
+}
+
 unsigned TargetTransformInfo::getMaxNumArgs() const {
   return TTIImpl->getMaxNumArgs();
 }
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index ee7137b92445bb..a92f859b59a3de 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -21,6 +21,7 @@
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
 #include "llvm/Support/Debug.h"
+#include "llvm/TargetParser/AArch64TargetParser.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
@@ -231,6 +232,17 @@ static bool hasPossibleIncompatibleOps(const Function *F) {
   return false;
 }
 
+uint64_t AArch64TTIImpl::getFMVPriority(Function &F) const {
+  StringRef FeatureStr = F.getFnAttribute("target-features").getValueAsString();
+  SmallVector<StringRef, 8> Features;
+  FeatureStr.split(Features, ",");
+  return AArch64::getCpuSupportsMask(Features);
+}
+
+GlobalVariable *AArch64TTIImpl::getCPUFeatures(Module &M) const {
+  return M.getGlobalVariable("__aarch64_cpu_features");
+}
+
 bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,
                                          const Function *Callee) const {
   SMEAttrs CallerAttrs(*Caller), CalleeAttrs(*Callee);
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index de39dea2be43e1..51ad79690679f5 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -83,6 +83,12 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   unsigned getInlineCallPenalty(const Function *F, const CallBase &Call,
                                 unsigned DefaultCallPenalty) const;
 
+  bool hasFMV() const { return ST->hasFMV(); }
+
+  uint64_t getFMVPriority(Function &F) const;
+
+  GlobalVariable *getCPUFeatures(Module &M) const;
+
   /// \name Scalar TTI Implementations
   /// @{
 
diff --git a/llvm/lib/TargetParser/AArch64TargetParser.cpp b/llvm/lib/TargetParser/AArch64TargetParser.cpp
index 71099462d5ecff..7a3d2fc5f0c9db 100644
--- a/llvm/lib/TargetParser/AArch64TargetParser.cpp
+++ b/llvm/lib/TargetParser/AArch64TargetParser.cpp
@@ -50,8 +50,13 @@ std::optional<AArch64::ArchInfo> AArch64::ArchInfo::findBySubArch(StringRef SubA
 uint64_t AArch64::getCpuSupportsMask(ArrayRef<StringRef> FeatureStrs) {
   uint64_t FeaturesMask = 0;
   for (const StringRef &FeatureStr : FeatureStrs) {
-    if (auto Ext = parseArchExtension(FeatureStr))
-      FeaturesMask |= (1ULL << Ext->CPUFeature);
+    StringRef Feat = resolveExtAlias(FeatureStr);
+    for (const auto &E : Extensions) {
+      if (Feat == E.Name || Feat == E.Feature) {
+        FeaturesMask |= (1ULL << E.CPUFeature);
+        break;
+      }
+    }
   }
   return FeaturesMask;
 }
diff --git a/llvm/lib/Transforms/IPO/GlobalOpt.cpp b/llvm/lib/Transforms/IPO/GlobalOpt.cpp
index da714c9a75701b..75ff270bb09b90 100644
--- a/llvm/lib/Transforms/IPO/GlobalOpt.cpp
+++ b/llvm/lib/Transforms/IPO/GlobalOpt.cpp
@@ -89,7 +89,7 @@ STATISTIC(NumAliasesRemoved, "Number of global aliases eliminated");
 STATISTIC(NumCXXDtorsRemoved, "Number of global C++ destructors removed");
 STATISTIC(NumInternalFunc, "Number of internal functions");
 STATISTIC(NumColdCC, "Number of functions marked coldcc");
-STATISTIC(NumIFuncsResolved, "Number of statically resolved IFuncs");
+STATISTIC(NumIFuncsResolved, "Number of resolved IFuncs");
 STATISTIC(NumIFuncsDeleted, "Number of IFuncs removed");
 
 static cl::opt<bool>
@@ -2462,6 +2462,227 @@ DeleteDeadIFuncs(Module &M,
   return Changed;
 }
 
+static Function *foldResolverForCallSite(CallBase *CS, uint64_t Priority,
+                                         TargetTransformInfo &TTI) {
+  // Look for the instruction which feeds the feature mask to the users.
+  auto findRoot = [&TTI](Function *F) -> Instruction * {
+    for (Instruction &I : F->getEntryBlock())
+      if (auto *Load = dyn_cast<LoadInst>(&I))
+        if (Load->getPointerOperand() == TTI.getCPUFeatures(*F->getParent()))
+          return Load;
+    return nullptr;
+  };
+
+  auto *IF = cast<GlobalIFunc>(CS->getCalledOperand());
+  Instruction *Root = findRoot(IF->getResolverFunction());
+  // There is no such instruction. Bail.
+  if (!Root)
+    return nullptr;
+
+  // Create a constant mask to use as seed for the constant propagation.
+  Constant *Seed = Constant::getIntegerValue(
+      Root->getType(), APInt(Root->getType()->getIntegerBitWidth(), Priority));
+
+  auto DL = CS->getModule()->getDataLayout();
+
+  // Recursively propagate on single use chains.
+  std::function<Constant *(Instruction *, Instruction *, Constant *,
+                           BasicBlock *)>
+      constFoldInst = [&](Instruction *I, Instruction *Use, Constant *C,
+                          BasicBlock *Pred) -> Constant * {
+    // Base case.
+    if (auto *Ret = dyn_cast<ReturnInst>(I))
+      if (Ret->getReturnValue() == Use)
+        return C;
+
+    // Minimal set of instruction types to handle.
+    if (auto *BinOp = dyn_cast<BinaryOperator>(I)) {
+      bool Swap = BinOp->getOperand(1) == Use;
+      if (auto *Other = dyn_cast<Constant>(BinOp->getOperand(Swap ? 0 : 1)))
+        C = Swap ? ConstantFoldBinaryInstruction(BinOp->getOpcode(), Other, C)
+                 : ConstantFoldBinaryInstruction(BinOp->getOpcode(), C, Other);
+    } else if (auto *Cmp = dyn_cast<CmpInst>(I)) {
+      bool Swap = Cmp->getOperand(1) == Use;
+      if (auto *Other = dyn_cast<Constant>(Cmp->getOperand(Swap ? 0 : 1)))
+        C = Swap ? ConstantFoldCompareInstOperands(Cmp->getPredicate(), Other,
+                                                   C, DL)
+                 : ConstantFoldCompareInstOperands(Cmp->getPredicate(), C,
+                                                   Other, DL);
+    } else if (auto *Sel = dyn_cast<SelectInst>(I)) {
+      if (Sel->getCondition() == Use)
+        C = dyn_cast<Constant>(C->isZeroValue() ? Sel->getFalseValue()
+                                                : Sel->getTrueValue());
+    } else if (auto *Phi = dyn_cast<PHINode>(I)) {
+      if (Pred)
+        C = dyn_cast<Constant>(Phi->getIncomingValueForBlock(Pred));
+    } else if (auto *Br = dyn_cast<BranchInst>(I)) {
+      if (Br->getCondition() == Use) {
+        BasicBlock *BB = Br->getSuccessor(C->isZeroValue());
+        return constFoldInst(&BB->front(), Root, Seed, Br->getParent());
+      }
+    } else {
+      // Don't know how to handle. Bail.
+      return nullptr;
+    }
+
+    // Folding succeeded. Continue.
+    if (C && I->hasOneUse())
+      if (auto *UI = dyn_cast<Instruction>(I->user_back()))
+        return constFoldInst(UI, I, C, nullptr);
+
+    return nullptr;
+  };
+
+  // Collect all users in the entry block ordered by proximity. The rest of
+  // them can be discovered later. Unfortunately we cannot simply traverse
+  // the Root's 'users()' as their order is not the same as execution order.
+  unsigned NUsersLeft = std::distance(Root->user_begin(), Root->user_end());
+  SmallVector<Instruction *> Users;
+  for (Instruction &I : *Root->getParent()) {
+    if (any_of(I.operands(), [Root](auto &Op) { return Op == Root; })) {
+      Users.push_back(&I);
+      if (--NUsersLeft == 0)
+        break;
+    }
+  }
+
+  // Return as soon as we find a foldable user. It has the highest priority.
+  for (Instruction *I : Users) {
+    Constant *C = constFoldInst(I, Root, Seed, nullptr);
+    if (C)
+      return cast<Function>(C);
+  }
+
+  return nullptr;
+}
+
+// Bypass the IFunc Resolver of MultiVersioned functions when possible. To
+// deduce whether the optimization is legal we need to compare the target
+// features between caller and callee versions. The criteria for bypassing
+// the resolver are the following:
+//
+// * If the callee's feature set is a subset of the caller's feature set,
+//   then the callee is a candidate for direct call.
+//
+// * Among such candidates the one of highest priority is the best match
+//   and it shall be picked, unless there is a version of the callee with
+//   higher priority than the best match which cannot be picked because
+//   there is no corresponding caller for whom it would have been the best
+//   match.
+//
+static bool OptimizeNonTrivialIFuncs(
+    Module &M, function_ref<TargetTransformInfo &(Function &)> GetTTI) {
+  bool Changed = false;
+
+  std::function<void(Value *, SmallVectorImpl<Function *> &)> visitValue =
+      [&](Value *V, SmallVectorImpl<Function *> &FuncVersions) {
+        if (auto *Func = dyn_cast<Function>(V)) {
+          FuncVersions.push_back(Func);
+        } else if (auto *Sel = dyn_cast<SelectInst>(V)) {
+          visitValue(Sel->getTrueValue(), FuncVersions);
+          visitValue(Sel->getFalseValue(), FuncVersions);
+        } else if (auto *Phi = dyn_cast<PHINode>(V))
+          for (unsigned I = 0, E = Phi->getNumIncomingValues(); I != E; ++I)
+            visitValue(Phi->getIncomingValue(I), FuncVersions);
+      };
+
+  for (GlobalIFunc &IF : M.ifuncs()) {
+    if (IF.isInterposable())
+      continue;
+
+    Function *Resolver = IF.getResolverFunction();
+    if (!Resolver)
+      continue;
+
+    if (Resolver->isInterposable())
+      continue;
+
+    TargetTransformInfo &TTI = GetTTI(*Resolver);
+    if (!TTI.hasFMV())
+      return false;
+
+    // Discover the callee versions.
+    SmallVector<Function *> Callees;
+    for (BasicBlock &BB : *Resolver)
+      if (auto *Ret = dyn_cast_or_null<ReturnInst>(BB.getTerminator()))
+        visitValue(Ret->getReturnValue(), Callees);
+
+    if (Callees.empty())
+      continue;
+
+    // Cache the feature mask for each callee version.
+    DenseMap<Function *, uint64_t> CalleePriorityMap;
+    for (Function *Callee : Callees) {
+      auto [It, Inserted] = CalleePriorityMap.try_emplace(Callee);
+      if (Inserted)
+        It->second = TTI.getFMVPriority(*Callee);
+    }
+
+    // Sort the callee versions in increasing feature priority order.
+    // Every time we find a caller that matches the highest priority
+    // callee we pop_back() one from this ordered list.
+    llvm::stable_sort(Callees, [&](auto *LHS, auto *RHS) {
+      return CalleePriorityMap[LHS] < CalleePriorityMap[RHS];
+    });
+
+    // Find the callsites and cache the feature mask for each caller.
+    DenseMap<Function *, uint64_t> CallerPriorityMap;
+    SmallVector<CallBase *> CallSites;
+    for (User *U : IF.users()) {
+      if (auto *CB = dyn_cast<CallBase>(U)) {
+        if (CB->getCalledOperand() == &IF) {
+          Function *Caller = CB->getFunction();
+          auto [It, Inserted] = CallerPriorityMap.try_emplace(Caller);
+          if (Inserted)
+            It->second = TTI.getFMVPriority(*Caller);
+          CallSites.push_back(CB);
+        }
+      }
+    }
+
+    // Sort the callsites in decreasing feature priority order.
+    llvm::stable_sort(CallSites, [&](auto *LHS, auto *RHS) {
+      return CallerPriorityMap[LHS->getFunction()] >
+             CallerPriorityMap[RHS->getFunction()];
+    });
+
+    // Now try to constant fold the resolver for every callsite starting
+    // from higher priority callers. This guarantees that as soon as we
+    // find a callee whose priority is lower than the expected best match
+    // then there is no point in continuing further.
+    DenseMap<uint64_t, Function *> foldedResolverCache;
+    for (CallBase *CS : CallSites) {
+      uint64_t CallerPriority = CallerPriorityMap[CS->getFunction()];
+      auto [It, Inserted] = foldedResolverCache.try_emplace(CallerPriority);
+      Function *&Callee = It->second;
+      if (Inserted)
+        Callee = foldResolverForCallSite(CS, CallerPriority, TTI);
+      if (Callee) {
+        if (!Callees.empty()) {
+          // If the priority of the candidate is greater or equal to
+          // the expected best match then it shall be picked. Otherwise
+          // there is a higher priority callee without a corresponding
+          // caller, in which case abort.
+          uint64_t CalleePriority = CalleePriorityMap[Callee];
+          if (CalleePriority == CalleePriorityMap[Callees.back()])
+            Callees.pop_back();
+          else if (CalleePriority < CalleePriorityMap[Callees.back()])
+            break;
+        }
+        CS->setCalledOperand(Callee);
+        Changed = true;
+      } else {
+        // Oops, something went wrong. We couldn't fold. Abort.
+        break;
+      }
+    }
+    if (IF.use_empty() ||
+        all_of(IF.users(), [](User *U) { return isa<GlobalAlias>(U); }))
+      NumIFuncsResolved++;
+  }
+  return Changed;
+}
+
 static bool
 optimizeGlobalsInModule(Module &M, const DataLayout &DL,
                         function_ref<TargetLibraryInfo &(Function &)> GetTLI,
@@ -2525,6 +2746,9 @@ optimizeGlobalsInModule(Module &M, const DataLayout &DL,
     // Optimize IFuncs whose callee's are statically known.
     LocalChange |= OptimizeStaticIFuncs(M);
 
+    // Optimize IFuncs based on the target features of the caller.
+    LocalChange |= OptimizeNonTrivialIFuncs(M, GetTTI);
+
     // Remove any IFuncs that are now dead.
     LocalChange |= DeleteDeadIFuncs(M, NotDiscardableComdats);
 
diff --git a/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll b/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
new file mode 100644
index 00000000000000..bcc73c8e44970f
--- /dev/null
+++ b/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
@@ -0,0 +1,211 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter "call i32 @(test_single_bb_resolver|test_multi_bb_resolver)" --version 4
+; RUN: opt --passes=globalopt -o - -S < %s | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+$test_single_bb_resolver.resolver = comdat any
+$test_multi_bb_resolver.resolver = comdat any
+$foo.resolver = comdat any
+$bar.resolver = comdat any
+
+@__aarch64_cpu_features = external local_unnamed_addr global { i64 }
+
+@test_single_bb_resolver.ifunc = weak_odr alias i32 (), ptr @test_single_bb_resolver
+@test_multi_bb_resolver.ifunc = weak_odr dso_local alias i32 (), ptr @test_multi_bb_resolver
+@foo.ifunc = weak_odr alias i32 (), ptr @foo
+@bar.ifunc = weak_odr dso_local alias i32 (), ptr @bar
+
+@test_single_bb_resolver = weak_odr ifunc i32 (), ptr @test_single_bb_resolver.resolver
+@test_multi_bb_resolver = weak_odr dso_local ifunc i32 (), ptr @test_multi_bb_resolver.resolver
+@foo = weak_odr ifunc i32 (), ptr @foo.resolver
+@bar = weak_odr dso_local ifunc i32 (), ptr @bar.resolver
+
+declare void @__init_cpu_features_resolver() local_unnamed_addr
+
+declare i32 @test_single_bb_resolver._Msve() #2
+
+declare i32 @test_single_bb_resolver._Msve2() #3
+
+define i32 @test_single_bb_resolver.default() #1 {
+; CHECK-LABEL: define i32 @test_single_bb_resolver.default(
+; CHECK-SAME: ) #[[ATTR2:[0-9]+]] {
+entry:
+  ret i32 0
+}
+
+define weak_odr ptr @test_single_bb_resolver.resolver() #0 comdat {
+; CHECK-LABEL: define weak_odr ptr @test_single_bb_resolver.resolver(
+; CHECK-SAME: ) #[[ATTR3:[0-9]+]] comdat {
+resolver_entry:
+  tail call void @__init_cpu_features_resolver()
+  %0 = load i64, ptr @__aarch64_cpu_features, align 8
+  %1 = and i64 %0, 68719476736
+  %.not = icmp eq i64 %1, 0
+  %2 = and i64 %0, 1073741824
+  %.not3 = icmp eq i64 %2, 0
+  %test_single_bb_resolver._Msve.test_single_bb_resolver.default = select i1 %.not3, ptr @test_single_bb_resolver.default, ptr @test_single_bb_resolver._Msve
+  %common.ret.op = select i1 %.not, ptr %test_single_bb_resolver._Msve.test_single_bb_resolver.default, ptr @test_single_bb_resolver._Msve2
+  ret ptr %common.ret.op
+}
+
+define i32 @foo._Msve() #2 {
+; CHECK-LABEL: define i32 @foo._Msve(
+; CHECK-SAME: ) #[[ATTR0:[0-9]+]] {
+; CHECK:    [[CALL:%.*]] = tail call i32 @test_single_bb_resolver._Msve()
+;
+entry:
+  %call = tail call i32 @test_single_bb_resolver()
+  %add = add nsw i32 %call, 30
+  ret i32 %add
+}
+
+define i32 @foo._Msve2() #3 {
+; CHECK-LABEL: define i32 @foo._Msve2(
+; CHECK-SAME: ) #[[ATTR1:[0-9]+]] {
+; CHECK:    [[CALL1:%.*]] = tail call i32 @test_single_bb_resolver._Msve2()
+; CHECK:    [[CALL2:%.*]] = tail call i32 @test_single_bb_resolver._Msve2()
+;
+entry:
+  %call1 = tail call i32 @test_single_bb_resolver()
+  %call2 = tail call i32 @test_single_bb_resolver()
+  %added = add nsw i32 %call1, %call2
+  %add = add nsw i32 %added, 20
+  ret i32 %add
+}
+
+define i32 @foo.default() #1 {
+; CHECK-LABEL: define i32 @foo.default(
+; CHECK-SAME: ) #[[ATTR2:[0-9]+]] {
+; CHECK:    [[CALL:%.*]] = tail call i32 @test_single_bb_resolver.default()
+;
+entry:
+  %call = tail call i32 @test_single_bb_resolver()
+  %add = add nsw i32 %call, 10
+  ret i32 %add
+}
+
+define weak_odr ptr @foo.resolver() #0 comdat {
+; CHECK-LABEL: define weak_odr ptr @foo.resolver(
+; CHECK-SAME: ) #[[ATTR3:[0-9]+]] comdat {
+resolver_entry:
+  tail call void @__init_cpu_features_resolver()
+  %0 = load i64, ptr @__aarch64_cpu_features, align 8
+  %1 = and i64 %0, 68719476736
+  %.not = icmp eq i64 %1, 0
+  %2 = ...
[truncated]

This will allow the backend to enable the corresponding subtarget feature (FeatureFMV), which in turn can be queried for llvm codegen decisions. See llvm#87939 for example.

llvm/lib/Transforms/IPO/GlobalOpt.cpp

andrewcarlotti · 2024-04-08T08:51:20Z

The testcase and description look valid, although the description seems a little more conservative than necessary.

If you were to add a bar._Mmops version, would it's call be optimised to test_multi_bb_resolver._Mmops? And would that also enable the call from bar.default to be optimised to point to test_multi_bb_resolver.default?

labrinea · 2024-04-08T12:50:36Z

The testcase and description look valid, although the description seems a little more conservative than necessary.

"Subset" is perhaps not the right terminology. I meant to say the feature set is "less or equal to". In other words "implied".

If you were to add a bar._Mmops version, would it's call be optimised to test_multi_bb_resolver._Mmops?

Yes

And would that also enable the call from bar.default to be optimised to point to test_multi_bb_resolver.default?

No, because there is no corresponding caller to pick test_multi_bb_resolver._Msve2.

Ah, now I see your point. Routing bar.default through the resolver could not have picked test_multi_bb_resolver._Msve2, because if the caller doesn't have sve it can't have sve2!

labrinea · 2024-04-08T22:44:58Z

@jroelofs, aside from semantics which I'll clarify with Andrew, I am thinking that the constant folding part might be redundant ~~and whether FeatureBitset is an alternative to the TargetParser~~ (ignore the second part, bad idea).

To deduce whether the optimization is legal we need to compare the target features between caller and callee versions. The criteria for bypassing the resolver are the following: * If the callee's feature set is a subset of the caller's feature set, then the callee is a candidate for direct call. * Among such candidates the one of highest priority is the best match and it shall be picked, unless there is a version of the callee with higher priority than the best match which cannot be picked from a higher priority caller (directly or through the resolver). * For every higher priority callee version than the best match, there is a higher priority caller version whose feature set availability is implied by the callee's feature set. Example: Callers and Callees are ordered in decreasing priority. The arrows indicate successful call redirections. Caller Callee Explanation ========================================================================= mops+sve2 --+--> mops all the callee versions are subsets of the | caller but mops has the highest priority | mops --+ sve2 between mops and default callees, mops wins sve sve between sve and default callees, sve wins but sve2 does not have a high priority caller default -----> default sve (callee) implies sve (caller), sve2(callee) implies sve (caller), mops(callee) implies mops(caller)

labrinea · 2024-04-10T14:15:34Z

I've removed the constant folding since it was indeed unnecessary computation. Also I found a testcase which exposes a potential issue: a function version (for example AES) having the same target features as the default. In TargetParser we have ExtensionInfo->DependentFeatures to model this information (I believe @ilinpv introduced this change) and it seems that for some entries we don't add the corresponding backend feature (perhaps intentionally?). I will try to address this in a separate ticket.

labrinea · 2024-04-11T13:08:47Z

I've removed the constant folding since it was indeed unnecessary computation. Also I found a testcase which exposes a potential issue: a function version (for example AES) having the same target features as the default. In TargetParser we have ExtensionInfo->DependentFeatures to model this information (I believe @ilinpv introduced this change) and it seems that for some entries we don't add the corresponding backend feature (perhaps intentionally?). I will try to address this in a separate ticket.

@andrewcarlotti already has a patch for ACLE ARM-software/acle#315 which removes pmull since it is implied by aes, therefore we can add the missing +aes in ExtensionInfo->DependentFeatures for aes.

This patch is sorting out inconsistencies in TargetParser regarding: * features without corresponding ArchExtKind * features without (Neg)Feature string * features with incorrect DependentFeatures string Also fixes "fp" which incorrectly implied "neon". This leaves us with "rpres" being the only remaining FMV feature with an incomplete entry. We are are currently reviewing it in ARM-software/acle#315. Having accurate ExtensionInfo entries in TargetParser is the only way of propagating the FMV information from clang to LLVM. If we don't want to rely on that we have to come up with another plan of encoding this information in LLVM-IR. For now we are relying on target-features. The PR llvm#87939 is an example of why we need this.

labrinea · 2024-05-21T08:58:50Z

Hi, @jroelofs. This patch has been quiet in a while and the reason is I have been collaborating with @tmatheson-arm in refactoring the TargetParser as a preliminary step. However the landscape doesn't seem to be changing much, it's mostly NFC. What I mean is that for this patch to work we need to be able to express every FMV feature in the LLVM IR attribute "target-features". This isn't the case for a few FMV features and I believe it won't be easy to change. For example the FMV feature AES in ExtensionInfo has no DependentFeatures because the SubtargetFeature for AES is fused with PMULL, meaning we can't link them (the fmv feature and the subtarget counterpart) together as they mean different things. I do not want to rely on the cleanup in ARM-software/acle#315 as features may diverge again between FMV and the AAch64 compiler backend. Spliting up backend features will break backwards compatibility in LLVM IR (because they will also have to be renamed), so maybe we need some other way to propagate FMV information from clang to LLVM (not via the target-features attribute). Is this worth an RFC, like introducing a new attribute for FMV in IR? Is this optimization still valuable to you? Shall we pursue this further or just give up?

jroelofs · 2024-05-21T16:54:44Z

Yes, this optimization is still interesting to me, but I’m not in a rush. We can take our time on it. I have a non-FMV project I need to focus on over the summer, but I expect to pick up work on things in this area in the fall.

When generating the body of the ifunc resolver, clang skips runtime checks for features that are implied from the command line. We bend this rule for certain features (memtag, bti, dgh), but this happens quite arbitrarily in my opinion. The reasoning is that some features are in the HINT instruction space, meaning they operate as NOPs if the hardware does not support them. Still the user wants to detect their presence with runtime checks. See llvm#90928 for details. I think we should always perform runtime checks regardless of the feature and then try to statically resolve calls whenever a function is compiled with a sufficiently high set of architecture features (so including target/target_version/target_clones attributes, and command line options). This is what GCC does. We have an open PR in LLVM GlobalOpt since it was suggested not to perform such codegen optimizations in clang anyway. See llvm#87939.

labrinea · 2025-01-10T21:30:38Z

llvm/lib/Transforms/IPO/GlobalOpt.cpp

+      Changed = true;
+    }
+    if (IF.use_empty() ||
+        all_of(IF.users(), [](User *U) { return isa<GlobalAlias>(U); }))


That is probably a leftover from the time we had ifunc aliases. Subject to removal.

labrinea · 2025-01-13T11:03:41Z

llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll

@@ -0,0 +1,365 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter "call i32 @(test_single_bb_resolver|test_multi_bb_resolver|test_caller_feats_not_implied|test_non_fmv_caller|test_priority|test_alternative_names)" --version 4
+; RUN: opt --passes=globalopt -o - -S < %s | FileCheck %s


Idea: perhaps it would be useful to have an integration test in clang at -O3 to show that this optimization is working as expected on C/C++ source?

labrinea · 2025-01-16T17:17:07Z

Ping

jroelofs

LGTM

llvm/include/llvm/Analysis/TargetTransformInfo.h

…nstant reference.

llvm-ci · 2025-01-17T10:57:40Z

LLVM Buildbot has detected a new failure on builder llvm-nvptx64-nvidia-ubuntu running on as-builder-7 while building llvm at step 6 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/160/builds/11463

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-llvm) failure: test (failure)
******************** TEST 'LLVM :: Transforms/GlobalOpt/resolve-fmv-ifunc.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 2: /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/build/bin/opt --passes=globalopt -o - -S < /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll | /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/build/bin/FileCheck /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
+ /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/build/bin/opt --passes=globalopt -o - -S
+ /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/build/bin/FileCheck /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/build/bin/opt: WARNING: failed to create target machine for 'aarch64-unknown-linux-gnu': unable to get target for 'aarch64-unknown-linux-gnu', see --version and --triple.
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:46:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver._Msve()
         ^
<stdin>:48:52: note: scanning from here
define i32 @caller1._Msve() local_unnamed_addr #1 {
                                                   ^
<stdin>:50:2: note: possible intended match here
 %call = tail call i32 @test_single_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:56:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver._Msve2()
         ^
<stdin>:54:53: note: scanning from here
define i32 @caller1._Msve2() local_unnamed_addr #2 {
                                                    ^
<stdin>:56:2: note: possible intended match here
 %call = tail call i32 @test_single_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:66:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver.default()
         ^
<stdin>:60:54: note: scanning from here
define i32 @caller1.default() local_unnamed_addr #0 {
                                                     ^
<stdin>:62:2: note: possible intended match here
 %call = tail call i32 @test_single_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:106:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_multi_bb_resolver._Mmops()
         ^
<stdin>:98:58: note: scanning from here
define i32 @caller2._MmopsMsve2() local_unnamed_addr #4 {
                                                         ^
<stdin>:100:2: note: possible intended match here
 %call = tail call i32 @test_multi_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx64-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:116:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_multi_bb_resolver._Mmops()
         ^
<stdin>:104:53: note: scanning from here
define i32 @caller2._Mmops() local_unnamed_addr #3 {
...

llvm-ci · 2025-01-17T10:58:44Z

LLVM Buildbot has detected a new failure on builder llvm-clang-x86_64-sie-ubuntu-fast running on sie-linux-worker while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/144/builds/15892

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: Transforms/GlobalOpt/resolve-fmv-ifunc.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 2: /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --passes=globalopt -o - -S < /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll | /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
+ /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt --passes=globalopt -o - -S
+ /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/FileCheck /home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/build/bin/opt: WARNING: failed to create target machine for 'aarch64-unknown-linux-gnu': unable to get target for 'aarch64-unknown-linux-gnu', see --version and --triple.
�[1m/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:46:10: �[0m�[0;1;31merror: �[0m�[1mCHECK: expected string not found in input
�[0m; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver._Msve()
�[0;1;32m         ^
�[0m�[1m<stdin>:48:52: �[0m�[0;1;30mnote: �[0m�[1mscanning from here
�[0mdefine i32 @caller1._Msve() local_unnamed_addr #1 {
�[0;1;32m                                                   ^
�[0m�[1m<stdin>:50:2: �[0m�[0;1;30mnote: �[0m�[1mpossible intended match here
�[0m %call = tail call i32 @test_single_bb_resolver()
�[0;1;32m ^
�[0m�[1m/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:56:10: �[0m�[0;1;31merror: �[0m�[1mCHECK: expected string not found in input
�[0m; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver._Msve2()
�[0;1;32m         ^
�[0m�[1m<stdin>:54:53: �[0m�[0;1;30mnote: �[0m�[1mscanning from here
�[0mdefine i32 @caller1._Msve2() local_unnamed_addr #2 {
�[0;1;32m                                                    ^
�[0m�[1m<stdin>:56:2: �[0m�[0;1;30mnote: �[0m�[1mpossible intended match here
�[0m %call = tail call i32 @test_single_bb_resolver()
�[0;1;32m ^
�[0m�[1m/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:66:10: �[0m�[0;1;31merror: �[0m�[1mCHECK: expected string not found in input
�[0m; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver.default()
�[0;1;32m         ^
�[0m�[1m<stdin>:60:54: �[0m�[0;1;30mnote: �[0m�[1mscanning from here
�[0mdefine i32 @caller1.default() local_unnamed_addr #0 {
�[0;1;32m                                                     ^
�[0m�[1m<stdin>:62:2: �[0m�[0;1;30mnote: �[0m�[1mpossible intended match here
�[0m %call = tail call i32 @test_single_bb_resolver()
�[0;1;32m ^
�[0m�[1m/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:106:10: �[0m�[0;1;31merror: �[0m�[1mCHECK: expected string not found in input
�[0m; CHECK: [[CALL:%.*]] = tail call i32 @test_multi_bb_resolver._Mmops()
�[0;1;32m         ^
�[0m�[1m<stdin>:98:58: �[0m�[0;1;30mnote: �[0m�[1mscanning from here
�[0mdefine i32 @caller2._MmopsMsve2() local_unnamed_addr #4 {
�[0;1;32m                                                         ^
�[0m�[1m<stdin>:100:2: �[0m�[0;1;30mnote: �[0m�[1mpossible intended match here
�[0m %call = tail call i32 @test_multi_bb_resolver()
�[0;1;32m ^
�[0m�[1m/home/buildbot/buildbot-root/llvm-clang-x86_64-sie-ubuntu-fast/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:116:10: �[0m�[0;1;31merror: �[0m�[1mCHECK: expected string not found in input
�[0m; CHECK: [[CALL:%.*]] = tail call i32 @test_multi_bb_resolver._Mmops()
�[0;1;32m         ^
�[0m�[1m<stdin>:104:53: �[0m�[0;1;30mnote: �[0m�[1mscanning from here
�[0mdefine i32 @caller2._Mmops() local_unnamed_addr #3 {
...

jplehr · 2025-01-17T11:07:10Z

Hi, there are two more bots red:
https://lab.llvm.org/buildbot/#/builders/140/builds/14935
https://lab.llvm.org/staging/#/builders/130/builds/12489

Let me know if you need help from my side.

llvm-ci · 2025-01-17T11:07:48Z

LLVM Buildbot has detected a new failure on builder llvm-nvptx-nvidia-ubuntu running on as-builder-7 while building llvm at step 6 "test-build-unified-tree-check-llvm".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/180/builds/11461

Here is the relevant piece of the build log for the reference

Step 6 (test-build-unified-tree-check-llvm) failure: test (failure)
******************** TEST 'LLVM :: Transforms/GlobalOpt/resolve-fmv-ifunc.ll' FAILED ********************
Exit Code: 1

Command Output (stderr):
--
RUN: at line 2: /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/build/bin/opt --passes=globalopt -o - -S < /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll | /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/build/bin/FileCheck /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
+ /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/build/bin/opt --passes=globalopt -o - -S
+ /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/build/bin/FileCheck /home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/build/bin/opt: WARNING: failed to create target machine for 'aarch64-unknown-linux-gnu': unable to get target for 'aarch64-unknown-linux-gnu', see --version and --triple.
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:46:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver._Msve()
         ^
<stdin>:48:52: note: scanning from here
define i32 @caller1._Msve() local_unnamed_addr #1 {
                                                   ^
<stdin>:50:2: note: possible intended match here
 %call = tail call i32 @test_single_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:56:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver._Msve2()
         ^
<stdin>:54:53: note: scanning from here
define i32 @caller1._Msve2() local_unnamed_addr #2 {
                                                    ^
<stdin>:56:2: note: possible intended match here
 %call = tail call i32 @test_single_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:66:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_single_bb_resolver.default()
         ^
<stdin>:60:54: note: scanning from here
define i32 @caller1.default() local_unnamed_addr #0 {
                                                     ^
<stdin>:62:2: note: possible intended match here
 %call = tail call i32 @test_single_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:106:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_multi_bb_resolver._Mmops()
         ^
<stdin>:98:58: note: scanning from here
define i32 @caller2._MmopsMsve2() local_unnamed_addr #4 {
                                                         ^
<stdin>:100:2: note: possible intended match here
 %call = tail call i32 @test_multi_bb_resolver()
 ^
/home/buildbot/worker/as-builder-7/ramdisk/llvm-nvptx-nvidia-ubuntu/llvm-project/llvm/test/Transforms/GlobalOpt/resolve-fmv-ifunc.ll:116:10: error: CHECK: expected string not found in input
; CHECK: [[CALL:%.*]] = tail call i32 @test_multi_bb_resolver._Mmops()
         ^
<stdin>:104:53: note: scanning from here
define i32 @caller2._Mmops() local_unnamed_addr #3 {
...

labrinea · 2025-01-17T11:17:35Z

Please review the fix here #123322

labrinea · 2025-01-17T14:31:35Z

The llvm test suite builder https://lab.llvm.org/buildbot/#/builders/199/builds/1002 is broken too and I am investigating if it relates to this change.

…vm#87939) To deduce whether the optimization is legal we need to compare the target features between caller and callee versions. The criteria for bypassing the resolver are the following: * If the callee's feature set is a subset of the caller's feature set, then the callee is a candidate for direct call. * Among such candidates the one of highest priority is the best match and it shall be picked, unless there is a version of the callee with higher priority than the best match which cannot be picked from a higher priority caller (directly or through the resolver). * For every higher priority callee version than the best match, there is a higher priority caller version whose feature set availability is implied by the callee's feature set. Example: Callers and Callees are ordered in decreasing priority. The arrows indicate successful call redirections. Caller Callee Explanation ========================================================================= mops+sve2 --+--> mops all the callee versions are subsets of the | caller but mops has the highest priority | mops --+ sve2 between mops and default callees, mops wins sve sve between sve and default callees, sve wins but sve2 does not have a high priority caller default -----> default sve (callee) implies sve (caller), sve2(callee) implies sve (caller), mops(callee) implies mops(caller)

labrinea · 2025-01-17T17:26:50Z

Verdict for the llvm-test-suite regression: What I am thinking is that the compiler assumes that we have predres because of mcpu=neoverse-v2, but in reality we don't have it on the target, and since the optimization is skipping the runtime check, the test fails. Perhaps it's only safe to perform the optimization when the caller is a versioned function. In this example the caller is main, which is a regular function.

labrinea · 2025-01-17T18:19:23Z

Fix -> #123383

jroelofs · 2025-01-17T18:20:04Z

compiler assumes that we have predres because of mcpu=neoverse-v2, but in reality we don't have it on the target

That feels like user error / broken test to me. If a user specifies an mcpu, the compiler should be free to assume that it can freely emit code for that target.

labrinea · 2025-01-17T18:24:59Z

Yeah, maybe. How shall we change the test though?

jroelofs · 2025-01-17T18:26:17Z

Where is it getting that mpcu from?

labrinea · 2025-01-17T18:31:59Z

https://lab.llvm.org/buildbot/#/builders/198/builds/1232/steps/5/logs/stdio

cmake -G Ninja ../llvm/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=True '-DLLVM_LIT_ARGS='"'"'-v'"'"'' -DCMAKE_INSTALL_PREFIX=../stage1.install '-DCMAKE_C_FLAGS='"'"'-mcpu=neoverse-v2'"'"'' '-DCMAKE_CXX_FLAGS='"'"'-mcpu=neoverse-v2'"'"'' -DLLVM_ENABLE_LLD=True -DMLIR_INCLUDE_INTEGRATION_TESTS=True -DMLIR_RUN_ARM_SVE_TESTS=True '-DLLVM_ENABLE_PROJECTS=clang-tools-extra;lld;mlir;llvm;clang;flang' -DLLVM_ENABLE_RUNTIMES=compiler-rt

I am wondering if it would be better to build with mcpu=native

jroelofs · 2025-01-17T18:40:30Z

Looks like that builder is:

Fujitsu FX700 48x A64FX; RAM 185GB OS: Ubuntu 22.04.5 LTS Kernel: 6.8.0-1019-aws #21-Ubuntu SMP Wed Nov 6 21:53:02 UTC 2024 Compiler: clang version 18.1.8 (http://git.linaro.org/toolchain/jenkins-scripts.git e5def089cd9f5aa71524f82fef301ca66eaa38d2) System Linker: GNU ld (GNU Binutils for Ubuntu) 2.38 C Library: ldd (Ubuntu GLIBC 2.35-0ubuntu3.8) 2.35 CMake: cmake version 3.28.6

https://lab.llvm.org/buildbot/#/workers/158

so -mcpu=a64fx would be much more appropriate. ~~@rengolin do you know who runs the linaro bots?~~ edit: just realized there's an admin list just to the left of the thing I just quoted. cc: @DavidSpickett

labrinea · 2025-01-17T18:47:55Z

Since it's late Friday and the buildbots are still red, shall we temporarily land #123383 and revisit this early next week?

jroelofs · 2025-01-17T18:49:36Z

SGTM

labrinea · 2025-01-20T20:42:50Z

Update: @DavidSpickett has fixed the host info on the linaro-g4 builders and it is now correctly showing Neoverse-V2. From my end I tried the following test on Graviton4 (behaves the same with -mcpu=native):

$ cat specres.c 
#include <stdio.h>

__attribute__((target_version("predres"))) void f(void) { printf("Has FEAT_SPECRES\n"); }
__attribute__((target_version("default"))) void f(void) { printf("Does not have FEAT_SPECRES\n"); }

int main() {
    f();
    asm volatile (
      "cfp rctx, x0" "\n"
      "dvp rctx, x1" "\n"
      "cpp rctx, x2" "\n"
    );
    return 0;
}
$ ./release/bin/clang --target=aarch64-linux-gnu -mcpu=neoverse-v2 specres.c -O0 -rtlib=compiler-rt
$ ./a.out 
Does not have FEAT_SPECRES
Illegal instruction (core dumped)

The runtime detection suggests predres isn't available even though ArmARM says

FEAT_SPECRES is mandatory from Armv8.5.

I am now trying to understand whether this has to do with reading ID_AA64ISAR1_EL1 from EL0 and whether the CFP/DVP/CPP RCTX , Xt instructions are valid at EL0. ArmARM says yes, but

In AArch64 state, EL0 access to the System instructions is controlled by:
• When HCR_EL2.{E2H, TGE} is not {1, 1}, SCTLR_EL1.EnRCTX.
• When HCR_EL2.{E2H, TGE} == {1, 1}, SCTLR_EL2.EnRCTX.

Also

Accessing ID_AA64ISAR1_EL1 with MRS <Xt>, ID_AA64ISAR1_EL1

if PSTATE.EL == EL0 then
  if IsFeatureImplemented(FEAT_IDST) then
    if EL2Enabled() && HCR_EL2.TGE == '1' then
      AArch64.SystemAccessTrap(EL2, 0x18);
    else
      AArch64.SystemAccessTrap(EL1, 0x18);
  else
    UNDEFINED;

I am suspecting we may have been misdetecting FEAT_SPECRES and FEAT_LS64 on linux targets all along, since they are the only two FMV features without a corresponding HWCAP.

labrinea · 2025-01-21T16:07:22Z

Reading more on https://www.kernel.org/doc/html/latest/arch/arm64/cpu-feature-registers.html my understanding is that some feature bits from ID_AA64ISAR1_EL1 may be visible, but specres doesn't seem to be one of them. Also my naive interpretation of /~https://github.com/torvalds/linux/blob/master/arch/arm64/kernel/cpufeature.c#L236 is that SPECRES is hidden.

labrinea · 2025-01-21T17:43:33Z

Accessing a restricted system register from an application generates an exception and ends up in SIGILL being delivered to the process. The infrastructure hooks into the exception handler and emulates the operation if the source belongs to the supported system register space.

@jroelofs I have confirmed the above with the linux kernel folks. SPECRES and LS64 are hidden features so reading those with MRS always yields zero. This shouldn't affect your platform though.

jroelofs · 2025-01-21T18:42:15Z

So should we disable/XFAIL or mark that test UNSUPPORTED on Linux platforms?

labrinea · 2025-01-22T17:56:04Z

I am thinking that the SPECRES instructions would trap when executed from EL0 on Darwin too (where they can be correctly detected), just like this example. We just don't have a bot to catch it.

My reading of ArmARM makes me think that LS64 is also unusable from EL0, but I may be wrong:

SCTLR_EL1.EnALS, bit [56] : When FEAT_LS64 is implemented:
When the Effective value of HCR_EL2.{E2H, TGE} is not {1, 1}, traps execution of an LD64B or ST64B instruction at EL0 to EL1.

SCTLR_EL1.EnAS0, bit [55] : When FEAT_LS64_ACCDATA is implemented:
When the Effective value of HCR_EL2.{E2H, TGE} is not {1, 1}, traps execution of an ST64BV0 instruction at EL0 to EL1.

SCTLR_EL1.EnASR, bit [54] : When FEAT_LS64_V is implemented:
When the Effective value of HCR_EL2.{E2H, TGE} is not {1, 1}, traps execution of an ST64BV instruction at EL0 to EL1.

The pseudocode in CheckLDST64BEnabled(), CheckST64BV0Enabled() and CheckST64BVEnabled() suggests the same.

So perhaps we should remove those features. I am trying to clarify internally.

labrinea requested review from jroelofs, ilinpv and DanielKristofKiss April 7, 2024 20:34

llvmbot added backend:AArch64 llvm:analysis llvm:transforms labels Apr 7, 2024

labrinea requested a review from nikic April 7, 2024 20:35

labrinea force-pushed the fmv-resolve-ifunc branch from 3ce5520 to 54ffdf4 Compare April 7, 2024 20:43

This was referenced Apr 7, 2024

[FMV] Multi Versioned Inlining #71714

Open

[llvm][GlobalOpt] Optimize statically resolvable IFuncs #80606

Merged

labrinea mentioned this pull request Apr 7, 2024

[clang][FMV] Pass the '+fmv' target-feature when FMV is enabled. #87942

Closed

jroelofs reviewed Apr 8, 2024

View reviewed changes

llvm/lib/Transforms/IPO/GlobalOpt.cpp Outdated Show resolved Hide resolved

labrinea force-pushed the fmv-resolve-ifunc branch 2 times, most recently from 9007fb7 to fe66c7d Compare April 9, 2024 21:38

labrinea force-pushed the fmv-resolve-ifunc branch from fe66c7d to 02bd5a7 Compare April 9, 2024 23:53

labrinea mentioned this pull request Apr 17, 2024

[FMV] Rectify incomplete ExtensionInfo entries in TargetParser. #89106

Closed

labrinea mentioned this pull request Jul 16, 2024

[FMV] Remove features without compiler support and consolidate others. ARM-software/acle#329

Closed

8 tasks

labrinea mentioned this pull request Jul 18, 2024

[FMV][AArch64] Do not optimize away runtime checks for implied features #99522

Merged

labrinea commented Jan 10, 2025

View reviewed changes

labrinea commented Jan 13, 2025

View reviewed changes

jroelofs approved these changes Jan 16, 2025

View reviewed changes

llvm/include/llvm/Analysis/TargetTransformInfo.h Outdated Show resolved Hide resolved

labrinea added 2 commits January 16, 2025 21:21

Merge branch 'main' into fmv-resolve-ifunc

881cb14

Make the argument to getFeatureMask and isMultiversionedFunction a co…

52a03fb

…nstant reference.

labrinea merged commit 831527a into llvm:main Jan 17, 2025
8 checks passed

labrinea deleted the fmv-resolve-ifunc branch January 17, 2025 10:49

jroelofs mentioned this pull request Jan 17, 2025

[FMV][GlobalOpt] Do not statically resolve non-FMV callers. #123383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FMV][GlobalOpt] Statically resolve calls to versioned functions. #87939

[FMV][GlobalOpt] Statically resolve calls to versioned functions. #87939

labrinea commented Apr 7, 2024 •

edited

Loading

llvmbot commented Apr 7, 2024 •

edited

Loading

andrewcarlotti commented Apr 8, 2024

labrinea commented Apr 8, 2024 •

edited

Loading

labrinea commented Apr 8, 2024 •

edited

Loading

labrinea commented Apr 10, 2024

labrinea commented Apr 11, 2024

labrinea commented May 21, 2024

jroelofs commented May 21, 2024

labrinea Jan 10, 2025

labrinea Jan 13, 2025

labrinea commented Jan 16, 2025

jroelofs left a comment

llvm-ci commented Jan 17, 2025

llvm-ci commented Jan 17, 2025

jplehr commented Jan 17, 2025

llvm-ci commented Jan 17, 2025

labrinea commented Jan 17, 2025 •

edited

Loading

labrinea commented Jan 17, 2025

labrinea commented Jan 17, 2025

labrinea commented Jan 17, 2025

jroelofs commented Jan 17, 2025

labrinea commented Jan 17, 2025

jroelofs commented Jan 17, 2025

labrinea commented Jan 17, 2025 •

edited

Loading

jroelofs commented Jan 17, 2025 •

edited

Loading

labrinea commented Jan 17, 2025

jroelofs commented Jan 17, 2025

labrinea commented Jan 20, 2025 •

edited

Loading

labrinea commented Jan 21, 2025

labrinea commented Jan 21, 2025

jroelofs commented Jan 21, 2025

labrinea commented Jan 22, 2025 •

edited

Loading

		@@ -0,0 +1,365 @@
		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter "call i32 @(test_single_bb_resolver\|test_multi_bb_resolver\|test_caller_feats_not_implied\|test_non_fmv_caller\|test_priority\|test_alternative_names)" --version 4
		; RUN: opt --passes=globalopt -o - -S < %s \| FileCheck %s

[FMV][GlobalOpt] Statically resolve calls to versioned functions. #87939

[FMV][GlobalOpt] Statically resolve calls to versioned functions. #87939

Conversation

labrinea commented Apr 7, 2024 • edited Loading

llvmbot commented Apr 7, 2024 • edited Loading

andrewcarlotti commented Apr 8, 2024

labrinea commented Apr 8, 2024 • edited Loading

labrinea commented Apr 8, 2024 • edited Loading

labrinea commented Apr 10, 2024

labrinea commented Apr 11, 2024

labrinea commented May 21, 2024

jroelofs commented May 21, 2024

labrinea Jan 10, 2025

Choose a reason for hiding this comment

labrinea Jan 13, 2025

Choose a reason for hiding this comment

labrinea commented Jan 16, 2025

jroelofs left a comment

Choose a reason for hiding this comment

llvm-ci commented Jan 17, 2025

llvm-ci commented Jan 17, 2025

jplehr commented Jan 17, 2025

llvm-ci commented Jan 17, 2025

labrinea commented Jan 17, 2025 • edited Loading

labrinea commented Jan 17, 2025

labrinea commented Jan 17, 2025

labrinea commented Jan 17, 2025

jroelofs commented Jan 17, 2025

labrinea commented Jan 17, 2025

jroelofs commented Jan 17, 2025

labrinea commented Jan 17, 2025 • edited Loading

jroelofs commented Jan 17, 2025 • edited Loading

labrinea commented Jan 17, 2025

jroelofs commented Jan 17, 2025

labrinea commented Jan 20, 2025 • edited Loading

labrinea commented Jan 21, 2025

labrinea commented Jan 21, 2025

jroelofs commented Jan 21, 2025

labrinea commented Jan 22, 2025 • edited Loading

labrinea commented Apr 7, 2024 •

edited

Loading

llvmbot commented Apr 7, 2024 •

edited

Loading

labrinea commented Apr 8, 2024 •

edited

Loading

labrinea commented Apr 8, 2024 •

edited

Loading

labrinea commented Jan 17, 2025 •

edited

Loading

labrinea commented Jan 17, 2025 •

edited

Loading

jroelofs commented Jan 17, 2025 •

edited

Loading

labrinea commented Jan 20, 2025 •

edited

Loading

labrinea commented Jan 22, 2025 •

edited

Loading