Reland "[LV]: Teach LV to recursively (de)interleave." #125094

hassnaaHamdi · 2025-01-30T17:17:18Z

This patch relands the changes from "[LV]: Teach LV to recursively (de)interleave.#122989"
Reason for revert:
- The patch exposed an assert in the vectorizer related to VF difference between
legacy cost model and VPlan-based cost model because of uncalculated cost for
VPInstruction which is created by VPlanTransforms as a replacement to 'or disjoint'
instruction.
VPlanTransforms do that instructions change when there are memory interleaving and
predicated blocks, but that change didn't cause problems because at most cases the cost
difference between legacy/new models is not noticeable.
- Issue is fixed by #125434

Original patch: #89018
Reviewed-by: paulwalker-arm, Mel-Chen

This commit relands the changes from "[LV]: Teach LV to recursively (de)interleave.llvm#122989" Reason for revert: - The patch exposed an assert in the vectorizer related to VF difference between legacy cost model and VPlan-based cost model because of uncalculated cost for VPInstruction which is created by VPlanTransforms as a replacement to 'or disjoint' instruction. - The fix is about skipping comparing legacy model to vplan-based model for at least interleaving factor > 2.

llvmbot · 2025-01-30T17:17:53Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Hassnaa Hamdi (hassnaaHamdi)

Changes

This patch relands the changes from "[LV]: Teach LV to recursively (de)interleave.#122989"
Reason for revert:
- The patch exposed an assert in the vectorizer related to VF difference between
legacy cost model and VPlan-based cost model because of uncalculated cost for
VPInstruction which is created by VPlanTransforms as a replacement to 'or disjoint'
instruction.
- The fix is about skipping comparing legacy model to vplan-based model for at least
interleaving factor > 2.

Patch is 207.41 KiB, truncated to 20.00 KiB below, full version: /~https://github.com/llvm/llvm-project/pull/125094.diff

7 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+13-7)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+56-23)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses-cost.ll (+108-11)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+259-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-masked-accesses.ll (+252)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-accesses.ll (+678-640)
(added) llvm/test/Transforms/PhaseOrdering/AArch64/sve-interleave-vectorization.ll (+135)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 5375d2be9c8751..1b05e09a95ab6d 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3400,10 +3400,10 @@ bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(
   if (hasIrregularType(ScalarTy, DL))
     return false;
 
-  // We currently only know how to emit interleave/deinterleave with
-  // Factor=2 for scalable vectors. This is purely an implementation
-  // limit.
-  if (VF.isScalable() && InterleaveFactor != 2)
+  // For scalable vectors, the only interleave factor currently supported
+  // must be power of 2 since we require the (de)interleave2 intrinsics
+  // instead of shufflevectors.
+  if (VF.isScalable() && !isPowerOf2_32(InterleaveFactor))
     return false;
 
   // If the group involves a non-integral pointer, we may not be able to
@@ -7434,6 +7434,12 @@ static bool planContainsAdditionalSimplifications(VPlan &Plan,
     for (VPRecipeBase &R : *VPBB) {
       if (auto *IR = dyn_cast<VPInterleaveRecipe>(&R)) {
         auto *IG = IR->getInterleaveGroup();
+        // The legacy-based cost model is more accurate for interleaving and
+        // comparing against the VPlan-based cost isn't desirable.
+        // At least skip interleaving with factor > 2 as higher factors
+        // cause higher cost difference.
+        if (IG->getFactor() > 2)
+          return true;
         unsigned NumMembers = IG->getNumMembers();
         for (unsigned I = 0; I != NumMembers; ++I) {
           if (Instruction *M = IG->getMember(I))
@@ -9279,9 +9285,9 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
                      CM.getWideningDecision(IG->getInsertPos(), VF) ==
                          LoopVectorizationCostModel::CM_Interleave);
       // For scalable vectors, the only interleave factor currently supported
-      // is 2 since we require the (de)interleave2 intrinsics instead of
-      // shufflevectors.
-      assert((!Result || !VF.isScalable() || IG->getFactor() == 2) &&
+      // must be power of 2 since we require the (de)interleave2 intrinsics
+      // instead of shufflevectors.
+      assert((!Result || !VF.isScalable() || isPowerOf2_32(IG->getFactor())) &&
              "Unsupported interleave factor for scalable vectors");
       return Result;
     };
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 2679ed6b26b5d1..7b2fab146945bf 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -2868,10 +2868,21 @@ static Value *interleaveVectors(IRBuilderBase &Builder, ArrayRef<Value *> Vals,
   // Scalable vectors cannot use arbitrary shufflevectors (only splats), so
   // must use intrinsics to interleave.
   if (VecTy->isScalableTy()) {
-    VectorType *WideVecTy = VectorType::getDoubleElementsVectorType(VecTy);
-    return Builder.CreateIntrinsic(WideVecTy, Intrinsic::vector_interleave2,
-                                   Vals,
-                                   /*FMFSource=*/nullptr, Name);
+    assert(isPowerOf2_32(Factor) && "Unsupported interleave factor for "
+                                    "scalable vectors, must be power of 2");
+    SmallVector<Value *> InterleavingValues(Vals);
+    // When interleaving, the number of values will be shrunk until we have the
+    // single final interleaved value.
+    auto *InterleaveTy = cast<VectorType>(InterleavingValues[0]->getType());
+    for (unsigned Midpoint = Factor / 2; Midpoint > 0; Midpoint /= 2) {
+      InterleaveTy = VectorType::getDoubleElementsVectorType(InterleaveTy);
+      for (unsigned I = 0; I < Midpoint; ++I)
+        InterleavingValues[I] = Builder.CreateIntrinsic(
+            InterleaveTy, Intrinsic::vector_interleave2,
+            {InterleavingValues[I], InterleavingValues[Midpoint + I]},
+            /*FMFSource=*/nullptr, Name);
+    }
+    return InterleavingValues[0];
   }
 
   // Fixed length. Start by concatenating all vectors into a wide vector.
@@ -2957,15 +2968,11 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
                           &InterleaveFactor](Value *MaskForGaps) -> Value * {
     if (State.VF.isScalable()) {
       assert(!MaskForGaps && "Interleaved groups with gaps are not supported.");
-      assert(InterleaveFactor == 2 &&
+      assert(isPowerOf2_32(InterleaveFactor) &&
              "Unsupported deinterleave factor for scalable vectors");
       auto *ResBlockInMask = State.get(BlockInMask);
-      SmallVector<Value *, 2> Ops = {ResBlockInMask, ResBlockInMask};
-      auto *MaskTy = VectorType::get(State.Builder.getInt1Ty(),
-                                     State.VF.getKnownMinValue() * 2, true);
-      return State.Builder.CreateIntrinsic(
-          MaskTy, Intrinsic::vector_interleave2, Ops,
-          /*FMFSource=*/nullptr, "interleaved.mask");
+      SmallVector<Value *> Ops(InterleaveFactor, ResBlockInMask);
+      return interleaveVectors(State.Builder, Ops, "interleaved.mask");
     }
 
     if (!BlockInMask)
@@ -3005,22 +3012,48 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
     ArrayRef<VPValue *> VPDefs = definedValues();
     const DataLayout &DL = State.CFG.PrevBB->getDataLayout();
     if (VecTy->isScalableTy()) {
-      assert(InterleaveFactor == 2 &&
+      assert(isPowerOf2_32(InterleaveFactor) &&
              "Unsupported deinterleave factor for scalable vectors");
 
-        // Scalable vectors cannot use arbitrary shufflevectors (only splats),
-        // so must use intrinsics to deinterleave.
-      Value *DI = State.Builder.CreateIntrinsic(
-          Intrinsic::vector_deinterleave2, VecTy, NewLoad,
-          /*FMFSource=*/nullptr, "strided.vec");
-      unsigned J = 0;
-      for (unsigned I = 0; I < InterleaveFactor; ++I) {
-        Instruction *Member = Group->getMember(I);
+      // Scalable vectors cannot use arbitrary shufflevectors (only splats),
+      // so must use intrinsics to deinterleave.
+      SmallVector<Value *> DeinterleavedValues(InterleaveFactor);
+      DeinterleavedValues[0] = NewLoad;
+      // For the case of InterleaveFactor > 2, we will have to do recursive
+      // deinterleaving, because the current available deinterleave intrinsic
+      // supports only Factor of 2, otherwise it will bailout after first
+      // iteration.
+      // When deinterleaving, the number of values will double until we
+      // have "InterleaveFactor".
+      for (unsigned NumVectors = 1; NumVectors < InterleaveFactor;
+           NumVectors *= 2) {
+        // Deinterleave the elements within the vector
+        SmallVector<Value *> TempDeinterleavedValues(NumVectors);
+        for (unsigned I = 0; I < NumVectors; ++I) {
+          auto *DiTy = DeinterleavedValues[I]->getType();
+          TempDeinterleavedValues[I] = State.Builder.CreateIntrinsic(
+              Intrinsic::vector_deinterleave2, DiTy, DeinterleavedValues[I],
+              /*FMFSource=*/nullptr, "strided.vec");
+        }
+        // Extract the deinterleaved values:
+        for (unsigned I = 0; I < 2; ++I)
+          for (unsigned J = 0; J < NumVectors; ++J)
+            DeinterleavedValues[NumVectors * I + J] =
+                State.Builder.CreateExtractValue(TempDeinterleavedValues[J], I);
+      }
 
-        if (!Member)
+#ifndef NDEBUG
+      for (Value *Val : DeinterleavedValues)
+        assert(Val && "NULL Deinterleaved Value");
+#endif
+      for (unsigned I = 0, J = 0; I < InterleaveFactor; ++I) {
+        Instruction *Member = Group->getMember(I);
+        Value *StridedVec = DeinterleavedValues[I];
+        if (!Member) {
+          // This value is not needed as it's not used
+          cast<Instruction>(StridedVec)->eraseFromParent();
           continue;
-
-        Value *StridedVec = State.Builder.CreateExtractValue(DI, I);
+        }
         // If this member has different type, cast the result type.
         if (Member->getType() != ScalarTy) {
           VectorType *OtherVTy = VectorType::get(Member->getType(), State.VF);
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses-cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses-cost.ll
index 564fba65e62389..5c6415d976539f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses-cost.ll
@@ -10,32 +10,123 @@ define void @test_masked_interleave(ptr noalias %A, ptr noalias %B, ptr noalias
 ; CHECK-LABEL: define void @test_masked_interleave(
 ; CHECK-SAME: ptr noalias [[A:%.*]], ptr noalias [[B:%.*]], ptr noalias [[C:%.*]]) #[[ATTR0:[0-9]+]] {
 ; CHECK-NEXT:  [[ENTRY:.*]]:
-; CHECK-NEXT:    br label %[[LOOP_HEADER:.*]]
-; CHECK:       [[LOOP_HEADER]]:
-; CHECK-NEXT:    [[IV:%.*]] = phi i64 [ [[IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP17:%.*]] = mul i64 [[TMP16]], 16
+; CHECK-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; CHECK:       [[VECTOR_PH]]:
+; CHECK-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 16
+; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 252, [[TMP3]]
+; CHECK-NEXT:    [[TMP4:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
+; CHECK-NEXT:    [[TMP5:%.*]] = select i1 [[TMP4]], i64 [[TMP3]], i64 [[N_MOD_VF]]
+; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 252, [[TMP5]]
+; CHECK-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 16
+; CHECK-NEXT:    [[TMP8:%.*]] = mul i64 [[N_VEC]], 4
+; CHECK-NEXT:    [[TMP9:%.*]] = call <vscale x 16 x i64> @llvm.stepvector.nxv16i64()
+; CHECK-NEXT:    [[TMP10:%.*]] = mul <vscale x 16 x i64> [[TMP9]], splat (i64 4)
+; CHECK-NEXT:    [[INDUCTION:%.*]] = add <vscale x 16 x i64> zeroinitializer, [[TMP10]]
+; CHECK-NEXT:    [[TMP11:%.*]] = mul i64 4, [[TMP7]]
+; CHECK-NEXT:    [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 16 x i64> poison, i64 [[TMP11]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT:%.*]] = shufflevector <vscale x 16 x i64> [[DOTSPLATINSERT]], <vscale x 16 x i64> poison, <vscale x 16 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP12:%.*]] = call <vscale x 16 x i32> @llvm.stepvector.nxv16i32()
+; CHECK-NEXT:    [[TMP13:%.*]] = mul <vscale x 16 x i32> [[TMP12]], splat (i32 4)
+; CHECK-NEXT:    [[INDUCTION1:%.*]] = add <vscale x 16 x i32> zeroinitializer, [[TMP13]]
+; CHECK-NEXT:    [[TMP14:%.*]] = trunc i64 [[TMP7]] to i32
+; CHECK-NEXT:    [[TMP15:%.*]] = mul i32 4, [[TMP14]]
+; CHECK-NEXT:    [[DOTSPLATINSERT2:%.*]] = insertelement <vscale x 16 x i32> poison, i32 [[TMP15]], i64 0
+; CHECK-NEXT:    [[DOTSPLAT3:%.*]] = shufflevector <vscale x 16 x i32> [[DOTSPLATINSERT2]], <vscale x 16 x i32> poison, <vscale x 16 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 16 x ptr> poison, ptr [[C]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 16 x ptr> [[BROADCAST_SPLATINSERT]], <vscale x 16 x ptr> poison, <vscale x 16 x i32> zeroinitializer
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT13:%.*]] = insertelement <vscale x 16 x ptr> poison, ptr [[B]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT14:%.*]] = shufflevector <vscale x 16 x ptr> [[BROADCAST_SPLATINSERT13]], <vscale x 16 x ptr> poison, <vscale x 16 x i32> zeroinitializer
+; CHECK-NEXT:    br label %[[VECTOR_BODY:.*]]
+; CHECK:       [[VECTOR_BODY]]:
+; CHECK-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND:%.*]] = phi <vscale x 16 x i64> [ [[INDUCTION]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[VEC_IND4:%.*]] = phi <vscale x 16 x i32> [ [[INDUCTION1]], %[[VECTOR_PH]] ], [ [[VEC_IND_NEXT5:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-NEXT:    [[OFFSET_IDX:%.*]] = mul i64 [[INDEX]], 4
+; CHECK-NEXT:    [[IV:%.*]] = add i64 [[OFFSET_IDX]], 0
 ; CHECK-NEXT:    [[IV_1:%.*]] = or disjoint i64 [[IV]], 1
 ; CHECK-NEXT:    [[GEP_A_1:%.*]] = getelementptr i8, ptr [[A]], i64 [[IV_1]]
-; CHECK-NEXT:    [[L_1:%.*]] = load i8, ptr [[GEP_A_1]], align 1
+; CHECK-NEXT:    [[WIDE_VEC:%.*]] = load <vscale x 64 x i8>, ptr [[GEP_A_1]], align 1
+; CHECK-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 32 x i8>, <vscale x 32 x i8> } @llvm.vector.deinterleave2.nxv64i8(<vscale x 64 x i8> [[WIDE_VEC]])
+; CHECK-NEXT:    [[TMP19:%.*]] = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } [[STRIDED_VEC]], 0
+; CHECK-NEXT:    [[TMP20:%.*]] = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } [[STRIDED_VEC]], 1
+; CHECK-NEXT:    [[STRIDED_VEC6:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[TMP19]])
+; CHECK-NEXT:    [[STRIDED_VEC7:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[TMP20]])
+; CHECK-NEXT:    [[TMP21:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC6]], 0
+; CHECK-NEXT:    [[TMP22:%.*]] = icmp eq <vscale x 16 x i8> [[TMP21]], zeroinitializer
+; CHECK-NEXT:    [[TMP23:%.*]] = add <vscale x 16 x i64> [[VEC_IND]], splat (i64 2)
+; CHECK-NEXT:    [[TMP24:%.*]] = getelementptr i8, ptr [[A]], <vscale x 16 x i64> [[TMP23]]
+; CHECK-NEXT:    [[TMP25:%.*]] = extractelement <vscale x 16 x ptr> [[TMP24]], i32 0
+; CHECK-NEXT:    [[TMP26:%.*]] = getelementptr i8, ptr [[TMP25]], i32 -2
+; CHECK-NEXT:    [[INTERLEAVED_MASK:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP22]], <vscale x 16 x i1> [[TMP22]])
+; CHECK-NEXT:    [[INTERLEAVED_MASK8:%.*]] = call <vscale x 32 x i1> @llvm.vector.interleave2.nxv32i1(<vscale x 16 x i1> [[TMP22]], <vscale x 16 x i1> [[TMP22]])
+; CHECK-NEXT:    [[INTERLEAVED_MASK9:%.*]] = call <vscale x 64 x i1> @llvm.vector.interleave2.nxv64i1(<vscale x 32 x i1> [[INTERLEAVED_MASK]], <vscale x 32 x i1> [[INTERLEAVED_MASK8]])
+; CHECK-NEXT:    [[WIDE_MASKED_VEC:%.*]] = call <vscale x 64 x i8> @llvm.masked.load.nxv64i8.p0(ptr [[TMP26]], i32 1, <vscale x 64 x i1> [[INTERLEAVED_MASK9]], <vscale x 64 x i8> poison)
+; CHECK-NEXT:    [[STRIDED_VEC10:%.*]] = call { <vscale x 32 x i8>, <vscale x 32 x i8> } @llvm.vector.deinterleave2.nxv64i8(<vscale x 64 x i8> [[WIDE_MASKED_VEC]])
+; CHECK-NEXT:    [[TMP27:%.*]] = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } [[STRIDED_VEC10]], 0
+; CHECK-NEXT:    [[TMP28:%.*]] = extractvalue { <vscale x 32 x i8>, <vscale x 32 x i8> } [[STRIDED_VEC10]], 1
+; CHECK-NEXT:    [[STRIDED_VEC11:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[TMP27]])
+; CHECK-NEXT:    [[STRIDED_VEC12:%.*]] = call { <vscale x 16 x i8>, <vscale x 16 x i8> } @llvm.vector.deinterleave2.nxv32i8(<vscale x 32 x i8> [[TMP28]])
+; CHECK-NEXT:    [[TMP29:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC11]], 0
+; CHECK-NEXT:    [[TMP30:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC12]], 0
+; CHECK-NEXT:    [[TMP31:%.*]] = extractvalue { <vscale x 16 x i8>, <vscale x 16 x i8> } [[STRIDED_VEC11]], 1
+; CHECK-NEXT:    [[TMP32:%.*]] = zext <vscale x 16 x i8> [[TMP31]] to <vscale x 16 x i32>
+; CHECK-NEXT:    [[TMP33:%.*]] = shl <vscale x 16 x i32> [[TMP32]], splat (i32 2)
+; CHECK-NEXT:    [[TMP34:%.*]] = zext <vscale x 16 x i8> [[TMP30]] to <vscale x 16 x i32>
+; CHECK-NEXT:    [[TMP35:%.*]] = shl <vscale x 16 x i32> [[TMP34]], splat (i32 2)
+; CHECK-NEXT:    [[TMP36:%.*]] = or <vscale x 16 x i32> [[TMP35]], [[TMP33]]
+; CHECK-NEXT:    [[TMP37:%.*]] = zext <vscale x 16 x i8> [[TMP29]] to <vscale x 16 x i32>
+; CHECK-NEXT:    [[TMP38:%.*]] = or <vscale x 16 x i32> [[TMP36]], [[TMP37]]
+; CHECK-NEXT:    [[TMP39:%.*]] = shl <vscale x 16 x i32> [[TMP38]], splat (i32 2)
+; CHECK-NEXT:    [[TMP40:%.*]] = or <vscale x 16 x i32> splat (i32 3), [[VEC_IND4]]
+; CHECK-NEXT:    [[TMP41:%.*]] = or <vscale x 16 x i32> [[TMP39]], [[TMP40]]
+; CHECK-NEXT:    [[TMP42:%.*]] = lshr <vscale x 16 x i32> [[TMP41]], splat (i32 2)
+; CHECK-NEXT:    [[TMP43:%.*]] = lshr <vscale x 16 x i32> [[TMP32]], splat (i32 2)
+; CHECK-NEXT:    [[TMP44:%.*]] = trunc <vscale x 16 x i32> [[TMP43]] to <vscale x 16 x i8>
+; CHECK-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP44]], <vscale x 16 x ptr> [[TMP24]], i32 1, <vscale x 16 x i1> [[TMP22]])
+; CHECK-NEXT:    [[TMP45:%.*]] = lshr <vscale x 16 x i32> [[TMP32]], splat (i32 5)
+; CHECK-NEXT:    [[TMP46:%.*]] = trunc <vscale x 16 x i32> [[TMP45]] to <vscale x 16 x i8>
+; CHECK-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP46]], <vscale x 16 x ptr> [[BROADCAST_SPLAT]], i32 1, <vscale x 16 x i1> [[TMP22]])
+; CHECK-NEXT:    [[TMP47:%.*]] = trunc <vscale x 16 x i32> [[TMP42]] to <vscale x 16 x i8>
+; CHECK-NEXT:    call void @llvm.masked.scatter.nxv16i8.nxv16p0(<vscale x 16 x i8> [[TMP47]], <vscale x 16 x ptr> [[BROADCAST_SPLAT14]], i32 1, <vscale x 16 x i1> [[TMP22]])
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP7]]
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <vscale x 16 x i64> [[VEC_IND]], [[DOTSPLAT]]
+; CHECK-NEXT:    [[VEC_IND_NEXT5]] = add <vscale x 16 x i32> [[VEC_IND4]], [[DOTSPLAT3]]
+; CHECK-NEXT:    [[TMP48:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-NEXT:    br i1 [[TMP48]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK:       [[MIDDLE_BLOCK]]:
+; CHECK-NEXT:    br label %[[SCALAR_PH]]
+; CHECK:       [[SCALAR_PH]]:
+; CHECK-NEXT:    [[BC_RESUME_VAL:%.*]] = phi i64 [ [[TMP8]], %[[MIDDLE_BLOCK]] ], [ 0, %[[ENTRY]] ]
+; CHECK-NEXT:    br label %[[LOOP_HEADER:.*]]
+; CHECK:       [[LOOP_HEADER]]:
+; CHECK-NEXT:    [[IV1:%.*]] = phi i64 [ [[IV_NEXT:%.*]], %[[LOOP_LATCH:.*]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-NEXT:    [[IV_3:%.*]] = or disjoint i64 [[IV1]], 1
+; CHECK-NEXT:    [[GEP_A_2:%.*]] = getelementptr i8, ptr [[A]], i64 [[IV_3]]
+; CHECK-NEXT:    [[L_1:%.*]] = load i8, ptr [[GEP_A_2]], align 1
 ; CHECK-NEXT:    [[C_1:%.*]] = icmp eq i8 [[L_1]], 0
 ; CHECK-NEXT:    br i1 [[C_1]], label %[[THEN:.*]], label %[[LOOP_LATCH]]
 ; CHECK:       [[THEN]]:
-; CHECK-NEXT:    [[IV_2:%.*]] = or disjoint i64 [[IV]], 2
+; CHECK-NEXT:    [[IV_2:%.*]] = or disjoint i64 [[IV1]], 2
 ; CHECK-NEXT:    [[ARRAYIDX7:%.*]] = getelementptr i8, ptr [[A]], i64 [[IV_2]]
 ; CHECK-NEXT:    [[L_2:%.*]] = load i8, ptr [[ARRAYIDX7]], align 1
 ; CHECK-NEXT:    [[CONV8:%.*]] = zext i8 [[L_2]] to i32
 ; CHECK-NEXT:    [[SHL:%.*]] = shl i32 [[CONV8]], 2
-; CHECK-NEXT:    [[ADD9:%.*]] = or disjoint i64 [[IV]], 1
+; CHECK-NEXT:    [[ADD9:%.*]] = or disjoint i64 [[IV1]], 1
 ; CHECK-NEXT:    [[ARRAYIDX10:%.*]] = getelementptr i8, ptr [[A]], i64 [[ADD9]]
 ; CHECK-NEXT:    [[TMP0:%.*]] = load i8, ptr [[ARRAYIDX10]], align 1
 ; CHECK-NEXT:    [[CONV11:%.*]] = zext i8 [[TMP0]] to i32
 ; CHECK-NEXT:    [[SHL12:%.*]] = shl i32 [[CONV11]], 2
 ; CHECK-NEXT:    [[OR:%.*]] = or i32 [[SHL12]], [[SHL]]
-; CHECK-NEXT:    [[B2:%.*]] = getelementptr i8, ptr [[A]], i64 [[IV]]
+; CHECK-NEXT:    [[B2:%.*]] = getelementptr i8, ptr [[A]], i64 [[IV1]]
 ; CHECK-NEXT:    [[TMP1:%.*]] = load i8, ptr [[B2]], align 1
 ; CHECK-NEXT:    [[CONV15:%.*]] = zext i8 [[TMP1]] to i32
 ; CHECK-NEXT:    [[OR16:%.*]] = or i32 [[OR]], [[CONV15]]
 ; CHECK-NEXT:    [[SHL17:%.*]] = shl i32 [[OR16]], 2
-; CHECK-NEXT:    [[CONV19:%.*]] = trunc i64 [[IV]] to i32
+; CHECK-NEXT:    [[CONV19:%.*]] = trunc i64 [[IV1]] to i32
 ; CHECK-NEXT:    [[ADD20:%.*]] = or i32 3, [[CONV19]]
 ; CHECK-NEXT:    [[DEST_0:%.*]] = or i32 [[SHL17]], [[ADD20]]
 ; CHECK-NEXT:    [[SHR:%.*]] = lshr i32 [[DEST_0]], 2
@@ -49,9 +140,9 @@ define void @test_masked_interleave(ptr noalias %A, ptr noalias %B, ptr noalias
 ; CHECK-NEXT:    store i8 [[CONV34]], ptr [[B]], align 1
 ; CHECK-NEXT:    br label %[[LOOP_LATCH]]
 ; CHECK:       [[LOOP_LATCH]]:
-; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV]], 4
-; CHECK-NEXT:    [[EC:%.*]] = icmp ugt i64 [[IV]], 1000
-; CHECK-NEXT:    br i1 [[EC]], label %[[EXIT:.*]], label %[[LOOP_HEADER]]
+; CHECK-NEXT:    [[IV_NEXT]] = add i64 [[IV1]], 4
+; CHECK-NEXT:    [[EC:%.*]] = icmp ugt i64...
[truncated]

david-arm · 2025-01-31T13:28:26Z

If I understand correctly @hassnaaHamdi you did try adding the vplan add (replacement for 'or disjoint') to the vplan cost model via VPInstruction::computeCost, but even then you still see the assert?

llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses-cost.ll

david-arm

LGTM. I am happy with this approach for now so that you can make progress on this patch. From what I understand, even if you implement VPInstruction::computeCost to properly cost all instructions in VPlan there is still a mismatch between the legacy and VPlan cost model. However, I still think in parallel it does make sense to pursue #125008 with accurate costing for the add VPInstruction.

david-arm · 2025-01-31T15:15:19Z

Please wait a day or so in case @fhahn has any objections.

fhahn

I think I may be missing something, I tried to track down the other divergence in the cost model besides the one mentioned in #125008.

But when I tried something like #125434, there seem to be no other differences between the cost models with this patch.

Are there perhaps other cases where computing the cost for binary VPInstructions with underlying instructions isn't enough?

hassnaaHamdi · 2025-02-03T02:37:51Z

I think I may be missing something, I tried to track down the other divergence in the cost model besides the one mentioned in #125008.

But when I tried something like #125434, there seem to be no other differences between the cost models with this patch.

Are there perhaps other cases where computing the cost for binary VPInstructions with underlying instructions isn't enough?

What you are saying is correct, but I did this patch because the opcode that is used by vplan-based cost model is 'Add' while the underlying instruction used in the legacy cost model is 'or'. As they are eventually different instructions, so I can't guarantee that I will get same costs from all backends.
And given that the comparison between legacy model and vpla-based model is temporary thing, I think this current solution is better.

hassnaaHamdi · 2025-02-03T13:16:16Z

@fhahn I tried your patch, and yes it gave same costs and the patch is needed anyway for the LoopVectorize itself.
What I'm saying is that I just don't want this patch to get reverted again in the future because of different target cost for 'Add' and 'or' instructions, given that the issue is not related to the interleaving itself, so I don't want to make connection between them.

david-arm · 2025-02-03T13:38:26Z

Just out of interest @fhahn do you have an idea of when this assert can be removed?

hassnaaHamdi · 2025-02-04T04:46:27Z

Hi @fhahn Are you going to reopen your draft and make a pull request ? or I append it to my patch here #125008 ?
I'm fine with the faster approach. Thank you.

fhahn · 2025-02-04T16:12:53Z

@fhahn I tried your patch, and yes it gave same costs and the patch is needed anyway for the LoopVectorize itself.
What I'm saying is that I just don't want this patch to get reverted again in the future because of different target cost for 'Add' and 'or' instructions, given that the issue is not related to the interleaving itself, so I don't want to make connection between them.

There shouldn't be a difference, as they should be interchangeable, otherwise we wouldn't have been able to generate them in the first place. I see the concern that there may be a difference between ADD and OR. I tried this on both X86 and AArch64 and at least for those it shouldn't make difference in practice.

My main concern with the early bail-out is unrelated to the underlying issue (as it could happen also without interleaving) and may hide other issues.

Hi @fhahn Are you going to reopen your draft and make a pull request ? or I append it to my patch here #125008 ? I'm fine with the faster approach. Thank you.

I marked it as ready.

fhahn · 2025-02-04T16:14:39Z

Just out of interest @fhahn do you have an idea of when this assert can be removed?

There are still a number of cost computation that need to be updated, for those the assert is still valuable I think (as this example also highlights, as it surfaced a case where the VPlan-based cost model missed the cost of some instructions).

Now that we branched for clang-20 I'll look into finishing the transition without risking introducing additional instability on the clang-20 release

hassnaaHamdi · 2025-02-05T16:45:40Z

@fhahn I tried your patch, and yes it gave same costs and the patch is needed anyway for the LoopVectorize itself.
What I'm saying is that I just don't want this patch to get reverted again in the future because of different target cost for 'Add' and 'or' instructions, given that the issue is not related to the interleaving itself, so I don't want to make connection between them.

There shouldn't be a difference, as they should be interchangeable, otherwise we wouldn't have been able to generate them in the first place. I see the concern that there may be a difference between ADD and OR. I tried this on both X86 and AArch64 and at least for those it shouldn't make difference in practice.

My main concern with the early bail-out is unrelated to the underlying issue (as it could happen also without interleaving) and may hide other issues.

Hi @fhahn Are you going to reopen your draft and make a pull request ? or I append it to my patch here #125008 ? I'm fine with the faster approach. Thank you.

I marked it as ready.

Okay, I see.
I will update this patch by removing the bail-out and merge it once your VPInstruction patch gets merged.
Thank you.

hassnaaHamdi · 2025-02-05T17:11:06Z

I have reverted the commit that was doing bail-out for interleaving.
I think this patch now is ready for landing once #125434 gets landed.

fhahn

Sounds good to me, thanks!

…es. (#125434) As exposed by #125094, we are missing cost computation for some binary VPInstructions we created based on original IR instructions. Their cost should be considered. PR: #125434

…rlying values. (#125434) As exposed by llvm/llvm-project#125094, we are missing cost computation for some binary VPInstructions we created based on original IR instructions. Their cost should be considered. PR: llvm/llvm-project#125434

llvm-ci · 2025-02-09T20:55:18Z

LLVM Buildbot has detected a new failure on builder llvm-clang-win-x-armv7l running on as-builder-1 while building llvm at step 10 "test-check-clang".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/38/builds/2250

Here is the relevant piece of the build log for the reference

Step 10 (test-check-clang) failure: Test just built components: check-clang completed (failure)
******************** TEST 'Clang :: Driver/offload-Xarch.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 3
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe --target=x86_64-unknown-linux-gnu -x cuda C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c -Xarch_nvptx64 -O3 -S -nogpulib -nogpuinc -### 2>&1 | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=O3ONCE C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' --target=x86_64-unknown-linux-gnu -x cuda 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c' -Xarch_nvptx64 -O3 -S -nogpulib -nogpuinc '-###'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=O3ONCE 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# RUN: at line 4
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe -x cuda C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c -Xarch_device -O3 -S -nogpulib -nogpuinc -### 2>&1 | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=O3ONCE C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' -x cuda 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c' -Xarch_device -O3 -S -nogpulib -nogpuinc '-###'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=O3ONCE 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# RUN: at line 5
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe -x hip C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c -Xarch_amdgcn -O3 -S -nogpulib -nogpuinc -### 2>&1 | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=O3ONCE C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' -x hip 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c' -Xarch_amdgcn -O3 -S -nogpulib -nogpuinc '-###'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=O3ONCE 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# RUN: at line 6
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe -fopenmp=libomp -fopenmp-targets=amdgcn-amd-amdhsa -nogpulib -nogpuinc    -Xarch_amdgcn -march=gfx90a -Xarch_amdgcn -O3 -S -### C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c 2>&1  | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=O3ONCE C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' -fopenmp=libomp -fopenmp-targets=amdgcn-amd-amdhsa -nogpulib -nogpuinc -Xarch_amdgcn -march=gfx90a -Xarch_amdgcn -O3 -S '-###' 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=O3ONCE 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# RUN: at line 9
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -nogpulib -nogpuinc    -Xarch_nvptx64 -march=sm_52 -Xarch_nvptx64 -O3 -S -### C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c 2>&1  | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=O3ONCE C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda -nogpulib -nogpuinc -Xarch_nvptx64 -march=sm_52 -Xarch_nvptx64 -O3 -S '-###' 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=O3ONCE 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# RUN: at line 15
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda,amdgcn-amd-amdhsa -nogpulib    --target=x86_64-unknown-linux-gnu -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_52,sm_60 -nogpuinc    -Xopenmp-target=amdgcn-amd-amdhsa --offload-arch=gfx90a,gfx1030 -ccc-print-bindings -### C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c 2>&1  | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=OPENMP C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda,amdgcn-amd-amdhsa -nogpulib --target=x86_64-unknown-linux-gnu -Xopenmp-target=nvptx64-nvidia-cuda --offload-arch=sm_52,sm_60 -nogpuinc -Xopenmp-target=amdgcn-amd-amdhsa --offload-arch=gfx90a,gfx1030 -ccc-print-bindings '-###' 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=OPENMP 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# RUN: at line 31
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe -x cuda C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c --offload-arch=sm_52,sm_60 -Xarch_sm_52 -O3 -Xarch_sm_60 -O0    --target=x86_64-unknown-linux-gnu -Xarch_host -O3 -S -nogpulib -nogpuinc -### 2>&1  | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=CUDA C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' -x cuda 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c' --offload-arch=sm_52,sm_60 -Xarch_sm_52 -O3 -Xarch_sm_60 -O0 --target=x86_64-unknown-linux-gnu -Xarch_host -O3 -S -nogpulib -nogpuinc '-###'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=CUDA 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# RUN: at line 39
c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe -fopenmp=libomp --offload-arch=gfx90a -nogpulib -nogpuinc    --target=x86_64-unknown-linux-gnu -Xarch_amdgcn -Wl,-lfoo -### C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c 2>&1  | c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe -check-prefix=LIBS C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\clang.exe' -fopenmp=libomp --offload-arch=gfx90a -nogpulib -nogpuinc --target=x86_64-unknown-linux-gnu -Xarch_amdgcn -Wl,-lfoo '-###' 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# executed command: 'c:\buildbot\as-builder-1\x-armv7l\build\bin\filecheck.exe' -check-prefix=LIBS 'C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c'
# .---command stderr------------
# | C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c:45:10: error: LIBS: expected string not found in input
# | // LIBS: "--device-linker=amdgcn-amd-amdhsa=-lfoo"
# |          ^
# | <stdin>:1:1: note: scanning from here
# | clang version 21.0.0git (/~https://github.com/llvm/llvm-project.git e9a20f77ee2117b4a6eb40826b7280e29ad29e1e)
# | ^
# | <stdin>:6:1310: note: possible intended match here
# |  "C:\\buildbot\\as-builder-1\\x-armv7l\\build\\bin\\clang.exe" "-cc1" "-triple" "x86_64-unknown-linux-gnu" "-emit-llvm-bc" "-emit-llvm-uselists" "-dumpdir" "a-" "-disable-free" "-clear-ast-before-backend" "-main-file-name" "offload-Xarch.c" "-mrelocation-model" "pic" "-pic-level" "2" "-pic-is-pie" "-mframe-pointer=all" "-fmath-errno" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-funwind-tables=2" "-target-cpu" "x86-64" "-tune-cpu" "generic" "-debugger-tuning=gdb" "-fdebug-compilation-dir=C:\\buildbot\\as-builder-1\\x-armv7l\\build\\tools\\clang\\test\\Driver" "-fcoverage-compilation-dir=C:\\buildbot\\as-builder-1\\x-armv7l\\build\\tools\\clang\\test\\Driver" "-resource-dir" "C:\\buildbot\\as-builder-1\\x-armv7l\\build\\lib\\clang\\21" "-internal-isystem" "C:\\buildbot\\as-builder-1\\x-armv7l\\build\\lib\\clang\\21\\include" "-internal-isystem" "/usr/local/include" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-internal-isystem" "C:\\buildbot\\as-builder-1\\x-armv7l\\build\\lib\\clang\\21\\include" "-internal-isystem" "/usr/local/include" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-ferror-limit" "19" "-fopenmp" "-nogpulib" "-fgnuc-version=4.2.1" "-fskip-odr-check-in-gmf" "-disable-llvm-passes" "-fopenmp-targets=amdgcn-amd-amdhsa" "-faddrsig" "-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o" "C:\\buildbot\\as-builder-1\\x-armv7l\\build\\lit-tmp-l72k81no\\offload-Xarch-0bf80a.bc" "-x" "c" "C:\\buildbot\\as-builder-1\\x-armv7l\\llvm-project\\clang\\test\\Driver\\offload-Xarch.c"
# |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
# | 
# | Input file: <stdin>
# | Check file: C:\buildbot\as-builder-1\x-armv7l\llvm-project\clang\test\Driver\offload-Xarch.c
...

hassnaaHamdi · 2025-02-10T14:22:09Z

LLVM Buildbot has detected a new failure on builder llvm-clang-win-x-armv7l running on as-builder-1 while building llvm at step 10 "test-check-clang".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/38/builds/2250

Here is the relevant piece of the build log for the reference

Tested that locally and it passed. Also commits after mine are working well. So, I'll ignore this failure.

hassnaaHamdi · 2025-02-11T11:53:26Z

Hi @fhahn
Is it possible that I backport this patch along with your patch #125434 to the release branch?

…es. (llvm#125434) As exposed by llvm#125094, we are missing cost computation for some binary VPInstructions we created based on original IR instructions. Their cost should be considered. PR: llvm#125434

This patch relands the changes from "[LV]: Teach LV to recursively (de)interleave.llvm#122989" Reason for revert: - The patch exposed an assert in the vectorizer related to VF difference between legacy cost model and VPlan-based cost model because of uncalculated cost for VPInstruction which is created by VPlanTransforms as a replacement to 'or disjoint' instruction. VPlanTransforms do that instructions change when there are memory interleaving and predicated blocks, but that change didn't cause problems because at most cases the cost difference between legacy/new models is not noticeable. - Issue is fixed by llvm#125434 Original patch: llvm#89018 Reviewed-by: paulwalker-arm, Mel-Chen

fhahn · 2025-02-12T20:19:36Z

@hassnaaHamdi I am not sure if new features can still be picked at this stage in the release (as they might introduce additional risk) or only bug-fixes. You could either check with @tstellar first or create a PR and take it from there.

hassnaaHamdi · 2025-02-13T15:37:00Z

/cherry-pick f07cd36 1611059 e9a20f7

llvmbot · 2025-02-13T15:39:22Z

Failed to cherry-pick: f07cd36

/~https://github.com/llvm/llvm-project/actions/runs/13311289552

Please manually backport the fix and push it to your github fork. Once this is done, please create a pull request

llvmbot · 2025-02-13T15:42:28Z

Failed to cherry-pick: f07cd36

/~https://github.com/llvm/llvm-project/actions/runs/13311360045

Please manually backport the fix and push it to your github fork. Once this is done, please create a pull request

…es. llvm#125434 As exposed by llvm#125094, we are missing cost computation for some binary VPInstructions we created based on original IR instructions. Their cost should be considered. PR: llvm#125434 Author: Florian Hahn <flo@fhahn.com> Change-Id: Icf985b3f1cd40898a17faaf47b241e2651f9e8dd

tstellar · 2025-02-18T01:13:02Z

@hassnaaHamdi Were you able to manually create the backport PR?

hassnaaHamdi · 2025-02-19T18:53:21Z

@tstellar This patch is depending on another one, I have created backport PR for the other one here #127101, once it gets merged, I will backport this one.

…es. llvm#125434 As exposed by llvm#125094, we are missing cost computation for some binary VPInstructions we created based on original IR instructions. Their cost should be considered. PR: llvm#125434 Author: Florian Hahn <flo@fhahn.com> Change-Id: Icf985b3f1cd40898a17faaf47b241e2651f9e8dd

hassnaaHamdi · 2025-02-23T03:01:12Z

/cherry-pick e9a20f7

llvmbot · 2025-02-23T03:06:42Z

/pull-request #128389

hassnaaHamdi · 2025-02-24T12:30:58Z

@tstellar here is the PR: #128389
Thanks so much :))

llvmbot added vectorizers llvm:transforms labels Jan 30, 2025

hassnaaHamdi requested review from david-arm and fhahn January 30, 2025 17:30

david-arm reviewed Jan 31, 2025

View reviewed changes

llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses-cost.ll Outdated Show resolved Hide resolved

david-arm approved these changes Jan 31, 2025

View reviewed changes

fhahn reviewed Feb 2, 2025

View reviewed changes

fhahn mentioned this pull request Feb 4, 2025

[VPlan] Compute cost for binary op VPInstruction with underlying values. #125434

Merged

hassnaaHamdi force-pushed the reland_vectorizer_patch branch from 96d5326 to 2f69919 Compare February 5, 2025 17:08

fhahn reviewed Feb 6, 2025

View reviewed changes

hassnaaHamdi merged commit e9a20f7 into llvm:main Feb 9, 2025
5 of 8 checks passed

hassnaaHamdi added this to the LLVM 20.X Release milestone Feb 13, 2025

llvmbot added the release:cherry-pick-failed label Feb 13, 2025

hassnaaHamdi mentioned this pull request Feb 13, 2025

release/20.x: [VPlan] Compute cost for binary op VPInstruction with underlying values. #127101

Merged

llvmbot removed the release:cherry-pick-failed label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reland "[LV]: Teach LV to recursively (de)interleave." #125094

Reland "[LV]: Teach LV to recursively (de)interleave." #125094

hassnaaHamdi commented Jan 30, 2025 •

edited

Loading

llvmbot commented Jan 30, 2025 •

edited

Loading

david-arm commented Jan 31, 2025

david-arm left a comment

david-arm commented Jan 31, 2025

fhahn left a comment

hassnaaHamdi commented Feb 3, 2025 •

edited

Loading

hassnaaHamdi commented Feb 3, 2025

david-arm commented Feb 3, 2025

hassnaaHamdi commented Feb 4, 2025

fhahn commented Feb 4, 2025

fhahn commented Feb 4, 2025

hassnaaHamdi commented Feb 5, 2025 •

edited

Loading

hassnaaHamdi commented Feb 5, 2025

fhahn left a comment

llvm-ci commented Feb 9, 2025

hassnaaHamdi commented Feb 10, 2025

hassnaaHamdi commented Feb 11, 2025 •

edited

Loading

fhahn commented Feb 12, 2025

hassnaaHamdi commented Feb 13, 2025

llvmbot commented Feb 13, 2025

llvmbot commented Feb 13, 2025

tstellar commented Feb 18, 2025

hassnaaHamdi commented Feb 19, 2025 •

edited

Loading

hassnaaHamdi commented Feb 23, 2025

llvmbot commented Feb 23, 2025

hassnaaHamdi commented Feb 24, 2025

Reland "[LV]: Teach LV to recursively (de)interleave." #125094

Reland "[LV]: Teach LV to recursively (de)interleave." #125094

Conversation

hassnaaHamdi commented Jan 30, 2025 • edited Loading

llvmbot commented Jan 30, 2025 • edited Loading

david-arm commented Jan 31, 2025

david-arm left a comment

Choose a reason for hiding this comment

david-arm commented Jan 31, 2025

fhahn left a comment

Choose a reason for hiding this comment

hassnaaHamdi commented Feb 3, 2025 • edited Loading

hassnaaHamdi commented Feb 3, 2025

david-arm commented Feb 3, 2025

hassnaaHamdi commented Feb 4, 2025

fhahn commented Feb 4, 2025

fhahn commented Feb 4, 2025

hassnaaHamdi commented Feb 5, 2025 • edited Loading

hassnaaHamdi commented Feb 5, 2025

fhahn left a comment

Choose a reason for hiding this comment

llvm-ci commented Feb 9, 2025

hassnaaHamdi commented Feb 10, 2025

hassnaaHamdi commented Feb 11, 2025 • edited Loading

fhahn commented Feb 12, 2025

hassnaaHamdi commented Feb 13, 2025

llvmbot commented Feb 13, 2025

llvmbot commented Feb 13, 2025

tstellar commented Feb 18, 2025

hassnaaHamdi commented Feb 19, 2025 • edited Loading

hassnaaHamdi commented Feb 23, 2025

llvmbot commented Feb 23, 2025

hassnaaHamdi commented Feb 24, 2025

hassnaaHamdi commented Jan 30, 2025 •

edited

Loading

llvmbot commented Jan 30, 2025 •

edited

Loading

hassnaaHamdi commented Feb 3, 2025 •

edited

Loading

hassnaaHamdi commented Feb 5, 2025 •

edited

Loading

hassnaaHamdi commented Feb 11, 2025 •

edited

Loading

hassnaaHamdi commented Feb 19, 2025 •

edited

Loading