Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serial tests fixes #1195

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

shajmakh
Copy link
Member

please see commits for more details.

Problem:
Reboot tests fails frequently on teardown when comparing the quantaty of
the device resources. The root cause is from the sample-device plugin
but haven't yet fully investigated considering we **want** to move to
real devices/ sriov simulation and due to capacity.

What do you want?
Make the serial suite less noisy due to to known
issue, until we fix the root cause or implement the alternative:
https://issues.redhat.com/browse/CNF-12824

How are you fixing the problem in this PR?
similarly to memory resources deviation for reboot tests, allow devices
quantity to change after reboot but under the condition that the ratios
are the same before and after the test in the context of allocatables
and availables; That means capacities of the same device may change
before vs after the test, but unequal resources consumption is not
tolerated.

How did you test the change?
On a cluster with sample devices:
- run the reboot tests to verify we no
longer hit a failure but a warning:
```
make binary-e2e-serial
bin/e2e-nrop-serial.test --ginkgo.v --ginkgo.focus="reboot"
"
- run tier-0 tests.
```

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add and use the MCPInfo.ToString() and print out the built info.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Previously using the MCP object that we built manually wasn't wrong
because the specified config is empty anyway; still, do the correct use
by pulling the updated object and use it in the MCPInfo object.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
This shouldn't be an issue if the original MCP that is configured in
NROP is targeting worker nodes (MCP=worker); Still don't depend on this
assumption and use the NRT nodes.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
When GinkgoHelper() was present (openshift-kni@f56658a)
It was hard to follow the logs and understand which step failed while
waiting for MCP update. Adding unique identifiers to each call helps
track the start and end of each call and enhanced tracking down the root
cause. The default identifier is `time.Now()` as a string.
Additionally report config values and expectations of the call.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2025
@shajmakh shajmakh changed the title WIP: serail tests fixes Feb 19 WIP: serial tests fixes Feb 19 Feb 19, 2025
@openshift-ci openshift-ci bot requested review from mrniranjan and Tal-or February 19, 2025 10:51
Copy link
Contributor

openshift-ci bot commented Feb 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2025
With the latest selinux updates, NROP will no longer require to reboot
the nodes on CR updates unless a node group annottaion is enabled. The
new selinux policy is default starting 4.18. However this wasn't
backported to older versions so in this commit we adapt this change
based on the existance of the annotation and the openshift version.
This commit fixes the expected bahavior per each case and adds missing
waiting loops for MCP updating conndition.

The test passed with default config on 4.18 and on older versions.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh shajmakh force-pushed the serial-fix-2-newversionof47674 branch 2 times, most recently from 067b1b6 to f94b584 Compare February 25, 2025 13:55
Upgrade `isDegradedInternalError` to `isDegradedWith` to check different
reasons instead of checking that the reason is particularly
`status.ReasonInternalError`.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
The package has already `isDegradedWith` which verifies if the degraded
reason and message is as expected, let's use it instead of duplicating
the same check.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh
Copy link
Member Author

proposal for rewriting the test is found here: #1196

@shajmakh shajmakh changed the title WIP: serial tests fixes Feb 19 serial tests fixes Feb 19 Feb 26, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025
@shajmakh shajmakh changed the title serial tests fixes Feb 19 serial tests fixes Feb 26, 2025
@shajmakh
Copy link
Member Author

/retest

@@ -20,6 +20,7 @@ import (
"context"
"encoding/json"
"fmt"
"github.com/openshift/cluster-node-tuning-operator/pkg/performanceprofile/controller/performanceprofile/resources"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes are not needed in this commit, should be cleanedup

}

func resourceIsDevice(resName string) bool {
// TODO avoid duplicates with e2e/tests/internal/fixture/devices
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the todo need to be removed

return false
}

// TODO rename to EqualMemoryWithDeviation because it is dedicated for memory resource
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a cleanup

Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed bag. Some commits are clear improvements, some other raises some question.
I can see the reason for all the changes, but some implementations can use some polishing, probably.

Comment on lines +19 to +23
const (
DevType1EnvVar = "E2E_NROP_DEVICE_TYPE_1"
DevType2EnvVar = "E2E_NROP_DEVICE_TYPE_2"
DevType3EnvVar = "E2E_NROP_DEVICE_TYPE_3"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit of moving here the constants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they were kept in test/ we cannot use them in the internal/ directory because of cycle imports because test/ directory already imports internal/. to solve this we move the common consts in internal and use them in test/ and internal helpers without issue. I believe it would help to have this change in separate commit. I'll split that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've grown less and less entusiast of constants-only packages (and yes, I'm guilty too) but this makes sense.

@@ -89,6 +102,59 @@ func EqualResourceInfo(resInfoA, resInfoB nrtv1alpha2.ResourceInfo, isRebootTest
return true, nil
}

func EqualDevicesWithWarning(a, b nrtv1alpha2.ResourceInfo) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to use "WithWarning", code is free to call the logger

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem the right name indeed, will fix that. Thanks for the comment.

// sample devices; need a way to distinguish between real vs sample devices used in the ci
if isRebootTest && resourceIsDevice(resInfoA.Name) {
// TODO: Feb 2025: NROP serial is using sample-device plugin which has a known issue that after operations like node reboot or taints update,
// quantity of the device reaources is messed up, so this is a workaround for reboot tests should be removed once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: resources

@@ -98,6 +97,10 @@ type mcpInfo struct {
sampleNode corev1.Node
}

func (i mcpInfo) ToString() string {
return fmt.Sprintf("name %q; config %q; sample node %q\n", i.mcpObj.Name, i.initialConfig, i.sampleNode.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stringer functions, either String() or our custom ToString() lesser format should never add the trailing "\n"

nodesNameSet := e2enrt.AccumulateNames(nrts)
// nrts > 1 is checked in the setup
targetedNodeName, _ := nodesNameSet.PopAny()
targetedNode := &corev1.Node{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a pointer? can be a local variable

@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration {
}

func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good idea but I'm puzzled by the implementation. We don't have GinkgoHelper() calls anymore, so failures should point to the specific test, should not they?

@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration {
}

func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) {
waitForMcpUpdateWithID(cli, ctx, mcpsInfo, updateType, time.Now().String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary, internal functions can (and should) change API freely

return err
})
}
return eg.Wait()
return eg.Wait() // will block the flow until goroutines are finished or return on first error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is well documented so this comment is redundant

@@ -1510,3 +1523,19 @@ func getLabelRoleWorker() string {
func getLabelRoleMCPTest() string {
return fmt.Sprintf("%s/%s", depnodes.LabelRole, roleMCPTest)
}

func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool {
GinkgoHelper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here we do want to call GinkgoHelper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants