-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serial tests fixes #1195
base: main
Are you sure you want to change the base?
serial tests fixes #1195
Conversation
Problem: Reboot tests fails frequently on teardown when comparing the quantaty of the device resources. The root cause is from the sample-device plugin but haven't yet fully investigated considering we **want** to move to real devices/ sriov simulation and due to capacity. What do you want? Make the serial suite less noisy due to to known issue, until we fix the root cause or implement the alternative: https://issues.redhat.com/browse/CNF-12824 How are you fixing the problem in this PR? similarly to memory resources deviation for reboot tests, allow devices quantity to change after reboot but under the condition that the ratios are the same before and after the test in the context of allocatables and availables; That means capacities of the same device may change before vs after the test, but unequal resources consumption is not tolerated. How did you test the change? On a cluster with sample devices: - run the reboot tests to verify we no longer hit a failure but a warning: ``` make binary-e2e-serial bin/e2e-nrop-serial.test --ginkgo.v --ginkgo.focus="reboot" " - run tier-0 tests. ``` Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add and use the MCPInfo.ToString() and print out the built info. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Previously using the MCP object that we built manually wasn't wrong because the specified config is empty anyway; still, do the correct use by pulling the updated object and use it in the MCPInfo object. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
This shouldn't be an issue if the original MCP that is configured in NROP is targeting worker nodes (MCP=worker); Still don't depend on this assumption and use the NRT nodes. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
When GinkgoHelper() was present (openshift-kni@f56658a) It was hard to follow the logs and understand which step failed while waiting for MCP update. Adding unique identifiers to each call helps track the start and end of each call and enhanced tracking down the root cause. The default identifier is `time.Now()` as a string. Additionally report config values and expectations of the call. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shajmakh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
With the latest selinux updates, NROP will no longer require to reboot the nodes on CR updates unless a node group annottaion is enabled. The new selinux policy is default starting 4.18. However this wasn't backported to older versions so in this commit we adapt this change based on the existance of the annotation and the openshift version. This commit fixes the expected bahavior per each case and adds missing waiting loops for MCP updating conndition. The test passed with default config on 4.18 and on older versions. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
067b1b6
to
f94b584
Compare
Upgrade `isDegradedInternalError` to `isDegradedWith` to check different reasons instead of checking that the reason is particularly `status.ReasonInternalError`. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
The package has already `isDegradedWith` which verifies if the degraded reason and message is as expected, let's use it instead of duplicating the same check. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
proposal for rewriting the test is found here: #1196 |
/retest |
@@ -20,6 +20,7 @@ import ( | |||
"context" | |||
"encoding/json" | |||
"fmt" | |||
"github.com/openshift/cluster-node-tuning-operator/pkg/performanceprofile/controller/performanceprofile/resources" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes are not needed in this commit, should be cleanedup
} | ||
|
||
func resourceIsDevice(resName string) bool { | ||
// TODO avoid duplicates with e2e/tests/internal/fixture/devices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the todo need to be removed
return false | ||
} | ||
|
||
// TODO rename to EqualMemoryWithDeviation because it is dedicated for memory resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mixed bag. Some commits are clear improvements, some other raises some question.
I can see the reason for all the changes, but some implementations can use some polishing, probably.
const ( | ||
DevType1EnvVar = "E2E_NROP_DEVICE_TYPE_1" | ||
DevType2EnvVar = "E2E_NROP_DEVICE_TYPE_2" | ||
DevType3EnvVar = "E2E_NROP_DEVICE_TYPE_3" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the benefit of moving here the constants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they were kept in test/
we cannot use them in the internal/
directory because of cycle imports because test/
directory already imports internal/
. to solve this we move the common consts in internal and use them in test/
and internal helpers without issue. I believe it would help to have this change in separate commit. I'll split that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've grown less and less entusiast of constants-only packages (and yes, I'm guilty too) but this makes sense.
@@ -89,6 +102,59 @@ func EqualResourceInfo(resInfoA, resInfoB nrtv1alpha2.ResourceInfo, isRebootTest | |||
return true, nil | |||
} | |||
|
|||
func EqualDevicesWithWarning(a, b nrtv1alpha2.ResourceInfo) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use "WithWarning", code is free to call the logger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't seem the right name indeed, will fix that. Thanks for the comment.
// sample devices; need a way to distinguish between real vs sample devices used in the ci | ||
if isRebootTest && resourceIsDevice(resInfoA.Name) { | ||
// TODO: Feb 2025: NROP serial is using sample-device plugin which has a known issue that after operations like node reboot or taints update, | ||
// quantity of the device reaources is messed up, so this is a workaround for reboot tests should be removed once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: resources
@@ -98,6 +97,10 @@ type mcpInfo struct { | |||
sampleNode corev1.Node | |||
} | |||
|
|||
func (i mcpInfo) ToString() string { | |||
return fmt.Sprintf("name %q; config %q; sample node %q\n", i.mcpObj.Name, i.initialConfig, i.sampleNode.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stringer functions, either String()
or our custom ToString()
lesser format should never add the trailing "\n"
nodesNameSet := e2enrt.AccumulateNames(nrts) | ||
// nrts > 1 is checked in the setup | ||
targetedNodeName, _ := nodesNameSet.PopAny() | ||
targetedNode := &corev1.Node{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a pointer? can be a local variable
@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration { | |||
} | |||
|
|||
func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good idea but I'm puzzled by the implementation. We don't have GinkgoHelper()
calls anymore, so failures should point to the specific test, should not they?
@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration { | |||
} | |||
|
|||
func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) { | |||
waitForMcpUpdateWithID(cli, ctx, mcpsInfo, updateType, time.Now().String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary, internal functions can (and should) change API freely
return err | ||
}) | ||
} | ||
return eg.Wait() | ||
return eg.Wait() // will block the flow until goroutines are finished or return on first error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is well documented so this comment is redundant
@@ -1510,3 +1523,19 @@ func getLabelRoleWorker() string { | |||
func getLabelRoleMCPTest() string { | |||
return fmt.Sprintf("%s/%s", depnodes.LabelRole, roleMCPTest) | |||
} | |||
|
|||
func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool { | |||
GinkgoHelper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so here we do want to call GinkgoHelper
?
please see commits for more details.