topology-updater: "Scan failed: checking if pod in a namespace is watchable" #922

nmathew · 2022-10-17T10:41:13Z

Is this a bug, timer is not retriggered?

/~https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/pkg/nfd-client/topology-updater/nfd-topology-updater.go#L137

this basically happens when pod comes and goes away in short duration, e.g. a pod going from pending to TopologyAffinityError. While Scan() is going on, it issues getPod and if pod is absent, then it gets into this issue.

this is the error
nfd-topology-updater.go:137] Scan failed: checking if pod in a namespace is watchable, namespace:default, pod name trex-sriov-intel-0-6: pods "trex-sriov-intel-0-6" not found

Environment:

Kubernetes version (use kubectl version): 2.23.0
OS (e.g: cat /etc/os-release): SuSe
Kernel (e.g. uname -a): 5.3.18

The text was updated successfully, but these errors were encountered:

marquiz · 2022-10-17T13:20:12Z

ping @fromanirh @swatisehgal you know the code better than me

To me this just sounds like a race condition that can happen with short-lived pods. AFAIU this shouldn't be anything fatal and should be fixed (unless a new race occurs) on the next "round"

ffromani · 2022-10-17T13:23:40Z

ping @fromanirh @swatisehgal you know the code better than me

To me this just sounds like a race condition that can happen with short-lived pods. AFAIU this shouldn't be anything fatal and should be fixed (unless a new race occurs) on the next "round"

ack, will check

marquiz · 2022-10-17T13:30:45Z

Hmm, now that I took a quick look at the code I think we have a problem:

node-feature-discovery/pkg/nfd-client/topology-updater/nfd-topology-updater.go

Line 138 in 0e1a48f

continue

By calling continue we skip arming of the new timer and the event loop effectively stops. It's probably be better to start using time.Tick instead.

ffromani · 2022-10-17T13:33:19Z

Hmm, now that I took a quick look at the code I think we have a problem:

node-feature-discovery/pkg/nfd-client/topology-updater/nfd-topology-updater.go

Line 138 in 0e1a48f

continue

By calling continue we skip arming of the new timer and the event loop effectively stops. It's probably be better to start using time.Tick instead.

Ah darn, that's bad indeed. Will send a PR ASAP.
Meanwhile we will sync internally about updating this codebase, we are experimenting in this area and there are a possible improvements/refactoring we should be able to submit.

nmathew added the kind/bug Categorizes issue or PR as related to a bug. label Oct 17, 2022

ffromani mentioned this issue Oct 20, 2022

topology-updater: continue looping on scan error #929

Merged

k8s-ci-robot closed this as completed in #929 Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topology-updater: "Scan failed: checking if pod in a namespace is watchable" #922

topology-updater: "Scan failed: checking if pod in a namespace is watchable" #922

nmathew commented Oct 17, 2022

marquiz commented Oct 17, 2022

ffromani commented Oct 17, 2022

marquiz commented Oct 17, 2022

ffromani commented Oct 17, 2022

topology-updater: "Scan failed: checking if pod in a namespace is watchable" #922

topology-updater: "Scan failed: checking if pod in a namespace is watchable" #922

Comments

nmathew commented Oct 17, 2022

marquiz commented Oct 17, 2022

ffromani commented Oct 17, 2022

marquiz commented Oct 17, 2022

ffromani commented Oct 17, 2022