Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

topology-updater: "Scan failed: checking if pod in a namespace is watchable" #922

Closed
nmathew opened this issue Oct 17, 2022 · 4 comments · Fixed by #929
Closed

topology-updater: "Scan failed: checking if pod in a namespace is watchable" #922

nmathew opened this issue Oct 17, 2022 · 4 comments · Fixed by #929
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@nmathew
Copy link

nmathew commented Oct 17, 2022

Is this a bug, timer is not retriggered?

/~https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/pkg/nfd-client/topology-updater/nfd-topology-updater.go#L137

this basically happens when pod comes and goes away in short duration, e.g. a pod going from pending to TopologyAffinityError. While Scan() is going on, it issues getPod and if pod is absent, then it gets into this issue.

this is the error
nfd-topology-updater.go:137] Scan failed: checking if pod in a namespace is watchable, namespace:default, pod name trex-sriov-intel-0-6: pods "trex-sriov-intel-0-6" not found

Environment:

  • Kubernetes version (use kubectl version): 2.23.0
  • OS (e.g: cat /etc/os-release): SuSe
  • Kernel (e.g. uname -a): 5.3.18
@nmathew nmathew added the kind/bug Categorizes issue or PR as related to a bug. label Oct 17, 2022
@marquiz
Copy link
Contributor

marquiz commented Oct 17, 2022

ping @fromanirh @swatisehgal you know the code better than me

To me this just sounds like a race condition that can happen with short-lived pods. AFAIU this shouldn't be anything fatal and should be fixed (unless a new race occurs) on the next "round"

@ffromani
Copy link
Contributor

ping @fromanirh @swatisehgal you know the code better than me

To me this just sounds like a race condition that can happen with short-lived pods. AFAIU this shouldn't be anything fatal and should be fixed (unless a new race occurs) on the next "round"

ack, will check

@marquiz
Copy link
Contributor

marquiz commented Oct 17, 2022

Hmm, now that I took a quick look at the code I think we have a problem:

By calling continue we skip arming of the new timer and the event loop effectively stops. It's probably be better to start using time.Tick instead.

@ffromani
Copy link
Contributor

Hmm, now that I took a quick look at the code I think we have a problem:

By calling continue we skip arming of the new timer and the event loop effectively stops. It's probably be better to start using time.Tick instead.

Ah darn, that's bad indeed. Will send a PR ASAP.
Meanwhile we will sync internally about updating this codebase, we are experimenting in this area and there are a possible improvements/refactoring we should be able to submit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants