-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give an indication in container events for probe failure as to whether the failure was ignored due to FailureThreshold #115823
Comments
/sig node |
/cc @RobertKielty |
/triage accepted This would be an amazing improvement in user experience indeed. Thank you for providing details on how exactly this will be implemented. Once implemented you can also contribute by updating the probes page https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ to mention this improvement. Perhaps start the troubleshooting probes section on that page. /assign @intUnderflow /good-first-issue |
@SergeyKanzhelev: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
let me work on this issue now please |
@ashutosh887 Is there any update on the issue? If not can I take this up?Thanks |
Yes I'm working on it 🙂 |
/assign |
/assign |
hey there, i am stuck on this issue for a while now, just wanted to make sure if im headed in the right direction. kubernetes/pkg/kubelet/prober/prober.go Lines 101 to 111 in 8ffbbe4
where we would need to mention the keepGoing value in the |
/assign |
I would like to work on this issue. |
@saiaunghlyanhtet I have an already open PR, feel free to review it tho |
Thank you for your information. |
Probes of all kinds currently support FailureThreshold (and SuccessThreshold), these properties allow a user to specify that Kubernetes should not take action in response to a failed probe unless it fails a successive number of times.
This is useful for end-users as it allows them to mitigate the effects of any probes that "flake" by requiring successive failure.
When a probe fails in Kubernetes, we emit a container event indicating this here: /~https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/prober.go#L110 and end-users can consume these events via the API for their own purposes. This event is emitted regardless of whether the FailureThreshold has been reached or not.
Currently when a user consumes a probe failure event they have no way of knowing whether the event resulted in action on the control plane (because the event can be ignored due to FailureThreshold, and information on this is not included in the event). This can lead to users assuming there is a problem and a container/pod was restarted when nothing occurred.
I think we should expose the keepGoing value from /~https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/worker.go#L203 in the emitted event somehow, my preferred solution is to emit the probe failure event in the worker rather than where it currently sits in /~https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/prober/prober.go#L110 - there is also the option of passing some information down the stack into the prober from the worker (such as making the FailureThreshold/SuccessThreshold decision in the prober) but I'm worried about separation of concerns, happy to hear what other folks think :)
Also of note is that FailureThreshold/SuccessThreshold is the only filter I can see where a probe can be ignored after being run (and therefore emitting a container event)
I’m happy to write this PR once we’re confident in our approach :)
The text was updated successfully, but these errors were encountered: