Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apimachinery & apiserver: use stacktrace in the stdlib #75853

Merged

Conversation

roycaihw
Copy link
Member

@roycaihw roycaihw commented Mar 29, 2019

use runtime.Stack() in the stdlib to log stack trace instead of using customized code, so that third party lib can parse the panic (like panicparse)

e.g. if we have a panic here, with the change, the tool is able to parse all three goroutines

without the change, only the first goroutine (finishRequest) gets parsed. The second goroutine (timeoutHandler) is missed because we drop a trailing newline in the one goroutine before. The third goroutine (HandleCrash) is missed because getCaller() is non-standard which only contains filename + line number.

Does this PR introduce a user-facing change?:

Use stdlib to log stack trace when a panic occurs

/cc @logicalhan @sttts

@k8s-ci-robot k8s-ci-robot requested a review from logicalhan March 29, 2019 00:40
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 29, 2019
@k8s-ci-robot k8s-ci-robot requested a review from sttts March 29, 2019 00:40
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 29, 2019
@liggitt
Copy link
Member

liggitt commented Mar 29, 2019

I think the size limit was to prevent excessively large logs (stdlib http server panic handling does the same thing)

I don't know why the custom caller code was added initially. @smarterclayton, do you know?

Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the size limit was to prevent excessively large logs (stdlib http server panic handling does the same thing)

That makes sense. Maybe we should add a comment if we revert to that behavior?

Personally I like this change. I find the callers thing quite odd.

@@ -92,23 +92,22 @@ func (t *timeoutHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
return
}

errCh := make(chan interface{})
// resultCh is used as both errCh and stopCh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like the rename. It makes more sense.

@@ -62,27 +63,14 @@ func HandleCrash(additionalHandlers ...func(interface{})) {

// logPanic logs the caller tree when a panic occurs.
func logPanic(r interface{}) {
callers := getCallers(r)
callers := debug.Stack()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to rename this to stacktrace?

@@ -155,10 +143,10 @@ func GetCaller() string {
// handlers to handle errors and panics the same way.
func RecoverFromPanic(err *error) {
if r := recover(); r != nil {
callers := getCallers(r)
callers := debug.Stack()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same.

@smarterclayton
Copy link
Contributor

I think this code was added before Go had an easy way to get a stack dump. There may have been a security concern too (avoiding showing some portions of the stack). But it’s been too long.

@roycaihw
Copy link
Member Author

2012: runtime/debug.Stack was deprecated (golang/go#4070, golang/go@e49a183)
2012: stdlib http server switched from runtime/debug.Stack to runtime.Stack with 4KB buf size (golang/go#4060, golang/go@dd43bf8)
2014: the http server increased buf size from 4KB to 64KB golang/go@645a341
2015: runtime/debug.Stack was re-implemented and un-deprecated golang/go@2d697b2
2017-2019: some conversation comparing the two APIs golang/go#3845 (comment)

I can go with manually allocating buf size if excessively large log is a concern. Will update the PR soon

@roycaihw roycaihw force-pushed the fix/use-standard-stacktrace branch from 943a02b to 0e61b77 Compare April 1, 2019 18:11
@roycaihw
Copy link
Member Author

roycaihw commented Apr 1, 2019

updated. I'm working on adding a test to capture what's printed in klog

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 1, 2019
Copy link
Member

@logicalhan logicalhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test!

/lgtm

callers := getCallers(r)
// Same as stdlib http server code. Manually allocate stack trace buffer size
// to prevent excessively large logs
const size = 64 << 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be worth extracting this logic somewhere, since we do the same thing in handlers/rest.go. Probably not though..

Copy link
Member Author

@roycaihw roycaihw Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm not sure on this

on one hand, I'd imagine extracting this logic would result in one more function call in the call stack

on the other hand, I guess there are more places in our code base need this logic (I was about to add the logic here until I saw #73137 (comment))

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2019
@roycaihw roycaihw force-pushed the fix/use-standard-stacktrace branch from 739be66 to 826cb90 Compare April 1, 2019 23:59
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 1, 2019
@roycaihw roycaihw force-pushed the fix/use-standard-stacktrace branch from 826cb90 to d5ca73a Compare April 2, 2019 00:53
@sttts sttts added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 2, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 2, 2019
@sttts sttts added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 2, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Apr 2, 2019
@sttts
Copy link
Contributor

sttts commented Apr 2, 2019

Looks good.

Can you post the difference in output from before and after this PR?

@roycaihw roycaihw force-pushed the fix/use-standard-stacktrace branch from d5ca73a to a2c59f5 Compare April 2, 2019 19:40
@roycaihw
Copy link
Member Author

roycaihw commented Apr 2, 2019

Can you post the difference in output from before and after this PR?

@sttts https://gist.github.com/roycaihw/b0977279000fb3f3048c3b77df89bec5

to reproduce: include this commit, bring up apiserver and look for the last occurrence of "Observed a panic" in apiserver log

@roycaihw
Copy link
Member Author

roycaihw commented Apr 3, 2019

/retest

@sttts
Copy link
Contributor

sttts commented Apr 3, 2019

@roycaihw do I read the gist correctly that we print all 3249 go routine stack traces now, while only one before?

@roycaihw
Copy link
Member Author

roycaihw commented Apr 3, 2019

@sttts we print stacktraces for 3 goroutines in both cases

after:

  • goroutine 3247 handlers.finishRequest
  • goroutine 3246 filters.(*timeoutHandler).ServeHTTP
  • goroutine 3277 runtime.logPanic

before:

  • goroutine 3277 handlers.finishRequest (without trailing newline)
  • goroutine 3276 filters.(*timeoutHandler).ServeHTTP
  • filename+line# (runtime.logPanic)

/retest

@roycaihw
Copy link
Member Author

roycaihw commented Apr 5, 2019

@liggitt could you approve this PR?

/retest

if match, _ := regexp.MatchString(`logPanic(.*)`, lines[2]); !match {
t.Errorf("mismatch symbolized function name: %s", lines[2])
}
if match, _ := regexp.MatchString(`runtime\.go:69 \+0x`, lines[3]); !match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a fragile test... is this format and line number stable across releases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question for the [running] format, etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line #: it's fragile. I meant to have people update the number when changing our runtime.go, but it's not necessary

[running]: it comes from one of the statuses. Should be stable in this test

updated both to use regexp. The rest of the test should be stable enough

also added documentation about if golang stdlib changes its format


// captureStderr redirects stderr to result string, and then restore stderr from backup
func captureStderr(f func()) (string, error) {
bak := os.Stderr
Copy link
Member

@liggitt liggitt Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer func(){ os.Stderr = bak }()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

tried to defer closing the writable file before, and it behaved weirdly (stuck on flushing the buffer?)

@roycaihw roycaihw force-pushed the fix/use-standard-stacktrace branch from a2c59f5 to ca8fcfc Compare April 5, 2019 19:37
@roycaihw roycaihw force-pushed the fix/use-standard-stacktrace branch from ca8fcfc to 999a02c Compare April 5, 2019 19:50
@liggitt
Copy link
Member

liggitt commented Apr 5, 2019

/lgtm
/approve
/retest

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 5, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, roycaihw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 5, 2019
@k8s-ci-robot k8s-ci-robot merged commit f8d0b21 into kubernetes:master Apr 6, 2019
@lavalamp
Copy link
Member

I think this code was added before Go had an easy way to get a stack dump.

Yeah, I'm pretty sure I added this code, for this reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants