-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a runner for automating sonobuoy runs. #557
Conversation
@TrevinTeacutter Thank you for the patch! Could you please signoff on your commit? I'm not sold on the UX of this approach |
Yeah, I don't have any opinions on the format of what it outputs to stdout so long as we get the json object into a single line since logging events are typically separated by new lines (or maybe that should be configurable in some way). |
@timothysc since you pinged in slack: I'd like an easy way to kick off tests and wait up to a user-specified timeout for them to complete. I'm ok scripting retrieving the results afterwards, or including it as an option using a flag. If the tests don't complete within the timeout, it might be useful to have it retrieve whatever diagnostic data it can, but that might be too specific of a use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @timothysc here, the use case here is narrow enough that I'm not sure if it belongs in this repo. I would love to see a combination "run-and-wait-until-done" command, but bundling up delete (non-optionally? what if you need to troubleshoot a failure) and outputting a tar to stdout (how does it handle all the various files) feels suspect to me.
Simplify its job a little bit, a few more options for controlling flow, and (ideally) some unit tests, and I'll be happy to +1 this!
cmd/sonobuoy/app/runner.go
Outdated
signals <- syscall.SIGUSR1 | ||
return | ||
case aggregation.CompleteStatus: | ||
// Results aren't immediately available even though status says complete= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra =
here
cmd/sonobuoy/app/runner.go
Outdated
return | ||
case aggregation.CompleteStatus: | ||
// Results aren't immediately available even though status says complete= | ||
time.Sleep(runnerflags.retrievalDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this fixed timeout. Why not just make a couple attempts at retrieval until it succeeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly because I don't like polling if I can avoid it, since this is a race condition that seems to exist between sonobuoy saying things are complete and when results are actually ready to be retrieved. Felt more apt to just do something similar to this race condition: kubernetes/ingress-nginx#322
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then this is a race and if the sleep isn't long enough we don't try again ...? 👎
Kubernetes would retry with backoff in this situation.
cmd/sonobuoy/app/runner.go
Outdated
|
||
switch status.Status { | ||
case aggregation.FailedStatus: | ||
signals <- syscall.SIGUSR1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this. I get that passing in SIGUSR1 makes a lot of sense for the end user, but using it internally feels like a hack.
cmd/sonobuoy/app/runner.go
Outdated
delete() | ||
|
||
if open { | ||
close(signals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why close here but not in the other case?
|
||
runnerset.DurationVar( | ||
&cfg.timeout, "runner-timeout", 6*time.Hour, | ||
"Length of time to give the runner before giving up. (Default: 6h)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did these times come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some arbitrary time for how long some of our Sonobuoy runs have taken. If you have a better default timeout I'm definitely open for suggestions.
cmd/sonobuoy/app/runner.go
Outdated
} | ||
} | ||
|
||
func run(runCfg *client.RunConfig) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn't this just return an error, like the other methods?
cmd/sonobuoy/app/runner.go
Outdated
return false | ||
} | ||
|
||
func retrieve(outDir string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, re:bool
cmd/sonobuoy/app/runner.go
Outdated
return nil | ||
} | ||
|
||
func printTarballResults(header *tar.Header, tarReader io.Reader) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you just printing the untarrred result to stdout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with each file on a separate line so that is recognized as a single event by logging daemons/drivers. Our current idea is to just use logging as our way to persist results of successful runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds a bit terrible 😕
Kubernetes's own CI would not want to use this, we'd upload the tarball contents separately to persistent storage (GCS) rather than dump everything into the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have GCS, it would feel more apt to allow multiple ways to output the result to various backends and not just GCS. At current though we don't really have usage scenarios to even test adding those different persistence stores as backends. At the same time, someone within our team also brought up that it would be nice if we had a way to parse the results to know if there were failures within the runs without having to manually scan the json files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think sonobuoy should upload to GCS, I think your automation should upload the output files to $storage ideally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for checking if there are failures, the exit code of the command should be sufficient without dumping the results to the logs.
cmd/sonobuoy/app/runner.go
Outdated
func init() { | ||
cmd := &cobra.Command{ | ||
Use: "runner", | ||
Short: "Submits a sonobuoy run and manages the lifetime of that run", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a longer description here. What does 'manage the lifetime' mean? what are the expected outputs?
I can add options for whether or not to delete, that's fine. The entire point of this is to automate sonobuoy runs (and cleanup) without hacky methods that try to do regex on the output of sonobuoy commands (which is extremely fragile) to determine what to do next (thus why it made sense to just do it as a single command). If that doesn't make sense to automate outside of troubleshooting I would like to know why so I can pass that along. |
cmd/sonobuoy/app/runner.go
Outdated
} | ||
}() | ||
|
||
time.Sleep(runnerflags.statusLoopDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you use a switch and ticker instead it can be more responsive to signals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and cleanly exit / print something to the logs rather than just aborting on signal directly ..?)
IIRC the tarball should contain results that are not JSON, which are useful to retreive, why not make your retrieval and print mechanism a separate command that your tools call after waiting for the tests to finish? That's more focused / composable and makes it easy for other users to not dump the tarball to the logs, I don't suspect that to be a popular option. |
The problem is still there is no automated way to check for completion because Sonobuoy itself offers no other output mechanisms for any of it's commands other than plain text (meaning regex is the only way to get info on whether it passed or not because exit codes aren't useful here either). There is also no equivalent of a |
Here is the story that will maybe help understand why there is desire to make this all one command: I don't want to write some hacky bash script that is regexing the output of each sonobuoy run (if I had a choice, I'd much rather rely on jq'ing the results than doing magical regex) either because that makes the automation extremely fragile. If you have recommendations outside of making this PR then I'll happily look into them. |
Most CI / testing is built around exit codes AFAIK.. so my suggested route:
Then your script is just:
Dumping the tarball to output seems like a huge hack, but also one that would be easy to implement in another tool. |
Is the run considered failed if something bad happened or if one of the e2e tests (or any other test) also failed? That seems....problematic because it still means someone has to manually pull that tarball and either run some separate (custom) command to interpret them or manually interpret them. |
- we can use different exit codes for the failure modes as necessary
- your small tool can pull the tarball and dump it to logs after run exits.
It doesn't need to be part of run though
Needing to wait for the run to end seems common, needing to dump the full
results to logs less so. The latter is also especially trivial and less
error prone than cleanly blocking on the run.
…On Thu, Nov 15, 2018, 08:55 Trevin Teacutter ***@***.*** wrote:
Is the run considered failed if something bad happened or if one of the
e2e tests (or any other test) also failed? That seems....problematic
because it still means someone has to manually pull that tarball and either
run some separate (custom) command to interpret them or manually interpret
them.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#557 (comment)>, or mute
the thread
</~https://github.com/notifications/unsubscribe-auth/AA4Bq0tpEcW1vP1xpnfSXy4tXotktAqvks5uvZyPgaJpZM4YRVm9>
.
|
I'd say that is more of a result of how sonobuoy is sending results in a manner that isn't easily serialized into the client to process those results than it is because of processing the results itself as a part of the same command. I'll just get rid of the dumping story for now. |
Signed-off-by: Trevin Teacutter <trevin.teacutter@cerner.com>
It seems that now that the delete and retrieve functionality have been removed, the main thing this does is wait for the finish. This matches up with #437 which has the more desirable UX of just adding a flag to the run itself. Under the hood it would still be polling for a terminal state but increases the scriptability of sonobuoy without adding new runners and complicating the UX. I'd be happy to get #437 included which seems to benefit you as much as this PR with a much smaller code/UX change. Let me know if you want to handle that (preferably in a new PR to clarify history/discussion). |
I have no preference, so if you wanna do the change that works for me. |
#592 has merged so you now have the option of doing:
which should run (up to a day), get the results, then delete everything. Obviously you can add some more scripting in there if you want to otherwise handle errors/get logs etc. My plan is to ensure that |
What this PR does / why we need it:
We were wanting to automate sonobuoy runs. So I added a "runner" command that does a few things:
sonobuoy run
sonobuoy status
for completion of testssonobuoy delete
stdout
since at least for us, we are expecting to have our logging solution grab them and persist them.Special notes for your reviewer:
I figured we wouldn't be the only ones interested in automating this process a bit more so rather than maintain a fork I'm submitting upstream.