Adding a runner for automating sonobuoy runs. #557

TrevinTeacutter · 2018-11-06T21:47:42Z

What this PR does / why we need it:
We were wanting to automate sonobuoy runs. So I added a "runner" command that does a few things:

Kicks off a sonobuoy run
Polls sonobuoy status for completion of tests
If status is bad, or something wrong happened, runs sonobuoy delete
If status is good and the run is complete, will try and retrieve results and output them to stdout since at least for us, we are expecting to have our logging solution grab them and persist them.

Special notes for your reviewer:
I figured we wouldn't be the only ones interested in automating this process a bit more so rather than maintain a fork I'm submitting upstream.

timothysc · 2018-11-07T20:31:59Z

@TrevinTeacutter Thank you for the patch! Could you please signoff on your commit?

I'm not sold on the UX of this approach
/assign @liztio for review.

TrevinTeacutter · 2018-11-07T20:34:14Z

Yeah, I don't have any opinions on the format of what it outputs to stdout so long as we get the json object into a single line since logging events are typically separated by new lines (or maybe that should be configurable in some way).

ncdc · 2018-11-09T16:33:33Z

@timothysc since you pinged in slack: I'd like an easy way to kick off tests and wait up to a user-specified timeout for them to complete. I'm ok scripting retrieving the results afterwards, or including it as an option using a flag. If the tests don't complete within the timeout, it might be useful to have it retrieve whatever diagnostic data it can, but that might be too specific of a use case.

liztio

I think I agree with @timothysc here, the use case here is narrow enough that I'm not sure if it belongs in this repo. I would love to see a combination "run-and-wait-until-done" command, but bundling up delete (non-optionally? what if you need to troubleshoot a failure) and outputting a tar to stdout (how does it handle all the various files) feels suspect to me.

Simplify its job a little bit, a few more options for controlling flow, and (ideally) some unit tests, and I'll be happy to +1 this!

liztio · 2018-11-09T20:05:01Z

cmd/sonobuoy/app/runner.go

+			signals <- syscall.SIGUSR1
+			return
+		case aggregation.CompleteStatus:
+			// Results aren't immediately available even though status says complete=


extra = here

liztio · 2018-11-09T20:06:02Z

cmd/sonobuoy/app/runner.go

+			return
+		case aggregation.CompleteStatus:
+			// Results aren't immediately available even though status says complete=
+			time.Sleep(runnerflags.retrievalDelay)


I don't really like this fixed timeout. Why not just make a couple attempts at retrieval until it succeeds?

This is mostly because I don't like polling if I can avoid it, since this is a race condition that seems to exist between sonobuoy saying things are complete and when results are actually ready to be retrieved. Felt more apt to just do something similar to this race condition: kubernetes/ingress-nginx#322

So then this is a race and if the sleep isn't long enough we don't try again ...? 👎
Kubernetes would retry with backoff in this situation.

liztio · 2018-11-09T20:07:27Z

cmd/sonobuoy/app/runner.go

+
+		switch status.Status {
+		case aggregation.FailedStatus:
+			signals <- syscall.SIGUSR1


I don't love this. I get that passing in SIGUSR1 makes a lot of sense for the end user, but using it internally feels like a hack.

liztio · 2018-11-09T20:09:23Z

cmd/sonobuoy/app/runner.go

+		delete()
+
+		if open {
+			close(signals)


why close here but not in the other case?

liztio · 2018-11-09T20:11:50Z

cmd/sonobuoy/app/runner.go

+
+	runnerset.DurationVar(
+		&cfg.timeout, "runner-timeout", 6*time.Hour,
+		"Length of time to give the runner before giving up. (Default: 6h)",


where did these times come from?

Just some arbitrary time for how long some of our Sonobuoy runs have taken. If you have a better default timeout I'm definitely open for suggestions.

liztio · 2018-11-09T20:12:51Z

cmd/sonobuoy/app/runner.go

+	}
+}
+
+func run(runCfg *client.RunConfig) bool {


why doesn't this just return an error, like the other methods?

liztio · 2018-11-09T20:12:59Z

cmd/sonobuoy/app/runner.go

+	return false
+}
+
+func retrieve(outDir string) bool {


same here, re:bool

liztio · 2018-11-09T20:15:48Z

cmd/sonobuoy/app/runner.go

+	return nil
+}
+
+func printTarballResults(header *tar.Header, tarReader io.Reader) error {


are you just printing the untarrred result to stdout?

Yes, with each file on a separate line so that is recognized as a single event by logging daemons/drivers. Our current idea is to just use logging as our way to persist results of successful runs.

That sounds a bit terrible 😕
Kubernetes's own CI would not want to use this, we'd upload the tarball contents separately to persistent storage (GCS) rather than dump everything into the logs.

We do not have GCS, it would feel more apt to allow multiple ways to output the result to various backends and not just GCS. At current though we don't really have usage scenarios to even test adding those different persistence stores as backends. At the same time, someone within our team also brought up that it would be nice if we had a way to parse the results to know if there were failures within the runs without having to manually scan the json files.

I don't think sonobuoy should upload to GCS, I think your automation should upload the output files to $storage ideally.

As for checking if there are failures, the exit code of the command should be sufficient without dumping the results to the logs.

liztio · 2018-11-09T20:17:11Z

cmd/sonobuoy/app/runner.go

+func init() {
+	cmd := &cobra.Command{
+		Use:   "runner",
+		Short: "Submits a sonobuoy run and manages the lifetime of that run",


I'd like to see a longer description here. What does 'manage the lifetime' mean? what are the expected outputs?

TrevinTeacutter · 2018-11-12T16:53:46Z

I can add options for whether or not to delete, that's fine.
As for outputting files, my only requirement is that each JSON within the tarball be able to be outputted in some way to stdout where each json file would be treated as a separate event by a logging daemon or driver (such as fluentd), thus why in this case each file is separated by a single line.

The entire point of this is to automate sonobuoy runs (and cleanup) without hacky methods that try to do regex on the output of sonobuoy commands (which is extremely fragile) to determine what to do next (thus why it made sense to just do it as a single command). If that doesn't make sense to automate outside of troubleshooting I would like to know why so I can pass that along.

BenTheElder · 2018-11-15T00:51:23Z

cmd/sonobuoy/app/runner.go

+		}
+	}()
+
+	time.Sleep(runnerflags.statusLoopDelay)


if you use a switch and ticker instead it can be more responsive to signals

(and cleanly exit / print something to the logs rather than just aborting on signal directly ..?)

BenTheElder · 2018-11-15T00:55:46Z

IIRC the tarball should contain results that are not JSON, which are useful to retreive, why not make your retrieval and print mechanism a separate command that your tools call after waiting for the tests to finish? That's more focused / composable and makes it easy for other users to not dump the tarball to the logs, I don't suspect that to be a popular option.

TrevinTeacutter · 2018-11-15T15:36:46Z

The problem is still there is no automated way to check for completion because Sonobuoy itself offers no other output mechanisms for any of it's commands other than plain text (meaning regex is the only way to get info on whether it passed or not because exit codes aren't useful here either). There is also no equivalent of a rollout status command to sonobuoy to avoid having to poll.

TrevinTeacutter · 2018-11-15T16:00:39Z

Here is the story that will maybe help understand why there is desire to make this all one command:
We want to automate runs of sonobuoy on our clusters, so that no human is starting them or sifting through results to ensure it passed all tests or not. If it fails of course human interaction would be needed (interaction depending on the failure) which means grabbing some summary to know what tests failed or grabbing the actual tarball which is persisted somehow.

I don't want to write some hacky bash script that is regexing the output of each sonobuoy run (if I had a choice, I'd much rather rely on jq'ing the results than doing magical regex) either because that makes the automation extremely fragile.

If you have recommendations outside of making this PR then I'll happily look into them.

BenTheElder · 2018-11-15T16:50:41Z

Most CI / testing is built around exit codes AFAIK.. so my suggested route:

have the run command give a useful exit code
let some user specific tool handle persisting the tarball

Then your script is just:

obtain cluster
run sonobuoy, save the exit code
ask sonobuoy for the tarball, persist it via your preferred method
exit with the code / output a string based on the exit code for your logging tools to find

Dumping the tarball to output seems like a huge hack, but also one that would be easy to implement in another tool.

TrevinTeacutter · 2018-11-15T16:55:41Z

Is the run considered failed if something bad happened or if one of the e2e tests (or any other test) also failed? That seems....problematic because it still means someone has to manually pull that tarball and either run some separate (custom) command to interpret them or manually interpret them.

BenTheElder · 2018-11-15T17:01:16Z

- we can use different exit codes for the failure modes as necessary - your small tool can pull the tarball and dump it to logs after run exits. It doesn't need to be part of run though Needing to wait for the run to end seems common, needing to dump the full results to logs less so. The latter is also especially trivial and less error prone than cleanly blocking on the run.

…

On Thu, Nov 15, 2018, 08:55 Trevin Teacutter ***@***.*** wrote: Is the run considered failed if something bad happened or if one of the e2e tests (or any other test) also failed? That seems....problematic because it still means someone has to manually pull that tarball and either run some separate (custom) command to interpret them or manually interpret them. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#557 (comment)>, or mute the thread </~https://github.com/notifications/unsubscribe-auth/AA4Bq0tpEcW1vP1xpnfSXy4tXotktAqvks5uvZyPgaJpZM4YRVm9> .

TrevinTeacutter · 2018-11-15T17:07:39Z

I'd say that is more of a result of how sonobuoy is sending results in a manner that isn't easily serialized into the client to process those results than it is because of processing the results itself as a part of the same command. I'll just get rid of the dumping story for now.

Signed-off-by: Trevin Teacutter <trevin.teacutter@cerner.com>

johnSchnake · 2019-02-13T19:30:41Z

It seems that now that the delete and retrieve functionality have been removed, the main thing this does is wait for the finish.

This matches up with #437 which has the more desirable UX of just adding a flag to the run itself. Under the hood it would still be polling for a terminal state but increases the scriptability of sonobuoy without adding new runners and complicating the UX.

I'd be happy to get #437 included which seems to benefit you as much as this PR with a much smaller code/UX change. Let me know if you want to handle that (preferably in a new PR to clarify history/discussion).

TrevinTeacutter · 2019-02-13T19:48:03Z

I have no preference, so if you wanna do the change that works for me.

johnSchnake · 2019-02-22T05:58:57Z

#592 has merged so you now have the option of doing:

sonobuoy run --wait && sonobuoy retrieve && sonobuoy delete --wait

which should run (up to a day), get the results, then delete everything.

Obviously you can add some more scripting in there if you want to otherwise handle errors/get logs etc.

My plan is to ensure that sonobuoy retrieve gives output about the files it created so that you can also easily get the tarball and run sonobuoy e2e to see the results. Once that is all done then I'll update the readme with the easier CLI flow.

timothysc requested a review from liztio November 7, 2018 20:32

timothysc assigned liztio Nov 7, 2018

liztio suggested changes Nov 9, 2018

View reviewed changes

BenTheElder reviewed Nov 15, 2018

View reviewed changes

Adding a runner for automating sonobuoy runs.

fad7a05

Signed-off-by: Trevin Teacutter <trevin.teacutter@cerner.com>

johnSchnake closed this Feb 22, 2019

timh unassigned liztio Sep 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a runner for automating sonobuoy runs. #557

Adding a runner for automating sonobuoy runs. #557

TrevinTeacutter commented Nov 6, 2018

timothysc commented Nov 7, 2018

TrevinTeacutter commented Nov 7, 2018

ncdc commented Nov 9, 2018

liztio left a comment

liztio Nov 9, 2018

liztio Nov 9, 2018

TrevinTeacutter Nov 12, 2018

BenTheElder Nov 15, 2018

liztio Nov 9, 2018

liztio Nov 9, 2018

liztio Nov 9, 2018

TrevinTeacutter Nov 12, 2018

liztio Nov 9, 2018

liztio Nov 9, 2018

liztio Nov 9, 2018

TrevinTeacutter Nov 12, 2018

BenTheElder Nov 15, 2018

TrevinTeacutter Nov 15, 2018

BenTheElder Nov 15, 2018

BenTheElder Nov 15, 2018

liztio Nov 9, 2018

TrevinTeacutter commented Nov 12, 2018 •

edited

Loading

BenTheElder Nov 15, 2018

BenTheElder Nov 15, 2018

BenTheElder commented Nov 15, 2018

TrevinTeacutter commented Nov 15, 2018

TrevinTeacutter commented Nov 15, 2018

BenTheElder commented Nov 15, 2018

TrevinTeacutter commented Nov 15, 2018

BenTheElder commented Nov 15, 2018 via email

TrevinTeacutter commented Nov 15, 2018

johnSchnake commented Feb 13, 2019

TrevinTeacutter commented Feb 13, 2019

johnSchnake commented Feb 22, 2019

Adding a runner for automating sonobuoy runs. #557

Adding a runner for automating sonobuoy runs. #557

Conversation

TrevinTeacutter commented Nov 6, 2018

timothysc commented Nov 7, 2018

TrevinTeacutter commented Nov 7, 2018

ncdc commented Nov 9, 2018

liztio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TrevinTeacutter commented Nov 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenTheElder commented Nov 15, 2018

TrevinTeacutter commented Nov 15, 2018

TrevinTeacutter commented Nov 15, 2018

BenTheElder commented Nov 15, 2018

TrevinTeacutter commented Nov 15, 2018

BenTheElder commented Nov 15, 2018 via email

TrevinTeacutter commented Nov 15, 2018

johnSchnake commented Feb 13, 2019

TrevinTeacutter commented Feb 13, 2019

johnSchnake commented Feb 22, 2019

TrevinTeacutter commented Nov 12, 2018 •

edited

Loading