Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a runner for automating sonobuoy runs. #557

Closed
wants to merge 1 commit into from
Closed

Adding a runner for automating sonobuoy runs. #557

wants to merge 1 commit into from

Conversation

TrevinTeacutter
Copy link

What this PR does / why we need it:
We were wanting to automate sonobuoy runs. So I added a "runner" command that does a few things:

  • Kicks off a sonobuoy run
  • Polls sonobuoy status for completion of tests
  • If status is bad, or something wrong happened, runs sonobuoy delete
  • If status is good and the run is complete, will try and retrieve results and output them to stdout since at least for us, we are expecting to have our logging solution grab them and persist them.

Special notes for your reviewer:
I figured we wouldn't be the only ones interested in automating this process a bit more so rather than maintain a fork I'm submitting upstream.

@timothysc
Copy link
Contributor

@TrevinTeacutter Thank you for the patch! Could you please signoff on your commit?

I'm not sold on the UX of this approach
/assign @liztio for review.

@TrevinTeacutter
Copy link
Author

Yeah, I don't have any opinions on the format of what it outputs to stdout so long as we get the json object into a single line since logging events are typically separated by new lines (or maybe that should be configurable in some way).

@ncdc
Copy link
Contributor

ncdc commented Nov 9, 2018

@timothysc since you pinged in slack: I'd like an easy way to kick off tests and wait up to a user-specified timeout for them to complete. I'm ok scripting retrieving the results afterwards, or including it as an option using a flag. If the tests don't complete within the timeout, it might be useful to have it retrieve whatever diagnostic data it can, but that might be too specific of a use case.

Copy link
Contributor

@liztio liztio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @timothysc here, the use case here is narrow enough that I'm not sure if it belongs in this repo. I would love to see a combination "run-and-wait-until-done" command, but bundling up delete (non-optionally? what if you need to troubleshoot a failure) and outputting a tar to stdout (how does it handle all the various files) feels suspect to me.

Simplify its job a little bit, a few more options for controlling flow, and (ideally) some unit tests, and I'll be happy to +1 this!

signals <- syscall.SIGUSR1
return
case aggregation.CompleteStatus:
// Results aren't immediately available even though status says complete=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra = here

return
case aggregation.CompleteStatus:
// Results aren't immediately available even though status says complete=
time.Sleep(runnerflags.retrievalDelay)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this fixed timeout. Why not just make a couple attempts at retrieval until it succeeds?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly because I don't like polling if I can avoid it, since this is a race condition that seems to exist between sonobuoy saying things are complete and when results are actually ready to be retrieved. Felt more apt to just do something similar to this race condition: kubernetes/ingress-nginx#322

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then this is a race and if the sleep isn't long enough we don't try again ...? 👎
Kubernetes would retry with backoff in this situation.


switch status.Status {
case aggregation.FailedStatus:
signals <- syscall.SIGUSR1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this. I get that passing in SIGUSR1 makes a lot of sense for the end user, but using it internally feels like a hack.

delete()

if open {
close(signals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why close here but not in the other case?


runnerset.DurationVar(
&cfg.timeout, "runner-timeout", 6*time.Hour,
"Length of time to give the runner before giving up. (Default: 6h)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did these times come from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some arbitrary time for how long some of our Sonobuoy runs have taken. If you have a better default timeout I'm definitely open for suggestions.

}
}

func run(runCfg *client.RunConfig) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't this just return an error, like the other methods?

return false
}

func retrieve(outDir string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, re:bool

return nil
}

func printTarballResults(header *tar.Header, tarReader io.Reader) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you just printing the untarrred result to stdout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, with each file on a separate line so that is recognized as a single event by logging daemons/drivers. Our current idea is to just use logging as our way to persist results of successful runs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds a bit terrible 😕
Kubernetes's own CI would not want to use this, we'd upload the tarball contents separately to persistent storage (GCS) rather than dump everything into the logs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have GCS, it would feel more apt to allow multiple ways to output the result to various backends and not just GCS. At current though we don't really have usage scenarios to even test adding those different persistence stores as backends. At the same time, someone within our team also brought up that it would be nice if we had a way to parse the results to know if there were failures within the runs without having to manually scan the json files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think sonobuoy should upload to GCS, I think your automation should upload the output files to $storage ideally.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for checking if there are failures, the exit code of the command should be sufficient without dumping the results to the logs.

func init() {
cmd := &cobra.Command{
Use: "runner",
Short: "Submits a sonobuoy run and manages the lifetime of that run",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a longer description here. What does 'manage the lifetime' mean? what are the expected outputs?

@TrevinTeacutter
Copy link
Author

TrevinTeacutter commented Nov 12, 2018

I can add options for whether or not to delete, that's fine.
As for outputting files, my only requirement is that each JSON within the tarball be able to be outputted in some way to stdout where each json file would be treated as a separate event by a logging daemon or driver (such as fluentd), thus why in this case each file is separated by a single line.

The entire point of this is to automate sonobuoy runs (and cleanup) without hacky methods that try to do regex on the output of sonobuoy commands (which is extremely fragile) to determine what to do next (thus why it made sense to just do it as a single command). If that doesn't make sense to automate outside of troubleshooting I would like to know why so I can pass that along.

}
}()

time.Sleep(runnerflags.statusLoopDelay)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use a switch and ticker instead it can be more responsive to signals

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and cleanly exit / print something to the logs rather than just aborting on signal directly ..?)

@BenTheElder
Copy link

IIRC the tarball should contain results that are not JSON, which are useful to retreive, why not make your retrieval and print mechanism a separate command that your tools call after waiting for the tests to finish? That's more focused / composable and makes it easy for other users to not dump the tarball to the logs, I don't suspect that to be a popular option.

@TrevinTeacutter
Copy link
Author

The problem is still there is no automated way to check for completion because Sonobuoy itself offers no other output mechanisms for any of it's commands other than plain text (meaning regex is the only way to get info on whether it passed or not because exit codes aren't useful here either). There is also no equivalent of a rollout status command to sonobuoy to avoid having to poll.

@TrevinTeacutter
Copy link
Author

Here is the story that will maybe help understand why there is desire to make this all one command:
We want to automate runs of sonobuoy on our clusters, so that no human is starting them or sifting through results to ensure it passed all tests or not. If it fails of course human interaction would be needed (interaction depending on the failure) which means grabbing some summary to know what tests failed or grabbing the actual tarball which is persisted somehow.

I don't want to write some hacky bash script that is regexing the output of each sonobuoy run (if I had a choice, I'd much rather rely on jq'ing the results than doing magical regex) either because that makes the automation extremely fragile.

If you have recommendations outside of making this PR then I'll happily look into them.

@BenTheElder
Copy link

Most CI / testing is built around exit codes AFAIK.. so my suggested route:

  • have the run command give a useful exit code
  • let some user specific tool handle persisting the tarball

Then your script is just:

  • obtain cluster
  • run sonobuoy, save the exit code
  • ask sonobuoy for the tarball, persist it via your preferred method
  • exit with the code / output a string based on the exit code for your logging tools to find

Dumping the tarball to output seems like a huge hack, but also one that would be easy to implement in another tool.

@TrevinTeacutter
Copy link
Author

Is the run considered failed if something bad happened or if one of the e2e tests (or any other test) also failed? That seems....problematic because it still means someone has to manually pull that tarball and either run some separate (custom) command to interpret them or manually interpret them.

@BenTheElder
Copy link

BenTheElder commented Nov 15, 2018 via email

@TrevinTeacutter
Copy link
Author

I'd say that is more of a result of how sonobuoy is sending results in a manner that isn't easily serialized into the client to process those results than it is because of processing the results itself as a part of the same command. I'll just get rid of the dumping story for now.

Signed-off-by: Trevin Teacutter <trevin.teacutter@cerner.com>
@johnSchnake
Copy link
Contributor

It seems that now that the delete and retrieve functionality have been removed, the main thing this does is wait for the finish.

This matches up with #437 which has the more desirable UX of just adding a flag to the run itself. Under the hood it would still be polling for a terminal state but increases the scriptability of sonobuoy without adding new runners and complicating the UX.

I'd be happy to get #437 included which seems to benefit you as much as this PR with a much smaller code/UX change. Let me know if you want to handle that (preferably in a new PR to clarify history/discussion).

@TrevinTeacutter
Copy link
Author

I have no preference, so if you wanna do the change that works for me.

@johnSchnake
Copy link
Contributor

#592 has merged so you now have the option of doing:

sonobuoy run --wait && sonobuoy retrieve && sonobuoy delete --wait

which should run (up to a day), get the results, then delete everything.

Obviously you can add some more scripting in there if you want to otherwise handle errors/get logs etc.

My plan is to ensure that sonobuoy retrieve gives output about the files it created so that you can also easily get the tarball and run sonobuoy e2e to see the results. Once that is all done then I'll update the readme with the easier CLI flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants