Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing subprocess to use system buffer instead of being unbuffered #4803

Merged
merged 9 commits into from
Oct 29, 2019

Conversation

jdmeyer3
Copy link
Contributor

While testing large datasets in actions, I came across massive CPU spikes for a relatively small dataset (~8MB) being returned from the action. While investigating the issue I found two problems that were consuming the CPU. The first issue was documented in #4798. The second issue was with how the action was created. When the subprocess is created without the buffer size, it defaults to zero (unbuffered). According to the subprocess docs,

"If you experience performance issues, it is recommended that you try to enable buffering by setting bufsize to either -1 or a large enough positive value (such as 4096)." https://docs.python.org/2/library/subprocess.html.

The bufsize can be a discussion whether to use the system default or a large value. After making this change on my local machine I noticed a performance increase minus the MongoEngine issue which still persists.

@pull-request-size pull-request-size bot added the size/XS PR that changes 0-9 lines. Quick fix/merge. label Oct 17, 2019
@blag
Copy link
Contributor

blag commented Oct 18, 2019

How does this impact the live execution results?

@arm4b
Copy link
Member

arm4b commented Oct 18, 2019

+1 to @blag I'm wondering too if this change may affect the real-time live output from actions like st2 execution tail?

Performance wise, you may also benefit from disabling "live output" in your environment

st2/conf/st2.conf.sample

Lines 25 to 26 in a643ba7

# True to store and stream action output (stdout and stderr) in real-time.
stream_output = True

which also takes a lot of CPU cycles.

@jdmeyer3
Copy link
Contributor Author

@blag At a -1 value for buffering, it would depend on your system. Some systems will fully buffer while others will buffer 4KB - 8KB of memory before flushing. If the buff value is hardcoded to 4096 then the live output should be able to display the information in 4K increments. This is to avoid having the CPU process the output from the execution one byte at a time and instead buffer to reduce the compute overhead. I can change it to a hardcoded buffer value instead of the system default if that works better for you.

@Kami
Copy link
Member

Kami commented Oct 29, 2019

Yeah, IIRC, in the past we've seen some issues with real time output hanging on some distros and that's why we used unbuffered output.

I'm fine with this change as long as we document it and make it configurable via config option in st2.conf (e.g. actionrunner.stream_output_buffer_size).

This way users who would potentially have issues can change that value in the config instead of needing to directly edit the code.

I can make that change and merge that into master.

Kami added 3 commits October 29, 2019 15:57
st2common.util.green.shell.run_command() function.
value for "bufsize" argument inside the Python runner run_command
function call.

For performance reasons, default it to -1, but allow users to change it
via config if they are experiencing performance or similar issues.
…ize" argument don't hang

or block the Python action runner streaming output functionality.
Kami added a commit to StackStorm/st2docs that referenced this pull request Oct 29, 2019
@Kami
Copy link
Member

Kami commented Oct 29, 2019

I pushed the following changes:

  1. Add new actionrunner.stream_output_buffer_size config option and default it to -1 per @jdmeyer3 observations. 2f96d5d

  2. Only use bufsize=-1 for Python runner actions. Changing the default for all the st2common.util.green.shell.run_command() function calls would affect a lot of other code as well so it's better to keep the change small and more contained. 25ea404, 2f96d5d.

  3. Add some basic tests for it. 711436f

  4. Document this change in upgrade notes - Add changelog entry for changes StackStorm/st2@4803 st2docs#941.

We should still keep an eye on potential regressions on other distributions, but I think now that the value can be controlled via config option, it's safe enough to be merged into master.

If users experience issues, they can change the config option value to revert back to the old behavior.

@Kami
Copy link
Member

Kami commented Oct 29, 2019

@jdmeyer3^ let me know if you are OK with the changes above.

Kami added 3 commits October 29, 2019 19:34
This way we can troubleshoot various "fail to start" issues on CI.

Without this change, we have no visibility if service fails to start
before the actual service starts writting into it's own log file.

Sadly we can't rely on new version of screen which supports -Logfile
path.log flag so we need to use screen config per screen window so we
can use different log files for different processes.
@Kami Kami force-pushed the bug/thread_buffering_issue branch from 61dd42d to 0b76bd5 Compare October 29, 2019 19:36
@Kami
Copy link
Member

Kami commented Oct 29, 2019

There were some weird / random Travis test failures.

I need to push some debugging changes (0b76bd5) which will hopefully provide us with more information and context which will allow us to troubleshoot such issues in the future.

@jdmeyer3
Copy link
Contributor Author

@Kami Ya, those changes are fine with me.

@Kami
Copy link
Member

Kami commented Oct 29, 2019

Yay, all green!

Thanks again for the contribution. I will go ahead and merge it into master.

@jdmeyer3 When you get a chance, can you please test the version / changes from master and confirm it's working as expected?

@Kami Kami merged commit 28d7c50 into StackStorm:master Oct 29, 2019
@jdmeyer3 jdmeyer3 deleted the bug/thread_buffering_issue branch November 1, 2019 18:21
@blag blag added the v3.1.1 label Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance service: action runner size/XS PR that changes 0-9 lines. Quick fix/merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants