Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bidirectional gRPC outputs + fix server streaming gRPC outputs #1241

Merged
merged 19 commits into from
Jun 29, 2020

Conversation

leodido
Copy link
Member

@leodido leodido commented May 29, 2020

What type of PR is this?

/kind bug
/kind design
/kind feature

Any specific area of the project related to this PR?

NONE

What this PR does / why we need it:

Long story short:
while debugging and profiling the Falco gRPC server because of #1126, @fntlnz and I identified that the existing gRPC outputs API (server streaming) had an architectural problem causing a high CPU consumption.

The implementation of that API was looping on all the remaining enqueued alerts at every process call to "keepalive" (pun intended) the stream.

An initial fix (announced by me during the community calls, very similar to #1237) of swapping the conditional operands into the if and the while and profit the short-circuiting did not lead to notable performance improvements in our tests.

Anyways, more details about this topic have been said and presented during the 2 past community calls.

Thus, with this PR we:

  • remove keepalive field from output request schema
  • introduce and implement a bi-direction gRPC API (falco.outputs.service/sub) to watch Falco alerts
  • fix the high CPU consumption of server streaming gRPC outputs
  • re-implement the server streaming gRPC outputs as (falco.outputs.service/get)
  • reproduce/debug end-of-stream errors

This PR contains breaking changes.

Some updates have to be carried on other Falco projects:

Which issue(s) this PR fixes:

Fixes #1126
Fixes #1240
Fixes #856

Special notes for your reviewers:

Here is a graph showing how the try_pop used to pop out Falco alerts does not govern anymore the whole thing.

Notice that this graph is related to the following situation:

  • a client flooding the server with continuous and parallel requests (~4MLN calls)
  • 13 alerts in 10 seconds of Falco execution

Click on the image and zoom it.

profile 1

While the thread_process (beacuse of the while on the try_pop) was occupying ~87.38% before (ref), now it's "only" occupying 37.74% even with a client deliberately flooding the server with requests.

Jun 5th 2020 update:

We did further improvements regarding memory allocation (and CPU) while allocating the response.
The latest change in this commit allows us to decrease the thread_process impact by 20% (it totals to ~17% now). This is very useful in very intensive scenarios or when the server is flooded by requests.

profile 2

How to try it out:

  1. Enable gRPC server and gRPC Outputs API (with Unix socket for simplicity).
diff --git a/falco.yaml b/falco.yaml
index dcd76f7..2051583 100644
--- a/falco.yaml
+++ b/falco.yaml
@@ -189,7 +189,7 @@ http_output:
 
 # gRPC server using an unix socket
 grpc:
-  enabled: false
+  enabled: true
   bind_address: "unix:///var/run/falco.sock"
   threadiness: 8
 
@@ -198,4 +198,4 @@ grpc:
 # By enabling this all the output events will be kept in memory until you read them with a gRPC client.
 # Make sure to have a consumer for them or leave this disabled.
 grpc_output:
-  enabled: false
+  enabled: true
  1. Run Falco
sudo ./build/userspace/falco/falco -r rules/falco_rules.yaml -c falco.yaml

3 (a). Try the /get method to obtain a stream of all the Falco alerts in the queue and shutdown.

sudo grpcurl -d '{}' --import-path userspace/falco/ --proto out
puts.proto --plaintext --keepalive-time 500000 --unix /var/run/falco.sock falco.outputs.service.get | jq

3 (b). Call the /sub method to obtain a stream of all the Falco alerts in the queue and wait online for (eventually) other alerts.

cat > /tmp/requests.txt
tail -f /tmp/requests.txt | sudo grpcurl -d @ --import-path userspace/falco/ --proto outputs.proto --plaintext --keepalive-time 500000 --unix /var/run/falco.sock falco.outputs.service.sub | jq

Write {} (on a new line each time) in the /tmp/requests.txt file.

Does this PR introduce a user-facing change?:

BREAKING CHANGE: server streaming gRPC outputs method is now `falco.outputs.service/get`
new: new bi-directional async streaming gRPC outputs (`falco.outputs.service/sub`) 
fix: high CPU usage when using server streaming gRPC outputs

leodido and others added 13 commits May 26, 2020 17:52
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-authored-by: Lorenzo Fontana <lo@linux.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
…ng method and a bidirectional method to obtain Falco alerts

Co-authored-by: Lorenzo Fontana <lo@linux.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-authored-by: Lorenzo Fontana <lo@linux.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-authored-by: Lorenzo Fontana <lo@linux.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
…e gRPC service

Co-authored-by: Lorenzo Fontana <lo@linux.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Lorenzo Fontana <lo@linux.com>
if it is still running or not

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Lorenzo Fontana <lo@linux.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Lorenzo Fontana <lo@linux.com>
transitions

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Lorenzo Fontana <lo@linux.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Lorenzo Fontana <lo@linux.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Lorenzo Fontana <lo@linux.com>
@leodido
Copy link
Member Author

leodido commented May 29, 2020

/milestone 0.24.0

@poiana poiana added this to the 0.24.0 milestone May 29, 2020
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
@poiana
Copy link
Contributor

poiana commented Jun 8, 2020

LGTM label has been added.

Git tree hash: 4c45770339ff16915d9e1b5081ca6f824db0c40d

@fntlnz
Copy link
Contributor

fntlnz commented Jun 10, 2020

/hold
Adding an hold on this so other Prs can be merged by poiana, feel free to remove once we have two approvals.

Copy link
Contributor

@mstemm mstemm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the question about the request being streaming. Otherwise looks like a great change!

// to `request` a stream of output `response`s.
service service {
// Subscribe to a stream of Falco outputs by sending a stream of requests.
rpc sub(stream request) returns (stream response);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the request have to be a stream? Isn't the big distinction that sub continues forever while get() returns outputs and stops?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we made that decision when we wrote up the original proposal /~https://github.com/falcosecurity/falco/pull/1241/files#diff-7399e66b7cefcefcb482bd41d0384dbb

I remember talking about the synchronous nature of the engine, but my memory is very bad as every one can tell you.

It looks like we have a really nice queue we can use now for outputs that would alleviate the need for us to solve asynchronicity (wow is that really a word?) at the gRPC level.

Design choice for the Kubernetes use case? I honestly can't remember.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the note in the description, but I didn't see anywhere in the server where a second request message was handled. (Maybe I missed it).

I thought the intent for this method was that you send a single request, and then you're subscribed to all responses, forever. What does a second request do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mstemm sorry I just saw your question.

The way it works is that each request has a stream of responses until the engine does not have anything else to send. Then another request is sent in the same stream.

The way it was before was slightly different. Only one request was made and a stream of responses was given which was put on hold in case the engine didn't have events. That hold was a bit problematic because it was a kind of pull mechanism which causes a lot of hits into the shared memory between all the threads. With the new approach we are using a push mechanism instead.

@@ -22,11 +22,10 @@ limitations under the License.

#include "formats.h"
#include "logger.h"
#include "falco_output_queue.h"
#include "falco_outputs_queue.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Knowing you @leodido you put a lot of thought into this.

Can you share your thinking behind the change. As I start digging more and more into the code I will be opening up more PRs and would like to understand your thinking before suggesting changes.

#include "tbb/concurrent_queue.h"

namespace falco
{
namespace output
namespace outputs
{
typedef tbb::concurrent_queue<response> response_cq;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like queue has some public methods/members. I know this is a nit, and is out of scope for this PR but some doc blocks that reference how we are supposed to use these methods would be useful.

The output queue is going to be very important as Falco gains adoptions so spending a few minutes to write down how others should use them might make sense.

Traditionally I have also made it a point to at the minimum do this in the public members of the header files, but however we decide to do it would be great!

Feel free to ignore this if we want to bring up a broader coding style/documentation discussion as a community. Just sharing thoughts as I look at the code.

@@ -36,7 +36,7 @@ class context
{
public:
context(::grpc::ServerContext* ctx);
~context() = default;
virtual ~context() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with context, why virtual now? and what does context mean to you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale is: context is a base class and instances of the derived classes get destructed.

When a base class destructor is not virtual and you have a variable pointing to an object derived from the base class, deleting the derived instance has undefined behaviour. Which can lead to memory leaks.

if(!m_stream_ctx->m_is_running)
{
m_state = request_context_base::FINISH;
m_res_writer->Finish(::grpc::Status::OK, this);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Finish() capitalized? I haven't seen many other capital methods in the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -211,7 +224,7 @@ void falco::grpc::server::run()

while(server_impl::is_running())
{
sleep(1);
std::this_thread::sleep_for(std::chrono::milliseconds(100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why we have a performance increase now :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see how this line can be related to that.

// m_status == stream_context::STREAMING?
// todo(leodido) > set m_stream

ctx.m_has_more = outputs::queue::get().try_pop(res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new queue replaces the lua_pop() stuff right?

Also does the try language mean we should expect this to throw an exception?


// Start or continue streaming
// m_status == stream_context::STREAMING?
// todo(leodido) > set m_stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open an issue for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This todo is for future work on #857

@@ -1,3 +1,19 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you <3

@@ -1,3 +1,19 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you <3

{ \
c.m_process_func = &server::impl; \
c.m_request_func = &svc::AsyncService::Request##rpc; \
c.start(this); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am indifferent about using the this keyword (personally I like it, but I have only seen it in a few places).

Again - I think we should talk coding style at some point. I know it is a hot topic in other projects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which keyword would you use here?

Copy link
Contributor

@krisnova krisnova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my comments are questions involve style/convention - this is great code - but I think we would all agree it makes sense to starting to talk more and more about style/convention/etc as a team.

All of my questions are cosmetic, and overall this looks great to me!

Would love to see a response to @mstemm about the streaming output, but otherwise all systems go 🚀

@krisnova
Copy link
Contributor

/approve

@poiana
Copy link
Contributor

poiana commented Jun 29, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fntlnz, kris-nova

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@leodido
Copy link
Member Author

leodido commented Jun 29, 2020

/hold cancel

@poiana poiana merged commit 9eb0b7f into master Jun 29, 2020
@poiana poiana deleted the feat/bidi-grpc-outputs branch June 29, 2020 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants