bidirectional gRPC outputs + fix server streaming gRPC outputs #1241

leodido · 2020-05-29T12:47:07Z

What type of PR is this?

/kind bug
/kind design
/kind feature

Any specific area of the project related to this PR?

NONE

What this PR does / why we need it:

Long story short:
while debugging and profiling the Falco gRPC server because of #1126, @fntlnz and I identified that the existing gRPC outputs API (server streaming) had an architectural problem causing a high CPU consumption.

The implementation of that API was looping on all the remaining enqueued alerts at every process call to "keepalive" (pun intended) the stream.

An initial fix (announced by me during the community calls, very similar to #1237) of swapping the conditional operands into the if and the while and profit the short-circuiting did not lead to notable performance improvements in our tests.

Anyways, more details about this topic have been said and presented during the 2 past community calls.

Thus, with this PR we:

remove keepalive field from output request schema
introduce and implement a bi-direction gRPC API (falco.outputs.service/sub) to watch Falco alerts
fix the high CPU consumption of server streaming gRPC outputs
re-implement the server streaming gRPC outputs as (falco.outputs.service/get)
reproduce/debug end-of-stream errors

This PR contains breaking changes.

Some updates have to be carried on other Falco projects:

Which issue(s) this PR fixes:

Fixes #1126
Fixes #1240
Fixes #856

Special notes for your reviewers:

Here is a graph showing how the try_pop used to pop out Falco alerts does not govern anymore the whole thing.

Notice that this graph is related to the following situation:

a client flooding the server with continuous and parallel requests (~4MLN calls)
13 alerts in 10 seconds of Falco execution

Click on the image and zoom it.

While the thread_process (beacuse of the while on the try_pop) was occupying ~87.38% before (ref), now it's "only" occupying 37.74% even with a client deliberately flooding the server with requests.

Jun 5th 2020 update:

We did further improvements regarding memory allocation (and CPU) while allocating the response.
The latest change in this commit allows us to decrease the thread_process impact by 20% (it totals to ~17% now). This is very useful in very intensive scenarios or when the server is flooded by requests.

How to try it out:

Enable gRPC server and gRPC Outputs API (with Unix socket for simplicity).

diff --git a/falco.yaml b/falco.yaml
index dcd76f7..2051583 100644
--- a/falco.yaml
+++ b/falco.yaml
@@ -189,7 +189,7 @@ http_output:
 
 # gRPC server using an unix socket
 grpc:
-  enabled: false
+  enabled: true
   bind_address: "unix:///var/run/falco.sock"
   threadiness: 8
 
@@ -198,4 +198,4 @@ grpc:
 # By enabling this all the output events will be kept in memory until you read them with a gRPC client.
 # Make sure to have a consumer for them or leave this disabled.
 grpc_output:
-  enabled: false
+  enabled: true

Run Falco

sudo ./build/userspace/falco/falco -r rules/falco_rules.yaml -c falco.yaml

3 (a). Try the /get method to obtain a stream of all the Falco alerts in the queue and shutdown.

sudo grpcurl -d '{}' --import-path userspace/falco/ --proto out
puts.proto --plaintext --keepalive-time 500000 --unix /var/run/falco.sock falco.outputs.service.get | jq

3 (b). Call the /sub method to obtain a stream of all the Falco alerts in the queue and wait online for (eventually) other alerts.

cat > /tmp/requests.txt

tail -f /tmp/requests.txt | sudo grpcurl -d @ --import-path userspace/falco/ --proto outputs.proto --plaintext --keepalive-time 500000 --unix /var/run/falco.sock falco.outputs.service.sub | jq

Write {} (on a new line each time) in the /tmp/requests.txt file.

Does this PR introduce a user-facing change?:

BREAKING CHANGE: server streaming gRPC outputs method is now `falco.outputs.service/get`
new: new bi-directional async streaming gRPC outputs (`falco.outputs.service/sub`) 
fix: high CPU usage when using server streaming gRPC outputs

Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

…ng method and a bidirectional method to obtain Falco alerts Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

…e gRPC service Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

if it is still running or not Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

transitions Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

leodido · 2020-05-29T14:13:21Z

/milestone 0.24.0

Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

poiana · 2020-06-08T13:22:01Z

LGTM label has been added.

Git tree hash: 4c45770339ff16915d9e1b5081ca6f824db0c40d

fntlnz · 2020-06-10T10:16:08Z

/hold
Adding an hold on this so other Prs can be merged by poiana, feel free to remove once we have two approvals.

mstemm

Just the question about the request being streaming. Otherwise looks like a great change!

mstemm · 2020-06-10T23:27:57Z

userspace/falco/outputs.proto

+// to `request` a stream of output `response`s.
+service service {
+  // Subscribe to a stream of Falco outputs by sending a stream of requests.
+  rpc sub(stream request) returns (stream response);


Why does the request have to be a stream? Isn't the big distinction that sub continues forever while get() returns outputs and stops?

I think we made that decision when we wrote up the original proposal /~https://github.com/falcosecurity/falco/pull/1241/files#diff-7399e66b7cefcefcb482bd41d0384dbb

I remember talking about the synchronous nature of the engine, but my memory is very bad as every one can tell you.

It looks like we have a really nice queue we can use now for outputs that would alleviate the need for us to solve asynchronicity (wow is that really a word?) at the gRPC level.

Design choice for the Kubernetes use case? I honestly can't remember.

I see the note in the description, but I didn't see anywhere in the server where a second request message was handled. (Maybe I missed it).

I thought the intent for this method was that you send a single request, and then you're subscribed to all responses, forever. What does a second request do?

@mstemm sorry I just saw your question.

The way it works is that each request has a stream of responses until the engine does not have anything else to send. Then another request is sent in the same stream.

The way it was before was slightly different. Only one request was made and a stream of responses was given which was put on hold in case the engine didn't have events. That hold was a bit problematic because it was a kind of pull mechanism which causes a lot of hits into the shared memory between all the threads. With the new approach we are using a push mechanism instead.

krisnova · 2020-06-11T06:18:25Z

userspace/falco/falco_outputs.cpp

@@ -22,11 +22,10 @@ limitations under the License.

 #include "formats.h"
 #include "logger.h"
-#include "falco_output_queue.h"
+#include "falco_outputs_queue.h"


Knowing you @leodido you put a lot of thought into this.

Can you share your thinking behind the change. As I start digging more and more into the code I will be opening up more PRs and would like to understand your thinking before suggesting changes.

krisnova · 2020-06-11T06:23:43Z

userspace/falco/falco_outputs_queue.h

 #include "tbb/concurrent_queue.h"

 namespace falco
 {
-namespace output
+namespace outputs
 {
 typedef tbb::concurrent_queue<response> response_cq;



It looks like queue has some public methods/members. I know this is a nit, and is out of scope for this PR but some doc blocks that reference how we are supposed to use these methods would be useful.

The output queue is going to be very important as Falco gains adoptions so spending a few minutes to write down how others should use them might make sense.

Traditionally I have also made it a point to at the minimum do this in the public members of the header files, but however we decide to do it would be great!

Feel free to ignore this if we want to bring up a broader coding style/documentation discussion as a community. Just sharing thoughts as I look at the code.

krisnova · 2020-06-11T06:24:14Z

userspace/falco/grpc_context.h

@@ -36,7 +36,7 @@ class context
 {
 public:
 	context(::grpc::ServerContext* ctx);
-	~context() = default;
+	virtual ~context() = default;


Same here with context, why virtual now? and what does context mean to you?

The rationale is: context is a base class and instances of the derived classes get destructed.

When a base class destructor is not virtual and you have a variable pointing to an object derived from the base class, deleting the derived instance has undefined behaviour. Which can lead to memory leaks.

krisnova · 2020-06-11T06:25:21Z

userspace/falco/grpc_request_context.cpp

+	if(!m_stream_ctx->m_is_running)
+	{
+		m_state = request_context_base::FINISH;
+		m_res_writer->Finish(::grpc::Status::OK, this);


Why is Finish() capitalized? I haven't seen many other capital methods in the code.

It is a method from the gRPC engine, not defined by us.

Docs: https://grpc.github.io/grpc/cpp/classgrpc__impl_1_1_server_async_writer_interface.html#a6683f8dfe66786b7d8de4466ea0908cd

krisnova · 2020-06-11T06:27:06Z

userspace/falco/grpc_server.cpp

@@ -211,7 +224,7 @@ void falco::grpc::server::run()

 	while(server_impl::is_running())
 	{
-		sleep(1);
+		std::this_thread::sleep_for(std::chrono::milliseconds(100));


I can see why we have a performance increase now :)

I can't see how this line can be related to that.

krisnova · 2020-06-11T06:28:07Z

userspace/falco/grpc_server_impl.cpp

+	// m_status == stream_context::STREAMING?
+	// todo(leodido) > set m_stream
+
+	ctx.m_has_more = outputs::queue::get().try_pop(res);


This new queue replaces the lua_pop() stuff right?

Also does the try language mean we should expect this to throw an exception?

krisnova · 2020-06-11T06:28:25Z

userspace/falco/grpc_server_impl.cpp

+
+	// Start or continue streaming
+	// m_status == stream_context::STREAMING?
+	// todo(leodido) > set m_stream


Should we open an issue for this?

This todo is for future work on #857

krisnova · 2020-06-11T06:32:50Z

userspace/falco/schema.proto

@@ -1,3 +1,19 @@
+/*


Thank you <3

krisnova · 2020-06-11T06:32:57Z

userspace/falco/version.proto

@@ -1,3 +1,19 @@
+/*


Thank you <3

krisnova · 2020-06-11T06:39:43Z

userspace/falco/grpc_server.cpp

+	{                                                                     \
+		c.m_process_func = &server::impl;                             \
+		c.m_request_func = &svc::AsyncService::Request##rpc;          \
+		c.start(this);                                                \


I am indifferent about using the this keyword (personally I like it, but I have only seen it in a few places).

Again - I think we should talk coding style at some point. I know it is a hot topic in other projects.

Which keyword would you use here?

krisnova

All of my comments are questions involve style/convention - this is great code - but I think we would all agree it makes sense to starting to talk more and more about style/convention/etc as a team.

All of my questions are cosmetic, and overall this looks great to me!

Would love to see a response to @mstemm about the streaming output, but otherwise all systems go 🚀

krisnova · 2020-06-29T17:48:26Z

/approve

poiana · 2020-06-29T17:48:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fntlnz, kris-nova

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fntlnz,kris-nova]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

leodido · 2020-06-29T18:41:41Z

/hold cancel

leodido and others added 13 commits May 26, 2020 17:52

fix(userspace/falco): virtual destructor of base grpc context

fa9ba19

Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

update(userspace/falco): unsafe_size() method for falco::output::queue

b4022fb

Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

update(userspace/falco): context class for bidirectional gRPC services

15ef0cb

Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

new(userspace/falco): output gRPC service to provide a server streami…

e09509c

…ng method and a bidirectional method to obtain Falco alerts Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

new(userspace/falco): macro to REGISTER_BIDI gRPC services

47f830c

Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

new(userspace/falco): gRPC context for bidirectional services

2e562fa

Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

new(userspace/falco): concrete initial implementation of the subscrib…

8e2a50e

…e gRPC service Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

wip(userspace/falco): bidirectional gRPC outputs logic (initial)

fa41be7

Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

update(userspace/falco/grpc): dealing with multiple streaming requests

93eb326

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

update(userspace/falco/grpc): for stream contexts use a flag to detect

3e774f6

if it is still running or not Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

update(userspace/falco/grpc): bidirectional sub implementation

c345af3

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

update(userspace/falco/grpc): simpler bidirectional context state

2c9c583

transitions Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

update(userspace/falco): remove output queue size

8764bba

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

poiana added the do-not-merge/work-in-progress label May 29, 2020

fntlnz mentioned this pull request May 29, 2020

Check keepalive first, and throttle try_pop #1237

Closed

poiana added kind/bug kind/design do-not-merge/release-note-label-needed labels May 29, 2020

poiana requested review from fntlnz and krisnova May 29, 2020 12:49

poiana added dco-signoff: yes kind/feature release-note size/L and removed do-not-merge/release-note-label-needed labels May 29, 2020

update(userspace/falco): remove keepalive from output request

47e71a2

Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>

poiana added this to the 0.24.0 milestone May 29, 2020

update(proposals): keep Falco gRPC Outputs proposal in sync

1f6dcaa

Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>

poiana added lgtm and removed do-not-merge/work-in-progress labels Jun 8, 2020

poiana added the approved label Jun 8, 2020

fntlnz mentioned this pull request Jun 8, 2020

High CPU usage when grpc is enable #1126

Closed

leogr mentioned this pull request Jun 8, 2020

gRPC Unix socket falcosecurity/falco-exporter#30

Closed

poiana added the do-not-merge/hold label Jun 10, 2020

leogr mentioned this pull request Jun 10, 2020

grpc unix socket support and several improvements for v0.3.0 falcosecurity/falco-exporter#33

Merged

3 tasks

mstemm reviewed Jun 10, 2020

View reviewed changes

krisnova reviewed Jun 11, 2020

View reviewed changes

userspace/falco/schema.proto

@@ -1,3 +1,19 @@

/*

Copy link

Contributor

krisnova Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you <3

krisnova reviewed Jun 11, 2020

View reviewed changes

userspace/falco/version.proto

@@ -1,3 +1,19 @@

/*

Copy link

Contributor

krisnova Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you <3

krisnova reviewed Jun 11, 2020

View reviewed changes

fntlnz mentioned this pull request Jun 24, 2020

Several restarts of a client makes Falco daemon unresponsive #1268

Closed

krisnova approved these changes Jun 29, 2020

View reviewed changes

poiana assigned krisnova Jun 29, 2020

poiana removed the do-not-merge/hold label Jun 29, 2020

poiana merged commit 9eb0b7f into master Jun 29, 2020

poiana deleted the feat/bidi-grpc-outputs branch June 29, 2020 18:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bidirectional gRPC outputs + fix server streaming gRPC outputs #1241

bidirectional gRPC outputs + fix server streaming gRPC outputs #1241

leodido commented May 29, 2020 •

edited

Loading

leodido commented May 29, 2020

poiana commented Jun 8, 2020

fntlnz commented Jun 10, 2020

mstemm left a comment

mstemm Jun 10, 2020

krisnova Jun 11, 2020

mstemm Jun 11, 2020

fntlnz Jun 25, 2020

krisnova Jun 11, 2020

krisnova Jun 11, 2020

krisnova Jun 11, 2020

leodido Jun 11, 2020

krisnova Jun 11, 2020

leodido Jun 11, 2020

krisnova Jun 11, 2020

fntlnz Jun 11, 2020

krisnova Jun 11, 2020

krisnova Jun 11, 2020

leodido Jun 11, 2020

krisnova Jun 11, 2020

krisnova Jun 11, 2020

krisnova Jun 11, 2020

leodido Jun 11, 2020

krisnova left a comment •

edited

Loading

krisnova commented Jun 29, 2020

poiana commented Jun 29, 2020

leodido commented Jun 29, 2020

bidirectional gRPC outputs + fix server streaming gRPC outputs #1241

bidirectional gRPC outputs + fix server streaming gRPC outputs #1241

Conversation

leodido commented May 29, 2020 • edited Loading

leodido commented May 29, 2020

poiana commented Jun 8, 2020

fntlnz commented Jun 10, 2020

mstemm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krisnova left a comment • edited Loading

Choose a reason for hiding this comment

krisnova commented Jun 29, 2020

poiana commented Jun 29, 2020

leodido commented Jun 29, 2020

leodido commented May 29, 2020 •

edited

Loading

krisnova left a comment •

edited

Loading