-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bidirectional gRPC outputs + fix server streaming gRPC outputs #1241
Conversation
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
…ng method and a bidirectional method to obtain Falco alerts Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
…e gRPC service Co-authored-by: Lorenzo Fontana <lo@linux.com> Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>
if it is still running or not Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>
transitions Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>
Co-Authored-By: Leonardo Di Donato <leodidonato@gmail.com> Signed-off-by: Lorenzo Fontana <lo@linux.com>
/milestone 0.24.0 |
Signed-off-by: Leonardo Di Donato <leodidonato@gmail.com>
LGTM label has been added. Git tree hash: 4c45770339ff16915d9e1b5081ca6f824db0c40d
|
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the question about the request being streaming. Otherwise looks like a great change!
// to `request` a stream of output `response`s. | ||
service service { | ||
// Subscribe to a stream of Falco outputs by sending a stream of requests. | ||
rpc sub(stream request) returns (stream response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the request have to be a stream? Isn't the big distinction that sub continues forever while get() returns outputs and stops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we made that decision when we wrote up the original proposal /~https://github.com/falcosecurity/falco/pull/1241/files#diff-7399e66b7cefcefcb482bd41d0384dbb
I remember talking about the synchronous nature of the engine, but my memory is very bad as every one can tell you.
It looks like we have a really nice queue we can use now for outputs that would alleviate the need for us to solve asynchronicity (wow is that really a word?) at the gRPC level.
Design choice for the Kubernetes use case? I honestly can't remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the note in the description, but I didn't see anywhere in the server where a second request message was handled. (Maybe I missed it).
I thought the intent for this method was that you send a single request, and then you're subscribed to all responses, forever. What does a second request do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mstemm sorry I just saw your question.
The way it works is that each request has a stream of responses until the engine does not have anything else to send. Then another request is sent in the same stream.
The way it was before was slightly different. Only one request was made and a stream of responses was given which was put on hold in case the engine didn't have events. That hold was a bit problematic because it was a kind of pull mechanism which causes a lot of hits into the shared memory between all the threads. With the new approach we are using a push mechanism instead.
@@ -22,11 +22,10 @@ limitations under the License. | |||
|
|||
#include "formats.h" | |||
#include "logger.h" | |||
#include "falco_output_queue.h" | |||
#include "falco_outputs_queue.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Knowing you @leodido you put a lot of thought into this.
Can you share your thinking behind the change. As I start digging more and more into the code I will be opening up more PRs and would like to understand your thinking before suggesting changes.
#include "tbb/concurrent_queue.h" | ||
|
||
namespace falco | ||
{ | ||
namespace output | ||
namespace outputs | ||
{ | ||
typedef tbb::concurrent_queue<response> response_cq; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like queue
has some public methods/members. I know this is a nit, and is out of scope for this PR but some doc blocks that reference how we are supposed to use these methods would be useful.
The output queue is going to be very important as Falco gains adoptions so spending a few minutes to write down how others should use them might make sense.
Traditionally I have also made it a point to at the minimum do this in the public members of the header files, but however we decide to do it would be great!
Feel free to ignore this if we want to bring up a broader coding style/documentation discussion as a community. Just sharing thoughts as I look at the code.
@@ -36,7 +36,7 @@ class context | |||
{ | |||
public: | |||
context(::grpc::ServerContext* ctx); | |||
~context() = default; | |||
virtual ~context() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here with context, why virtual now? and what does context
mean to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale is: context
is a base class and instances of the derived classes get destructed.
When a base class destructor is not virtual and you have a variable pointing to an object derived from the base class, deleting the derived instance has undefined behaviour. Which can lead to memory leaks.
if(!m_stream_ctx->m_is_running) | ||
{ | ||
m_state = request_context_base::FINISH; | ||
m_res_writer->Finish(::grpc::Status::OK, this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is Finish()
capitalized? I haven't seen many other capital methods in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a method from the gRPC engine, not defined by us.
@@ -211,7 +224,7 @@ void falco::grpc::server::run() | |||
|
|||
while(server_impl::is_running()) | |||
{ | |||
sleep(1); | |||
std::this_thread::sleep_for(std::chrono::milliseconds(100)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why we have a performance increase now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see how this line can be related to that.
// m_status == stream_context::STREAMING? | ||
// todo(leodido) > set m_stream | ||
|
||
ctx.m_has_more = outputs::queue::get().try_pop(res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new queue replaces the lua_pop()
stuff right?
Also does the try
language mean we should expect this to throw an exception?
|
||
// Start or continue streaming | ||
// m_status == stream_context::STREAMING? | ||
// todo(leodido) > set m_stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we open an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This todo is for future work on #857
@@ -1,3 +1,19 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you <3
@@ -1,3 +1,19 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you <3
{ \ | ||
c.m_process_func = &server::impl; \ | ||
c.m_request_func = &svc::AsyncService::Request##rpc; \ | ||
c.start(this); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am indifferent about using the this
keyword (personally I like it, but I have only seen it in a few places).
Again - I think we should talk coding style at some point. I know it is a hot topic in other projects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which keyword would you use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my comments are questions involve style/convention - this is great code - but I think we would all agree it makes sense to starting to talk more and more about style/convention/etc as a team.
All of my questions are cosmetic, and overall this looks great to me!
Would love to see a response to @mstemm about the streaming output, but otherwise all systems go 🚀
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fntlnz, kris-nova The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What type of PR is this?
/kind bug
/kind design
/kind feature
Any specific area of the project related to this PR?
NONE
What this PR does / why we need it:
Long story short:
while debugging and profiling the Falco gRPC server because of #1126, @fntlnz and I identified that the existing gRPC outputs API (server streaming) had an architectural problem causing a high CPU consumption.
The implementation of that API was looping on all the remaining enqueued alerts at every process call to "keepalive" (pun intended) the stream.
An initial fix (announced by me during the community calls, very similar to #1237) of swapping the conditional operands into the
if
and thewhile
and profit the short-circuiting did not lead to notable performance improvements in our tests.Anyways, more details about this topic have been said and presented during the 2 past community calls.
Thus, with this PR we:
keepalive
field from output request schemafalco.outputs.service/sub
) to watch Falco alertsfalco.outputs.service/get
)This PR contains breaking changes.
Some updates have to be carried on other Falco projects:
Which issue(s) this PR fixes:
Fixes #1126
Fixes #1240
Fixes #856
Special notes for your reviewers:
Here is a graph showing how the
try_pop
used to pop out Falco alerts does not govern anymore the whole thing.Notice that this graph is related to the following situation:
Click on the image and zoom it.
While the
thread_process
(beacuse of thewhile
on thetry_pop
) was occupying ~87.38% before (ref), now it's "only" occupying 37.74% even with a client deliberately flooding the server with requests.Jun 5th 2020 update:
We did further improvements regarding memory allocation (and CPU) while allocating the response.
The latest change in this commit allows us to decrease the
thread_process
impact by 20% (it totals to ~17% now). This is very useful in very intensive scenarios or when the server is flooded by requests.How to try it out:
sudo ./build/userspace/falco/falco -r rules/falco_rules.yaml -c falco.yaml
3 (a). Try the /get method to obtain a stream of all the Falco alerts in the queue and shutdown.
3 (b). Call the /sub method to obtain a stream of all the Falco alerts in the queue and wait online for (eventually) other alerts.
cat > /tmp/requests.txt
tail -f /tmp/requests.txt | sudo grpcurl -d @ --import-path userspace/falco/ --proto outputs.proto --plaintext --keepalive-time 500000 --unix /var/run/falco.sock falco.outputs.service.sub | jq
Write
{}
(on a new line each time) in the/tmp/requests.txt
file.Does this PR introduce a user-facing change?: